// Edge LLM Runtime

The Edge LLM Runtime Stack 2026: llama.cpp, Ollama, TensorRT Edge-LLM, ExecuTorch, vLLM, MLX, LiteRT-LM

Updated May 2026

Picking the wrong runtime can leave 3× throughput on the table — and trap your model in a format that doesn't port to your next hardware. Here's a decision framework for the seven edge LLM runtimes that actually matter in 2026.

7 runtimes profiled

Hardware pairing matrix

Jetson · Apple · Mobile

Production + prototyping

Quick Answer

llama.cpp — CPU-first, GGUF, the most portable runtime. Pick this when portability matters more than peak throughput.
Ollama — convenience wrapper over llama.cpp with an HTTP API. Best for prototypes and developer workflows; not for production serving.
TensorRT Edge-LLM SDK — NVIDIA's production runtime for Jetson Thor / Orin. Pick this on Jetson when you need predictable latency under load.
ExecuTorch — PyTorch-native, 50KB base footprint, 12+ hardware backends. Best for mobile and microcontroller deployments shipping from a PyTorch training pipeline.
vLLM — server-class, multi-tenant, PagedAttention. Best for an edge server with multiple users or batch jobs.
MLX — Apple Silicon only. Best for Mac-based development or M-series edge deployments.
LiteRT-LM — Google's mobile-first runtime. Best for Android / iOS apps shipping Gemma-class models.

Who This Page Is For

Choosing a runtime stack after picking the edge AI hardware
Evaluating whether to port from Ollama prototypes to a production runtime
Sizing edge LLM workloads where quantization format and KV cache behavior decide throughput
Planning a multi-platform deployment (Jetson + mobile + Mac) without rewriting serving code

// Why It Matters

Why Runtime Choice Actually Matters

The hardware conversation around edge AI gets most of the attention — Jetson Thor, Hailo-8, Coral, Apple Neural Accelerators. The runtime conversation gets almost none of it. That's a mistake. The runtime is the layer that determines whether your 70B model actually hits production latency, whether it survives a JetPack upgrade, and whether you can port to a second platform without rewriting your serving code.

Three reasons it matters more than people assume:

Throughput swings of 3× are normal. The same Llama 3.3 70B model on the same Jetson Thor moves dramatically between llama.cpp (GGUF Q4_K_M), TensorRT-LLM with NVFP4, and TensorRT-LLM with EAGLE-3 speculative decoding. A 2.5× uplift from speculative decoding alone is documented on NVIDIA's blog. None of that is silicon; it's all runtime.
Quantization formats don't port cleanly. A GGUF Q4_K_M file is portable across CPUs, Jetsons, and Apple Silicon — but you'll leave Blackwell-specific NVFP4 throughput on the table. A model converted to NVFP4 won't run on Orin. ExecuTorch .pte files won't run in llama.cpp. Plan for this at design time, not deployment time.
KV cache management is where production deployments live or die. For long-context workloads, the KV cache often exceeds model weight memory. Runtimes differ wildly in how they handle it — PagedAttention (vLLM), KV cache quantization (TensorRT-LLM, llama.cpp), and attention-sink eviction (StreamingLLM, DuoAttention) are all on the table.

// Runtime Profiles

The Seven Runtime Profiles

llama.cpp — CPU king, max portability

The de facto standard for CPU-based LLM inference, evolved from a proof-of-concept into a core production tool. C++ with extensive SIMD optimization. Runs on x86, ARM, Apple Silicon, even Raspberry Pi. GGUF is now the lingua franca for quantized model distribution on Hugging Face — hundreds of community model ports use it.

Language	C++
Format	GGUF
Hardware	CPU, CUDA, Metal, Vulkan
License	MIT
Strengths	Most portable runtime in the ecosystem. Zero dependencies. Hundreds of model architectures supported. Active daily development.
Limitations	Leaves silicon-specific throughput on the table — Jetson Thor's NVFP4, Apple's ANE, Qualcomm's NPU. Single-process serving is not designed for multi-tenant.

Ollama — developer experience and prototyping

A lightweight Go wrapper around llama.cpp with HTTP API, model management, and a one-line install. Ollama crossed 165k GitHub stars in 2026 because ollama run deepseek-r1 just works. Most developer prototypes and self-hosted AI stacks (Open WebUI, Dify, n8n) sit on top of Ollama rather than llama.cpp directly.

Language	Go
Format	GGUF (via llama.cpp)
Hardware	CPU, CUDA, Metal
License	MIT
Strengths	Zero-friction local inference. Massive model library. Excellent for prototyping, agent workflows, and personal AI assistants.
Limitations	Inherits llama.cpp's serving limits — not designed for high-concurrency production. Adds latency overhead vs calling llama.cpp directly.

NVIDIA TensorRT Edge-LLM SDK — Jetson production runtime

New in JetPack 7.1 (Jan 2026). An open-source C++ runtime built specifically for Jetson-class devices that operate under tight memory budgets, hard latency constraints, and shared GPU/CPU pressure from perception and control workloads. Typical flow: export PyTorch model → ONNX → TensorRT optimization → deploy engine. Supports NVFP4, FP8, W4A16, and EAGLE-3 speculative decoding on Thor.

Language	C++
Format	ONNX → TensorRT engine
Hardware	Jetson Thor, Orin, T4000
License	Open-source (Apache-2)
Strengths	Highest measured throughput on Jetson hardware. Native NVFP4 unlocks 70B+ models on Thor. Predictable latency under load — designed for real-time systems.
Limitations	Jetson-only. ONNX conversion can be a sharp edge for non-standard architectures. Community model ports lag llama.cpp by weeks to months.

Meta ExecuTorch — PyTorch-native for mobile and embedded

Meta's production runtime for on-device PyTorch, hit v1.0 in late 2025. 50KB base footprint — runs on microcontrollers through flagship smartphones. Twelve hardware backends. Now powers AI features in Instagram, WhatsApp, Messenger, and Facebook at billions-of-users scale. Around 80% of the most popular edge LLMs on Hugging Face have working ExecuTorch exports.

Language	C++ / Python
Format	.pte (PyTorch export)
Hardware	Apple, Qualcomm, Arm, MediaTek, Vulkan, + 8 more
License	BSD-3
Strengths	Direct PyTorch export — no ONNX or TFLite conversion step. Smallest runtime footprint in the field. Best cross-vendor mobile NPU support.
Limitations	Only meaningful if your training pipeline is PyTorch. Newer than llama.cpp, so tooling and community knowledge are thinner. Quantization workflow is more involved than GGUF.

vLLM — edge server, multi-tenant

Server-class LLM serving with PagedAttention — the algorithm that 2–3×'d throughput by treating KV cache like virtual memory pages. Originally a datacenter project, but increasingly used on edge servers and even on Jetson Thor for multi-user scenarios. NVIDIA's Cat AI Assistant demo at CES 2026 used vLLM serving Qwen3 4B on Jetson Thor.

Language	Python + CUDA kernels
Format	Hugging Face safetensors
Hardware	NVIDIA, AMD, Intel Arc, TPU
License	Apache-2
Strengths	Best concurrency story in the field — handles many simultaneous requests without thrashing KV cache. OpenAI-compatible API. Wide hardware support beyond NVIDIA.
Limitations	Heavier than embedded-class runtimes — assumes a real server environment, Python, and a few hundred MB of RAM just to start. Not designed for single-process, latency-critical robotics.

Apple MLX — Apple Silicon only

Apple's array framework optimized for M-series unified memory architecture. NumPy-like API in Python, with a Swift binding for native iOS/macOS apps. MLX-VLM extends it to vision-language models. On an M3 Max with 64GB unified memory, a 4-bit quantized Llama 3.1 70B runs at ~8 tokens/sec — practical for local development.

Language	C++ / Python / Swift
Format	MLX-native, GGUF
Hardware	M-series (M1/M2/M3/M4/M5)
License	MIT
Strengths	Best throughput on Apple Silicon. Unified memory means no host-device copies. Excellent developer experience for Mac-based ML workflows. Native Swift integration for shipping apps.
Limitations	Apple Silicon only. Smaller community than llama.cpp or ExecuTorch. Not relevant for non-Apple edge deployments.

Google LiteRT-LM — Android / iOS mobile

Google's successor to TFLite, tuned for running Gemma-class small models on phones. The Google AI Edge Gallery reference app on Play Store and App Store is built on LiteRT-LM — open-source codebase that shows the full model-management, inference, and agentic-tool-call pipeline for production mobile AI. Real-world performance: a small Gemma variant generates draft text on an iPhone in airplane mode in a few seconds.

Language	C++ / Kotlin / Swift
Format	LiteRT (TFLite successor)
Hardware	Android NNAPI, iOS Core ML, Apple ANE
License	Apache-2
Strengths	Best mobile NPU support across both Android and iOS. Google-maintained, ships in Pixel and Android AI features. Reference app gives you a working production pipeline to fork.
Limitations	Mobile-only — not for Jetson or edge servers. Model support is narrower than llama.cpp; you're mostly running Google-published variants. Agentic tool-call reliability is "fine for demo, not yet fine for production."

// Pairing Matrix

Hardware × Runtime Pairing Matrix

The right runtime is almost always determined by the hardware you've already chosen. Use this as a sanity check:

Hardware	Best (production)	Best (prototyping)	Notes
Jetson AGX Thor	TensorRT Edge-LLM SDK	Ollama (via llama.cpp)	NVFP4 requires TensorRT. vLLM viable for multi-user.
Jetson AGX Orin 64GB	TensorRT-LLM / Edge-LLM	Ollama	No FP4 hardware; FP8/INT8 are the ceiling.
Jetson Orin Nano / NX	llama.cpp (CUDA)	Ollama	Small models only — 7B class at Q4_K_M is the realistic ceiling.
Raspberry Pi 5 / SBC (ARM)	llama.cpp (CPU)	Ollama	1B–3B models. Offline assistants, sensor fusion narration.
x86 edge server (no GPU)	llama.cpp	Ollama	CPU-only inference — 7B Q4 hits usable token rates on modern Xeons.
x86 edge server (NVIDIA GPU)	vLLM	Ollama	Multi-tenant serving is where vLLM beats everything.
Apple Mac (M-series)	MLX	Ollama (Metal)	Unified memory makes Macs surprisingly capable edge dev boxes.
iPhone / iPad	LiteRT-LM or ExecuTorch	LiteRT-LM (Edge Gallery)	ExecuTorch wins if your training pipeline is PyTorch.
Android phone	LiteRT-LM	LiteRT-LM	Best NPU support via NNAPI / vendor delegates.
Qualcomm RB5 / RB3	ExecuTorch (QNN backend)	llama.cpp	ExecuTorch's QNN backend unlocks the Hexagon NPU.
Hailo-8 (PCIe accelerator)	Hailo Model Zoo SDK	n/a	LLM support is limited; primarily a vision accelerator.
Coral TPU	TFLite (LiteRT)	n/a	Not a real LLM platform. Vision-only for production purposes.

// Common Mistakes

Three Mistakes We See Most Often

1. Picking a runtime before picking the model

Teams pick "we'll use llama.cpp" before knowing what model they'll deploy, then end up needing a multimodal model that's only well-supported in ExecuTorch, or a 70B that needs NVFP4 via TensorRT. The runtime should come after model selection. Pick the model your task needs, then pick the runtime that runs it best on your hardware.

2. Ignoring the KV cache for long-context workloads

The KV cache grows linearly with sequence length and can exceed model weight memory for long-context inference. For RAG, agent workflows, or document summarization at the edge, KV cache management is often more impactful than weight quantization. Use a runtime that supports KV cache quantization (TensorRT-LLM does; llama.cpp partially does) or chunked attention (StreamingLLM-style approaches, now in vLLM).

3. Treating "it runs in Ollama on my laptop" as a deployment plan

Ollama is excellent for prototyping. It's not a production runtime for embedded devices. The latency overhead, the always-on HTTP server, the model-management daemon — none of it survives a real industrial deployment. Prototype in Ollama, then port to llama.cpp directly or to a platform-native runtime (TensorRT Edge-LLM on Jetson, ExecuTorch on mobile, MLX on Apple) before you ship.

// Forecast

What We Expect to Change in 2026

llama.cpp Vulkan backend matures — making it the default cross-vendor GPU path and reducing CUDA's lock-in for community models.
ExecuTorch picks up share on Jetson — as the PyTorch export workflow tightens, expect ExecuTorch to compete with TensorRT Edge-LLM for non-NVFP4 workloads.
vLLM lands on smaller hardware — work on a "vLLM Edge" variant for single-device, low-concurrency edge servers is already in flight in the community.
1-bit and BitNet runtimes go from research to early production — particularly for ultra-low-power deployments where weight memory dominates power draw.

// FAQ

Frequently Asked Questions

Which edge LLM runtime is best for Jetson in 2026?

For production on Jetson AGX Thor or AGX Orin, NVIDIA's TensorRT Edge-LLM SDK delivers the highest measured throughput — it's the only runtime that unlocks NVFP4 on Thor and supports EAGLE-3 speculative decoding. For prototyping on smaller Jetsons (Orin Nano, Orin NX), llama.cpp with CUDA or Ollama is the pragmatic choice.

Should I use Ollama in production?

No. Ollama is an excellent prototyping and developer-experience tool, but it's a thin wrapper over llama.cpp with an HTTP API and model-management daemon. For embedded production, port to llama.cpp directly or to a platform-native runtime (TensorRT Edge-LLM on Jetson, ExecuTorch on mobile, MLX on Apple Silicon) to avoid the latency overhead and always-on server.

What's the difference between llama.cpp and ExecuTorch?

llama.cpp is C++ first, CPU-first, with optional CUDA/Metal/Vulkan backends, consuming the portable GGUF format. ExecuTorch is PyTorch-native, with twelve hardware backends including Apple, Qualcomm, MediaTek, Arm, and Vulkan; it consumes PyTorch's own .pte export format. Pick llama.cpp for maximum hardware portability; pick ExecuTorch when your training pipeline is PyTorch and you want the smallest possible runtime footprint (50KB base) on mobile or microcontrollers.

Why does runtime choice affect throughput by up to 3x?

The same model on the same hardware can move dramatically between runtimes because of: (1) quantization format support — NVFP4 on Blackwell, FP8 in TensorRT, GGUF Q4 elsewhere; (2) speculative decoding — EAGLE-3 alone delivers up to 2.5× uplift on certain models; and (3) KV cache management — PagedAttention (vLLM), KV cache quantization (TensorRT, llama.cpp), and chunked attention all reshape throughput on long contexts.

Which runtime is best for shipping LLM features in a mobile app?

For Android, Google LiteRT-LM is the default — best NPU support via NNAPI / vendor delegates, and Google maintains the Edge Gallery reference app. For iOS, both LiteRT-LM and ExecuTorch are viable; ExecuTorch wins if your training pipeline is PyTorch. Don't ship Ollama in a mobile app — the model-management daemon doesn't fit the mobile sandbox model.

Bottom line

Hardware decides the shortlist; the model decides the runtime; KV cache behavior decides whether your deployment scales. Start with hardware and model, then pick the runtime from the pairing matrix — don't reverse the order.

Match runtime to hardware

Start with the Hardware Selector to pin down the platform, then use the pairing matrix above to pick the runtime stack. For end-to-end sizing including BOM and capacity, run the System Designer.

Open Hardware Selector →