The Edge LLM Runtime Stack 2026: llama.cpp, Ollama, TensorRT Edge-LLM, ExecuTorch, vLLM, MLX, LiteRT-LM
Updated May 2026
Picking the wrong runtime can leave 3× throughput on the table — and trap your model in a format that doesn't port to your next hardware. Here's a decision framework for the seven edge LLM runtimes that actually matter in 2026.
Quick Answer
- llama.cpp — CPU-first, GGUF, the most portable runtime. Pick this when portability matters more than peak throughput.
- Ollama — convenience wrapper over llama.cpp with an HTTP API. Best for prototypes and developer workflows; not for production serving.
- TensorRT Edge-LLM SDK — NVIDIA's production runtime for Jetson Thor / Orin. Pick this on Jetson when you need predictable latency under load.
- ExecuTorch — PyTorch-native, 50KB base footprint, 12+ hardware backends. Best for mobile and microcontroller deployments shipping from a PyTorch training pipeline.
- vLLM — server-class, multi-tenant, PagedAttention. Best for an edge server with multiple users or batch jobs.
- MLX — Apple Silicon only. Best for Mac-based development or M-series edge deployments.
- LiteRT-LM — Google's mobile-first runtime. Best for Android / iOS apps shipping Gemma-class models.
Who This Page Is For
- Choosing a runtime stack after picking the edge AI hardware
- Evaluating whether to port from Ollama prototypes to a production runtime
- Sizing edge LLM workloads where quantization format and KV cache behavior decide throughput
- Planning a multi-platform deployment (Jetson + mobile + Mac) without rewriting serving code
Why Runtime Choice Actually Matters
The hardware conversation around edge AI gets most of the attention — Jetson Thor, Hailo-8, Coral, Apple Neural Accelerators. The runtime conversation gets almost none of it. That's a mistake. The runtime is the layer that determines whether your 70B model actually hits production latency, whether it survives a JetPack upgrade, and whether you can port to a second platform without rewriting your serving code.
Three reasons it matters more than people assume:
- Throughput swings of 3× are normal. The same Llama 3.3 70B model on the same Jetson Thor moves dramatically between llama.cpp (GGUF Q4_K_M), TensorRT-LLM with NVFP4, and TensorRT-LLM with EAGLE-3 speculative decoding. A 2.5× uplift from speculative decoding alone is documented on NVIDIA's blog. None of that is silicon; it's all runtime.
- Quantization formats don't port cleanly. A GGUF Q4_K_M file is portable across CPUs, Jetsons, and Apple Silicon — but you'll leave Blackwell-specific NVFP4 throughput on the table. A model converted to NVFP4 won't run on Orin. ExecuTorch .pte files won't run in llama.cpp. Plan for this at design time, not deployment time.
- KV cache management is where production deployments live or die. For long-context workloads, the KV cache often exceeds model weight memory. Runtimes differ wildly in how they handle it — PagedAttention (vLLM), KV cache quantization (TensorRT-LLM, llama.cpp), and attention-sink eviction (StreamingLLM, DuoAttention) are all on the table.
The Seven Runtime Profiles
llama.cpp — CPU king, max portability
The de facto standard for CPU-based LLM inference, evolved from a proof-of-concept into a core production tool. C++ with extensive SIMD optimization. Runs on x86, ARM, Apple Silicon, even Raspberry Pi. GGUF is now the lingua franca for quantized model distribution on Hugging Face — hundreds of community model ports use it.
| Language | C++ |
| Format | GGUF |
| Hardware | CPU, CUDA, Metal, Vulkan |
| License | MIT |
| Strengths | Most portable runtime in the ecosystem. Zero dependencies. Hundreds of model architectures supported. Active daily development. |
| Limitations | Leaves silicon-specific throughput on the table — Jetson Thor's NVFP4, Apple's ANE, Qualcomm's NPU. Single-process serving is not designed for multi-tenant. |
Ollama — developer experience and prototyping
A lightweight Go wrapper around llama.cpp with HTTP API, model management, and a one-line install. Ollama crossed 165k GitHub stars in 2026 because ollama run deepseek-r1 just works. Most developer prototypes and self-hosted AI stacks (Open WebUI, Dify, n8n) sit on top of Ollama rather than llama.cpp directly.
| Language | Go |
| Format | GGUF (via llama.cpp) |
| Hardware | CPU, CUDA, Metal |
| License | MIT |
| Strengths | Zero-friction local inference. Massive model library. Excellent for prototyping, agent workflows, and personal AI assistants. |
| Limitations | Inherits llama.cpp's serving limits — not designed for high-concurrency production. Adds latency overhead vs calling llama.cpp directly. |
NVIDIA TensorRT Edge-LLM SDK — Jetson production runtime
New in JetPack 7.1 (Jan 2026). An open-source C++ runtime built specifically for Jetson-class devices that operate under tight memory budgets, hard latency constraints, and shared GPU/CPU pressure from perception and control workloads. Typical flow: export PyTorch model → ONNX → TensorRT optimization → deploy engine. Supports NVFP4, FP8, W4A16, and EAGLE-3 speculative decoding on Thor.
| Language | C++ |
| Format | ONNX → TensorRT engine |
| Hardware | Jetson Thor, Orin, T4000 |
| License | Open-source (Apache-2) |
| Strengths | Highest measured throughput on Jetson hardware. Native NVFP4 unlocks 70B+ models on Thor. Predictable latency under load — designed for real-time systems. |
| Limitations | Jetson-only. ONNX conversion can be a sharp edge for non-standard architectures. Community model ports lag llama.cpp by weeks to months. |
Meta ExecuTorch — PyTorch-native for mobile and embedded
Meta's production runtime for on-device PyTorch, hit v1.0 in late 2025. 50KB base footprint — runs on microcontrollers through flagship smartphones. Twelve hardware backends. Now powers AI features in Instagram, WhatsApp, Messenger, and Facebook at billions-of-users scale. Around 80% of the most popular edge LLMs on Hugging Face have working ExecuTorch exports.
| Language | C++ / Python |
| Format | .pte (PyTorch export) |
| Hardware | Apple, Qualcomm, Arm, MediaTek, Vulkan, + 8 more |
| License | BSD-3 |
| Strengths | Direct PyTorch export — no ONNX or TFLite conversion step. Smallest runtime footprint in the field. Best cross-vendor mobile NPU support. |
| Limitations | Only meaningful if your training pipeline is PyTorch. Newer than llama.cpp, so tooling and community knowledge are thinner. Quantization workflow is more involved than GGUF. |
vLLM — edge server, multi-tenant
Server-class LLM serving with PagedAttention — the algorithm that 2–3×'d throughput by treating KV cache like virtual memory pages. Originally a datacenter project, but increasingly used on edge servers and even on Jetson Thor for multi-user scenarios. NVIDIA's Cat AI Assistant demo at CES 2026 used vLLM serving Qwen3 4B on Jetson Thor.
| Language | Python + CUDA kernels |
| Format | Hugging Face safetensors |
| Hardware | NVIDIA, AMD, Intel Arc, TPU |
| License | Apache-2 |
| Strengths | Best concurrency story in the field — handles many simultaneous requests without thrashing KV cache. OpenAI-compatible API. Wide hardware support beyond NVIDIA. |
| Limitations | Heavier than embedded-class runtimes — assumes a real server environment, Python, and a few hundred MB of RAM just to start. Not designed for single-process, latency-critical robotics. |
Apple MLX — Apple Silicon only
Apple's array framework optimized for M-series unified memory architecture. NumPy-like API in Python, with a Swift binding for native iOS/macOS apps. MLX-VLM extends it to vision-language models. On an M3 Max with 64GB unified memory, a 4-bit quantized Llama 3.1 70B runs at ~8 tokens/sec — practical for local development.
| Language | C++ / Python / Swift |
| Format | MLX-native, GGUF |
| Hardware | M-series (M1/M2/M3/M4/M5) |
| License | MIT |
| Strengths | Best throughput on Apple Silicon. Unified memory means no host-device copies. Excellent developer experience for Mac-based ML workflows. Native Swift integration for shipping apps. |
| Limitations | Apple Silicon only. Smaller community than llama.cpp or ExecuTorch. Not relevant for non-Apple edge deployments. |
Google LiteRT-LM — Android / iOS mobile
Google's successor to TFLite, tuned for running Gemma-class small models on phones. The Google AI Edge Gallery reference app on Play Store and App Store is built on LiteRT-LM — open-source codebase that shows the full model-management, inference, and agentic-tool-call pipeline for production mobile AI. Real-world performance: a small Gemma variant generates draft text on an iPhone in airplane mode in a few seconds.
| Language | C++ / Kotlin / Swift |
| Format | LiteRT (TFLite successor) |
| Hardware | Android NNAPI, iOS Core ML, Apple ANE |
| License | Apache-2 |
| Strengths | Best mobile NPU support across both Android and iOS. Google-maintained, ships in Pixel and Android AI features. Reference app gives you a working production pipeline to fork. |
| Limitations | Mobile-only — not for Jetson or edge servers. Model support is narrower than llama.cpp; you're mostly running Google-published variants. Agentic tool-call reliability is "fine for demo, not yet fine for production." |
Hardware × Runtime Pairing Matrix
The right runtime is almost always determined by the hardware you've already chosen. Use this as a sanity check:
| Hardware | Best (production) | Best (prototyping) | Notes |
|---|---|---|---|
| Jetson AGX Thor | TensorRT Edge-LLM SDK | Ollama (via llama.cpp) | NVFP4 requires TensorRT. vLLM viable for multi-user. |
| Jetson AGX Orin 64GB | TensorRT-LLM / Edge-LLM | Ollama | No FP4 hardware; FP8/INT8 are the ceiling. |
| Jetson Orin Nano / NX | llama.cpp (CUDA) | Ollama | Small models only — 7B class at Q4_K_M is the realistic ceiling. |
| Raspberry Pi 5 / SBC (ARM) | llama.cpp (CPU) | Ollama | 1B–3B models. Offline assistants, sensor fusion narration. |
| x86 edge server (no GPU) | llama.cpp | Ollama | CPU-only inference — 7B Q4 hits usable token rates on modern Xeons. |
| x86 edge server (NVIDIA GPU) | vLLM | Ollama | Multi-tenant serving is where vLLM beats everything. |
| Apple Mac (M-series) | MLX | Ollama (Metal) | Unified memory makes Macs surprisingly capable edge dev boxes. |
| iPhone / iPad | LiteRT-LM or ExecuTorch | LiteRT-LM (Edge Gallery) | ExecuTorch wins if your training pipeline is PyTorch. |
| Android phone | LiteRT-LM | LiteRT-LM | Best NPU support via NNAPI / vendor delegates. |
| Qualcomm RB5 / RB3 | ExecuTorch (QNN backend) | llama.cpp | ExecuTorch's QNN backend unlocks the Hexagon NPU. |
| Hailo-8 (PCIe accelerator) | Hailo Model Zoo SDK | n/a | LLM support is limited; primarily a vision accelerator. |
| Coral TPU | TFLite (LiteRT) | n/a | Not a real LLM platform. Vision-only for production purposes. |
Three Mistakes We See Most Often
1. Picking a runtime before picking the model
Teams pick "we'll use llama.cpp" before knowing what model they'll deploy, then end up needing a multimodal model that's only well-supported in ExecuTorch, or a 70B that needs NVFP4 via TensorRT. The runtime should come after model selection. Pick the model your task needs, then pick the runtime that runs it best on your hardware.
2. Ignoring the KV cache for long-context workloads
The KV cache grows linearly with sequence length and can exceed model weight memory for long-context inference. For RAG, agent workflows, or document summarization at the edge, KV cache management is often more impactful than weight quantization. Use a runtime that supports KV cache quantization (TensorRT-LLM does; llama.cpp partially does) or chunked attention (StreamingLLM-style approaches, now in vLLM).
3. Treating "it runs in Ollama on my laptop" as a deployment plan
Ollama is excellent for prototyping. It's not a production runtime for embedded devices. The latency overhead, the always-on HTTP server, the model-management daemon — none of it survives a real industrial deployment. Prototype in Ollama, then port to llama.cpp directly or to a platform-native runtime (TensorRT Edge-LLM on Jetson, ExecuTorch on mobile, MLX on Apple) before you ship.
What We Expect to Change in 2026
- llama.cpp Vulkan backend matures — making it the default cross-vendor GPU path and reducing CUDA's lock-in for community models.
- ExecuTorch picks up share on Jetson — as the PyTorch export workflow tightens, expect ExecuTorch to compete with TensorRT Edge-LLM for non-NVFP4 workloads.
- vLLM lands on smaller hardware — work on a "vLLM Edge" variant for single-device, low-concurrency edge servers is already in flight in the community.
- 1-bit and BitNet runtimes go from research to early production — particularly for ultra-low-power deployments where weight memory dominates power draw.
Frequently Asked Questions
Which edge LLM runtime is best for Jetson in 2026?
For production on Jetson AGX Thor or AGX Orin, NVIDIA's TensorRT Edge-LLM SDK delivers the highest measured throughput — it's the only runtime that unlocks NVFP4 on Thor and supports EAGLE-3 speculative decoding. For prototyping on smaller Jetsons (Orin Nano, Orin NX), llama.cpp with CUDA or Ollama is the pragmatic choice.
Should I use Ollama in production?
No. Ollama is an excellent prototyping and developer-experience tool, but it's a thin wrapper over llama.cpp with an HTTP API and model-management daemon. For embedded production, port to llama.cpp directly or to a platform-native runtime (TensorRT Edge-LLM on Jetson, ExecuTorch on mobile, MLX on Apple Silicon) to avoid the latency overhead and always-on server.
What's the difference between llama.cpp and ExecuTorch?
llama.cpp is C++ first, CPU-first, with optional CUDA/Metal/Vulkan backends, consuming the portable GGUF format. ExecuTorch is PyTorch-native, with twelve hardware backends including Apple, Qualcomm, MediaTek, Arm, and Vulkan; it consumes PyTorch's own .pte export format. Pick llama.cpp for maximum hardware portability; pick ExecuTorch when your training pipeline is PyTorch and you want the smallest possible runtime footprint (50KB base) on mobile or microcontrollers.
Why does runtime choice affect throughput by up to 3x?
The same model on the same hardware can move dramatically between runtimes because of: (1) quantization format support — NVFP4 on Blackwell, FP8 in TensorRT, GGUF Q4 elsewhere; (2) speculative decoding — EAGLE-3 alone delivers up to 2.5× uplift on certain models; and (3) KV cache management — PagedAttention (vLLM), KV cache quantization (TensorRT, llama.cpp), and chunked attention all reshape throughput on long contexts.
Which runtime is best for shipping LLM features in a mobile app?
For Android, Google LiteRT-LM is the default — best NPU support via NNAPI / vendor delegates, and Google maintains the Edge Gallery reference app. For iOS, both LiteRT-LM and ExecuTorch are viable; ExecuTorch wins if your training pipeline is PyTorch. Don't ship Ollama in a mobile app — the model-management daemon doesn't fit the mobile sandbox model.
Bottom line
Hardware decides the shortlist; the model decides the runtime; KV cache behavior decides whether your deployment scales. Start with hardware and model, then pick the runtime from the pairing matrix — don't reverse the order.
Match runtime to hardware
Start with the Hardware Selector to pin down the platform, then use the pairing matrix above to pick the runtime stack. For end-to-end sizing including BOM and capacity, run the System Designer.