Running vLLM on Jetson Orin: The Unified-Memory Traps Nobody Warns You About
Updated June 2026
vLLM is the throughput king for concurrent LLM serving — and it fights you at every step on Jetson. x86-only wheels, 15GB+ containers, and a silent CPU fallback that quietly costs you 100× performance. This is the field guide for getting it running on Orin, and knowing when not to bother.
Quick Answer
Don't pip install vllm on Jetson — the wheel is x86-only. Use NVIDIA's JetPack-matched container, run vLLM on AGX Orin or Thor (not the 8GB Orin Nano), load your largest model first, and verify GPU utilization so you don't get silently dropped to CPU.
vLLM's advantage is PagedAttention throughput under concurrent load; if you're a single user prototyping, Ollama is the saner choice. The three traps that burn people: the aarch64 packaging gap, out-of-memory at launch on unified memory, and a silent CPU fallback that collapses throughput by ~100× with no error message.
Who This Page Is For
- Engineers serving an LLM or VLM to multiple users from a Jetson and hitting vLLM's packaging and memory walls.
- Anyone who ran
pip install vllmon Orin and watched it fail and wants the supported path. - Teams debugging a mysterious throughput collapse that turns out to be silent CPU fallback on unified memory.
- Architects deciding between Ollama, vLLM, and TensorRT for an edge serving deployment and weighing the operational cost.
Why vLLM At All
vLLM exists for one reason: throughput under concurrent load. Its core trick is PagedAttention. A traditional inference engine pre-allocates one contiguous block of memory for each request's KV cache at startup — like reserving an entire hotel floor for one guest. Most of it sits empty, and you can't fit many concurrent guests. PagedAttention borrows from operating-system virtual memory: the KV cache is broken into small fixed-size pages allocated on demand and scattered anywhere in memory. The result is near-zero waste, more concurrent sessions, and longer conversations without out-of-memory errors — which is why vLLM delivers dramatically higher throughput than naive serving when multiple users hit the endpoint at once.
That payoff is real on an edge server — even on Thor, NVIDIA has demonstrated vLLM serving multi-user LLM workloads. But the benefit is concentrated under concurrency. For a single user asking one question at a time, the difference versus a simpler runtime is small, and the operational cost of vLLM on Jetson is large. Know which regime you're in before you commit.
Trap 1: The aarch64 Packaging Gap
The first wall everyone hits: pip install vllm fails on Jetson. The published wheel is compiled for x86_64 only, and Jetson is aarch64 (ARM64). pip can't find a matching wheel, so it errors out or tries to build from source. This is not a configuration mistake on your end — it's the same ecosystem gap that shows up with PyTorch and JetPack: ARM64 edge deployment is still a second-class citizen in the ML infrastructure world.
Your two real options:
- Prebuilt JetPack-matched container (recommended). NVIDIA publishes vLLM images for Jetson via NGC and the jetson-containers project. They work, but they're 15 GB or more and are pinned to specific JetPack versions — pull the wrong tag for your JetPack release and it won't run.
- Build from source on aarch64 (last resort). Possible, but it requires resolving CUDA kernel compilation for your exact JetPack version. It's a multi-hour process with no guaranteed success. Budget the afternoon and have a fallback.
The honest framing: vLLM is not a first-class, pip-installable deployment target on Jetson the way it is on a datacenter GPU. Plan for containers and version-matching from the start.
Trap 2: Out-of-Memory on Unified Memory
Jetson's unified memory is shared between CPU, GPU, OS, and every workload on the box. vLLM, designed for datacenter GPUs with dedicated VRAM, will happily try to grab a large GPU-memory fraction for its KV cache and crash with CUDA out-of-memory at launch. The fixes are launch-time discipline:
- Lower the GPU memory utilization target so vLLM leaves room for the OS and co-tenants.
- Reduce max model length and context — the KV cache scales with context, and on the edge you rarely need a huge window.
- Drop the OS page cache before launch to reclaim reclaimable memory.
- Quantize aggressively — FP8 or W4A16 weights cut the footprint substantially. On the smaller Orin modules this is the difference between fits and doesn't.
- For VLM workloads, cut max tokens and raise the frame interval so the model isn't constantly re-running and ballooning activations.
The mental model: on unified memory, fitting often means trading speed for stability. A model that loads and stays loaded at modest throughput beats one that OOMs at launch.
Trap 3: The Silent CPU Fallback
This is the one that costs people days, because nothing tells you it happened. On unified memory, the order you load models in matters. A documented example: load a 1B model and it runs cleanly — fast, near-full GPU utilization, stable. Then load a 3B model after the 1B is already resident. There isn't enough GPU headroom left, so the runtime silently falls back to CPU inference. No error. No warning. Throughput craters from roughly 15 tokens/sec to about 0.3 — a ~100× collapse — and the only symptom is that it "feels slow."
Three defenses:
- Load the largest model first. Claim the GPU headroom before smaller models nibble it away.
- Verify GPU utilization, don't assume it. Watch
tegrastatsor your GPU monitor. If a model is "running" but GPU utilization is near zero, you're on CPU. - Treat sudden slowdowns as fallback until proven otherwise. A 100× slowdown is never a model getting "tired" — it's almost always a silent device switch.
This is a real operational risk for unattended edge deployments: a fleet device can quietly degrade to CPU performance after a restart loads models in a different order, and your monitoring won't flag an error because there isn't one.
The Supported Path
Putting it together, the path that actually works on JetPack 6:
- Confirm your JetPack version and pick the matching NVIDIA vLLM container tag (e.g. a tegra aarch64 build pinned to your JetPack/CUDA release). Don't pull 15 GB on a guess.
- Run on AGX Orin or Thor for anything beyond a toy. The 8 GB Orin Nano lacks the headroom that gives vLLM its concurrency advantage.
- Quantize to FP8 or W4A16 and serve with a conservative GPU-memory-utilization fraction and a bounded max model length.
- Load largest-first and verify the GPU is doing the work before you put it in front of users.
- Benchmark under realistic concurrency — vLLM's value only shows up when multiple sessions or batch jobs are live, so test that, not a single prompt.
For the broader runtime trade-offs and where vLLM sits next to TensorRT Edge-LLM, llama.cpp, and others, see the Edge LLM Runtime Stack (2026).
When to Use Ollama Instead
For a large fraction of edge deployments, vLLM is the wrong tool and Ollama is the right one. Ollama installs on Jetson with a single command, has CUDA support, pulls models trivially (ollama pull gemma3:4b), and is a convenience wrapper over llama.cpp with an HTTP API. For prototypes, single-user assistants, and any workload that isn't fielding concurrent requests, it gets you running in minutes instead of an afternoon of container-wrangling.
Ollama isn't immune to the unified-memory traps — it has the same silent CPU fallback behavior on bad load order — but it sidesteps the entire packaging problem and the OOM-at-launch tuning dance. The rule: prototype on Ollama, graduate to vLLM only when you need concurrent throughput on AGX Orin or Thor, and reach for TensorRT Edge-LLM when you need predictable latency under load on a deployed device.
Decision Framework
Use vLLM if
- You're serving multiple concurrent users or batch jobs from one endpoint
- You're on AGX Orin 64GB or Thor with memory headroom for the KV cache
- You can run JetPack-matched containers and verify GPU utilization
Use Ollama if
- You're single-user, prototyping, or fielding one request at a time
- You're on an Orin Nano 8GB where vLLM is impractical
- You want to be running in minutes, not an afternoon
Use TensorRT Edge-LLM if
- You need predictable, low-jitter latency on a deployed robot or industrial device
- You're on Thor and want native FP8/FP4 with speculative decoding
- The deployment is production, not exploration
Stop and re-check if
- Throughput suddenly dropped ~100× — assume silent CPU fallback, fix load order
pip install vllmfailed — that's expected on aarch64; switch to a container- You OOM at launch — lower GPU-mem fraction, context, and quantize harder
Frequently Asked Questions
Why does pip install vllm fail on Jetson?
The standard vLLM wheel on PyPI is compiled for x86_64 only. Jetson is aarch64 (ARM64), so pip cannot find a matching wheel and either fails or tries to build from source. This is the same ecosystem gap that affects PyTorch and JetPack: ARM64 edge deployment still lags x86 in ML tooling. The supported path is NVIDIA's JetPack-specific NGC or jetson-containers vLLM images, not pip.
Can I run vLLM on a Jetson Orin Nano 8GB?
It is not a well-supported target. Building vLLM from source on aarch64 is a multi-hour process with no guaranteed success, and the prebuilt containers are 15GB or more and target specific JetPack versions. On 8GB unified memory, even when it runs, there is little headroom for the KV cache that gives vLLM its advantage. For the Orin Nano, Ollama or llama.cpp are the practical runtimes; reserve vLLM for AGX Orin or Thor.
What is the silent CPU fallback problem on Jetson?
On unified-memory Jetsons, if you load a second model after one is already resident and there is not enough GPU headroom left, the runtime can silently fall back to CPU inference — no error, no warning. Throughput can collapse by around 100×, for example from roughly 15 tokens per second to about 0.3. The lesson is that model load order matters on unified memory: load the largest model first, and verify GPU utilization rather than trusting that it is being used.
Why is vLLM worth the trouble over Ollama?
vLLM implements PagedAttention, which treats the KV cache like virtual-memory pages allocated on demand instead of one big pre-reserved block. That eliminates most cache waste and delivers far higher throughput under concurrent load — commonly cited as 10× or more versus naive serving. Under a single-user prototype the difference is small; the payoff appears when multiple sessions or batch jobs hit the same endpoint. For single-user edge use, Ollama's convenience usually wins.
How do I avoid vLLM CUDA out-of-memory errors on Jetson?
Lower the GPU memory utilization target, reduce max model length and context, drop the OS page cache before launch, and prefer aggressive quantization (FP8 or W4A16). On the smaller Orin modules, cut max tokens and increase any frame-processing interval for VLM workloads so the model is not constantly re-running. Fitting in memory often means trading speed for stability.
Which vLLM container should I use on JetPack 6?
Use the NVIDIA-published JetPack-matched images rather than building from source — for example the jetson-ai-iot vLLM container tagged for your JetPack and CUDA version (such as an r36.x tegra aarch64 build on JetPack 6). The exact tag must match your JetPack release; a mismatch is a common cause of failed launches. Always confirm the container tag against your installed JetPack version before pulling 15 GB.
Bottom line
vLLM is the right tool for concurrent LLM serving and the wrong tool for almost everything else on Jetson. The three traps — the x86-only wheel, OOM at launch on shared unified memory, and the silent ~100× CPU fallback on bad load order — are all avoidable once you know they exist: use a JetPack-matched container, run on AGX Orin or Thor, quantize and cap memory, and load largest-first while verifying the GPU is actually doing the work. If you're single-user, save yourself the afternoon and use Ollama. If you're serving a crowd, vLLM's PagedAttention is worth the setup — just respect the unified-memory rules.
Size your Jetson LLM deployment
Use the GPU Sizing Tool to confirm weights + KV cache fit on your target module, and the System Designer for end-to-end compute, memory, and power planning before you fight the runtime.