// VLM Sizing

Which Jetson Runs Which VLM? Sizing Vision-Language Models on Orin & Thor (2026)

Updated June 2026

Vision-language models turn a camera feed into answers — "is the print failing?", "describe the traffic." But VLM-fit on Jetson is a memory question first and a throughput question second. Here's which model lands on which module, with the memory math and real tokens-per-second figures behind it.

4B on Orin Nano

7B on Orin NX

13–20B on AGX Orin

70B+ on Thor

Quick Answer

Match the VLM to the module's unified memory:

Orin Nano 8GB handles VLMs up to ~4B (Qwen2.5-VL-3B, VILA 1.5-3B, Gemma 3 4B).
Orin NX 16GB extends to ~7B comfortably (Qwen2.5-VL-7B, Phi-3.5-Vision).
AGX Orin 64GB runs 4B–20B (LLaVA-13B, gpt-oss-20b) and concurrent pipelines.
Thor 128GB targets 20B–120B and multiple concurrent models.

Throughput is the second filter: even a willing module is slow if the model barely fits. VLMs are best used event-driven — trigger on motion or a question, sample a frame, reason — not as frame-by-frame real-time video on smaller Jetsons.

Memory ceilings assume 4-bit / W4A16 for the larger models and INT8 or GGUF Q4 for the smaller ones. Thor's NVFP4 is what lets 70B-class multimodal fit in 128GB with KV-cache headroom.

Who This Page Is For

Builders adding scene understanding to a camera system — "describe what changed," visual Q&A, anomaly flagging — and choosing the Jetson to run it on.
Robotics and physical-AI teams evaluating VLMs for perception and weighing Orin against Thor.
Engineers who already run object detection and want to layer a VLM on top without buying more hardware than necessary.
Anyone sizing a local multimodal assistant (wildlife monitor, smart camera, visual agent) on a fixed power and memory budget.

// VLMs vs LLMs

Why VLMs Size Differently Than LLMs

A vision-language model is a text LLM with two extra parts bolted on: a vision encoder that turns an image into embeddings, and a projection layer that maps those embeddings into the language model's token space. The consequence for sizing is that a 7B VLM costs more memory than a 7B text LLM. The vision tower adds its own weights, and — more importantly — each image is expanded into hundreds or thousands of image tokens that flow through the KV cache exactly like text tokens. High-resolution inputs and multi-frame video multiply that cost.

This is why VLM-fit on Jetson is a memory question first. The parameter count tells you roughly which module is in play; the real footprint — weights plus KV cache plus activations plus the vision encoder plus image-token expansion — tells you whether it actually runs with headroom to spare on a unified-memory device that's also running the OS and possibly a detection pipeline.

// Memory Math

The Memory Math

The same back-of-envelope formula used across the Jetson line applies here, with a VLM surcharge:

weight memory ≈ (params in billions) × (bits per weight) ÷ 8 GB

So a 7B model at 4-bit weights is roughly 3.5 GB just for weights. A 13B at 4-bit is ~6.5 GB. Then:

Add 30–50% for KV cache and activations at your batch size and context length. Longer prompts and more image tokens push this up.
Add the vision encoder. Typically a few hundred MB to ~1 GB depending on the model and input resolution.
Add OS and co-tenant headroom. On a unified-memory Jetson, the model competes with the OS, the display stack, and any detection or recording pipeline sharing the box.

A worked example: Qwen2.5-VL-7B at W4A16 weights is ~3.5 GB, plus ~2 GB KV/activations at modest context, plus the vision tower, plus OS overhead — call it 7–8 GB working set. That fits an Orin NX 16GB or AGX Orin with room; it's tight-to-impractical on an 8GB Orin Nano, which is why the Nano's sweet spot is the 3–4B class.

// Fit Table

VLM-to-Jetson Fit Table

Practical model classes per module, based on NVIDIA's published guidance and community deployments as of mid-2026. "Comfortable" means it fits with headroom for a real workload; treat the edges as test-before-you-commit.

Module	Unified memory	Comfortable VLM class	Example models
Orin Nano 8GB (Super)	8 GB	Up to ~4B	Qwen2.5-VL-3B, VILA 1.5-3B, Gemma 3 4B, Cosmos Reason 2B
Orin NX 16GB	16 GB	Up to ~7B	Qwen2.5-VL-7B, Phi-3.5-Vision, Cosmos Reason 8B (quantized)
AGX Orin 64GB	64 GB	4B–20B	LLaVA-13B, Qwen2.5-VL-7B, Phi-3.5-Vision, gpt-oss-20b
AGX Thor 128GB	128 GB	20B–120B + concurrent	Llama 3.2 Vision 70B, Nemotron Nano Omni, multiple models at once

Memory ceilings assume 4-bit / W4A16 weights for the larger models and INT8 or GGUF Q4 for the smaller ones. Thor adds native NVFP4, which is what lets 70B-class multimodal models fit in 128GB with KV-cache headroom. Quantization format determines the ceiling as much as the module does.

// Throughput

Real Throughput Numbers

Fit gets the model loaded; throughput decides whether it's usable. A few measured reference points from NVIDIA demos and community testing, as of mid-2026:

Gemma 3 4B on Orin Nano Super — around 15 tokens/sec for text generation via the Live VLM tooling. For live vision, a frame takes several seconds to process — practical for continuous-but-slow monitoring, not real-time video.
Qwen2.5-VL-3B on AGX Orin 64GB — community testing via vLLM with W4A16 reports on the order of ~30 tokens/sec in real configs, well below idealized single-stream benchmark claims; tune context length and image resolution to hold throughput up.
gpt-oss-20b on AGX Orin via vLLM — ~40 tokens/sec generation through Open WebUI, illustrating that a 20B-class model is genuinely interactive on AGX Orin.

The takeaway: a model "running" and a model "running fast enough for your interaction" are different bars. Smaller Jetsons running VLMs near their memory ceiling slow down sharply. Design around event-driven invocation — sample a frame on a trigger, reason, return — rather than streaming every frame through the model.

// Module-by-Module

Module-by-Module Guidance

Orin Nano 8GB (Super) — the entry point

The cheapest way into edge VLMs. It reliably runs 3–4B vision models — Qwen2.5-VL-3B, VILA 1.5-3B, Gemma 3 4B, Cosmos Reason 2B — which are enough for basic monitoring, simple visual queries ("is there a person at the door?"), and wildlife or process snapshots. JetPack 6.2's Super Mode delivers up to a 2× inference boost over the original Orin Nano. Treat live-vision frame rates as seconds-per-frame, and offload the heavy lifting to event triggers.

Orin NX 16GB — the value sweet spot for 7B

Doubling memory to 16GB lifts the practical ceiling to the 7B class, which is where VLM quality takes a real step up — Qwen2.5-VL-7B and Phi-3.5-Vision handle document, chart, and multi-object reasoning that 3B models fumble. For a single-model visual assistant that needs to be genuinely useful rather than just present, Orin NX is the price/performance pick.

AGX Orin 64GB — multi-model and concurrency

64GB lets you run a 7–13B VLM alongside a perception stack, or stand up vLLM for a few concurrent users. This is the platform for a visual AI agent that combines detection, a VLM, and speech without constantly hitting memory limits. A 20B-class model like gpt-oss-20b runs interactively here. It's the production workhorse for edge multimodal up to ~20B.

AGX Thor 128GB — foundation-model and VLA

Thor is for the workloads Orin can't hold: 20B–120B VLMs, multiple concurrent models with hard isolation, and Vision-Language-Action robotics models. Its native NVFP4 is what makes 70B-class multimodal models fit in 128GB with KV-cache headroom. If you're building a humanoid, a multi-agent perception system, or running a 70B vision model on-device, Thor is the answer — and the only Jetson that is. For everything smaller, it's overkill at 40–130W and $3,499.

// Runtimes

Runtimes: Ollama, vLLM, TensorRT

The same model can be fast or painfully slow depending on the runtime. Three you'll actually use on Jetson:

Ollama — the fast path for experimentation. One-command install with CUDA support on Jetson, easy pulls (ollama pull gemma3:4b), and it pairs cleanly with the Live VLM WebUI for a webcam sandbox. Best for prototypes and single-user. Watch out for silent CPU fallback on unified memory when models are loaded in the wrong order — see the unified-memory traps guide.
vLLM — server-class throughput via PagedAttention, the choice for multi-user or batch serving on AGX Orin and Thor. Higher ceiling, but harder to stand up on Jetson and impractical on the 8GB Orin Nano.
TensorRT (Edge-LLM) — NVIDIA's production runtime for predictable latency under load on Jetson, with quantization down to FP8/FP4 on Thor. The path for a deployed robot or industrial system where jitter matters.

For a deeper runtime breakdown, see the Edge LLM Runtime Stack (2026).

// Decision Framework

Decision Framework

Choose Orin Nano 8GB if

Your VLM task is basic monitoring or simple visual Q&A on a 2–4B model
Event-driven (trigger-then-reason) interaction is acceptable; you don't need real-time video
Cost and power are the dominant constraints

Choose Orin NX 16GB if

You want genuinely useful 7B-class reasoning (documents, charts, multi-object scenes)
It's a single-model visual assistant, not a concurrent multi-agent system
You want the best quality-per-dollar without jumping to AGX

Choose AGX Orin 64GB if

You're running a VLM alongside detection, speech, or other models
You need a few concurrent users via vLLM
Your largest model is in the 13–20B range

Choose AGX Thor 128GB if

You need 20B–120B VLMs or multiple concurrent foundation models
You're deploying Vision-Language-Action robotics models
You have the 40–130W power budget and active cooling Thor requires

// FAQ

Frequently Asked Questions

What is the smallest Jetson that can run a useful VLM?

The Jetson Orin Nano 8GB (especially the Super variant) is the entry point. It comfortably runs VLMs up to roughly 4B parameters such as Qwen2.5-VL-3B, VILA 1.5-3B, or Gemma 3 4B. Expect around 15 tokens per second on Gemma 3 4B for text, and frame analysis on the order of several seconds per frame for live vision — usable for event-driven monitoring, not real-time video understanding.

How much memory does a VLM need on Jetson?

Estimate weight memory as parameters in billions times bits-per-weight divided by 8, in gigabytes. A 7B model at 4-bit needs about 3.5 GB for weights, then budget another 30–50% for the KV cache and activations, plus extra headroom for the vision encoder and image tokens, which VLMs add on top of a same-size text LLM. On unified-memory Jetsons that pool is shared with the OS and any other workloads, so leave margin.

Can a Jetson VLM do real-time video?

Not in the frame-by-frame sense at full rate on smaller Jetsons. Even a capable Orin Nano running Gemma 3 4B processes a frame in several seconds, so VLMs are best used event-driven — trigger on motion or a question, sample a frame or short clip, and reason about it. For higher frame rates, multiple streams, or concurrent agents, AGX Orin or Thor provide the memory and compute headroom to scale.

Do I need Thor to run vision-language models?

No. Most practical edge VLM work — scene description, visual Q&A, monitoring agents up to about 7B parameters — runs on Orin Nano, Orin NX, or AGX Orin. Thor with its 128GB unified memory and native FP4 is for the largest workloads: 20B to 120B-class VLMs, multiple concurrent models, or Vision-Language-Action robotics models. If those aren't on your roadmap, Orin-class hardware is the cheaper answer.

Ollama or vLLM for VLMs on Jetson?

Ollama is the fast path for experimentation and single-user use — one command, CUDA support on Jetson, and easy model pulls like gemma3:4b. vLLM delivers far higher throughput under concurrent load via PagedAttention and is the choice for multi-user or batch serving, but it is harder to stand up on Jetson and on the smaller Orin Nano can be impractical. Prototype on Ollama, move to vLLM on AGX Orin or Thor when you need concurrency.

What is the difference between a VLM and just running object detection?

An object detector outputs bounding boxes for a fixed set of classes. A VLM reasons about an entire scene in natural language — it can answer open-ended questions like 'is the 3D print failing?' or 'describe the traffic pattern' without being trained on those specific classes. VLMs are slower and heavier per inference, so a common pattern is a fast detector handling the real-time stream and a VLM invoked on demand for richer understanding.

Bottom line

VLM selection on Jetson is memory-first: the module's unified memory sets the parameter ceiling, and the vision encoder plus image-token expansion make a VLM heavier than a same-size text model. Orin Nano covers 3–4B monitoring tasks, Orin NX hits the 7B quality sweet spot, AGX Orin runs 13–20B and concurrency, and Thor is the only option for 70B-class and VLA workloads. Throughput is the second gate — design for event-driven invocation rather than streaming every frame, and prototype on Ollama before scaling to vLLM or TensorRT.

Size your VLM deployment

Use the GPU Sizing Tool to confirm weights, KV cache, and vision-encoder footprint fit on your target Jetson, and the System Designer for end-to-end compute, memory, and power planning.

Open GPU Sizing →