Estimate inference memory for your edge AI workload

Memory planning for edge AI deployments determines whether a model fits on-device without swapping or OOM errors. This tool calculates VRAM and system RAM requirements across quantization levels (FP32, FP16, INT8) and hardware platforms — covering Jetson unified memory constraints, Coral SRAM limits, and Hailo-8 on-chip buffers. Supports NVIDIA Jetson (unified memory), Google Coral (on-chip SRAM), and Hailo-8 / 8L.

Hardware Context
Hardware Family
Loading hardware catalog…
Platform
// Select a platform first
Runtime
Model Configuration
Model Family
Model Variant
// Select a model family first
Precision
Batch Size
1
2
4
8
16
32
Streams
1
2
4
8
16
32
Resolution
224×224
320×320
416×416
640×640
1280×720
1920×1080
Tracker
None
ByteTrack (~50 MB)
DeepSORT (~300 MB)
OC-SORT (~45 MB)
BoT-SORT (~100 MB)
Deployment
Bare Metal (no overhead)
Docker (+100 MB)
K3s / K8s (+200 MB)
// Select hardware and model to continue
Summary
Configure inputs and run to see results.
Memory Breakdown
Planning Notes
Configure inputs to see planning recommendations.
Assumptions
Configure the system to see detailed assumptions.
// RELATED TOOLS
→ Tool 08: Inference Throughput Estimator → Tool 07: Module Power Calculator → Tool 06: Full Deployment Planner
// machine-readable output — application/json
{ }

Memory Planning for Edge AI Deployments

Unified memory vs. discrete VRAM

Jetson modules use a unified memory pool shared between CPU, GPU, and OS — there is no dedicated VRAM. On an 8 GB Orin Nano, the OS and runtime consume ~1.5–2 GB before inference begins. Memory planning for edge AI deployments must account for this overhead. Discrete GPU cards (e.g. desktop RTX) maintain separate VRAM, but Jetson has no such separation.

Quantization and activation memory

INT8 quantization reduces weight storage 4× versus FP32, but activation memory — which scales with input resolution and batch size — is computed in FP16 even in INT8 networks. This means VRAM and RAM sizing for INT8 models at 640×640 or higher resolution still requires careful accounting of activation buffers, not just weight size.

Related tools

Module Power Calculator — size PSU and thermal budget alongside memory.
Inference Throughput Estimator — estimate FPS and latency once memory fit is confirmed.
Full Deployment Planner — combine memory, power, and throughput into an end-to-end edge AI BOM.

FAQ
What is unified memory on Jetson?

Jetson modules use a unified memory architecture — there is no separate VRAM. CPU processes, the OS, and GPU inference all share the same physical memory pool. This means your 8 GB Orin Nano isn't 8 GB dedicated to inference; the OS alone uses ~1.5–2 GB.

Why does INT8 not reduce memory 4×?

INT8 reduces weight storage 4× vs FP32. But activation memory — the largest component at high resolutions — is computed in FP16 even in INT8 networks. Runtime activation memory reduction is ~2×, not 4×.

What is TensorRT build workspace?

During trtexec export, TensorRT allocates 1–4 GB of temporary workspace for kernel selection and layer fusion. This is a one-time cost at build time — it doesn't consume memory during inference.

How accurate are these estimates?

±30% for activations and runtime overhead. Weights are exact (calculated from verified parameter counts). Always validate with jtop or tegrastats on device before finalising memory specifications.