// Memory Planning

RAM Sizing for Edge AI: How Much Memory Do You Really Need?

Q: Can I add RAM to a Jetson module after purchase?

No. Jetson RAM is soldered at manufacture. You cannot add or upgrade RAM after purchase. This makes RAM selection at procurement critical.

Q: How do I measure actual runtime memory usage on a Jetson?

Use tegrastats (real-time system stats) or ps -aux to check resident set size (RSS). Profile under full production load with all cameras and models running. Monitor GPU memory separately—Jetson unified memory means GPU and CPU compete for the same pool.

Last updated: March 2026

Choosing the wrong RAM tier can break an edge AI deployment before it ships. This guide shows when 8 GB is enough, when 16 GB is the practical default, and when 32 GB or 64 GB is justified for multi-stream, multi-model, or transformer-heavy pipelines.

8 GB: 1–4 light streams

16 GB: most deployments

32 GB: multi-task pipelines

64 GB: server / R&D tier

Quick Answer

8 GB is enough for light single-model pipelines and up to about 4 streams when models stay compact. 16 GB is the practical default for most real edge AI deployments because it gives headroom for tracking, secondary models, logging, and stream growth. 32 GB is for transformer-heavy or multi-task pipelines. 64 GB is typically reserved for inference servers, onsite evaluation, or R&D nodes. Because Jetson memory is soldered, profile full-load runtime usage before locking the module SKU.

Scope of This Page

This guide explains general RAM sizing principles for edge AI inference systems across different frameworks and platforms. It covers memory planning fundamentals that apply to any inference workload: model memory, pipeline buffering, system overhead, and multi-model strategies.

This page does NOT cover: YOLOv8-specific memory requirements or Jetson platform details beyond general principles. For YOLOv8-specific guidance on Jetson, see YOLOv8 RAM Requirements on Jetson.

Use this page when sizing RAM for any inference framework. For YOLOv8 deployment guidance with concrete memory numbers, link to the YOLOv8-specific article.

Planning Takeaway

The most common RAM sizing mistake is budgeting for the model only. In production, OS overhead, frame buffers, tracking state, logging, and secondary models usually consume more memory than expected. For most deployments, RAM headroom matters more than theoretical minimums.

Who This Page Is For

Choosing between 8 GB, 16 GB, 32 GB, and 64 GB edge AI hardware
Sizing Jetson or RK3588 memory for multi-camera inference
Understanding when model size is not the real memory bottleneck
Planning for tracking, secondary classification, and logging overhead
Avoiding swap, OOM crashes, and undersized module purchases

RAM Tier Quick Reference (2026)

8 GB: 1–4 cameras, single detection model (YOLOv8s or smaller), no large transformers
16 GB: 4–8 cameras, detection + tracking + secondary classification, medium models
32 GB: Multi-task pipelines, large transformers (SAM, ViT), 8–12 concurrent streams
64 GB: Inference servers, onsite model evaluation, R&D workloads
Always measure: Profile actual RSS + GPU allocation under maximum stream count and full production load

Rule of thumb: OS overhead (3 GB) + model footprints × 3–5x activation multiplier + frame buffers + 30% headroom = minimum RAM tier.

Why this matters: RAM on Jetson and SoM-based platforms is soldered at manufacture—there is no upgrade path. A node deployed with insufficient RAM requires a replacement module or a new unit. This is one of the few hardware decisions that cannot be corrected in the field.

Engineering Summary

Runtime footprint is not model file size: A 100 MB TensorRT engine can consume 300–500 MB during inference at 1080p due to activation memory. Size from profiled runtime usage, not weight file size.
Unified memory means GPU and CPU compete for the same pool: On Jetson, every MB the inference engine allocates is a MB not available to the OS, Docker, and application stack. Monitor both sides under full load.
Stream count scales memory non-linearly: Frame buffer pools, decoder state, and tracking buffers all grow with stream count. Profile at maximum production stream count, not development-time subsets.
Swap is not a safety net for real-time inference: Swap events cause latency spikes and frame drops. Size RAM to avoid swap entirely in production; disable zRAM where latency is a hard requirement.
The full production pipeline consumes more RAM than the prototype: Tracking, alerting, logging, and secondary classification are added after initial validation. Budget for the complete pipeline from day one—not just the detection model.

Quick RAM Budget Formula

Minimum RAM = OS overhead + Σ(model weights × activation multiplier) + frame buffers + application stack + 30% headroom

Example: 8-camera warehouse node — OS + DeepStream: 3.5 GB, YOLOv8m + tracking: 500 MB, re-ID model: 200 MB, frame buffers: 200 MB, application: 400 MB = 4.8 GB base. Add 30% headroom: ~6.3 GB. 8 GB is tight; 16 GB is recommended for secondary processing headroom.

Recommendation: For most production edge AI nodes, buy the smallest RAM tier that still leaves at least 30% headroom at full stream count and full pipeline load. This avoids paying for unused memory while protecting against swap, OOM crashes, and later feature growth.

Complementary guides: NVMe SSD endurance for Jetson Orin Nano and PoE power budget calculator for complete system sizing.

Why RAM Matters for Inference

RAM is the working memory of the inference pipeline. Every model loaded for inference, every frame buffer holding camera input, every decoded video frame, every intermediate tensor in the inference graph, and the OS and application stack all compete for the same pool of memory. When memory pressure is too high, the OS starts swapping to storage — and on an edge node doing real-time inference, even a brief swap event can cause frame drops, latency spikes, or pipeline stalls.

Unlike servers where you can add DIMM slots, embedded and SoM-based edge AI platforms have fixed RAM soldered at manufacture. Selecting the wrong RAM tier at procurement means a hardware revision to fix it. This decision is worth getting right.

Model Memory Footprint

TensorRT engine files loaded into GPU memory (or shared Jetson unified memory) consume RAM proportional to model size and precision:

YOLOv8n (INT8, TensorRT): ~25–40 MB
YOLOv8s (INT8, TensorRT): ~50–80 MB
YOLOv8m (INT8, TensorRT): ~100–160 MB
YOLOv8l / YOLOv8x (INT8): 200–400 MB
Large transformer (ViT-B, FP16): 700 MB – 2 GB
Segment Anything Model (SAM, FP16): 2–4 GB

These are loaded model sizes. During inference, additional memory is allocated for input tensors, output tensors, and intermediate activation layers. Activation memory scales with batch size and input resolution. A model with 100 MB of weights may allocate 300–500 MB total during inference at 1080p input.

OS and Runtime Overhead

A minimal JetPack Ubuntu image at idle consumes approximately 1.5–2.5 GB RAM:

Kernel and system services: ~400–600 MB
Docker daemon (if in use): ~200–400 MB
CUDA runtime and libraries: ~300–500 MB shared
DeepStream pipeline overhead: ~500 MB – 1.5 GB depending on stream count
Application-layer processes (logging, networking, alerting): 100–300 MB

Budget a minimum of 3 GB for OS and runtime overhead on any Jetson-based node before counting model or frame buffer memory. On non-Jetson ARM platforms with lighter OS configurations, 1.5 GB is achievable.

Frame Buffers and Stream Count

Each decoded camera stream requires frame buffer memory. A 1080p frame in YUV420 format (common RTSP output) is approximately 3 MB. With decode pipelines maintaining a buffer queue of 4–8 frames per stream:

1 camera: ~12–24 MB frame buffer
4 cameras: ~50–100 MB frame buffer
8 cameras: ~100–200 MB frame buffer

Frame buffers alone are not the limiting factor for RAM. However, if pre-processing (resize, normalize, letterbox) is performed on the CPU before GPU handoff, additional copies may exist in CPU memory simultaneously. Zero-copy pipelines using unified memory (Jetson) eliminate this duplication.

For the full picture of how stream count drives hardware requirements beyond RAM, see the 8-camera reference architecture.

Multi-Model Concurrency

Running multiple models simultaneously multiplies memory requirements:

Detection + classification pipeline: Primary detector (YOLOv8s, ~80 MB) + secondary classifier (MobileNet, ~15 MB) = ~95 MB model memory. Manageable on 16 GB.
Detection + tracking + re-ID: Adds DeepSORT or ByteTrack memory overhead (~100–200 MB state buffers) and a re-ID model (ResNet50 variant, ~100–200 MB). Total model + state: 400–600 MB. Still feasible on 16 GB.
Multi-task with large transformer: Detection + SAM-based segmentation on detected objects. SAM at FP16 alone requires 2–4 GB. This configuration requires 32 GB minimum.
Parallel independent inference servers: If the node serves multiple inference API endpoints simultaneously (each loading its own model instance), multiply model memory by concurrent instance count. 4 instances of YOLOv8s = ~400 MB; 4 instances of a 500 MB model = 2 GB just for models.

Unified Memory Architecture on Jetson

Jetson's unified memory architecture means CPU and GPU share the same physical DRAM pool. There is no separate GPU VRAM — the 16 GB or 32 GB figure is the total pool used by both CPU and GPU simultaneously. This simplifies zero-copy tensor passing between CPU preprocessing and GPU inference, but it also means GPU memory pressure directly reduces available system RAM.

On discrete GPU systems (x86 + NVIDIA GPU), GPU VRAM is separate from system RAM. A 16 GB system RAM + 8 GB GPU VRAM node effectively has 8 GB for the OS/CPU side and 8 GB for GPU inference, with transfer overhead for any data crossing the PCIe bus. Jetson's unified approach eliminates the bus but means all consumers compete for one pool.

RAM Tier Comparison

Strategic summary: 8 GB works for compact pipelines, 16 GB is the default buying decision, 32 GB is where complex multi-model workloads become comfortable, and 64 GB is usually excessive unless the node also acts like a local inference server.

On Jetson-class systems, these RAM tiers reflect total shared system memory, not separate CPU RAM plus GPU VRAM.

RAM Tier	Typical Platform	Max Concurrent Models	Max Streams (Practical)	Large Transformer Support	Best For
8 GB	Jetson Orin Nano	1–2 small models	2–4	No	Single-model, 1–4 camera pipelines
16 GB	Jetson Orin NX 16GB	2–4 medium models	4–8	Marginal	Multi-camera detection and tracking
32 GB	Jetson AGX Orin 32GB	4–8 models	8–12	Yes (FP16)	Complex pipelines, multi-task inference
64 GB	Jetson AGX Orin 64GB	8+ models	12–16	Yes (FP32 + FP16)	Inference server, onsite model evaluation, R&D nodes

Sizing Examples

These examples are meant to show order-of-magnitude sizing logic, not exact platform benchmarks.

Example 1: Retail foot traffic node, 2 cameras

OS overhead: 2.5 GB
YOLOv8s detection model: 120 MB (with activation memory)
DeepSORT tracking state: 50 MB
Frame buffers (2 cameras): 30 MB
Logging and application: 200 MB
Total: ~3.0 GB — 8 GB is comfortable, 16 GB has significant headroom

Example 2: Warehouse safety monitoring, 8 cameras, detection + tracking + zone alerts

OS and DeepStream overhead: 3.5 GB
YOLOv8m detection (INT8): 300 MB
Person re-ID model: 200 MB
Tracking state (8 streams): 400 MB
Frame buffers (8 cameras): 200 MB
Application, logging, alerting: 400 MB
Total: ~5 GB — 8 GB marginal, 16 GB recommended for headroom

Example 3: Multi-task node, detection + segmentation + re-ID, 4 cameras

OS overhead: 2.5 GB
YOLOv8l detection: 400 MB
SAM segmentation (FP16): 3 GB
Re-ID model: 200 MB
Frame buffers and state: 300 MB
Application: 300 MB
Total: ~6.7 GB — 8 GB is too tight; 16 GB is minimum; 32 GB preferred

For enclosure and thermal implications of higher-RAM platforms (which often have higher TDP), see fanless mini PC thermal constraints. For the full deployment workflow once hardware is selected, see the Jetson deployment checklist.

Common Pitfalls

Sizing from model weights only: Model file size (e.g., a 50 MB TensorRT engine) is not the same as runtime memory usage. Activation memory during inference can be 3–5x the weight size depending on input resolution and batch size.
Not accounting for Docker layer memory: Running inference in Docker containers adds 100–300 MB of container runtime overhead per container instance. Multiple containers multiply this overhead.
Assuming shared memory is free: On Jetson's unified memory, every byte allocated by the GPU inference engine is a byte not available to the CPU-side application. Monitor both sides of memory usage, not just GPU allocation.
Forgetting swap configuration: By default, Jetson enables a zRAM swap partition. While useful for burst handling, sustained swapping degrades real-time inference performance significantly. Disable swap or size RAM to avoid it in production.
Testing with a single model and then adding more: Prototype memory footprints often represent a single inference path. Production pipelines commonly add tracking, alerting, logging, and secondary classification after initial validation. Budget for the full pipeline from day one.
Not profiling at maximum camera count: Memory usage scales non-linearly with stream count due to decoder buffer pools and pipeline state. Profile at the maximum production stream count, not a development-time subset.

Decision Checklist

☐ Profiled actual runtime memory (tegrastats) at maximum stream count under full production load?
☐ Accounted for activation memory (3–5x model weight size), not just model file size?
☐ Budgeted for the full production pipeline: tracking, alerting, logging—not just the detection model?
☐ Added ≥30% headroom above measured peak to the RAM requirement?
☐ Verified swap configuration: disabled or sized to prevent latency-disrupting swap events in production?

Frequently Asked Questions

Can I add RAM to a Jetson module after purchase?

No. Jetson modules use LPDDR5 memory soldered directly to the SoM during manufacturing. The memory configuration (8 GB, 16 GB, 32 GB, 64 GB) is fixed at the factory. Select the correct module variant at procurement time.

How do I measure actual runtime memory usage on a Jetson?

Use tegrastats for combined CPU+GPU memory reporting, or free -h for system RAM. For detailed GPU memory allocation, use nvidia-smi or the Nsight systems profiler. Monitor under full production load for at least 10 minutes to catch steady-state usage.

Does increasing batch size increase memory usage?

Yes, approximately linearly. Batch size 1 requires one set of input/output tensor allocations. Batch size 4 requires four. For real-time single-stream inference, batch size 1 is standard. Batching across streams is possible but increases latency for individual frames.

Is 8 GB enough for YOLOv8 on 4 cameras?

YOLOv8s or smaller at INT8 precision on 4 streams is feasible on 8 GB with careful pipeline optimization. YOLOv8m and above at 4 streams is marginal — expect limited headroom for secondary processing or tracking state.

What happens when a Jetson runs out of RAM?

The kernel's OOM (out-of-memory) killer terminates the highest-memory process, which is typically the inference application. This causes a pipeline crash. Production systems should monitor RSS memory usage and implement a watchdog to restart the pipeline if it terminates unexpectedly.

Does quantization (INT8 vs FP16 vs FP32) affect RAM usage?

Yes. FP32 uses 4 bytes per parameter, FP16 uses 2 bytes, INT8 uses 1 byte. A model with 10 million parameters uses 40 MB at FP32, 20 MB at FP16, and 10 MB at INT8 for weights alone. Activation memory is similarly reduced. INT8 quantization roughly halves memory usage compared to FP16.

The Bottom Line

For most edge AI nodes, 16 GB is the safest default because it absorbs the difference between a lab prototype and a real production pipeline. Buy 8 GB only when the workload is tightly bounded. Move to 32 GB or 64 GB only when model complexity, concurrency, or evaluation workloads clearly justify it.