Estimate inference throughput for your edge AI workload

Estimate real-world inference throughput for vision models on edge AI hardware. Configure runtime, precision, batch size, and concurrent streams to compare FPS, per-image latency, and maximum stream capacity. Supports NVIDIA Jetson, Google Coral, and Hailo platforms.

Configuration
Platform Family
Loading hardware catalog…
Platform
// Select a platform first
Power Mode
// Select a platform first
Runtime
Model Family
Model Variant
// Select a model family first
Precision
Resolution
224×224
320×320
416×416
640×640
1280×720
Advanced
Batch Size
1
2
4
8
16
Streams
1
2
4
8
16
32
// Select hardware, runtime, and model to continue
Quick Results
Estimated FPS
Latency / batch
Latency / image
Accelerator util.
Multi-Stream Capacity
FPS per stream
Max streams @ 30fps
Max streams @ 15fps
Total FPS (all streams)
Planning Notes
Configure inputs to see planning recommendations.
Assumptions
Configure the system to see detailed assumptions.
// RELATED TOOLS
→ Tool 07: Module Power Calculator → Tool 06: Full Deployment Planner
// machine-readable output — application/json
{ }
FAQ
What inference engines does this tool support?

The estimator supports NVIDIA TensorRT, PyTorch, ONNX Runtime, Google Coral Edge TPU SDK, and Hailo Runtime. Runtime availability depends on the selected hardware platform — unsupported runtimes are disabled in the selector.

What is estimated FPS?

Estimated frames per second — how many inference passes the hardware can complete per second for the selected model, precision, and runtime. Higher FPS is better for real-time inference.

What is the difference between latency/batch and latency/image?

Latency/batch is the time to process a full batch of frames. Latency/image divides that by batch size — the per-frame processing time. For real-time streaming, latency/image is the relevant metric.

What does the confidence score mean?

High (90%): exact published vendor benchmark. Medium (65%): interpolated from GFLOPs across known variants. Low (40%): theoretical TOPS heuristic with no benchmark data. Always validate Low-confidence estimates on device.

Why is TensorRT so much faster than PyTorch?

TensorRT performs layer fusion, precision calibration, and kernel auto-tuning at build time — extracting 1.5–2.5× more throughput than vanilla PyTorch inference on Jetson hardware. The build step (trtexec) takes minutes but runs once.

What is DLA (Deep Learning Accelerator)?

DLA is a fixed-function neural network processor on Jetson Orin NX and AGX Orin (2 DLA cores each). It runs supported layers alongside the GPU, freeing GPU headroom for other tasks. Not all YOLO11 ops are DLA-compatible; unsupported layers fall back to GPU automatically.

How accurate are these estimates?

Benchmark-backed estimates are ±10–15% of real measured throughput under similar conditions. GFLOPs-interpolated: ~65% accurate. Theoretical TOPS: planning-only (±30–50%). Always measure on target hardware before finalising a deployment design.