How to optimize inference using TensorRT on Jetson AGX Orin

Blog post | TensorRT inference optimization on NVIDIA Jetson AGX Orin: Turning video analytics bottlenecks into breakthroughs

Highlights

Edge AI deployments face major latency challenges when running multiple computer vision models in real-time on devices like the NVIDIA Jetson AGX Orin.
NVIDIA TensorRT emerges as a powerful inference optimizer, using techniques like layer fusion, precision calibration (FP16/INT8), and kernel auto-tuning to accelerate deep learning models.
Converting YOLO models to TensorRT drastically reduced inference time, achieving up to 2.6ms per image and enabling smooth, real-time video analytics.

‎

When we first started working with edge computing and TensorRT video analytics, we had little idea about inference optimization for real-time applications. Our project required deploying AI models on the NVIDIA Jetson AGX Orin device, where speed was critical. Not only did we have to process large models individually, but we also needed to run the feature extraction and inference pipelines in real-time. The challenge? High latency.

‎

Hitting a wall: The latency struggle in video analytics

We were building a model that extracted multiple features from an image—pose landmarks, facial landmarks, segmentation masks, depth estimation, and face detection. While the models worked well, the inference time was painfully slow. Every millisecond counted, and we needed to find a way to accelerate the processing without compromising accuracy.

One of the biggest bottlenecks we faced was depth estimation. The inference time of our depth estimation model was too high for real-time processing, making it a major roadblock. While searching for solutions, we came across DepthAnythingV2’s TensorRT GitHub page. Curious about its potential, we explored the repository, examined the optimizations, and decided to give it a shot. The results were impressive—switching to the TensorRT version significantly reduced inference time. That’s when we started wondering—what if we could convert all our models to TensorRT?

‎

What is TensorRT?

NVIDIA’s TensorRT, an ecosystem of APIs, is a high-performance deep learning inference optimizer that helps developers deploy AI models faster and more efficiently. It works by optimizing trained neural networks using techniques such as layer fusion, precision calibration (like FP16 and INT8), and kernel auto-tuning to achieve the best balance between speed and accuracy. TensorRT compiles these optimized models into lightweight inference engines that run with minimal latency on NVIDIA GPUs, from powerful data center cards to compact edge devices like the Jetson AGX Orin. The latest versions also extend support for large language models through TensorRT-LLM, enabling faster, more scalable generative AI inference across diverse applications.

‎

The breakthrough: How TensorRT transformed our AI inference

We started researching TensorRT, NVIDIA’s inference optimization engine built for accelerating deep learning models on GPUs. Unlike traditional frameworks, TensorRT compiles models into highly optimized engines, applying a set of optimizations that make inference lightning-fast. As we explored its optimizations and different conversion methods, we began converting models one by one—YOLO (for pose, object detection, segmentation), depth estimation, and face detection—ensuring each integrated smoothly into the pipeline for real-time performance. Some of the key techniques TensorRT uses include layer fusion, precision conversion, kernel auto-tuning, and efficient memory management.

Key Optimization Techniques in NVIDIA TensorRT — Figure 1. Key Optimization Techniques in TensorRT

This article is the first in a series where we’ll walk you through how we optimized our model using TensorRT, the methods we explored, and the incredible improvements in inference speed. If you’ve ever struggled with model latency, this might just save you hours (or even days) of frustration!

‎

Setting up TensorRT for high-performance video processing

The first step was to get TensorRT running. We found that it can be installed in two ways:

JetPack SDK (for Jetson devices): Comes with TensorRT pre-installed when flashing Jetson boards.
Manual Installation: If TensorRT isn’t included in the system, it can be downloaded from Installing TensorRT.

For our Jetson AGX Orin, we reflashed it with JetPack 6.0, and everything was set up right out of the box. It came pre-installed with TensorRT, CUDA, cuDNN, DeepStream, and OpenCV, making the setup seamless and saving us the hassle of manual installations.

‎

TensorRT hardware and software compatibility: What you need to know

Before optimizing, we needed to ensure TensorRT compatibility. A key requirement is a compute capability of 7.5 or higher to unlock TensorRT’s full potential. You can verify hardware and software compatibility using the TensorRT Support Matrix.

To check if our GPU was compatible, we ran the following simple Python command:

python -c "import torch; print(torch.cuda.get_device_capability(0))"

Here’s a table you can quickly check to see if your device is compatible with TensorRT:

Hardware	GPU Series	Compute Capability	JetPack Compatibility	Server Compatibility
NVIDIA Data Center GPUs	A100, H100	8.0+	N/A	Supported
Jetson AGX Orin	Orin	8.7	JetPack6.0+	N/A
Jetson Xavier	Xavier Series	7.2	JetPack 4.5+	N/A
GeForce RTX Series	RTX 30xx, 40xx	8.0+	N/A	Supported (workstations)

With compatibility confirmed, it was time to move forward. Our system—Jetson AGX Orin—has a compute capability of 8.7, meeting the requirements for advanced TensorRT optimizations. Now, the real challenge began—converting our model into a TensorRT-optimized engine to maximize inference performance.

‎

Optimizing video models: Streamlined techniques for TensorRT conversion

TensorRT supports multiple methods for converting models into optimized engines. We explored several approaches to find the one that best suited our model’s needs. Here’s a quick rundown:

trtexec CLI Tool: A command-line tool to quickly convert ONNX models to TensorRT engines and benchmark their performance.
torch2trt: A lightweight Python library to convert PyTorch models to TensorRT engines, simplifying deployment.
TensorRT Python API: Provides granular control to parse, build, and optimize models programmatically for TensorRT.
Torch-TensorRT: An extension of PyTorch to directly compile and optimize models for TensorRT.
DeepStream SDK: Automates the conversion and deployment of models for video analytics applications.
NVIDIA TAO Toolkit: Use TAO Toolkit to optimize and deploy pretrained models directly in TensorRT.
TensorFlow-TensorRT Integration (TF-TRT): A deep-learning compiler for TensorFlow that optimizes TF models for inference on NVIDIA devices.

‎

Step-by-step: Converting our YOLO model to TensorRT

After successfully optimizing some models with TensorRT, we decided to do initial experiments with YOLO models, used for pose estimation, object detection, and segmentation. We chose to use trtexec because it is a command-line tool included in NVIDIA’s TensorRT SDK that allows us to benchmark networks, generate optimized TensorRT engines, and create serialized timing caches. It is typically located at:

/usr/src/tensorrt/bin/trtexec

Fixing "trtexec: command not found"

While running trtexec, we encountered the classic error:

trtexec: command not found

This usually means the binary isn’t in the system’s PATH. A solution was creating a symbolic link to

make it accessible globally:

sudo ln -sf /usr/src/tensorrt/bin/trtexec /usr/local/bin/trtexec

After this, we verified the installation with:

trtexec --version

With TensorRT and trtexec ready to go, we moved on to converting our YOLO model into a TensorRT engine.

Prerequisites

YOLO Python library - ultralytics.
TensorRT with trtexec CLI tool.
Required Python libraries: opencv-python,onnx,onnxruntime.

Step 1: Export the YOLOv11n model to ONNX format

To convert YOLO to TensorRT, we first needed to export it to ONNX format. The Ultralytics Python library makes this easy:

Save the following script as export_yolov11n.py:

from ultralytics import YOLO
model = YOLO("../models/weights/yolo11n.pt")
# Export the model
model.export(format="onnx", half=True)

Then we ran:

python export_yolov11n.py

This produced a yolov11n.onnx file, which is the intermediate format before conversion to TensorRT.

Step 2: Convert the ONNX model to TensorRT using trtexec

With the ONNX model ready, we used trtexec to generate an optimized TensorRT engine. Since we wanted to maximize speed while keeping precision, we enabled FP16 mode:

trtexec --onnx=yolov11n.onnx --saveEngine=yolov11n.engine --fp16

Explanation of parameters:

--onnx: Path to the ONNX model file.
--saveEngine: Saves the optimized TensorRT engine.
--fp16: Enables FP16 precision for faster inference.

Once the process was complete, we had a yolov11n.engine file—an optimized, serialized TensorRT model ready for deployment.

Step 3: Running inference on the TensorRT engine

The final step was testing the model's performance. We loaded the engine and ran inference with batch size 8 to leverage GPU parallelism:

trtexec --loadEngine=model.engine --batch=8

Explanation of parameters:

--loadEngine=model.engine: Loads the prebuilt TensorRT engine.
--batch=8: Processes 8 inputs simultaneously, improving efficiency by utilizing the GPU more effectively.

More options for trtexec can be found in the TensorRT trtexec Documentation.

‎

The results: A massive speed boost in video analytics with TensorRT

After conversion, we measured the performance, and the improvements were immediate, with the model delivering an average response time of just 2.6 milliseconds per image over 10 iterations and 10,592 total queries, demonstrating impressive speed and efficiency.

[01/27/2025-15:07:41] [I] === Performance summary ===
[01/27/2025-15:07:41] [I] Throughput: 3519.62 qps
[01/27/2025-15:07:41] [I] Latency: min = 2.48022 ms, max = 7.34958 ms, mean = 2.61849 ms, median = 2.57405 ms, percentile(90%) = 2.59662 ms, percentile(95%) = 2.61865 ms, percentile(99%) = 3.79596 ms
[01/27/2025-15:07:41] [I] Enqueue Time: min = 1.08496 ms, max = 2.37524 ms, mean = 1.52408 ms, median = 1.53952 ms, percentile(90%) = 1.61441 ms, percentile(95%) = 1.67493 ms, percentile(99%) = 1.91187 ms
[01/27/2025-15:07:41] [I] H2D Latency: min = 0.153564 ms, max = 0.626709 ms, mean = 0.197224 ms, median = 0.196045 ms, percentile(90%) = 0.20166 ms, percentile(95%) = 0.20639 ms, percentile(99%) = 0.242401 ms
[01/27/2025-15:07:41] [I] GPU Compute Time: min = 2.19629 ms, max = 6.56174 ms, mean = 2.2646 ms, median = 2.22241 ms, percentile(90%) = 2.24188 ms, percentile(95%) = 2.25098 ms, percentile(99%) = 3.35431 ms
[01/27/2025-15:07:41] [I] D2H Latency: min = 0.0827637 ms, max = 0.200378 ms, mean = 0.156667 ms, median = 0.15564 ms, percentile(90%) = 0.160156 ms, percentile(95%) = 0.164032 ms, percentile(99%) = 0.178589 ms
[01/27/2025-15:07:41] [I] Total Host Walltime: 3.00941 s
[01/27/2025-15:07:41] [I] Total GPU Compute Time: 2.99833 s
&&&& PASSED TensorRT.trtexec [TensorRT v8602] # trtexec --loadEngine=yolo11n.engine --batch=8

‎

Conclusion: Key takeaways from our TensorRT optimization journey

With TensorRT, we efficiently processed multiple streams, running vision models for pose estimation, facial landmarks, object detection, segmentation, and depth estimation with minimal latency. Optimizing models while maintaining accuracy enabled smooth real-time performance, making the system ideal for edge deployment. If you’re working on edge AI applications, we highly recommend experimenting with TensorRT for inference optimization. Try different conversion methods, test various precision modes like FP16 and INT8, and see how much you can improve inference speed and memory usage.

This is just the first step—next, we’ll dive deeper into advanced TensorRT techniques and how to fine-tune performance even further. Stay tuned!

At KeyValue, we believe every millisecond matters. Our AI lab is built to push edge AI and GPU optimization to their absolute limits. This experiment with TensorRT is just one of many steps in our journey to build smarter, faster, and more adaptive AI systems.

‎

FAQs

‎‎

How do I optimize inference speed with TensorRT?

Use TensorRT to build an optimized engine with tactics like layer fusion, reduced precision (FP16/INT8 with calibration), kernel auto-tuning, and efficient batching/dynamic shapes; profile and benchmark with trtexec to pick the fastest plan for your target GPU.

‎

What is TensorRT used for?

TensorRT, NVIDIA’s high-performance inference SDK, compiles trained models into optimized runtime engines for NVIDIA GPUs, delivering low-latency, high-throughput deployment via C++/Python APIs and integrations (Torch-TensorRT, TF-TRT).

‎

What is TensorRT-LLM?

TensorRT-LLM is NVIDIA’s open-source library that optimizes large language model inference on GPUs using custom kernels, caching, and quantization for faster, memory-efficient performance across single and multi-GPU systems.

‎

Which GPUs support TensorRT?

TensorRT supports NVIDIA RTX (desktop/workstation), Data Center (e.g., A100/H100), and Jetson (edge/embedded)—with precision features (FP16/INT8, DLA, etc.) varying by architecture. Always check the TensorRT Support Matrix for your exact GPU and features.

‎

Can I use TensorRT without a GPU?

No for inference, TensorRT is designed to run optimized engines on NVIDIA GPUs. (You can develop in containers or use cloud/RTX/Jetson targets, but CPU-only execution isn’t supported by TensorRT’s inference runtime.)

Highlights

Hitting a wall: The latency struggle in video analytics

What is TensorRT?

The breakthrough: How TensorRT transformed our AI inference

Setting up TensorRT for high-performance video processing

TensorRT hardware and software compatibility: What you need to know

Optimizing video models: Streamlined techniques for TensorRT conversion

Step-by-step: Converting our YOLO model to TensorRT

Fixing "trtexec: command not found"

Prerequisites

Step 1: Export the YOLOv11n model to ONNX format

Step 2: Convert the ONNX model to TensorRT using trtexec

Explanation of parameters:

Step 3: Running inference on the TensorRT engine

Explanation of parameters:

The results: A massive speed boost in video analytics with TensorRT

Conclusion: Key takeaways from our TensorRT optimization journey

FAQs

Let’s stay connected.Subscribe to our newsletter

Let’s stay connected.
Subscribe to our newsletter