Deploying deep learning models to production requires speed and efficiency. TensorRT is a powerful SDK from NVIDIA that can optimize, quantize, and accelerate inference on NVIDIA GPUs. In this article, we’ll walk through how to convert a PyTorch model into a TensorRT-optimized engine and benchmark its performance.
What Is TensorRT?
TensorRT is a high-performance deep learning inference optimizer and runtime library developed by NVIDIA. It supports FP16, INT8, and dynamic tensor shapes, delivering impressive latency and throughput improvements for inference tasks on supported hardware.
Installation
TensorRT can be installed via NVIDIA’s developer site or through Docker images. To get started with PyTorch integration:
pip install torch torchvision pip install onnx onnxruntime pip install tensorrt
Make sure your environment includes CUDA, cuDNN, and a compatible NVIDIA driver.
Step 1: Export Your PyTorch Model to ONNX
import torch import torchvision.models as models model = models.resnet50(pretrained=True) model.eval() dummy_input = torch.randn(1, 3, 224, 224) torch.onnx.export(model, dummy_input, "resnet50.onnx", input_names=["input"], output_names=["output"])
Step 2: Optimize the ONNX Model With TensorRT
Use NVIDIA’s `trtexec` tool to convert the model into an engine:
trtexec --onnx=resnet50.onnx --saveEngine=resnet50.trt --fp16
This will create a serialized TensorRT engine file using FP16 precision for better performance.
Step 3: Load and Run the TensorRT Engine
import tensorrt as trt import pycuda.driver as cuda import pycuda.autoinit import numpy as np TRT_LOGGER = trt.Logger(trt.Logger.WARNING) def load_engine(path): with open(path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime: return runtime.deserialize_cuda_engine(f.read()) engine = load_engine("resnet50.trt") context = engine.create_execution_context() # Allocate buffers input_shape = (1, 3, 224, 224) input_nbytes = np.prod(input_shape) * np.dtype(np.float32).itemsize output_nbytes = 1000 * np.dtype(np.float32).itemsize d_input = cuda.mem_alloc(input_nbytes) d_output = cuda.mem_alloc(output_nbytes) # Prepare data input_data = np.random.random(input_shape).astype(np.float32) cuda.memcpy_htod(d_input, input_data) # Run inference context.execute_v2([int(d_input), int(d_output)]) output_data = np.empty(1000, dtype=np.float32) cuda.memcpy_dtoh(output_data, d_output)
Benchmarking Inference Performance
Use the `trtexec` tool to benchmark your model:
trtexec --loadEngine=resnet50.trt --fp16
This will provide latency, throughput, and memory usage metrics that help you tune your deployment for production use.
Use Cases
- Accelerating object detection or classification models
- Serving real-time models in cloud-native environments
- Deploying edge AI applications with Jetson devices
Conclusion
TensorRT enables fast and efficient deployment of deep learning models. By converting models to ONNX, optimizing with TensorRT, and running them on NVIDIA hardware, you can drastically improve performance. It’s a critical part of any production-ready ML pipeline targeting low-latency inference.
If this post helped you, consider buying me a coffee: buymeacoffee.com/hexshift
Top comments (0)