Environment
- TensorRT Version: 8.6.2.3
- Device: NVIDIA Jetson Orin NX (16 GB RAM)
- CUDA Version: 12.2.140
- CUDNN Version: 8.9.4.25
- Operating System: Ubuntu 22.04 (Jammy Jellyfish)
- Python Version (if applicable): 3.10.12
- PyTorch Version (if applicable): 2.6.0
- Baremetal or Container: nvcr.io/nvidia/deepstream-l4t:7.0-samples-multiarch
- ONNX: 1.19.0
Description
I tried to convert my PyTorch model to a TensorRT engine using torch.onnx.export and trtexec. However, I observed a severe performance drop between bs=1 and bs=2. Specifically, inferences with bs=2 have a QPS (queries per second) almost 50% lower than with bs=1.
This issue appears very similar to #976.
I ran the following tests:
Inference with static batch size = 1
- Throughput: 70.0427 QPS
- Latency: min = 14.6602 ms, max = 19.5797 ms, mean = 14.7998 ms
- Enqueue Time: min = 0.8222 ms, max = 1.1145 ms, mean = 0.9361 ms
- H2D Latency: min = 0.2893 ms, max = 0.5101 ms, mean = 0.2997 ms
- GPU Compute Time: min = 14.1738 ms, max = 18.7857 ms, mean = 14.2679 ms
- D2H Latency: min = 0.1887 ms, max = 0.2986 ms, mean = 0.2322 ms
- Total Host Walltime: 3.01245 s
- Total GPU Compute Time: 3.01053 s
Inference with dynamic batch size = 1
- Throughput: 70.1527 QPS
- Latency: min = 14.6797 ms, max = 19.5344 ms, mean = 14.7761 ms
- Enqueue Time: min = 0.6501 ms, max = 1.4023 ms, mean = 0.9709 ms
- H2D Latency: min = 0.2877 ms, max = 0.4823 ms, mean = 0.2998 ms
- GPU Compute Time: min = 14.1780 ms, max = 18.7908 ms, mean = 14.2463 ms
- D2H Latency: min = 0.1880 ms, max = 0.2996 ms, mean = 0.2300 ms
- Total Host Walltime: 3.02198 s
- Total GPU Compute Time: 3.02021 s
Inference with static batch size = 2
- Throughput: 36.9994 QPS
- Latency: min = 27.7139 ms, max = 35.2312 ms, mean = 27.9782 ms
- Enqueue Time: min = 0.6567 ms, max = 1.3977 ms, mean = 0.9839 ms
- H2D Latency: min = 0.5500 ms, max = 0.9541 ms, mean = 0.5673 ms
- GPU Compute Time: min = 26.7891 ms, max = 33.8774 ms, mean = 27.0105 ms
- D2H Latency: min = 0.3606 ms, max = 0.4104 ms, mean = 0.4004 ms
- Total Host Walltime: 3.0541 s
- Total GPU Compute Time: 3.05219 s
Inference with dynamic batch size = 2
- Throughput: 36.9886 QPS
- Latency: min = 27.7246 ms, max = 35.6218 ms, mean = 27.9854 ms
- Enqueue Time: min = 0.6264 ms, max = 1.2593 ms, mean = 0.9366 ms
- H2D Latency: min = 0.5500 ms, max = 0.9653 ms, mean = 0.5660 ms
- GPU Compute Time: min = 26.7764 ms, max = 34.2528 ms, mean = 27.0181 ms
- D2H Latency: min = 0.3611 ms, max = 0.4097 ms, mean = 0.4013 ms
- Total Host Walltime: 3.0550 s
- Total GPU Compute Time: 3.05305 s
Python Script to Export PyTorch → ONNX (static batch size)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True) model.load_state_dict(state_dict=checkpoint["model_state_dict"]) model.eval() dummy_input = torch.randn(size=(1, 3, 416, 608), device=device) save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx") torch.onnx.export( model=model, args=(dummy_input,), f=save_path, export_params=True, verbose=False, input_names=["input"], output_names=["output"], opset_version=17 ) Python Script to Export PyTorch → ONNX (dynamic batch size)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True) model.load_state_dict(state_dict=checkpoint["model_state_dict"]) model.eval() dummy_input = torch.randn(size=(2, 3, 416, 608), device=device) save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx") torch.onnx.export( model=model, args=(dummy_input,), f=save_path, export_params=True, verbose=False, input_names=["input"], output_names=["output"], opset_version=17, dynamic_axes={ "input": {0: "batch_size"}, "output": {0: "batch_size"} }, ) TRT Commands
Static batch size
trtexec --onnx=test_static_bs_1.onnx --saveEngine=test.engine --fp16 trtexec --onnx=test_static_bs_2.onnx --saveEngine=test.engine --fp16 Dynamic batch size
trtexec --onnx=model.onnx \ --saveEngine=model.engine \ --minShapes=input:1x3x416x608 \ --optShapes=input:2x3x416x608 \ --maxShapes=input:16x3x416x608 \ --fp16 Inference
trtexec --loadEngine=file.engine --shapes=input:Nx3x416x608 --fp16 Observations
- Throughput drops almost 2× when moving from batch 1 (~70 QPS) to batch 2 (~37 QPS).
- Latency roughly doubles with batch 2 (~28 ms vs ~14.8 ms).
- GPU compute time scales linearly with batch size (~14 ms → ~27 ms), indicating poor batch efficiency.
- H2D and D2H transfer times slightly increase with batch.
- Enqueue time is similar across batch sizes.
- Static vs dynamic batching shows almost identical performance, so it’s not the source of the issue.
Questions
- In this answer to a very similar issue, it was stated that “trtexec returns the runtime per inference, where an inference is a query of batch_size=N which you specified.”
On the other hand, in this reply, the QPS was multiplied by the batch size, which suggests a performance improvement withbs > 1. So I’m not sure iftrtexecreports the inference time per query withbs=N, or if it reports the inference time withbs=1that should then be multiplied byN. - Is it normal to observe such a severe performance drop between
bs=1andbs=2when using batch sizes > 1? - Could this be caused by an issue in the ONNX export or TensorRT engine conversion?
- Are there any recommended steps or best practices to improve performance for dynamic batch sizes in TensorRT?