Inference time in trtexec scale linearly with batch size

vic01 · September 3, 2025, 10:43am

Environment

TensorRT Version: 8.6.2.3
Device: NVIDIA Jetson Orin NX (16 GB RAM)
CUDA Version: 12.2.140
CUDNN Version: 8.9.4.25
Operating System: Ubuntu 22.04 (Jammy Jellyfish)
Python Version (if applicable): 3.10.12
PyTorch Version (if applicable): 2.6.0
Baremetal or Container: nvcr.io/nvidia/deepstream-l4t:7.0-samples-multiarch
ONNX: 1.19.0

Description

I tried to convert my PyTorch model to a TensorRT engine using torch.onnx.export and trtexec. However, I observed a severe performance drop between bs=1 and bs=2. Specifically, inferences with bs=2 have a QPS (queries per second) almost 50% lower than with bs=1.

This issue appears very similar to #976.

I ran the following tests:

Inference with static batch size = 1

Throughput: 70.0427 QPS
Latency: min = 14.6602 ms, max = 19.5797 ms, mean = 14.7998 ms
Enqueue Time: min = 0.8222 ms, max = 1.1145 ms, mean = 0.9361 ms
H2D Latency: min = 0.2893 ms, max = 0.5101 ms, mean = 0.2997 ms
GPU Compute Time: min = 14.1738 ms, max = 18.7857 ms, mean = 14.2679 ms
D2H Latency: min = 0.1887 ms, max = 0.2986 ms, mean = 0.2322 ms
Total Host Walltime: 3.01245 s
Total GPU Compute Time: 3.01053 s

Inference with dynamic batch size = 1

Throughput: 70.1527 QPS
Latency: min = 14.6797 ms, max = 19.5344 ms, mean = 14.7761 ms
Enqueue Time: min = 0.6501 ms, max = 1.4023 ms, mean = 0.9709 ms
H2D Latency: min = 0.2877 ms, max = 0.4823 ms, mean = 0.2998 ms
GPU Compute Time: min = 14.1780 ms, max = 18.7908 ms, mean = 14.2463 ms
D2H Latency: min = 0.1880 ms, max = 0.2996 ms, mean = 0.2300 ms
Total Host Walltime: 3.02198 s
Total GPU Compute Time: 3.02021 s

Inference with static batch size = 2

Throughput: 36.9994 QPS
Latency: min = 27.7139 ms, max = 35.2312 ms, mean = 27.9782 ms
Enqueue Time: min = 0.6567 ms, max = 1.3977 ms, mean = 0.9839 ms
H2D Latency: min = 0.5500 ms, max = 0.9541 ms, mean = 0.5673 ms
GPU Compute Time: min = 26.7891 ms, max = 33.8774 ms, mean = 27.0105 ms
D2H Latency: min = 0.3606 ms, max = 0.4104 ms, mean = 0.4004 ms
Total Host Walltime: 3.0541 s
Total GPU Compute Time: 3.05219 s

Inference with dynamic batch size = 2

Throughput: 36.9886 QPS
Latency: min = 27.7246 ms, max = 35.6218 ms, mean = 27.9854 ms
Enqueue Time: min = 0.6264 ms, max = 1.2593 ms, mean = 0.9366 ms
H2D Latency: min = 0.5500 ms, max = 0.9653 ms, mean = 0.5660 ms
GPU Compute Time: min = 26.7764 ms, max = 34.2528 ms, mean = 27.0181 ms
D2H Latency: min = 0.3611 ms, max = 0.4097 ms, mean = 0.4013 ms
Total Host Walltime: 3.0550 s
Total GPU Compute Time: 3.05305 s

Python Script to Export PyTorch → ONNX (static batch size)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True) model.load_state_dict(state_dict=checkpoint["model_state_dict"]) model.eval() dummy_input = torch.randn(size=(1, 3, 416, 608), device=device) save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx") torch.onnx.export( model=model, args=(dummy_input,), f=save_path, export_params=True, verbose=False, input_names=["input"], output_names=["output"], opset_version=17 )

Python Script to Export PyTorch → ONNX (dynamic batch size)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True) model.load_state_dict(state_dict=checkpoint["model_state_dict"]) model.eval() dummy_input = torch.randn(size=(2, 3, 416, 608), device=device) save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx") torch.onnx.export( model=model, args=(dummy_input,), f=save_path, export_params=True, verbose=False, input_names=["input"], output_names=["output"], opset_version=17, dynamic_axes={ "input": {0: "batch_size"}, "output": {0: "batch_size"} }, )

TRT Commands

Static batch size

trtexec --onnx=test_static_bs_1.onnx --saveEngine=test.engine --fp16 trtexec --onnx=test_static_bs_2.onnx --saveEngine=test.engine --fp16

Dynamic batch size

trtexec --onnx=model.onnx \ --saveEngine=model.engine \ --minShapes=input:1x3x416x608 \ --optShapes=input:2x3x416x608 \ --maxShapes=input:16x3x416x608 \ --fp16

Inference

trtexec --loadEngine=file.engine --shapes=input:Nx3x416x608 --fp16

Observations

Throughput drops almost 2× when moving from batch 1 (~70 QPS) to batch 2 (~37 QPS).
Latency roughly doubles with batch 2 (~28 ms vs ~14.8 ms).
GPU compute time scales linearly with batch size (~14 ms → ~27 ms), indicating poor batch efficiency.
H2D and D2H transfer times slightly increase with batch.
Enqueue time is similar across batch sizes.
Static vs dynamic batching shows almost identical performance, so it’s not the source of the issue.

Questions

In this answer to a very similar issue, it was stated that “trtexec returns the runtime per inference, where an inference is a query of batch_size=N which you specified.”
On the other hand, in this reply, the QPS was multiplied by the batch size, which suggests a performance improvement with bs > 1. So I’m not sure if trtexec reports the inference time per query with bs=N, or if it reports the inference time with bs=1 that should then be multiplied by N.
Is it normal to observe such a severe performance drop between bs=1 and bs=2 when using batch sizes > 1?
Could this be caused by an issue in the ONNX export or TensorRT engine conversion?
Are there any recommended steps or best practices to improve performance for dynamic batch sizes in TensorRT?

Topic		Replies	Views
Trtexec throughput linear to batch size, even if batch=1000? Jetson AGX Orin tensorrt , jetson-inference	7	1617	October 5, 2022
ResNet18: Batch size 1 works, but batch size 10, 32 only has minor acceleration TensorRT	2	1809	February 20, 2020
Inference time is not improving with the increase in batch size TensorRT	8	2011	June 1, 2022
TensorRT Engine batch inference only has one result TensorRT tensorrt	9	1209	October 12, 2021
TRT inference on batches is not giving any performance benefit Jetson TX2 tensorrt , nvbugs	11	1303	October 18, 2021
Inference time is linear respective to batch size while using TENSORRT MODEL TensorRT tensorrt , yolo	8	2938	May 5, 2021
ONNX to TensorRT with dynamic batch size in Python TensorRT tensorrt , onnx	4	6335	October 12, 2021
Tensorrt, convert pytorch onnx module dynamic batch failed General Topics and Other SDKs tensorrt , ubuntu	0	627	February 11, 2022
No speedup on batch size larger than 1 TensorRT tensorrt , pytorch	4	1680	July 31, 2020
Batch inference on tensorrt TensorRT tensorrt	4	466	February 15, 2021