Inference time in trtexec scale linearly with batch size

Environment

  • TensorRT Version: 8.6.2.3
  • Device: NVIDIA Jetson Orin NX (16 GB RAM)
  • CUDA Version: 12.2.140
  • CUDNN Version: 8.9.4.25
  • Operating System: Ubuntu 22.04 (Jammy Jellyfish)
  • Python Version (if applicable): 3.10.12
  • PyTorch Version (if applicable): 2.6.0
  • Baremetal or Container: nvcr.io/nvidia/deepstream-l4t:7.0-samples-multiarch
  • ONNX: 1.19.0

Description

I tried to convert my PyTorch model to a TensorRT engine using torch.onnx.export and trtexec. However, I observed a severe performance drop between bs=1 and bs=2. Specifically, inferences with bs=2 have a QPS (queries per second) almost 50% lower than with bs=1.

This issue appears very similar to #976.

I ran the following tests:

Inference with static batch size = 1

  • Throughput: 70.0427 QPS
  • Latency: min = 14.6602 ms, max = 19.5797 ms, mean = 14.7998 ms
  • Enqueue Time: min = 0.8222 ms, max = 1.1145 ms, mean = 0.9361 ms
  • H2D Latency: min = 0.2893 ms, max = 0.5101 ms, mean = 0.2997 ms
  • GPU Compute Time: min = 14.1738 ms, max = 18.7857 ms, mean = 14.2679 ms
  • D2H Latency: min = 0.1887 ms, max = 0.2986 ms, mean = 0.2322 ms
  • Total Host Walltime: 3.01245 s
  • Total GPU Compute Time: 3.01053 s

Inference with dynamic batch size = 1

  • Throughput: 70.1527 QPS
  • Latency: min = 14.6797 ms, max = 19.5344 ms, mean = 14.7761 ms
  • Enqueue Time: min = 0.6501 ms, max = 1.4023 ms, mean = 0.9709 ms
  • H2D Latency: min = 0.2877 ms, max = 0.4823 ms, mean = 0.2998 ms
  • GPU Compute Time: min = 14.1780 ms, max = 18.7908 ms, mean = 14.2463 ms
  • D2H Latency: min = 0.1880 ms, max = 0.2996 ms, mean = 0.2300 ms
  • Total Host Walltime: 3.02198 s
  • Total GPU Compute Time: 3.02021 s

Inference with static batch size = 2

  • Throughput: 36.9994 QPS
  • Latency: min = 27.7139 ms, max = 35.2312 ms, mean = 27.9782 ms
  • Enqueue Time: min = 0.6567 ms, max = 1.3977 ms, mean = 0.9839 ms
  • H2D Latency: min = 0.5500 ms, max = 0.9541 ms, mean = 0.5673 ms
  • GPU Compute Time: min = 26.7891 ms, max = 33.8774 ms, mean = 27.0105 ms
  • D2H Latency: min = 0.3606 ms, max = 0.4104 ms, mean = 0.4004 ms
  • Total Host Walltime: 3.0541 s
  • Total GPU Compute Time: 3.05219 s

Inference with dynamic batch size = 2

  • Throughput: 36.9886 QPS
  • Latency: min = 27.7246 ms, max = 35.6218 ms, mean = 27.9854 ms
  • Enqueue Time: min = 0.6264 ms, max = 1.2593 ms, mean = 0.9366 ms
  • H2D Latency: min = 0.5500 ms, max = 0.9653 ms, mean = 0.5660 ms
  • GPU Compute Time: min = 26.7764 ms, max = 34.2528 ms, mean = 27.0181 ms
  • D2H Latency: min = 0.3611 ms, max = 0.4097 ms, mean = 0.4013 ms
  • Total Host Walltime: 3.0550 s
  • Total GPU Compute Time: 3.05305 s

Python Script to Export PyTorch → ONNX (static batch size)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True) model.load_state_dict(state_dict=checkpoint["model_state_dict"]) model.eval() dummy_input = torch.randn(size=(1, 3, 416, 608), device=device) save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx") torch.onnx.export( model=model, args=(dummy_input,), f=save_path, export_params=True, verbose=False, input_names=["input"], output_names=["output"], opset_version=17 ) 

Python Script to Export PyTorch → ONNX (dynamic batch size)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True) model.load_state_dict(state_dict=checkpoint["model_state_dict"]) model.eval() dummy_input = torch.randn(size=(2, 3, 416, 608), device=device) save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx") torch.onnx.export( model=model, args=(dummy_input,), f=save_path, export_params=True, verbose=False, input_names=["input"], output_names=["output"], opset_version=17, dynamic_axes={ "input": {0: "batch_size"}, "output": {0: "batch_size"} }, ) 

TRT Commands

Static batch size

trtexec --onnx=test_static_bs_1.onnx --saveEngine=test.engine --fp16 trtexec --onnx=test_static_bs_2.onnx --saveEngine=test.engine --fp16 

Dynamic batch size

trtexec --onnx=model.onnx \ --saveEngine=model.engine \ --minShapes=input:1x3x416x608 \ --optShapes=input:2x3x416x608 \ --maxShapes=input:16x3x416x608 \ --fp16 

Inference

trtexec --loadEngine=file.engine --shapes=input:Nx3x416x608 --fp16 

Observations

  • Throughput drops almost 2× when moving from batch 1 (~70 QPS) to batch 2 (~37 QPS).
  • Latency roughly doubles with batch 2 (~28 ms vs ~14.8 ms).
  • GPU compute time scales linearly with batch size (~14 ms → ~27 ms), indicating poor batch efficiency.
  • H2D and D2H transfer times slightly increase with batch.
  • Enqueue time is similar across batch sizes.
  • Static vs dynamic batching shows almost identical performance, so it’s not the source of the issue.

Questions

  • In this answer to a very similar issue, it was stated that “trtexec returns the runtime per inference, where an inference is a query of batch_size=N which you specified.”
    On the other hand, in this reply, the QPS was multiplied by the batch size, which suggests a performance improvement with bs > 1. So I’m not sure if trtexec reports the inference time per query with bs=N, or if it reports the inference time with bs=1 that should then be multiplied by N.
  • Is it normal to observe such a severe performance drop between bs=1 and bs=2 when using batch sizes > 1?
  • Could this be caused by an issue in the ONNX export or TensorRT engine conversion?
  • Are there any recommended steps or best practices to improve performance for dynamic batch sizes in TensorRT?