How can I optimize multi-batch and parallel inference in TensorRT for faster performance on high-resolution image patches?

nizamudeen · November 7, 2024, 8:00am

Description

I am encountering performance bottlenecks while running multi-threaded inference on high-resolution images using TensorRT. The model involves breaking the image into patches to manage GPU memory, performing inference on each patch, and then merging the results. However, the inference time per patch is still high, even when increasing the batch size. Additionally, loading multiple engines onto the GPU to parallelize the inference does not yield the expected speedup. I am seeking advice on optimizing the inference process for faster execution, either by improving batch processing or enabling better parallelism in TensorRT.

Environment

TensorRT Version: 10.5.0
GPU Type: RTX 3050TI 4GB
Nvidia Driver Version: 535.183.01
CUDA Version: 12.2
CUDNN Version: N/A
Operating System + Version: Ubuntu 20.04
Python Version: 3.11

Relevant Files

`build_engine.py`

def build_engine(onnx_file_path, engine_file_path): logger = trt.Logger(trt.Logger.ERROR) builder = trt.Builder(logger) network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) profile = builder.create_optimization_profile() config = builder.create_builder_config() parser = trt.OnnxParser(network, logger) if not os.path.exists(onnx_file_path): print("Failed finding ONNX file!") return print("Succeeded finding ONNX file!") with open(onnx_file_path, 'rb') as model: if not parser.parse(model.read()): print('Failed parsing the ONNX file') for error in range(parser.num_errors): print(parser.get_error(error)) return print('Completed parsing of ONNX file') # Configure input profile input_tensor = network.get_input(0) profile.set_shape(input_tensor.name, (min_batch, shape[1], shape[2], shape[3]), shape, (max_batch, shape[1], shape[2], shape[3])) config.add_optimization_profile(profile) # Build the serialized engine engine_string = builder.build_serialized_network(network, config) if engine_string is None: print("Failed building engine!") return print("Succeeded building engine!") with open(engine_file_path, "wb") as f: f.write(engine_string)

`inference.py`

class TRTModel: def __init__(self, trt_path): self.trt_path = trt_path trt.init_libnvinfer_plugins(None, "") self.logger = trt.Logger(trt.Logger.ERROR) with open(self.trt_path, "rb") as f: engine_data = f.read() self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(engine_data) def create_execution_context(self): return self.engine.create_execution_context() def process_async(self, input_data): _, stream = cudart.cudaStreamCreate() context = self.create_execution_context() input_size = input_data.nbytes output_size = input_data.nbytes input_device = cudart.cudaMallocAsync(input_size, stream)[1] output_device = cudart.cudaMallocAsync(output_size, stream)[1] input_data_np = input_data.cpu().numpy() cudart.cudaMemcpyAsync(input_device, input_data_np.ctypes.data, input_data.nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream) context.set_tensor_address('images', int(input_device)) context.set_tensor_address('output', int(output_device)) context.execute_async_v3(stream_handle=int(stream)) output_host = np.empty_like(input_data_np, dtype=np.float32) cudart.cudaMemcpyAsync(output_host.ctypes.data, output_device, output_host.nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream) cudart.cudaStreamSynchronize(stream) cudart.cudaFree(input_device) cudart.cudaFree(output_device) cudart.cudaStreamDestroy(stream) return output_host

Steps To Reproduce

Build the Engine: Use build_engine to convert an ONNX model into a TensorRT engine.
Run Inference: Use TRTModel to perform inference on cropped image patches.
Expected Result: While batch sizes are increased, the inference time per patch remains high. Running multiple engines for parallel inference also does not improve performance.
Profiling Results:
- Transfer to device: 0.48 ms
- Inference time: 784.75 ms
- Transfer to host: 0.67 ms
- Total time for a single patch (256x256): 19-22 seconds on average

I am seeking optimization suggestions for improving multi-batch processing or multi-threaded parallel inference in TensorRT.

AakankshaS · November 30, 2024, 9:45am

Hi @nizamudeen ,
Can you pls share your onnx model with us.

Thanks

nizamudeen · December 2, 2024, 5:49am

Topic		Replies	Views
Parallelize tensorRT inference in C++ TensorRT	1	546	April 6, 2020
Inference on Very High Resolution Images TensorRT	5	1434	October 12, 2021
ResNet18: Batch size 1 works, but batch size 10, 32 only has minor acceleration TensorRT	2	1809	February 20, 2020
Inference multiple images TensorRT TensorRT	8	2337	November 9, 2020
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1237	May 11, 2021
Inference on large batch size TensorRT	5	4670	September 21, 2018
How can I improve my prediction performance in TenserRt 3.0? TensorRT	3	959	April 26, 2018
the inference time increases linearly when running more than 2 tensorrt instance on single GPU TensorRT	1	1606	April 4, 2019
Batchsize performance differs greatly in the two application methods of tensorrt TensorRT	2	702	April 4, 2019
Ideas to maximize throughput using TensorRT TensorRT	1	391	November 20, 2020