Performance Difference Between Custom CUDNN Wrapper and PyTorch

lilil111 · May 16, 2025, 10:56am

I’ve implemented a wrapper function for calling the CUDNN convolution API as shown in my code snippet below. After testing its execution and profiling with nsys, I noticed a significant performance gap compared to PyTorch’s implementation.

When profiling my implementation, I found that the core convolution computation takes more than 20 milliseconds:

cudnn_infer_ampere_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1 Begins: 3.61082s Ends: 3.6346s (+23.778 ms) grid: <<<1024, 4, 2>>> block: <<<256, 1, 1>>> Launch Type: Regular Static Shared Memory: 49,152 bytes Dynamic Shared Memory: 0 bytes Registers Per Thread: 126 Local Memory Per Thread: 0 bytes Local Memory Total: 127,401,984 bytes Shared Memory executed: 102,400 bytes Shared Memory Bank Size: 4 B Theoretical occupancy: 25 % Launched from thread: 3428407 Latency: ←109.155 ms Correlation ID: 2594 Stream: Stream 22

However, when I wrote the same convolution operation in PyTorch and profiled it, it only takes about 100 microseconds:

cudnn_ampere_scudnn_128x32_relu_small_nn_v1 Begins: 13.3173s Ends: 13.3174s (+93.952 μs) grid: <<<3136, 2, 1>>> block: <<<64, 1, 1>>> Launch Type: Regular Static Shared Memory: 5,376 bytes Dynamic Shared Memory: 0 bytes Registers Per Thread: 128 Local Memory Per Thread: 0 bytes Local Memory Total: 205,258,752 bytes Shared Memory executed: 102,400 bytes Shared Memory Bank Size: 4 B Theoretical occupancy: 25 % Launched from thread: 3417000 Latency: ←124.023 ms Correlation ID: 2300 Stream: Default stream 7

This performance difference is more than two orders of magnitude!

I suspect that PyTorch’s backend or the Python backend has some compilation optimizations. However, looking at the kernel execution, it seems that PyTorch is still ultimately calling pre-written kernels from CUDNN rather than generating them through compilation techniques.

Question：
Is it possible for my wrapper function approach to achieve execution efficiency similar to PyTorch? If so, what additional work do I need to do?

Here’s my wrapper function implementation:

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuCudnnConv2dForward( int n, int c, int h, int w_in, // Input dimensions int k, int r, int s, // Kernel dimensions int pad_h, int pad_w, // Padding int stride_h, int stride_w, // Stride int dilation_h, int dilation_w, // Dilation void* x_data, void* w_data, void* bias_data, // Input, weight, and bias pointers void* y_data, // Output pointer CUstream stream // CUDA stream ) { // Ensure global context mgpuEnsureContext(); // Get cuDNN handle for this stream cudnnHandle_t handle = mgpuCudnnGetHandle(stream); // Create descriptors cudnnTensorDescriptor_t xDesc, yDesc, biasDesc; cudnnFilterDescriptor_t wDesc; cudnnConvolutionDescriptor_t convDesc; CUDNN_REPORT_IF_ERROR(cudnnCreateTensorDescriptor(&xDesc)); CUDNN_REPORT_IF_ERROR(cudnnCreateFilterDescriptor(&wDesc)); CUDNN_REPORT_IF_ERROR(cudnnCreateTensorDescriptor(&yDesc)); CUDNN_REPORT_IF_ERROR(cudnnCreateTensorDescriptor(&biasDesc)); CUDNN_REPORT_IF_ERROR(cudnnCreateConvolutionDescriptor(&convDesc)); // Set input descriptor CUDNN_REPORT_IF_ERROR(cudnnSetTensor4dDescriptor( xDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, n, c, h, w_in)); // Set weight descriptor CUDNN_REPORT_IF_ERROR(cudnnSetFilter4dDescriptor( wDesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, k, c, r, s)); // Set convolution descriptor CUDNN_REPORT_IF_ERROR(cudnnSetConvolution2dDescriptor( convDesc, pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT)); // Enable Tensor Cores CUDNN_REPORT_IF_ERROR(cudnnSetConvolutionMathType(convDesc, CUDNN_TENSOR_OP_MATH)); // Get output dimensions int out_n, out_c, out_h, out_w; CUDNN_REPORT_IF_ERROR(cudnnGetConvolution2dForwardOutputDim( convDesc, xDesc, wDesc, &out_n, &out_c, &out_h, &out_w)); fprintf(stderr, "Output dimensions: n=%d, c=%d, h=%d, w=%d\n", out_n, out_c, out_h, out_w); CUDNN_REPORT_IF_ERROR(cudnnSetTensor4dDescriptor( yDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, out_n, out_c, out_h, out_w)); // Set bias descriptor (1xCx1x1) CUDNN_REPORT_IF_ERROR(cudnnSetTensor4dDescriptor( biasDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, k, 1, 1)); int requestedAlgoCount = 10; int returnedAlgoCount; cudnnConvolutionFwdAlgoPerf_t perfResults[10]; CUDNN_REPORT_IF_ERROR(cudnnGetConvolutionForwardAlgorithm_v7( handle, xDesc, wDesc, convDesc, yDesc, requestedAlgoCount, &returnedAlgoCount, perfResults)); cudnnConvolutionFwdAlgo_t algo = perfResults[0].algo; // Get workspace size size_t workspaceSize = 0; CUDNN_REPORT_IF_ERROR(cudnnGetConvolutionForwardWorkspaceSize( handle, xDesc, wDesc, convDesc, yDesc, algo, &workspaceSize)); // Allocate workspace void* workspace = nullptr; if (workspaceSize > 0) { CUdeviceptr wsPtr = 0; CUDA_REPORT_IF_ERROR(cuMemAlloc(&wsPtr, workspaceSize)); workspace = reinterpret_cast<void*>(wsPtr); } // Execute convolution const float alpha = 1.0f; const float beta = 0.0f; cudnnStatus_t status = cudnnConvolutionForward( handle, &alpha, xDesc, x_data, wDesc, w_data, convDesc, algo, workspace, workspaceSize, &beta, yDesc, y_data); // Report errors (if any) CUDNN_REPORT_IF_ERROR(status); // Add bias (if provided) if (bias_data != nullptr) { const float alpha_bias = 1.0f; const float beta_bias = 1.0f; CUDNN_REPORT_IF_ERROR(cudnnAddTensor( handle, &alpha_bias, biasDesc, bias_data, &beta_bias, yDesc, y_data)); } // Free workspace if (workspace != nullptr) { CUDA_REPORT_IF_ERROR(cuMemFree(reinterpret_cast<CUdeviceptr>(workspace))); } // Clean up descriptors CUDNN_REPORT_IF_ERROR(cudnnDestroyTensorDescriptor(xDesc)); CUDNN_REPORT_IF_ERROR(cudnnDestroyFilterDescriptor(wDesc)); CUDNN_REPORT_IF_ERROR(cudnnDestroyTensorDescriptor(yDesc)); CUDNN_REPORT_IF_ERROR(cudnnDestroyTensorDescriptor(biasDesc)); CUDNN_REPORT_IF_ERROR(cudnnDestroyConvolutionDescriptor(convDesc)); }

Any suggestions or insights would be greatly appreciated!

AakankshaS · May 31, 2025, 1:25pm

Hi @lilil111 ,
Let me fetch the detailed answer for this and get back to you.

Thanks

machengyu163 · June 11, 2025, 1:54am

I’ve had a similar issue before, and I’m also troubled.

lilil111 · June 24, 2025, 7:02am

I don’t know if you have solved this problem.
However, in my problem, I found a point that I had overlooked before. When I called the cudnn API myself, such as executing a conv, I allocated all the input parameters on the shared memory (a part visible to both the host and the device), which caused the API to continuously transfer data during execution, resulting in a large performance difference, making the efficiency of my own call not on the same order of magnitude as pytorch.
After I solved this problem, the efficiency of my own call was still slower than pytorch, but it was on the same order of magnitude. I think this should be understandable, because pytorch does a lot of preprocessing and selects the best execution algorithm at runtime.

Topic		Replies	Views
Why is 2-D convolution slower than the matrix product? CUDA Programming and Performance	17	6912	April 18, 2015
Factors affecting CUDA kernel performance CUDA Programming and Performance	6	1350	October 12, 2021
CUDNN: cudnnConvolutionForward very bad performance(very long execution time) on xavier Jetson AGX Xavier	4	1098	October 18, 2021
Device-callable FFT? CUDA Programming and Performance	4	6258	April 21, 2007
Is it possible to get or set CUDNN convolution algorithm on Pytorch? cuDNN pytorch , jupyterlab	1	1787	March 17, 2021
Cudnn convolution is significantly slow cuDNN	3	1159	April 19, 2022
2D convolutions with changing kernel but fixed data CUDA Programming and Performance	5	8379	December 16, 2008
Cudnn may be slower? GPU-Accelerated Libraries	3	2694	September 28, 2015
Performance issue CUDA Programming and Performance	3	2131	June 18, 2008
cuDNN 5.1 delivers 2.7x Faster Training of Networks with 3x3 Convolutions Announcements	0	2097	August 12, 2016

Performance Difference Between Custom CUDNN Wrapper and PyTorch

Related topics