Cuda slow performance after process sleep/wait

or1 · June 14, 2022, 1:38pm

I’m currently writing code for a system that has multiple processes, some CPU and GPU intensive. I noticed the performance I achieve for cuda kernel functions during runtime is noticeably different (dozens of percentage slower) than the results when I simply run those functions by themselves. One of the reasons I found is that calls for functions such as “std::this_thread::sleep_for” and condition_variable::wait would cause later runs of cuda kernel functions to be much slower.

I created a code snippet for example:

CMakeLists.txt

cmake_minimum_required(VERSION 3.22) project(CudaSlow) set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc") enable_language(CUDA) FIND_PACKAGE(Threads REQUIRED) FIND_PACKAGE(CUDA 10.2 REQUIRED) set(CMAKE_CXX_STANDARD 17) add_executable(CudaSlow main.cu) target_include_directories(CudaSlow PUBLIC ${CUDA_INCLUDE_DIRS} ) TARGET_LINK_LIBRARIES(CudaSlow PUBLIC Threads::Threads ${CUDA_LIBRARIES} )

main.cu

#include <iostream> #include <cuda_device_runtime_api.h> #include <cuda_runtime.h> #include <chrono> #include <thread> __global__ void Func(float* A, float* B, float* C, int N) { for (int j = threadIdx.x; j < N; j+= blockDim.x) { C[j] = A[j] * B[j]; } } int main() { // example function variables int N = 1e5; float* A; float* B; float* C; cudaMalloc(&A, N * sizeof(float)); cudaMalloc(&B, N * sizeof(float)); cudaMalloc(&C, N * sizeof(float)); cudaStream_t stream; cudaStreamCreate(&stream); auto iterations = 2000; // Run multiple iterations without std::this_thread::sleep_for { double totalTime = 0; for (int i = 0; i < iterations; ++i) { auto start = std::chrono::system_clock::now(); Func<<<1, 512, 0, stream>>>(A, B, C, N); cudaStreamSynchronize(stream); totalTime += std::chrono::duration_cast<std::chrono::duration<double>>( std::chrono::system_clock::now() - start).count(); } std::cout << "Average time without std::this_thread::sleep_for is " << totalTime / (float) iterations << std::endl; } // Run multiple iterations with std::this_thread::sleep_for { double totalTime = 0; for (int i = 0; i < iterations; ++i) { auto start = std::chrono::system_clock::now(); Func<<<1, 512, 0, stream>>>(A, B, C, N); cudaStreamSynchronize(stream); totalTime += std::chrono::duration_cast<std::chrono::duration<double>>( std::chrono::system_clock::now() - start).count(); // call to sleep that is outside the time calculation and after synchronizing the stream std::this_thread::sleep_for(std::chrono::milliseconds (1)); } std::cout << "Average time with std::this_thread::sleep_for is " << totalTime / (float) iterations << std::endl; } return 0; }

The output when I run this program on a GeForce RTX 3090 with Cuda 11.4:

Average time without std::this_thread::sleep_for is 7.91676e-05 Average time with std::this_thread::sleep_for is 0.000129487

Robert_Crovella · June 14, 2022, 2:02pm

You’re close to measuring the launch overhead with these tiny kernels doing almost nothing. I suspect the difference here would be lost in the noise if your kernels executed for a few milliseconds. For example, by adding a for loop to your kernel and running it for 100 times, I get results like this:

Average time without std::this_thread::sleep_for is 0.00415706 Average time with std::this_thread::sleep_for is 0.00416839

Now the difference is less than 1%.

So we are talking about a relatively small (on the order of the kernel launch overhead) fixed cost that appears when you put a thread to sleep, to subsequently launch a kernel after the thread wakes up.

Since CUDA has a lazy initialization process, it wouldn’t surprise me that there is some additional resource initialization time to make the CUDA runtime usable again, after a thread goes to sleep and wakes up.

I have no idea what is happening, really, that is just idle speculation. But generally good advice for using CUDA is to try to avoid launching kernels that are a few microseconds or a few tens of microseconds in duration. If you do so, even independent of your observation here, the cost to launch a kernel becomes a significant part of your overall workflow, and thus you are using the GPU inefficiently.

You can file a bug if you wish, but my guess is that:

this is ultimately expected behavior
its unlikely that significant resource would be applied to try to improve this situation, because you are using the GPU inefficiently (even if this were “fixed”).

Topic		Replies	Views
Sleeping during kernel call CUDA Programming and Performance	0	2171	September 11, 2009
Strange Runtime behavior CUDA Programming and Performance	7	3151	December 18, 2009
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9925	February 8, 2008
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7819	December 5, 2008
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6285	May 26, 2009
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20321	May 4, 2007
Why is there 10uS between kernel launches? CUDA Programming and Performance	2	3854	August 6, 2010
Using cuda for string matching CUDA Programming and Performance	4	3876	March 19, 2009
Host CPU busy while waiting ? CUDA Programming and Performance	3	2175	May 5, 2009
Async Kernel launch cpu seems not getting control after kernel launch CUDA Programming and Performance	7	3239	December 3, 2008

Cuda slow performance after process sleep/wait

Related topics