Why does cublasSgemm uses `f16` for `float`?

Robert_Crovella · March 8, 2023, 9:59pm

It uses something called “automatic down-conversion”.

I don’t have a detailed description, but I believe the expectation is that your input data must fit within FP16 type in order for the calculation to produce expected results.

Example:

$ cat t2208.cu #include <cublas_v2.h> #include <limits> #include <iostream> int main(){ cublasHandle_t handle; cublasCreate(&handle); #ifdef USE_TC cublasSetMathMode( handle, CUBLAS_TENSOR_OP_MATH ); #endif float *h_X, *d_A = 0, *d_B = 0, *d_C = 0, alpha = 1.0f; const int N = 1024; const int n2 = N*N; h_X = new float[n2](); cudaMalloc(reinterpret_cast<void **>(&d_A), n2 * sizeof(d_A[0])); cudaMalloc(reinterpret_cast<void **>(&d_B), n2 * sizeof(d_B[0])); cudaMalloc(reinterpret_cast<void **>(&d_C), n2 * sizeof(d_C[0])); for (int i = 0; i < n2; i+=N+1) h_X[i] = 0.1f; cudaMemcpy(d_A, h_X, n2*sizeof(h_X[0]), cudaMemcpyHostToDevice); for (int i = 0; i < n2; i++) h_X[i] = std::numeric_limits<float>::max(); cudaMemcpy(d_B, h_X, n2*sizeof(h_X[0]), cudaMemcpyHostToDevice); float beta = 0.0f; cublasStatus_t s = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N); cudaMemcpy(h_X, d_C, n2*sizeof(h_X[0]), cudaMemcpyDeviceToHost); std::cout << std::numeric_limits<float>::max() << std::endl; std::cout << "status: " << (int) s << std::endl; std::cout << "result[0]: " << h_X[0] << std::endl; } $ nvcc -o t2208 t2208.cu -lcublas $ ./t2208 3.40282e+38 status: 0 result[0]: 3.40282e+37 $ nvcc -o t2208 t2208.cu -lcublas -DUSE_TC $ ./t2208 3.40282e+38 status: 0 result[0]: nan $

If you’d like to see an improvement in the documentation, please file a bug.

Topic		Replies	Views
why cublasHgemm is slower more than cublasSgemm when I use? GPU-Accelerated Libraries	6	4380	January 22, 2019
why is cublasHgemm is slower than cublasSgemm when matrix is low dimension GPU-Accelerated Libraries	0	475	January 22, 2019
How does cublasGemmEx() call work with CUDA_R_16F inputs and CUDA_R_32F computeType CUDA Programming and Performance	3	1917	December 10, 2017
cublasGemmEx result is always zero CUDA Programming and Performance cuda	3	700	October 12, 2021
low cublas single precision accuracy cublasDgemv <-> cublasSgemv CUDA Programming and Performance	3	1902	July 23, 2008
Significant difference in results between MKL-BLAS & CUBLAS different results in Cgemm CUDA Programming and Performance	9	5050	August 31, 2009
a cublas problem CUDA Programming and Performance	4	3502	August 3, 2011
Matlab mex file using cublas - problems CUDA Programming and Performance	13	9061	October 13, 2009
faster sgemm when transA and transB are 't' and 't CUDA Programming and Performance	7	4882	July 29, 2008
Fp16 or fp32 Accumulate? CUDA Developer Tools	0	842	April 12, 2020

Why does cublasSgemm uses `f16` for `float`?

Related topics