Memory errors on Tesla K20c, GTX Titan (but not on GTX680)

zero · June 11, 2014, 8:17pm

So I ran into issues with several Tesla K20c GPUs running on Linux machines like this:

$ uname -a Linux cluster-cn-211 3.2.0-61-generic #93-Ubuntu SMP Fri May 2 21:31:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

This is my GPU (there are two GPUs per computer):

$ nvidia-smi -a Driver Version : 331.67 GPU 0000:03:00.0 Product Name : Tesla K20c ... FB Memory Usage Total : 4799 MiB Used : 12 MiB Free : 4787 MiB Ecc Mode Current : Enabled Pending : Enabled

When I try to execute one of the sample applications, I run into a huge number of memory errors:

$ /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul -wA=100 -hA=100 -wB=100 -hB=100 | head [Matrix Multiply Using CUDA] - Starting... GPU Device 1: "Tesla K20c" with compute capability 3.5 MatrixA(100,100), MatrixB(100,100) Computing result using CUDA Kernel... done Performance= 91.03 GFlop/s, Time= 0.022 msec, Size= 2000000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Error! Matrix[00000]=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06 Error! Matrix[00001]=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06 Error! Matrix[00002]=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06 ...

If I run the same binary with memcheck, I get errors like this:

/usr/local/cuda/bin/cuda-memcheck /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul -wA=100 -hA=100 -wB=100 -hB=100 | head ========= CUDA-MEMCHECK ========= Invalid __global__ read of size 4 ========= at 0x00000158 in void matrixMulCUDA<int=32>(float*, float*, float*, int, int) ========= by thread (11,5,0) in block (0,0,0) ========= Address 0xb00213bfc is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x331) [0x138291] ========= Host Frame:/usr/local/cuda/samples/0_Simple/matrixMul/matrixMul [0x1b5b8]

Please note, the same binary, same Linux, same NVidia driver and same CUDA installation but with a different GPU (GTX 680) works flawlessly even with larger matrices. Only Tesla K20c and GTX Titan seem to have this problem in my system.

Also, the log file in /var/log/messages has a huge number of lines like this:

Jun 11 21:43:40 myhost kernel: [16942.564565] init: Handling drivers-device-added event Jun 11 21:43:41 myhost kernel: [16942.641190] init: Handling drivers-device-removed event

And when I try to run some self-made CUDA code (kind of a hello world example), I get this error as return value from cudaMemcpy when copying from device to host:

77: an illegal memory access was encountered

There are no ECC errors in nvidia-smi’s output.

What does it all mean? Do I have a hardware defect? Is there any important configuration for dual GPU nodes I might have missed? Any driver bugs?

Topic		Replies	Views
Cuda-memcheck.exe weird behavior on Tesla C2075 (even on CUDA examples). CUDA Programming and Performance	0	1948	May 30, 2012
Only 300 MB of free memory on Tesla S2050 GPUs CUDA Programming and Performance	6	1248	January 11, 2011
kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure CUDA Programming and Performance	38	3111	September 23, 2010
Why moving code from card with computability 1.x to 2.0 fails? allocation memory fails on Tesla card CUDA Programming and Performance	8	1254	March 3, 2012
CUDA 2.1 Beta Problem/Bugs (Linux) CUDA Programming and Performance	5	1699	January 6, 2009
GTX Titan Black Memory Issues CUDA Programming and Performance	8	1717	December 14, 2016
Problem with reporting memory usage on Tesla CUDA Programming and Performance	2	4564	January 20, 2010
Hardware problem with Tesla card? CUDA Programming and Performance	9	8370	April 2, 2008
K20 Global Memory CUDA Setup and Installation	1	4273	February 11, 2013
out of memory CUDA Programming and Performance	11	16613	April 13, 2009

Memory errors on Tesla K20c, GTX Titan (but not on GTX680)

Related topics