Memory errors on Tesla K20c, GTX Titan (but not on GTX680)

So I ran into issues with several Tesla K20c GPUs running on Linux machines like this:

$ uname -a Linux cluster-cn-211 3.2.0-61-generic #93-Ubuntu SMP Fri May 2 21:31:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux 

This is my GPU (there are two GPUs per computer):

$ nvidia-smi -a Driver Version : 331.67 GPU 0000:03:00.0 Product Name : Tesla K20c ... FB Memory Usage Total : 4799 MiB Used : 12 MiB Free : 4787 MiB Ecc Mode Current : Enabled Pending : Enabled 

When I try to execute one of the sample applications, I run into a huge number of memory errors:

$ /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul -wA=100 -hA=100 -wB=100 -hB=100 | head [Matrix Multiply Using CUDA] - Starting... GPU Device 1: "Tesla K20c" with compute capability 3.5 MatrixA(100,100), MatrixB(100,100) Computing result using CUDA Kernel... done Performance= 91.03 GFlop/s, Time= 0.022 msec, Size= 2000000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Error! Matrix[00000]=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06 Error! Matrix[00001]=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06 Error! Matrix[00002]=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06 ... 

If I run the same binary with memcheck, I get errors like this:

/usr/local/cuda/bin/cuda-memcheck /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul -wA=100 -hA=100 -wB=100 -hB=100 | head ========= CUDA-MEMCHECK ========= Invalid __global__ read of size 4 ========= at 0x00000158 in void matrixMulCUDA<int=32>(float*, float*, float*, int, int) ========= by thread (11,5,0) in block (0,0,0) ========= Address 0xb00213bfc is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x331) [0x138291] ========= Host Frame:/usr/local/cuda/samples/0_Simple/matrixMul/matrixMul [0x1b5b8] 

Please note, the same binary, same Linux, same NVidia driver and same CUDA installation but with a different GPU (GTX 680) works flawlessly even with larger matrices. Only Tesla K20c and GTX Titan seem to have this problem in my system.

Also, the log file in /var/log/messages has a huge number of lines like this:

Jun 11 21:43:40 myhost kernel: [16942.564565] init: Handling drivers-device-added event Jun 11 21:43:41 myhost kernel: [16942.641190] init: Handling drivers-device-removed event 

And when I try to run some self-made CUDA code (kind of a hello world example), I get this error as return value from cudaMemcpy when copying from device to host:

77: an illegal memory access was encountered 

There are no ECC errors in nvidia-smi’s output.

What does it all mean? Do I have a hardware defect? Is there any important configuration for dual GPU nodes I might have missed? Any driver bugs?