Curious to see how the different memory subsystems on the T10P worked, I took a memory benchmark that MisterAnderson42 posted over in the public CUDA forums and made several changes. Files are attached, and any bugs you find are probably mine, not his. :)
The benchmark tests 3 kinds of access patterns: random, broadcast, and linear. The source array is 256 ints. “Random” has all threads in a block read the source array with some random permutation. In “Broadcast” mode, all threads in a warp read the same array entry. In “Linear” mode, thread i reads element i. Bandwidth is computed for each pattern using const, shared, texture, and global memory as the source, with 10 kernel calls, 256 threads * 500 blocks, 10000 reads per thread.
(Being too lazy to try to do the data analysis inside the CUDA program, I wrote a python script to parse the output from the profiler and make a table.)
Here’s an output session showing the results for our 8800 GTX and T10P in the same computer:
[volsung@grad08 t10p]$ nvcc -o read_test2 read_test2.cu [volsung@grad08 t10p]$ CUDA_PROFILE=1 ./read_test2 0 Selecting Device 0: GeForce 8800 GTX [volsung@grad08 t10p]$ python read_test2_summary.py cuda_profile.log (All values in GB/sec) ------------------------------------------------- | memtype | random | broadcast | linear | ------------------------------------------------- | const | 38.2 | 295.0 | 38.6 | | global | 6.3 | 5.0 | 55.5 | | shmem | 113.5 | 246.9 | 247.7 | | tex | 35.0 | 72.9 | 72.8 | ------------------------------------------------- Total Read: 51 GB [volsung@grad08 t10p]$ CUDA_PROFILE=1 ./read_test2 1 Selecting Device 1: GT200 [volsung@grad08 t10p]$ python read_test2_summary.py cuda_profile.log (All values in GB/sec) ------------------------------------------------- | memtype | random | broadcast | linear | ------------------------------------------------- | const | 45.2 | 326.8 | 45.2 | | global | 5.5 | 68.6 | 48.0 | | shmem | 130.1 | 272.4 | 276.0 | | tex | 25.9 | 123.7 | 123.9 | ------------------------------------------------- Total Read: 51 GB Nothing here is unexpected based on the Programming Guide, but it is impressive to see the new memory transaction hardware tear through the global broadcast case. The random case still does poorly since the permutation is over 256 elements, and memory transactions are are in groups of 32 elements (128 bytes). I expected it to do a little better than it did on the T10P since on average each memory transaction should be able service more than one thread, due to a birthday paradox sort of argument. Perhaps I should read section 5.1.2.1 more closely…
[BTW, NVIDIA, please, please pester your forum admins to permit .cu file attachments! .py would be nice too.]
read_test2_summary.py.txt (2.01 KB)
read_test2.cu.txt (5.7 KB)