H
The data read into j is never used anywhere, so without the “-G” flag the memory read is just optimized away.
Use it to calculate something you write out at the end of the kernel.
Also the placement of the clock accesses in the compiled code is probably not what you expect (the compiler will happily order memory reads around it). Use “cuobjdump -sass test” to look at the compiled code and experiment with barriers, dependencies and asm volatile(“”); and asm volatile(“” ::: “memory”); statements until you get the code you want to time.
Hi tera,
Thanks for the reply, I have tried what you proposed, and the kernel code now looks like this:
__global__ void clock_test(UINT64 *timer, UINT32 *data, UINT32 *data_copy) { register volatile UINT32 i, j = 0; register volatile UINT64 start, end; for (i = 0; i < ELEMENTS; ++i) { asm volatile("membar.cta;"); start = clock64(); j = data[i]; asm volatile("membar.cta;"); end = clock64(); data_copy[i] = j; timer[i] = end - start; } } When i run the code (compiled without -G), i get this:
0 1357 1 539 2 540 3 539 4 540 5 540 6 540 7 539 8 539 9 540 10 540 11 540 12 539 13 540 14 540 15 539 16 540 17 540 18 540 19 540 20 540 21 539 22 540 23 540 24 539 25 540 26 540 27 539 28 540 29 540 30 539 It shouldn’t be like this, i tried with barriers at different sections of the code… but the values keep getting constant with more bariers. I dont get what is wrong…
These results look sensible to me. What do you think is suspicious?
If you want to discuss this further, please also show the code for how you call the kernel and what is is you are printing out.
Hi tera,
The ideea is the following, i want to expose the L1 cache line size. To do that i read values from L1, do something with them and read the latency, it should be almost constant while you are reading from line, and then a miss should come, this should repeat at every N elements, and this should indicate the cache line size. I need to verify the cache line size for optimizations, and i do not get a pattern or “real latencies”.
The whole code:
#include <stdio.h> #include <stdlib.h> #include <cuda.h> #define STRIDE 1 #define ELEMENTS 512 #define UINT64 unsigned long long #define UINT32 unsigned int __global__ void clock_test(UINT64 *timer, UINT32 *data, UINT32 *data_copy) { register volatile UINT32 i, j = 0; register volatile UINT64 start, end; for (i = 0; i < ELEMENTS; ++i) { asm volatile("membar.cta;"); start = clock64(); j = data[i]; asm volatile("membar.cta;"); end = clock64(); data_copy[i] = j; timer[i] = end - start; } } int main(int argc, char **argv) { UINT64 h_duration[ELEMENTS]; UINT32 h_data[ELEMENTS]; UINT32 h_data_copy[ELEMENTS]; printf("%d\n", ELEMENTS); UINT64 *d_duration; cudaError_t error_id; error_id = cudaMalloc(&d_duration, ELEMENTS * sizeof(UINT64)); if (error_id != cudaSuccess) { printf("Could not allocate d_duration %s\n", cudaGetErrorString(error_id)); } UINT32 *d_data; error_id = cudaMalloc(&d_data, ELEMENTS * sizeof(UINT32)); if (error_id != cudaSuccess) { printf("Could not allocate d_data %s\n", cudaGetErrorString(error_id)); } UINT32 *d_data_copy; error_id = cudaMalloc(&d_data_copy, ELEMENTS * sizeof(UINT32)); if (error_id != cudaSuccess) { printf("Could not allocate d_data_copy %s\n", cudaGetErrorString(error_id)); } for (int i = 0; i < ELEMENTS; ++i) { h_data[i] = (i + STRIDE) % ELEMENTS; } cudaMemcpy((void*) d_data, (void*) h_data, ELEMENTS * sizeof(UINT32), cudaMemcpyHostToDevice); cudaDeviceSynchronize(); dim3 Db = dim3(1); dim3 Dg = dim3(1,1,1); clock_test <<<Dg, Db>>> (d_duration, d_data, d_data_copy); cudaDeviceSynchronize(); cudaMemcpy((void *)h_duration, (void *)d_duration, ELEMENTS * sizeof(UINT64) , cudaMemcpyDeviceToHost); cudaMemcpy((void *)h_data_copy, (void *)d_data_copy, ELEMENTS * sizeof(UINT32) , cudaMemcpyDeviceToHost); cudaDeviceSynchronize(); for (int i = 0; i < ELEMENTS; ++i) { printf("%d %llu\n", i, h_duration[i]); if (h_data[i] != h_data_copy[i]) { printf("TEST FAILED !"); exit(-1); } } } Output:
0 1338 1 533 2 534 3 534 4 532 5 532 6 532 7 531 8 532 9 532 10 532 11 532 12 534 13 532 14 531 15 531 16 531 17 531 18 534 19 534 20 532 21 532 22 532 23 532 24 532 25 534 26 534 27 534 28 532 29 532 30 532 31 534 32 534 33 532 34 531 35 534 36 534 37 532 38 531 39 531 40 531 41 532 42 532 43 532 44 531 45 532 46 534 How the output should be theoretically:
0 670 1 115 2 90 3 91 4 90 5 91 6 92 7 90 8 170 9 90 10 90 11 92 12 91 13 90 14 92 15 90 16 90 17 170 18 90 But it does not happen, there are to many clock cycles in the output, and they are constant… they shouldnt since I am itterating over data that should be in L1. The output is the clock diff before and after
j = data[i]; , and it shouldnt be that big. Am i missing something?
What you are seeing is global memory latency on the first access, then L2 latency for the following ones.
You need to explicitly opt-in to using the L1 cache by adding “-Xptxas -dlcm=ca” to the nvcc command line.
are you sure that you really need to measure that yourself instead of reading about it? gpus are different to cpus, so if you just use yourr cpu experice to explore gpu, this may fail
my own list of low-level benchs:
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
Demystifying GPU Microarchitecture through Microbenchmarking | stuffedcow Demystifying GPU Microarchitecture through Microbenchmarking
http://asg.ict.ac.cn/dgemm/microbenchs.tar.gz
Understanding the ISA impact on GPU Architecture.
https://hal.inria.fr/file/index/docid/789958/filename/112_Lai.pdf
Dissecting GPU Memory Hierarchy through Microbenchmarking | hgpu.org Dissecting GPU Memory Hierarchy through Microbenchmarking
Understanding Latency Hiding on GPUs | hgpu.org Understanding Latency Hiding on GPUs by Vasily Volkov
full list here:
Hi guys,
So i have tried with “-Xptxas -dlcm=ca”, indeed with this flag, i do get the pattern i was expecting, it seems that by default nvcc is using “-Xptxas -dlcm=cg”. But the latency is still high… i dont get it… it’s 1 block, one thread so that i can expose the latency, not hide it.
Thx for all the links BulatZiganshin, i think that altough GPU’s are different, the cache memory should be similar in structure to the CPU, access time and speed may differ. If i am wrong please correct me, but with “-Xptxas -dlcm=ca” the pattern confirms the intial theory.
This confirms that L1 cache line is 32B and that is exactly 8 vector elements. The only issue that remains is the latency, should it be that high?
The output with “-Xptxas -dlcm=ca”:
0 1171 1 369 2 370 3 369 4 369 5 370 6 369 7 369 8 515 ---> miss 9 369 10 369 11 369 12 369 13 369 14 369 15 369 16 515 ---> miss 17 369 18 370 19 369 20 370 21 369 22 369 23 369 24 515 ---> miss 25 370 I will go over the documentation and see if i can better understand this.
Thx
yes, it can be high since GPUs prioritize throughpurt over latency and can run 16 threads per core. that said maxwell l1c latency afair is 30-50 cycles as was measured in one of these papers
Hello,
I’ve been doing some modification to the kernel code and observed the following behavior. When I declare the “j” variable to a simple register (removing volatile), the clock changes significantly. The latency drops from ~240 to 75 (it stays constant). From what I have read in other topics, if the volatile is not used, the compiler is free to optimize the reads/writes to the variable by caching the data in a local register. I’m still confused on whether to use volatile or not.
register volatile UINT32 i; register volatile UINT32 j = 0; clock_t start, end; for(int k = 0; k < ITERATIONS; k++) { for (i = 0; i < ELEMENTS; ++i) { start = clock64(); asm volatile("ld.cg.u32 %0, [%1];" : "=r"(j) : "l"(&data[j])); end = clock64(); data_copy[i] = j; timer[i] = end - start; } } As the “register” keyword already indicates, you want j to be in a register rather than in memory, so you don’t produce additional memory accesses beyond the one you want to measure. So the “volatile” modifier on it should go.
On a side note, the “register” keyword is just ignored by the compiler. It will place all variables in registers anyway (unless it’s not able to, in which case even the register keyword will not change that).
You should really look at the generated code as displayed by “cuobjdump -sass”. Without checking what you are actually timing microbenchmarking really is meaningless.
Hi tera,
I see, thx a lot for the info, will do that!