I wanted to optimize my application, so I used the profiler to get some information especially about memory access.
I saw that there was many Uncoalesced gld, so I read the Programming guide and many other tutorials.
They all said, that a memory access is coalesced if
-
we access 4B/8B/16B
-
the memory is aligned, so that next thread accesses next memory block (so there is no space between the blocks)
-
first thread accesses N*64B
so could anybody tell me what is wrong with my code (or my mind)?
struct testStruct { int x; int y; ON_CUDA testStruct(){} ON_CUDA testStruct(int x, int y = 0, int z = 0) : x(x), y(y){} }; __global__ static void testScale(testStruct *ret0) { const int bid = blockIdx.x; const int tid = threadIdx.x; testStruct ts(ret0 [bid * blockDim.x + tid]); int ret = tid * bid + tid - bid + ts.x; for (int i = 1; i < 1; i++) { for (int j = i; j < 1; j++) { ret += i; ret *= j; ret %=99991; ret++; ret <<= 1; } } testStruct retS(ret, 0); ret0[bid * blockDim.x + tid] = retS; } profiler says:
gld uncoalesced: 26880
gst uncoalesced: 107520
with testScale<<<105, 256>>>
I think the alignment is no problem and data size seems to be okay with 2x4bytes
I am using WinXP32 on 8600M GS with cuda 2.3