When compiling my code I realised that for some reason I am using local memory.
My register usage is way below limits so it is not just some registers spilling out problem. I am also not addressing any registers dynamically nor using any local arrays.
While searching for the cause I narrowed down to the following construction:
int4 data=context.global_array[threadIdx.x] context is a struct of some pointers to global memory, which is passed as a parameter to the kernel (hence resides in shared memory).
If I replace the above with the following:
int4 data; data.x=context.global_array[threadIdx.x].x; data.y=context.global_array[threadIdx.x].y; data.z=context.global_array[threadIdx.x].z; data.w=context.global_array[threadIdx.x].w; Suddenly, I do not use local memory at all. What can be the reason and how can I avoid it?
My suspision is that the compiler does not know if pointer context.global_array is aligned or not, so it does not know if it can use 16-byte wide load instruction or not and somehow it forces the load to be done into local memory instead of registers.
However the pointer I am using here is a value returned from cudaMalloc and should be aligned well. Simply the compiler does not know that at kernel compilation.
If that is the case, how can I inform the compiler that 16-byte wide load instruction is safe at this point?
If that is not the case, what am I doing wrong and how it can be avoided?