Why l1 cache always hit at the first access

iversonkamp · August 18, 2025, 2:19pm

I run the following code on Orin-X with Jetpack version 6.2.1

In kernel, the time_record should be the latency of gpu accessing global memory from index 0 to iterations/2 and the latency of gpu accessing l1 cache from index iterations/2 to the end.

__global__ void l1_cache_size_test_kernel(unsigned int * my_array, unsigned int * time_record, unsigned int * index_record, int iterations, int * sum) { int i; volatile int j = 0; int tmp = 0; *sum = 0; j = 0; for (i = 0; i < iterations/2; i++) { unsigned int start_time = clock(); j = my_array[j]; // s_index[i % 100] = j; tmp += j; unsigned int end_time = clock(); // __threadfence(); time_record[i] = end_time - start_time; // index_record[i] = j; // __threadfence(); } __threadfence(); j = 0; for (i = iterations/2; i < iterations; i++) { unsigned int start_time = clock(); j = my_array[j]; // s_index[i % 100] = j; tmp += j; unsigned int end_time = clock(); // __threadfence(); time_record[i] = end_time - start_time; // index_record[i] = j; // __threadfence(); } *sum = tmp; } int main() { int i; int stride = 1; int array_size = 36870; size_t Value; cudaDeviceGetLimit(&Value, cudaLimitMaxL2FetchGranularity); printf("L2 fetch granularity before set: %lu\n", Value); cudaDeviceSetLimit(cudaLimitMaxL2FetchGranularity, 32); cudaDeviceGetLimit(&Value, cudaLimitMaxL2FetchGranularity); printf("L2 fetch granularity after set: %lu\n", Value); int * sum_d; int * sum_h = (int *)malloc(1 * sizeof(unsigned int)); unsigned int * A_h = (unsigned int *)malloc(array_size * sizeof(unsigned int)); unsigned int * A_d; cudaMalloc(&A_d, array_size * sizeof(unsigned int)); cudaMalloc(&sum_d, 1 * sizeof(unsigned int)); for (i = 0; i < array_size; i++) { A_h[i] = (i + stride) % array_size; } cudaMemcpy(A_d, A_h, array_size * sizeof(unsigned int), cudaMemcpyHostToDevice); unsigned int * time_record_h = (unsigned int *)malloc(array_size * sizeof(unsigned int)); unsigned int * index_record_h = (unsigned int *)malloc(array_size * sizeof(unsigned int)); unsigned int * time_record_d; unsigned int * index_record_d; cudaMalloc(&time_record_d, array_size * sizeof(unsigned int)); cudaMalloc(&index_record_d, array_size * sizeof(unsigned int)); // execute l1 cache size test l1_cache_size_test_kernel<<<1, 1>>>(A_d, time_record_d, index_record_d, array_size, sum_d); cudaDeviceSynchronize(); cudaMemcpy(time_record_h, time_record_d, array_size * sizeof(unsigned int), cudaMemcpyDeviceToHost); cudaMemcpy(index_record_h, index_record_d, array_size * sizeof(unsigned int), cudaMemcpyDeviceToHost); cudaMemcpy(sum_h, sum_d, 1 * sizeof(unsigned int), cudaMemcpyDeviceToHost); int max_time = 0, max_time_index = 0; for (i = 0; i < array_size; i++) { if (time_record_h[i] > max_time) { max_time = time_record_h[i]; max_time_index = i; } if (time_record_h[i] > 0) { printf("%d: index %u, time %u\n", i, index_record_h[i], time_record_h[i]); } } printf("max_time: %d, index %d\n", max_time, max_time_index); printf("sum_h: %d\n", *sum_h); free(sum_h); free(A_h); free(time_record_h); cudaFree(A_d); cudaFree(time_record_d); cudaFree(sum_d); return 0; }

But, the result shows below(The index can be ignored). The first value of time_record is 30 which means the sm access l1 cache. So I have two questions:

What happened at the first access ?
Why does the time access L1 cache differ ? They are always be 19 or 42
As far as I know, the size of L1 cache is 144KB in orin-X. From log can conclude that sm access L2 cache in the second for loop in kernel. Why doesn’t it access the l1 cache?

Furthermore, when I uncomment the index_record(kernel code shows below),

__global__ void l1_cache_size_test_kernel(unsigned int * my_array, unsigned int * time_record, unsigned int * index_record, int iterations, int * sum) { int i; volatile int j = 0; int tmp = 0; *sum = 0; j = 0; for (i = 0; i < iterations/2; i++) { unsigned int start_time = clock(); j = my_array[j]; tmp += j; unsigned int end_time = clock(); // __threadfence(); time_record[i] = end_time - start_time; index_record[i] = j; // __threadfence(); } __threadfence(); j = 0; for (i = iterations/2; i < iterations; i++) { unsigned int start_time = clock(); j = my_array[j]; tmp += j; unsigned int end_time = clock(); // __threadfence(); time_record[i] = end_time - start_time; index_record[i] = j; // __threadfence(); } *sum = tmp; }

the result becomes different. From log can conclude that sm access l1-cache in the first for loop in kernel and access l2-cache in the second loop. Why does the l1-cache hit in the first loop when there is no preheat operation ?

Robert_Crovella · August 18, 2025, 4:59pm

there is no guarantee that the compiler has not moved instructions around, into or out of the timing region, so it is necessary to study the SASS to have confident understanding of what is actually happening.

Greg · August 18, 2025, 9:46pm

The generated SASS is not measuring what you expect and there is no method in the compiler today to force it to do what you want.

I simplified the kernel to the first loop and output the SASS for sm_80.

__global__ void l1_cache_size_test_kernel(unsigned int * my_array, unsigned int * time_record, unsigned int * index_record, int iterations, int * sum) { int i; volatile int j = 0; int tmp = 0; *sum = 0; j = 0; for (i = 0; i < 2; i++) { unsigned int start_time = clock(); j = my_array[j]; // s_index[i % 100] = j; tmp += j; unsigned int end_time = clock(); // __threadfence(); time_record[i] = end_time - start_time; // index_record[i] = j; // __threadfence(); } *sum = tmp; }

The two timing ranges have numerous issues.

For some reason the start_time = clock() is being converted to a S2UR which is a variable latency instruction. The DEPBAR.LE verifies the S2UR has returned. Additional instructions are included for the address calculation and LDG. The CS2R after the LDG is reading the timestamp after the LDG is issued but with no dependency that the LDG has completed. See the R0 read to use bar on the right.

 MOV R1, c[0x0][0x28] MOV R2, c[0x0][0x180] ULDC.64 UR4, c[0x0][0x118] MOV R3, c[0x0][0x184] STG.E desc[UR4][R2.64], RZ -> S2UR UR6, SR_CLOCKLO | DEPBAR.LE SB1, 0x0 | HFMA2.MMA R9, -RZ, RZ, 0, 2.384185791015625e-07 | IMAD.WIDE R4, RZ, R9, c[0x0][0x160] | LDG.E R0, desc[UR4][R4.64] R0 read -> CS2R.32 R7, SR_CLOCKLO | IADD3 R7, R7, -UR6, RZ | IMAD.MOV.U32 R4, RZ, RZ, c[0x0][0x168] | MOV R5, c[0x0][0x16c] | STG.E desc[UR4][R4.64], R7 | -> S2UR UR6, SR_CLOCKLO | | DEPBAR.LE SB2, 0x0 | | IMAD.WIDE R6, R0, R9, c[0x0][0x160] R0 used | LDG.E R7, desc[UR4][R6.64] | IADD3 R9, R0, R7, RZ -> CS2R.32 R7, SR_CLOCKLO IADD3 R7, R7, -UR6, RZ STG.E desc[UR4][R4.64+0x4], R7 STG.E desc[UR4][R2.64], R9 EXIT

If #pragma unroll 1 is added before the loop then the timing makes more sense; however, S2UR is still in use.

l1_cache_size_test_kernel(unsigned int*, unsigned int*, unsigned int*, int, int*): MOV R1, c[0x0][0x28] MOV R2, c[0x0][0x180] ULDC.64 UR4, c[0x0][0x118] MOV R3, c[0x0][0x184] HFMA2.MMA R0, -RZ, RZ, 0, 0 CS2R R6, SRZ STG.E desc[UR4][R2.64], RZ .L_x_0: -> S2UR UR6, SR_CLOCKLO | DEPBAR.LE SB1, 0x0 | MOV R11, 0x4 | IMAD.WIDE R4, R6, R11, c[0x0][0x160] | LDG.E R6, desc[UR4][R4.64] R6 read | IADD3 R7, R6, R7, RZ R6 use -> CS2R.32 R9, SR_CLOCKLO IMAD.WIDE R4, R0.reuse, R11, c[0x0][0x168] IADD3 R0, R0, 0x1, RZ IADD3 R9, R9, -UR6, RZ ISETP.GE.U32.AND P0, PT, R0, 0x2, PT IMAD.MOV.U32 R0, RZ, RZ, 0x1 STG.E desc[UR4][R4.64], R9 @!P0 BRA `(.L_x_0) STG.E desc[UR4][R2.64], R7 EXIT .L_x_1: BRA `(.L_x_1) NOP NOP NOP NOP NOP NOP NOP NOP .L_x_2:

The ideal code generation would look like

-> CS2R R10, SR_CLOCKLO | LDG.E R6, desc[UR4][R4.64] R6 read | IADD3 R7, R6, R7, RZ R6 use -> CS2R.32 R9, SR_CLOCKLO

This moves the address calculation to a separate statement before the clock.
This tries to use inline PTX in place of S2UR.

static __device__ inline uint32_t __rd_clock() { uint32_t clock; asm volatile("mov.u32 %0, %%clock;" : "=r"(clock) ); return clock; } __global__ void l1_cache_size_test_kernel(unsigned int * my_array, unsigned int * time_record, unsigned int * index_record, int iterations, int * sum) { int i; volatile int j = 0; int tmp = 0; *sum = 0; j = 0; #pragma unroll 1 for (i = 0; i < 2; i++) { unsigned int* pj = &my_array[j]; unsigned int start_time = __rd_clock(); j = *pj; tmp += j; unsigned int end_time = __rd_clock(); time_record[i] = end_time - start_time; } *sum = tmp; }

Moving the address calculation out did remove a few instructions; however, even using inline PTX does not stop the compiler from using the S2UR.

 l1_cache_size_test_kernel(unsigned int*, unsigned int*, unsigned int*, int, int*): MOV R1, c[0x0][0x28] MOV R2, c[0x0][0x180] ULDC.64 UR4, c[0x0][0x118] MOV R3, c[0x0][0x184] HFMA2.MMA R0, -RZ, RZ, 0, 0 CS2R R6, SRZ STG.E desc[UR4][R2.64], RZ .L_x_0: MOV R11, 0x4 IMAD.WIDE R4, R6, R11, c[0x0][0x160] S2UR UR6, SR_CLOCKLO DEPBAR.LE SB1, 0x0 LDG.E R6, desc[UR4][R4.64] IADD3 R7, R6, R7, RZ CS2R.32 R9, SR_CLOCKLO IMAD.WIDE R4, R0.reuse, R11, c[0x0][0x168] IADD3 R0, R0, 0x1, RZ IADD3 R9, R9, -UR6, RZ ISETP.GE.U32.AND P0, PT, R0, 0x2, PT IMAD.MOV.U32 R0, RZ, RZ, 0x1 STG.E desc[UR4][R4.64], R9 @!P0 BRA `(.L_x_0) STG.E desc[UR4][R2.64], R7 EXIT

Greg · August 18, 2025, 9:51pm

The S2UR instruction can only read 32-bit special register. The CS2R can read 32-bit or 64-bit.
Please note I changed several locations to use fixed width types.

#include <stdint.h> static __device__ inline uint64_t __rd_clock64() { uint64_t clock; asm volatile("mov.u64 %0, %%clock64;" : "=l"(clock) ); return clock; } __global__ void l1_cache_size_test_kernel(unsigned int * my_array, uint64_t * time_record, unsigned int * index_record, int iterations, int * sum) { int i; volatile int j = 0; int tmp = 0; *sum = 0; j = 0; #pragma unroll 1 for (i = 0; i < 2; i++) { unsigned int* pj = &my_array[j]; uint64_t start_time = __rd_clock64(); j = *pj; tmp += j; uint64_t end_time = __rd_clock64(); time_record[i] = end_time - start_time; } *sum = tmp; }

Using 64-bit clock read fixes the code generation. I would recommend filing a compiler bug.

 l1_cache_size_test_kernel(unsigned int*, unsigned long*, unsigned int*, int, int*): MOV R1, c[0x0][0x28] MOV R2, c[0x0][0x180] ULDC.64 UR4, c[0x0][0x118] MOV R3, c[0x0][0x184] HFMA2.MMA R0, -RZ, RZ, 0, 0 CS2R R8, SRZ STG.E desc[UR4][R2.64], RZ .L_x_0: MOV R5, 0x4 IMAD.WIDE R4, R8, R5, c[0x0][0x160] -> CS2R R6, SR_CLOCKLO | LDG.E R8, desc[UR4][R4.64] | IADD3 R9, R8, R9, RZ -> CS2R R4, SR_CLOCKLO IADD3 R6, P0, -R6, R4, RZ IMAD.MOV.U32 R11, RZ, RZ, 0x8 IADD3.X R7, ~R7, R5, RZ, P0, !PT IMAD.WIDE R4, R0.reuse, R11, c[0x0][0x168] IADD3 R0, R0, 0x1, RZ ISETP.GE.U32.AND P0, PT, R0, 0x2, PT STG.E.64 desc[UR4][R4.64], R6 IMAD.MOV.U32 R0, RZ, RZ, 0x1 @!P0 BRA `(.L_x_0) STG.E desc[UR4][R2.64], R9 EXIT .L_x_1:

iversonkamp · August 19, 2025, 4:36am

Thank you for replying.

I use the “nvcc -ptx“ to get ptx code of l1_cache_size_test_kernel.

The result shows with index_record[i]=j uncommented:

// // Generated by NVIDIA NVVM Compiler // // Compiler Build ID: CL-34714021 // Cuda compilation tools, release 12.6, V12.6.68 // Based on NVVM 7.0.1 // .version 8.5 .target sm_52 .address_size 64	// .globl	_Z25l1_cache_size_test_kernelPjS_S_iPi .visible .entry _Z25l1_cache_size_test_kernelPjS_S_iPi(	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_0,	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_1,	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_2,	.param .u32 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_3,	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_4 ) {	.reg .pred	%p<11>;	.reg .b32	%r<143>;	.reg .b64	%rd<85>;	ld.param.u64	%rd28, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_0];	ld.param.u64	%rd29, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_1];	ld.param.u64	%rd30, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_2];	ld.param.u64	%rd31, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_4];	ld.param.u32	%r33, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_3];	cvta.to.global.u64	%rd1, %rd30;	cvta.to.global.u64	%rd2, %rd29;	cvta.to.global.u64	%rd3, %rd28;	mov.u32	%r34, 0;	mov.u32	%r142, %r34;	cvta.to.global.u64	%rd5, %rd31;	st.global.u32	[%rd5], %r34;	mov.u32	%r142, %r34;	shr.u32	%r35, %r33, 31;	add.s32	%r36, %r33, %r35;	shr.s32	%r1, %r36, 1;	setp.lt.s32	%p1, %r33, 2;	mov.u32	%r141, %r34;	@%p1 bra	$L__BB0_7;	add.s32	%r40, %r1, -1;	and.b32	%r131, %r1, 3;	setp.lt.u32	%p2, %r40, 3;	mov.u32	%r141, 0;	mov.u32	%r129, %r141;	@%p2 bra	$L__BB0_4;	sub.s32	%r126, %r1, %r131;	mov.u32	%r141, 0;	mov.u64	%rd77, %rd1;	mov.u64	%rd78, %rd2; $L__BB0_3:	// begin inline asm	mov.u64	%rd33, %clock64;	// end inline asm	cvt.u32.u64	%r43, %rd33;	mov.u32	%r44, %r142;	mul.wide.s32	%rd41, %r44, 4;	add.s64	%rd42, %rd3, %rd41;	ld.global.u32	%r45, [%rd42];	mov.u32	%r142, %r45;	mov.u32	%r46, %r142;	add.s32	%r47, %r46, %r141;	// begin inline asm	mov.u64	%rd34, %clock64;	// end inline asm	cvt.u32.u64	%r48, %rd34;	sub.s32	%r49, %r48, %r43;	st.global.u32	[%rd78], %r49;	mov.u32	%r50, %r142;	st.global.u32	[%rd77], %r50;	// begin inline asm	mov.u64	%rd35, %clock64;	// end inline asm	cvt.u32.u64	%r51, %rd35;	mov.u32	%r52, %r142;	mul.wide.s32	%rd43, %r52, 4;	add.s64	%rd44, %rd3, %rd43;	ld.global.u32	%r53, [%rd44];	mov.u32	%r142, %r53;	mov.u32	%r54, %r142;	add.s32	%r55, %r54, %r47;	// begin inline asm	mov.u64	%rd36, %clock64;	// end inline asm	cvt.u32.u64	%r56, %rd36;	sub.s32	%r57, %r56, %r51;	st.global.u32	[%rd78+4], %r57;	mov.u32	%r58, %r142;	st.global.u32	[%rd77+4], %r58;	// begin inline asm	mov.u64	%rd37, %clock64;	// end inline asm	cvt.u32.u64	%r59, %rd37;	mov.u32	%r60, %r142;	mul.wide.s32	%rd45, %r60, 4;	add.s64	%rd46, %rd3, %rd45;	ld.global.u32	%r61, [%rd46];	mov.u32	%r142, %r61;	mov.u32	%r62, %r142;	add.s32	%r63, %r62, %r55;	// begin inline asm	mov.u64	%rd38, %clock64;	// end inline asm	cvt.u32.u64	%r64, %rd38;	sub.s32	%r65, %r64, %r59;	st.global.u32	[%rd78+8], %r65;	mov.u32	%r66, %r142;	st.global.u32	[%rd77+8], %r66;	// begin inline asm	mov.u64	%rd39, %clock64;	// end inline asm	cvt.u32.u64	%r67, %rd39;	mov.u32	%r68, %r142;	mul.wide.s32	%rd47, %r68, 4;	add.s64	%rd48, %rd3, %rd47;	ld.global.u32	%r69, [%rd48];	mov.u32	%r142, %r69;	mov.u32	%r70, %r142;	add.s32	%r141, %r70, %r63;	// begin inline asm	mov.u64	%rd40, %clock64;	// end inline asm	cvt.u32.u64	%r71, %rd40;	sub.s32	%r72, %r71, %r67;	st.global.u32	[%rd78+12], %r72;	mov.u32	%r73, %r142;	st.global.u32	[%rd77+12], %r73;	add.s32	%r129, %r129, 4;	add.s64	%rd78, %rd78, 16;	add.s64	%rd77, %rd77, 16;	add.s32	%r126, %r126, -4;	setp.ne.s32	%p3, %r126, 0;	@%p3 bra	$L__BB0_3; $L__BB0_4:	setp.eq.s32	%p4, %r131, 0;	@%p4 bra	$L__BB0_7;	mul.wide.s32	%rd49, %r129, 4;	add.s64	%rd80, %rd1, %rd49;	add.s64	%rd79, %rd2, %rd49; $L__BB0_6:	.pragma "nounroll";	// begin inline asm	mov.u64	%rd50, %clock64;	// end inline asm	cvt.u32.u64	%r74, %rd50;	mov.u32	%r75, %r142;	mul.wide.s32	%rd52, %r75, 4;	add.s64	%rd53, %rd3, %rd52;	ld.global.u32	%r76, [%rd53];	mov.u32	%r142, %r76;	mov.u32	%r77, %r142;	add.s32	%r141, %r77, %r141;	// begin inline asm	mov.u64	%rd51, %clock64;	// end inline asm	cvt.u32.u64	%r78, %rd51;	sub.s32	%r79, %r78, %r74;	st.global.u32	[%rd79], %r79;	mov.u32	%r80, %r142;	st.global.u32	[%rd80], %r80;	add.s64	%rd80, %rd80, 4;	add.s64	%rd79, %rd79, 4;	add.s32	%r131, %r131, -1;	setp.ne.s32	%p5, %r131, 0;	@%p5 bra	$L__BB0_6; $L__BB0_7:	membar.gl;	mov.u32	%r142, %r34;	setp.ge.s32	%p6, %r1, %r33;	@%p6 bra	$L__BB0_14;	sub.s32	%r83, %r33, %r1;	and.b32	%r135, %r83, 3;	setp.eq.s32	%p7, %r135, 0;	mov.u32	%r137, %r1;	@%p7 bra	$L__BB0_11;	mul.wide.s32	%rd54, %r1, 4;	add.s64	%rd82, %rd1, %rd54;	add.s64	%rd81, %rd2, %rd54;	mov.u32	%r137, %r1; $L__BB0_10:	.pragma "nounroll";	// begin inline asm	mov.u64	%rd55, %clock64;	// end inline asm	cvt.u32.u64	%r84, %rd55;	mov.u32	%r85, %r142;	mul.wide.s32	%rd57, %r85, 4;	add.s64	%rd58, %rd3, %rd57;	ld.global.u32	%r86, [%rd58];	mov.u32	%r142, %r86;	mov.u32	%r87, %r142;	add.s32	%r141, %r87, %r141;	// begin inline asm	mov.u64	%rd56, %clock64;	// end inline asm	cvt.u32.u64	%r88, %rd56;	sub.s32	%r89, %r88, %r84;	st.global.u32	[%rd81], %r89;	mov.u32	%r90, %r142;	st.global.u32	[%rd82], %r90;	add.s32	%r137, %r137, 1;	add.s64	%rd82, %rd82, 4;	add.s64	%rd81, %rd81, 4;	add.s32	%r135, %r135, -1;	setp.ne.s32	%p8, %r135, 0;	@%p8 bra	$L__BB0_10; $L__BB0_11:	not.b32	%r91, %r1;	add.s32	%r92, %r91, %r33;	setp.lt.u32	%p9, %r92, 3;	@%p9 bra	$L__BB0_14;	mul.wide.s32	%rd59, %r137, 4;	add.s64	%rd60, %rd59, 8;	add.s64	%rd84, %rd2, %rd60;	add.s64	%rd83, %rd1, %rd60; $L__BB0_13:	// begin inline asm	mov.u64	%rd61, %clock64;	// end inline asm	cvt.u32.u64	%r93, %rd61;	mov.u32	%r94, %r142;	mul.wide.s32	%rd69, %r94, 4;	add.s64	%rd70, %rd3, %rd69;	ld.global.u32	%r95, [%rd70];	mov.u32	%r142, %r95;	mov.u32	%r96, %r142;	add.s32	%r97, %r96, %r141;	// begin inline asm	mov.u64	%rd62, %clock64;	// end inline asm	cvt.u32.u64	%r98, %rd62;	sub.s32	%r99, %r98, %r93;	st.global.u32	[%rd84+-8], %r99;	mov.u32	%r100, %r142;	st.global.u32	[%rd83+-8], %r100;	// begin inline asm	mov.u64	%rd63, %clock64;	// end inline asm	cvt.u32.u64	%r101, %rd63;	mov.u32	%r102, %r142;	mul.wide.s32	%rd71, %r102, 4;	add.s64	%rd72, %rd3, %rd71;	ld.global.u32	%r103, [%rd72];	mov.u32	%r142, %r103;	mov.u32	%r104, %r142;	add.s32	%r105, %r104, %r97;	// begin inline asm	mov.u64	%rd64, %clock64;	// end inline asm	cvt.u32.u64	%r106, %rd64;	sub.s32	%r107, %r106, %r101;	st.global.u32	[%rd84+-4], %r107;	mov.u32	%r108, %r142;	st.global.u32	[%rd83+-4], %r108;	// begin inline asm	mov.u64	%rd65, %clock64;	// end inline asm	cvt.u32.u64	%r109, %rd65;	mov.u32	%r110, %r142;	mul.wide.s32	%rd73, %r110, 4;	add.s64	%rd74, %rd3, %rd73;	ld.global.u32	%r111, [%rd74];	mov.u32	%r142, %r111;	mov.u32	%r112, %r142;	add.s32	%r113, %r112, %r105;	// begin inline asm	mov.u64	%rd66, %clock64;	// end inline asm	cvt.u32.u64	%r114, %rd66;	sub.s32	%r115, %r114, %r109;	st.global.u32	[%rd84], %r115;	mov.u32	%r116, %r142;	st.global.u32	[%rd83], %r116;	// begin inline asm	mov.u64	%rd67, %clock64;	// end inline asm	cvt.u32.u64	%r117, %rd67;	mov.u32	%r118, %r142;	mul.wide.s32	%rd75, %r118, 4;	add.s64	%rd76, %rd3, %rd75;	ld.global.u32	%r119, [%rd76];	mov.u32	%r142, %r119;	mov.u32	%r120, %r142;	add.s32	%r141, %r120, %r113;	// begin inline asm	mov.u64	%rd68, %clock64;	// end inline asm	cvt.u32.u64	%r121, %rd68;	sub.s32	%r122, %r121, %r117;	st.global.u32	[%rd84+4], %r122;	mov.u32	%r123, %r142;	st.global.u32	[%rd83+4], %r123;	add.s64	%rd84, %rd84, 16;	add.s64	%rd83, %rd83, 16;	add.s32	%r137, %r137, 4;	setp.lt.s32	%p10, %r137, %r33;	@%p10 bra	$L__BB0_13; $L__BB0_14:	st.global.u32	[%rd5], %r141;	ret; }

And result shows with index_record[i]=j commented:

// // Generated by NVIDIA NVVM Compiler // // Compiler Build ID: CL-34714021 // Cuda compilation tools, release 12.6, V12.6.68 // Based on NVVM 7.0.1 // .version 8.5 .target sm_52 .address_size 64	// .globl	_Z25l1_cache_size_test_kernelPjS_S_iPi .visible .entry _Z25l1_cache_size_test_kernelPjS_S_iPi(	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_0,	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_1,	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_2,	.param .u32 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_3,	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_4 ) {	.reg .pred	%p<11>;	.reg .b32	%r<133>;	.reg .b64	%rd<68>;	ld.param.u64	%rd16, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_0];	ld.param.u64	%rd17, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_1];	ld.param.u64	%rd18, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_4];	ld.param.u32	%r33, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_3];	cvta.to.global.u64	%rd1, %rd17;	cvta.to.global.u64	%rd2, %rd16;	mov.u32	%r34, 0;	mov.u32	%r132, %r34;	cvta.to.global.u64	%rd4, %rd18;	st.global.u32	[%rd4], %r34;	mov.u32	%r132, %r34;	shr.u32	%r35, %r33, 31;	add.s32	%r36, %r33, %r35;	shr.s32	%r1, %r36, 1;	setp.lt.s32	%p1, %r33, 2;	mov.u32	%r131, %r34;	@%p1 bra	$L__BB0_7;	add.s32	%r40, %r1, -1;	and.b32	%r121, %r1, 3;	setp.lt.u32	%p2, %r40, 3;	mov.u32	%r131, 0;	mov.u32	%r119, %r131;	@%p2 bra	$L__BB0_4;	sub.s32	%r116, %r1, %r121;	mov.u32	%r131, 0;	mov.u64	%rd64, %rd1; $L__BB0_3:	// begin inline asm	mov.u64	%rd20, %clock64;	// end inline asm	cvt.u32.u64	%r43, %rd20;	mov.u32	%r44, %r132;	mul.wide.s32	%rd28, %r44, 4;	add.s64	%rd29, %rd2, %rd28;	ld.global.u32	%r45, [%rd29];	mov.u32	%r132, %r45;	mov.u32	%r46, %r132;	add.s32	%r47, %r46, %r131;	// begin inline asm	mov.u64	%rd21, %clock64;	// end inline asm	cvt.u32.u64	%r48, %rd21;	sub.s32	%r49, %r48, %r43;	st.global.u32	[%rd64], %r49;	// begin inline asm	mov.u64	%rd22, %clock64;	// end inline asm	cvt.u32.u64	%r50, %rd22;	mov.u32	%r51, %r132;	mul.wide.s32	%rd30, %r51, 4;	add.s64	%rd31, %rd2, %rd30;	ld.global.u32	%r52, [%rd31];	mov.u32	%r132, %r52;	mov.u32	%r53, %r132;	add.s32	%r54, %r53, %r47;	// begin inline asm	mov.u64	%rd23, %clock64;	// end inline asm	cvt.u32.u64	%r55, %rd23;	sub.s32	%r56, %r55, %r50;	st.global.u32	[%rd64+4], %r56;	// begin inline asm	mov.u64	%rd24, %clock64;	// end inline asm	cvt.u32.u64	%r57, %rd24;	mov.u32	%r58, %r132;	mul.wide.s32	%rd32, %r58, 4;	add.s64	%rd33, %rd2, %rd32;	ld.global.u32	%r59, [%rd33];	mov.u32	%r132, %r59;	mov.u32	%r60, %r132;	add.s32	%r61, %r60, %r54;	// begin inline asm	mov.u64	%rd25, %clock64;	// end inline asm	cvt.u32.u64	%r62, %rd25;	sub.s32	%r63, %r62, %r57;	st.global.u32	[%rd64+8], %r63;	// begin inline asm	mov.u64	%rd26, %clock64;	// end inline asm	cvt.u32.u64	%r64, %rd26;	mov.u32	%r65, %r132;	mul.wide.s32	%rd34, %r65, 4;	add.s64	%rd35, %rd2, %rd34;	ld.global.u32	%r66, [%rd35];	mov.u32	%r132, %r66;	mov.u32	%r67, %r132;	add.s32	%r131, %r67, %r61;	// begin inline asm	mov.u64	%rd27, %clock64;	// end inline asm	cvt.u32.u64	%r68, %rd27;	sub.s32	%r69, %r68, %r64;	st.global.u32	[%rd64+12], %r69;	add.s32	%r119, %r119, 4;	add.s64	%rd64, %rd64, 16;	add.s32	%r116, %r116, -4;	setp.ne.s32	%p3, %r116, 0;	@%p3 bra	$L__BB0_3; $L__BB0_4:	setp.eq.s32	%p4, %r121, 0;	@%p4 bra	$L__BB0_7;	mul.wide.s32	%rd36, %r119, 4;	add.s64	%rd65, %rd1, %rd36; $L__BB0_6:	.pragma "nounroll";	// begin inline asm	mov.u64	%rd37, %clock64;	// end inline asm	cvt.u32.u64	%r70, %rd37;	mov.u32	%r71, %r132;	mul.wide.s32	%rd39, %r71, 4;	add.s64	%rd40, %rd2, %rd39;	ld.global.u32	%r72, [%rd40];	mov.u32	%r132, %r72;	mov.u32	%r73, %r132;	add.s32	%r131, %r73, %r131;	// begin inline asm	mov.u64	%rd38, %clock64;	// end inline asm	cvt.u32.u64	%r74, %rd38;	sub.s32	%r75, %r74, %r70;	st.global.u32	[%rd65], %r75;	add.s64	%rd65, %rd65, 4;	add.s32	%r121, %r121, -1;	setp.ne.s32	%p5, %r121, 0;	@%p5 bra	$L__BB0_6; $L__BB0_7:	membar.gl;	mov.u32	%r132, %r34;	setp.ge.s32	%p6, %r1, %r33;	@%p6 bra	$L__BB0_14;	sub.s32	%r78, %r33, %r1;	and.b32	%r125, %r78, 3;	setp.eq.s32	%p7, %r125, 0;	mov.u32	%r127, %r1;	@%p7 bra	$L__BB0_11;	mul.wide.s32	%rd41, %r1, 4;	add.s64	%rd66, %rd1, %rd41;	mov.u32	%r127, %r1; $L__BB0_10:	.pragma "nounroll";	// begin inline asm	mov.u64	%rd42, %clock64;	// end inline asm	cvt.u32.u64	%r79, %rd42;	mov.u32	%r80, %r132;	mul.wide.s32	%rd44, %r80, 4;	add.s64	%rd45, %rd2, %rd44;	ld.global.u32	%r81, [%rd45];	mov.u32	%r132, %r81;	mov.u32	%r82, %r132;	add.s32	%r131, %r82, %r131;	// begin inline asm	mov.u64	%rd43, %clock64;	// end inline asm	cvt.u32.u64	%r83, %rd43;	sub.s32	%r84, %r83, %r79;	st.global.u32	[%rd66], %r84;	add.s32	%r127, %r127, 1;	add.s64	%rd66, %rd66, 4;	add.s32	%r125, %r125, -1;	setp.ne.s32	%p8, %r125, 0;	@%p8 bra	$L__BB0_10; $L__BB0_11:	not.b32	%r85, %r1;	add.s32	%r86, %r85, %r33;	setp.lt.u32	%p9, %r86, 3;	@%p9 bra	$L__BB0_14;	mul.wide.s32	%rd46, %r127, 4;	add.s64	%rd47, %rd1, %rd46;	add.s64	%rd67, %rd47, 8; $L__BB0_13:	// begin inline asm	mov.u64	%rd48, %clock64;	// end inline asm	cvt.u32.u64	%r87, %rd48;	mov.u32	%r88, %r132;	mul.wide.s32	%rd56, %r88, 4;	add.s64	%rd57, %rd2, %rd56;	ld.global.u32	%r89, [%rd57];	mov.u32	%r132, %r89;	mov.u32	%r90, %r132;	add.s32	%r91, %r90, %r131;	// begin inline asm	mov.u64	%rd49, %clock64;	// end inline asm	cvt.u32.u64	%r92, %rd49;	sub.s32	%r93, %r92, %r87;	st.global.u32	[%rd67+-8], %r93;	// begin inline asm	mov.u64	%rd50, %clock64;	// end inline asm	cvt.u32.u64	%r94, %rd50;	mov.u32	%r95, %r132;	mul.wide.s32	%rd58, %r95, 4;	add.s64	%rd59, %rd2, %rd58;	ld.global.u32	%r96, [%rd59];	mov.u32	%r132, %r96;	mov.u32	%r97, %r132;	add.s32	%r98, %r97, %r91;	// begin inline asm	mov.u64	%rd51, %clock64;	// end inline asm	cvt.u32.u64	%r99, %rd51;	sub.s32	%r100, %r99, %r94;	st.global.u32	[%rd67+-4], %r100;	// begin inline asm	mov.u64	%rd52, %clock64;	// end inline asm	cvt.u32.u64	%r101, %rd52;	mov.u32	%r102, %r132;	mul.wide.s32	%rd60, %r102, 4;	add.s64	%rd61, %rd2, %rd60;	ld.global.u32	%r103, [%rd61];	mov.u32	%r132, %r103;	mov.u32	%r104, %r132;	add.s32	%r105, %r104, %r98;	// begin inline asm	mov.u64	%rd53, %clock64;	// end inline asm	cvt.u32.u64	%r106, %rd53;	sub.s32	%r107, %r106, %r101;	st.global.u32	[%rd67], %r107;	// begin inline asm	mov.u64	%rd54, %clock64;	// end inline asm	cvt.u32.u64	%r108, %rd54;	mov.u32	%r109, %r132;	mul.wide.s32	%rd62, %r109, 4;	add.s64	%rd63, %rd2, %rd62;	ld.global.u32	%r110, [%rd63];	mov.u32	%r132, %r110;	mov.u32	%r111, %r132;	add.s32	%r131, %r111, %r105;	// begin inline asm	mov.u64	%rd55, %clock64;	// end inline asm	cvt.u32.u64	%r112, %rd55;	sub.s32	%r113, %r112, %r108;	st.global.u32	[%rd67+4], %r113;	add.s64	%rd67, %rd67, 16;	add.s32	%r127, %r127, 4;	setp.lt.s32	%p10, %r127, %r33;	@%p10 bra	$L__BB0_13; $L__BB0_14:	st.global.u32	[%rd4], %r131;	ret; }

There is no S2UR in those two ptx code.

Morever, the ptx code of uncommented index_record is just add mov and st.global.u32 for each loop. Why do these two operations make result different? From your explain, is the reason timestamp read after LDG also not barried before using the data LDG get ?

iversonkamp · August 19, 2025, 4:58am

Things go well when I put #pragma unroll 1 before the for loop.

I use your kernel and the result is not affected by index_record :

It looks like the latency of sm accessing global memory is 400+ cycles. Is it expectful ?

There is another question. I change the kernel as below:

__global__ void l1_cache_size_test_kernel(unsigned int * my_array, unsigned int * time_record, unsigned int * index_record, int iterations, int * sum) { int i; volatile int j = 0; int tmp = 0; *sum = 0; #elif 1 #pragma unroll 1 for (i = 0; i < iterations/2; i++) { unsigned int* pj = &my_array[j]; uint64_t start_time = __rd_clock64(); j = *pj; tmp += j; uint64_t end_time = __rd_clock64(); time_record[i] = (uint32_t)(end_time - start_time); // index_record[i] = j; } j = 0; __threadfence(); for (i = iterations/2; i < iterations; i++) { unsigned int* pj = &my_array[j]; uint64_t start_time = __rd_clock64(); j = *pj; tmp += j; uint64_t end_time = __rd_clock64(); time_record[i] = (uint32_t)(end_time - start_time); // index_record[i] = j; } #endif *sum = tmp; }

And the ptx code generated by “nvcc -arch=sm_87 -ptx“ is

// // Generated by NVIDIA NVVM Compiler // // Compiler Build ID: CL-34714021 // Cuda compilation tools, release 12.6, V12.6.68 // Based on NVVM 7.0.1 // .version 8.5 .target sm_87 .address_size 64	// .globl	_Z25l1_cache_size_test_kernelPjS_S_iPi .visible .entry _Z25l1_cache_size_test_kernelPjS_S_iPi(	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_0,	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_1,	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_2,	.param .u32 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_3,	.param .u64 _Z25l1_cache_size_test_kernelPjS_S_iPi_param_4 ) {	.reg .pred	%p<8>;	.reg .b32	%r<73>;	.reg .b64	%rd<66>;	ld.param.u64	%rd22, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_0];	ld.param.u64	%rd23, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_1];	ld.param.u64	%rd24, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_2];	ld.param.u64	%rd25, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_4];	ld.param.u32	%r22, [_Z25l1_cache_size_test_kernelPjS_S_iPi_param_3];	cvta.to.global.u64	%rd1, %rd24;	cvta.to.global.u64	%rd2, %rd23;	cvta.to.global.u64	%rd3, %rd22;	mov.u32	%r23, 0;	mov.u32	%r72, %r23;	cvta.to.global.u64	%rd5, %rd25;	st.global.u32	[%rd5], %r23;	mov.u32	%r72, %r23;	shr.u32	%r24, %r22, 31;	add.s32	%r25, %r22, %r24;	shr.s32	%r1, %r25, 1;	setp.lt.s32	%p1, %r22, 2;	mov.u32	%r71, %r23;	@%p1 bra	$L__BB0_3;	mov.u32	%r71, 0;	mov.u64	%rd60, %rd2;	mov.u64	%rd61, %rd1;	mov.u32	%r61, %r71; $L__BB0_2:	.pragma "nounroll";	mov.u32	%r28, %r72;	mul.wide.s32	%rd29, %r28, 4;	add.s64	%rd30, %rd3, %rd29;	// begin inline asm	mov.u64 %rd27, %clock64;	// end inline asm	ld.global.u32	%r29, [%rd30];	mov.u32	%r72, %r29;	mov.u32	%r30, %r72;	add.s32	%r71, %r30, %r71;	// begin inline asm	mov.u64 %rd28, %clock64;	// end inline asm	sub.s64	%rd31, %rd28, %rd27;	st.global.u32	[%rd60], %rd31;	mov.u32	%r31, %r72;	st.global.u32	[%rd61], %r31;	add.s64	%rd61, %rd61, 4;	add.s64	%rd60, %rd60, 4;	add.s32	%r61, %r61, 1;	setp.lt.s32	%p2, %r61, %r1;	@%p2 bra	$L__BB0_2; $L__BB0_3:	mov.u32	%r72, %r23;	membar.gl;	setp.ge.s32	%p3, %r1, %r22;	@%p3 bra	$L__BB0_10;	sub.s32	%r34, %r22, %r1;	and.b32	%r65, %r34, 3;	setp.eq.s32	%p4, %r65, 0;	mov.u32	%r67, %r1;	@%p4 bra	$L__BB0_7;	mul.wide.s32	%rd32, %r1, 4;	add.s64	%rd63, %rd1, %rd32;	add.s64	%rd62, %rd2, %rd32;	mov.u32	%r67, %r1; $L__BB0_6:	.pragma "nounroll";	mov.u32	%r35, %r72;	mul.wide.s32	%rd35, %r35, 4;	add.s64	%rd36, %rd3, %rd35;	// begin inline asm	mov.u64 %rd33, %clock64;	// end inline asm	ld.global.u32	%r36, [%rd36];	mov.u32	%r72, %r36;	mov.u32	%r37, %r72;	add.s32	%r71, %r37, %r71;	// begin inline asm	mov.u64 %rd34, %clock64;	// end inline asm	sub.s64	%rd37, %rd34, %rd33;	st.global.u32	[%rd62], %rd37;	mov.u32	%r38, %r72;	st.global.u32	[%rd63], %r38;	add.s32	%r67, %r67, 1;	add.s64	%rd63, %rd63, 4;	add.s64	%rd62, %rd62, 4;	add.s32	%r65, %r65, -1;	setp.ne.s32	%p5, %r65, 0;	@%p5 bra	$L__BB0_6; $L__BB0_7:	not.b32	%r39, %r1;	add.s32	%r40, %r39, %r22;	setp.lt.u32	%p6, %r40, 3;	@%p6 bra	$L__BB0_10;	mul.wide.s32	%rd38, %r67, 4;	add.s64	%rd39, %rd38, 8;	add.s64	%rd65, %rd2, %rd39;	add.s64	%rd64, %rd1, %rd39; $L__BB0_9:	mov.u32	%r41, %r72;	mul.wide.s32	%rd48, %r41, 4;	add.s64	%rd49, %rd3, %rd48;	// begin inline asm	mov.u64 %rd40, %clock64;	// end inline asm	ld.global.u32	%r42, [%rd49];	mov.u32	%r72, %r42;	mov.u32	%r43, %r72;	add.s32	%r44, %r43, %r71;	// begin inline asm	mov.u64 %rd41, %clock64;	// end inline asm	sub.s64	%rd50, %rd41, %rd40;	st.global.u32	[%rd65+-8], %rd50;	mov.u32	%r45, %r72;	st.global.u32	[%rd64+-8], %r45;	mov.u32	%r46, %r72;	mul.wide.s32	%rd51, %r46, 4;	add.s64	%rd52, %rd3, %rd51;	// begin inline asm	mov.u64 %rd42, %clock64;	// end inline asm	ld.global.u32	%r47, [%rd52];	mov.u32	%r72, %r47;	mov.u32	%r48, %r72;	add.s32	%r49, %r48, %r44;	// begin inline asm	mov.u64 %rd43, %clock64;	// end inline asm	sub.s64	%rd53, %rd43, %rd42;	st.global.u32	[%rd65+-4], %rd53;	mov.u32	%r50, %r72;	st.global.u32	[%rd64+-4], %r50;	mov.u32	%r51, %r72;	mul.wide.s32	%rd54, %r51, 4;	add.s64	%rd55, %rd3, %rd54;	// begin inline asm	mov.u64 %rd44, %clock64;	// end inline asm	ld.global.u32	%r52, [%rd55];	mov.u32	%r72, %r52;	mov.u32	%r53, %r72;	add.s32	%r54, %r53, %r49;	// begin inline asm	mov.u64 %rd45, %clock64;	// end inline asm	sub.s64	%rd56, %rd45, %rd44;	st.global.u32	[%rd65], %rd56;	mov.u32	%r55, %r72;	st.global.u32	[%rd64], %r55;	mov.u32	%r56, %r72;	mul.wide.s32	%rd57, %r56, 4;	add.s64	%rd58, %rd3, %rd57;	// begin inline asm	mov.u64 %rd46, %clock64;	// end inline asm	ld.global.u32	%r57, [%rd58];	mov.u32	%r72, %r57;	mov.u32	%r58, %r72;	add.s32	%r71, %r58, %r54;	// begin inline asm	mov.u64 %rd47, %clock64;	// end inline asm	sub.s64	%rd59, %rd47, %rd46;	st.global.u32	[%rd65+4], %rd59;	mov.u32	%r59, %r72;	st.global.u32	[%rd64+4], %r59;	add.s64	%rd65, %rd65, 16;	add.s64	%rd64, %rd64, 16;	add.s32	%r67, %r67, 4;	setp.lt.s32	%p7, %r67, %r22;	@%p7 bra	$L__BB0_9; $L__BB0_10:	st.global.u32	[%rd5], %r71;	ret; }

Result of the first loop is expectful. But the second loop shows below, why the latency various rather than stable 36 ?

system · September 2, 2025, 4:59am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
L2 Hit Rate(Texture Reads) becomes 100% when modifying memory never used CUDA Programming and Performance	7	2730	March 17, 2018
Very weird behaviour CUDA Programming and Performance	20	1646	March 25, 2011
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26318	December 2, 2011
Why texture memory is better on Fermi? CUDA Programming and Performance	62	21100	January 28, 2011
CUDA clock() issue CUDA Programming and Performance	12	2465	March 7, 2017
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9431	January 7, 2008
Why L1 cache hit ratio become zero on K20? CUDA Programming and Performance	10	5730	January 17, 2013
Global memory access latency access latency is as much as 1200 cycles CUDA Programming and Performance	28	17756	August 13, 2009
Yet another performance question CUDA Programming and Performance	16	4324	February 12, 2009
Device memory latency and lookup tables - please advise CUDA Programming and Performance	26	7170	February 28, 2009

Why l1 cache always hit at the first access

Related topics