How to do a in-wrap transpose mma C&D matrix?

wang1570 · November 10, 2025, 11:16am

Hi, I’m currently doing a computation on Ampere and Hopper GPUs where

C = A @ B # comp1 E = D @ C # comp2

Supposing I’m using mma.m16n8k16.

The required layouts of C are different in

comp1

and comp2

Is it possible to perform the transpose without storing and loading matrix C into shared memory? Can I use movmatrix.m8n8 in this scenario?

Thanks.

Curefab · November 10, 2025, 12:43pm

You could transpose and exchange A and B instead.

C^T = B^T @ A^T

Then you would need to load the inputs in a transposed fashion. Alternatively you could apply it to comp2 and get the output as transposed matrix (and the inputs as they currently are).

wang1570 · November 11, 2025, 5:40am

@Curefab Thanks for the quick answer!

I did come up with it before, but unfortunately, I’m utilizing sparse tensor core, which requires the order of matmul to be fixed (i.e., I cannot make any other transpose to the computations).

So an in-warp transpose is still required here.

Curefab · November 11, 2025, 9:15am

So A has to be sparse, and B may not? And C (as it is computed) not at all? And D is sparse?

Otherwise two movmatrix calls should work (as one does 8x8 at a time).

That is 4 operations (1 mma + 2 movmatrix + 1 mma).

I am not sure about the performance impact. Perhaps in the end it is faster to change either comp1 or comp2 to two dense operations (as those have half the matrix size) instead.

→ 3 operations (1 mma + 2 mma).

striker159 · November 11, 2025, 9:22am

It should be easy to set up a simple experiment to figure out if `movmatrix` produces the desired layout.

Judging from the documentation https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-movmatrix, I believe it does work.

wang1570 · November 11, 2025, 11:42am

@Curefab @striker159

Thanks for your kindly help!

Yuh I think multiple movmatrix might be the only ideal way to address the issue in this scenario, or otherwise use shuffle commands to simulate one like in llama.cpp/ggml.

Curefab · November 11, 2025, 12:05pm

Does anybody know, whether movmatrix is a tensor core or a shared memory instruction or something else?

The shuffle instructions and ldmatrix use shared memory silicon.

striker159 · November 11, 2025, 1:39pm

Profiling this code with ncu

__global__ void kernel(unsigned int* data){ unsigned int fragment = data[threadIdx.x]; for(int i = 0; i < 256; i++){ asm( "movmatrix.sync.aligned.m8n8.trans.b16 %0, %0;\n\t" : "+r"(fragment) : ); } data[threadIdx.x] = fragment; } int main(){ unsigned int* d_data; cudaMalloc(&d_data, sizeof(unsigned int) * 512); kernel<<<512, 512>>>(d_data); cudaDeviceSynchronize(); }

shows LSU as most utilized pipe. memory section shows > 40% shared memory utilization, top stall reasons are short scoreboard and mio throttle.

Looks like it uses shared memory hardware.

Curefab · November 11, 2025, 2:18pm

For FP16 x FP16 → FFP16, the consumer class GPUs (assumed you use those) can do 128 actual (dense) FMAs per cycle/partition or 512 per SM.

The m16n8k16 sparse matrices need 1024 FMAs each, that is 2048 for comp1+comp2 using 4 cycles.

The two movmatrix probably together need 2 cycles from the MIO pipeline.

So depending on whether you also load input or store output from/in shared memory (e.g. 2 cycles to load B, 2 cycles to store E), either tensor cores or shared memory will be the limiting factor.

If it is shared memory I would reconsider making one computation dense to save those 2 cycles. If it is tensor cores, what you envisioned is the fastest way.

If you can choose 6/4 or 4/6 for tc/mio cycles, do both and get 5 cycles.

wang1570 · November 12, 2025, 6:04am

@striker159
Thanks for the profiling.

My first time knowing that shuffle and movmatrix also use shared memory resources, which looks interesting to me.

@Curefab
Thanks for the insight.

I’m currently working on a consumer-class GPU and eventually aiming for industrial-class GPUs. In that case, I think the computation might take even fewer cycles (for A100, I assume it takes 2 cycles for each comp since 256 FMAs/cycle) and make the movmatrix overhead more observable.

You are absolutely right that I also need to load/store to/from smem, which might be the bottleneck. Making one of them dense can save the movmatrix cycle, but it also increases gmem and smem access (given a fixed global matrix size). So I guess I might have to try different combinations and compare them to see the results.

Topic		Replies	Views
Matrix multiplication CUDA CUDA Programming and Performance	7	2990	November 12, 2012
An Efficient Matrix Transpose in CUDA Fortran Technical Blog	2	461	February 5, 2014
Question about tranpose CUDA Programming and Performance	19	7426	June 11, 2008
rectangular matrix transpose CUDA Programming and Performance	3	7731	April 30, 2008
transposed matrix of size N*M CUDA Programming and Performance	0	751	August 12, 2011
matrix multiplication C = A * B^T reading B from global and writing B^T in shmem CUDA Programming and Performance	0	757	July 22, 2009
Efficient Matrix Transpose off-the-shelf solution? CUDA Programming and Performance	0	980	May 15, 2009
GM2=GM1 is faster than "SM=GM1; GM2=SM;" ? memory access time CUDA Programming and Performance	10	5446	April 19, 2007
Doubling the speed of the SDK transpose CUDA Programming and Performance	16	6396	October 15, 2008
Batch matrix transposed CUDA Programming and Performance	0	2381	August 21, 2009

How to do a in-wrap transpose mma C&D matrix?

Related topics