Skip to content

Conversation

@hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Jun 2, 2025

With ffb9bbf, memref.assume_alignment op returns a result value. The revision updates the tests to reflect the change:

  • Update all the lit tests to use the result of memref.assume_alignment, if it is present.
  • Capture the result of the op in lit tests.
- Update all the lit tests to use the result of memref.assume_alignment, if it is present. - Capture the result of the op in lit tests. Signed-off-by: hanhanW <hanhan0912@gmail.com>
@llvmbot
Copy link
Member

llvmbot commented Jun 2, 2025

@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir-memref

Author: Han-Chung Wang (hanhanW)

Changes

With ffb9bbf, memref.assume_alignment op returns a result value. The revision updates the tests to reflect the change:

  • Update all the lit tests to use the result of memref.assume_alignment, if it is present.
  • Capture the result of the op in lit tests.

Full diff: https://github.com/llvm/llvm-project/pull/142358.diff

5 Files Affected:

  • (modified) mlir/test/Conversion/MemRefToLLVM/memref-to-llvm.mlir (+2-2)
  • (modified) mlir/test/Dialect/MemRef/emulate-narrow-type.mlir (+1-1)
  • (modified) mlir/test/Dialect/MemRef/ops.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/sm90/gemm_f32_f16_f16_128x128x128.mlir (+3-3)
  • (modified) mlir/test/Integration/GPU/CUDA/sm90/gemm_pred_f32_f16_f16_128x128x128.mlir (+3-3)
diff --git a/mlir/test/Conversion/MemRefToLLVM/memref-to-llvm.mlir b/mlir/test/Conversion/MemRefToLLVM/memref-to-llvm.mlir index 8c863bb2d3d65..acfc188574255 100644 --- a/mlir/test/Conversion/MemRefToLLVM/memref-to-llvm.mlir +++ b/mlir/test/Conversion/MemRefToLLVM/memref-to-llvm.mlir @@ -189,7 +189,7 @@ func.func @assume_alignment(%0 : memref<4x4xf16>) { // CHECK-NEXT: %[[ALIGN:.*]] = llvm.mlir.constant(16 : index) : i64 // CHECK-NEXT: llvm.intr.assume %[[TRUE]] ["align"(%[[PTR]], %[[ALIGN]] : !llvm.ptr, i64)] : i1 // CHECK-INTERFACE: llvm.intr.assume - memref.assume_alignment %0, 16 : memref<4x4xf16> + %1 = memref.assume_alignment %0, 16 : memref<4x4xf16> return } @@ -205,7 +205,7 @@ func.func @assume_alignment_w_offset(%0 : memref<4x4xf16, strided<[?, ?], offset // CHECK-DAG: %[[ALIGN:.*]] = llvm.mlir.constant(16 : index) : i64 // CHECK-NEXT: llvm.intr.assume %[[TRUE]] ["align"(%[[BUFF_ADDR]], %[[ALIGN]] : !llvm.ptr, i64)] : i1 // CHECK-INTERFACE: llvm.intr.assume - memref.assume_alignment %0, 16 : memref<4x4xf16, strided<[?, ?], offset: ?>> + %1 = memref.assume_alignment %0, 16 : memref<4x4xf16, strided<[?, ?], offset: ?>> return } // ----- diff --git a/mlir/test/Dialect/MemRef/emulate-narrow-type.mlir b/mlir/test/Dialect/MemRef/emulate-narrow-type.mlir index 111a02abcc74c..3378d329e8205 100644 --- a/mlir/test/Dialect/MemRef/emulate-narrow-type.mlir +++ b/mlir/test/Dialect/MemRef/emulate-narrow-type.mlir @@ -63,7 +63,7 @@ func.func @memref_load_i4(%arg0: index) -> i4 { func.func @memref_load_i4_rank2(%arg0: index, %arg1: index) -> i4 { %0 = memref.alloc() : memref<3x125xi4> - %align0 =memref.assume_alignment %0, 64 : memref<3x125xi4> + %align0 = memref.assume_alignment %0, 64 : memref<3x125xi4> %1 = memref.load %align0[%arg0,%arg1] : memref<3x125xi4> return %1 : i4 } diff --git a/mlir/test/Dialect/MemRef/ops.mlir b/mlir/test/Dialect/MemRef/ops.mlir index 38ee363a7d424..13fdf3cf13510 100644 --- a/mlir/test/Dialect/MemRef/ops.mlir +++ b/mlir/test/Dialect/MemRef/ops.mlir @@ -283,7 +283,7 @@ func.func @memref_view(%arg0 : index, %arg1 : index, %arg2 : index) { // CHECK-LABEL: func @assume_alignment // CHECK-SAME: %[[MEMREF:.*]]: memref<4x4xf16> func.func @assume_alignment(%0: memref<4x4xf16>) { - // CHECK: memref.assume_alignment %[[MEMREF]], 16 : memref<4x4xf16> + // CHECK: %{{.*}} = memref.assume_alignment %[[MEMREF]], 16 : memref<4x4xf16> %1 = memref.assume_alignment %0, 16 : memref<4x4xf16> return } diff --git a/mlir/test/Integration/GPU/CUDA/sm90/gemm_f32_f16_f16_128x128x128.mlir b/mlir/test/Integration/GPU/CUDA/sm90/gemm_f32_f16_f16_128x128x128.mlir index aaa3aff5350ad..6153a11622a4f 100644 --- a/mlir/test/Integration/GPU/CUDA/sm90/gemm_f32_f16_f16_128x128x128.mlir +++ b/mlir/test/Integration/GPU/CUDA/sm90/gemm_f32_f16_f16_128x128x128.mlir @@ -120,7 +120,7 @@ func.func @main() { threads(%arg3, %arg4, %arg5) in (%arg9 = %hc128, %arg10 = %hc1, %arg11 = %hc1) dynamic_shared_memory_size %shmemSize { - memref.assume_alignment %matrixD, 16 : memref<128x128xf32> + %align_matrixD = memref.assume_alignment %matrixD, 16 : memref<128x128xf32> %c256 = arith.constant 256 : index %c10000000 = arith.constant 10000000 : index @@ -226,13 +226,13 @@ func.func @main() { scf.for %arg12 = %17 to %c128 step %c4 { %19 = arith.muli %18, %c4 : index %20 = vector.load %accShmemPtr[%arg12, %19] : memref<128x128xf32, 3>, vector<4xf32> - vector.store %20, %matrixD[%arg12, %19] : memref<128x128xf32>, vector<4xf32> + vector.store %20, %align_matrixD[%arg12, %19] : memref<128x128xf32>, vector<4xf32> } gpu.terminator } // Step 5. Copy D2H - %5 = gpu.memcpy async [%token] %matrixDHost, %matrixD : memref<128x128xf32>, memref<128x128xf32> + %5 = gpu.memcpy async [%token] %matrixDHost, %align_matrixD : memref<128x128xf32>, memref<128x128xf32> gpu.wait [%token] // Step 6. Compute on host diff --git a/mlir/test/Integration/GPU/CUDA/sm90/gemm_pred_f32_f16_f16_128x128x128.mlir b/mlir/test/Integration/GPU/CUDA/sm90/gemm_pred_f32_f16_f16_128x128x128.mlir index b257d2b0f1e34..b8e355712d37f 100644 --- a/mlir/test/Integration/GPU/CUDA/sm90/gemm_pred_f32_f16_f16_128x128x128.mlir +++ b/mlir/test/Integration/GPU/CUDA/sm90/gemm_pred_f32_f16_f16_128x128x128.mlir @@ -120,7 +120,7 @@ func.func @main() { threads(%arg3, %arg4, %arg5) in (%arg9 = %hc128, %arg10 = %hc1, %arg11 = %hc1) dynamic_shared_memory_size %shmemSize { - memref.assume_alignment %matrixD, 16 : memref<128x128xf32> + %align_matrixD = memref.assume_alignment %matrixD, 16 : memref<128x128xf32> %c256 = arith.constant 256 : index %c10000000 = arith.constant 10000000 : index @@ -234,13 +234,13 @@ func.func @main() { scf.for %arg12 = %17 to %c128 step %c4 { %19 = arith.muli %18, %c4 : index %20 = vector.load %accShmemPtr[%arg12, %19] : memref<128x128xf32, 3>, vector<4xf32> - vector.store %20, %matrixD[%arg12, %19] : memref<128x128xf32>, vector<4xf32> + vector.store %20, %align_matrixD[%arg12, %19] : memref<128x128xf32>, vector<4xf32> } gpu.terminator } // Step 5. Copy D2H - %5 = gpu.memcpy async [%token] %matrixDHost, %matrixD : memref<128x128xf32>, memref<128x128xf32> + %5 = gpu.memcpy async [%token] %matrixDHost, %align_matrixD : memref<128x128xf32>, memref<128x128xf32> gpu.wait [%token] // Step 6. Compute on host 
@hanhanW
Copy link
Contributor Author

hanhanW commented Jun 2, 2025

I don't know why AnyMemref<> does not force IR to capture the result, but I identified this when I was doing an integrate for a downstream project. There are a few missing features after the op has the new semantics, and I'll help fix a few of them.

let arguments = (ins AnyMemRef:$memref,
ConfinedAttr<I32Attr, [IntPositive]>:$alignment);
let results = (outs AnyMemRef:$result);

@hanhanW hanhanW marked this pull request as draft June 2, 2025 10:46
Signed-off-by: hanhanW <hanhan0912@gmail.com>

// Step 5. Copy D2H
%5 = gpu.memcpy async [%token] %matrixDHost, %matrixD : memref<128x128xf32>, memref<128x128xf32>
%5 = gpu.memcpy async [%token] %matrixDHost, %align_matrixD : memref<128x128xf32>, memref<128x128xf32>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to use align_matrixD here or not, so I move the declaration outside the GPU kernel block. I'm happy to move it back, if my change is not correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need align here. I mentioned line above where we need

@hanhanW hanhanW marked this pull request as ready for review June 2, 2025 10:50
@hanhanW
Copy link
Contributor Author

hanhanW commented Jun 2, 2025

(I don't have a local setup for nvgpu, so I can only test it here.)

@hanhanW
Copy link
Contributor Author

hanhanW commented Jun 2, 2025

I can't add @shay-kl to reviewers, so I tag you instead.

@shay-kl
Copy link
Contributor

shay-kl commented Jun 2, 2025

Don't have enough expertise to comment on the specifics of the GPU interaction, but otherwise LGTM.

%19 = arith.muli %18, %c4 : index
%20 = vector.load %accShmemPtr[%arg12, %19] : memref<128x128xf32, 3>, vector<4xf32>
vector.store %20, %matrixD[%arg12, %19] : memref<128x128xf32>, vector<4xf32>
vector.store %20, %align_matrixD[%arg12, %19] : memref<128x128xf32>, vector<4xf32>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only need aligned matrix, so we can generate vectorized store here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!


// Step 5. Copy D2H
%5 = gpu.memcpy async [%token] %matrixDHost, %matrixD : memref<128x128xf32>, memref<128x128xf32>
%5 = gpu.memcpy async [%token] %matrixDHost, %align_matrixD : memref<128x128xf32>, memref<128x128xf32>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need align here. I mentioned line above where we need

Signed-off-by: hanhanW <hanhan0912@gmail.com>
@hanhanW hanhanW merged commit 77e2e3f into llvm:main Jun 2, 2025
11 checks passed
@hanhanW hanhanW deleted the assume_align_tests branch June 2, 2025 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment