Skip to content

Conversation

@joker-eph
Copy link
Collaborator

@joker-eph joker-eph commented Sep 24, 2025

When executed in the context of canonicalization, the folders are invoked in a fixed-point iterative process. However in the context of an API like createOrFold() or in DialectConversion for example, we expect a "one-shot" call to fold to be as "folded" as possible. However, even when folders themselves are indempotent, folders on a given operation interact with each other. For example:

// X = 0 + Y
%X = arith.addi %c_0, %Y : i32

should fold to %Y, but the process actually involves first the folder provided by the IsCommutative trait to move the constant to the right. However this happens after attempting to fold the operation and the operation folder isn't attempt again after applying the trait folder.

This commit makes sure we iterate until fixed point on folder applications.

Fixes #159844

@llvmbot
Copy link
Member

llvmbot commented Sep 24, 2025

@llvm/pr-subscribers-llvm-adt
@llvm/pr-subscribers-mlir-core

@llvm/pr-subscribers-mlir

Author: Mehdi Amini (joker-eph)

Changes

When executed in the context of canonicalization, the folders are invoked in a fixed-point iterative process. However in the context of an API like createOrFold() or in DialectConversion for example, we expect a "one-shot" call to fold to be as "folded" as possible. However, even when folders themselves are indempotent, folders on a given operation interact with each other. For example:

// X = 0 + Y
%X = arith.addi %c_0, %Y : i32

should fold to %Y, but the process actually involves first the folder provided by the IsCommutative trait to move the constant to the right. However this happens after attempting to fold the operation and the operation folder isn't attempt again after applying the trait folder.

This commit makes sure we iterate until fixed point on folder applications.


Full diff: https://github.com/llvm/llvm-project/pull/160615.diff

6 Files Affected:

  • (modified) mlir/lib/IR/Builders.cpp (+7)
  • (modified) mlir/lib/Transforms/Utils/FoldUtils.cpp (+10-2)
  • (modified) mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir (+2-6)
  • (modified) mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir (+2-4)
  • (modified) mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir (+7-13)
  • (modified) mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir (+5-11)
diff --git a/mlir/lib/IR/Builders.cpp b/mlir/lib/IR/Builders.cpp index 3d366276b4375..6373b1ead94d1 100644 --- a/mlir/lib/IR/Builders.cpp +++ b/mlir/lib/IR/Builders.cpp @@ -14,6 +14,7 @@ #include "mlir/IR/IRMapping.h" #include "mlir/IR/Matchers.h" #include "llvm/ADT/SmallVectorExtras.h" +#include "llvm/Support/DebugLog.h" using namespace mlir; @@ -489,6 +490,12 @@ OpBuilder::tryFold(Operation *op, SmallVectorImpl<Value> &results, if (failed(op->fold(foldResults))) return cleanupFailure(); + int count = 0; + do { + LDBG() << "Folded in place #" << count++ + << " times: " << OpWithFlags(op, OpPrintingFlags().skipRegions()); + } while (foldResults.empty() && succeeded(op->fold(foldResults))); + // An in-place fold does not require generation of any constants. if (foldResults.empty()) return success(); diff --git a/mlir/lib/Transforms/Utils/FoldUtils.cpp b/mlir/lib/Transforms/Utils/FoldUtils.cpp index 5e07509871ea2..70310685315c4 100644 --- a/mlir/lib/Transforms/Utils/FoldUtils.cpp +++ b/mlir/lib/Transforms/Utils/FoldUtils.cpp @@ -16,6 +16,7 @@ #include "mlir/IR/Builders.h" #include "mlir/IR/Matchers.h" #include "mlir/IR/Operation.h" +#include "llvm/Support/DebugLog.h" using namespace mlir; @@ -226,8 +227,15 @@ bool OperationFolder::isFolderOwnedConstant(Operation *op) const { LogicalResult OperationFolder::tryToFold(Operation *op, SmallVectorImpl<Value> &results) { SmallVector<OpFoldResult, 8> foldResults; - if (failed(op->fold(foldResults)) || - failed(processFoldResults(op, results, foldResults))) + if (failed(op->fold(foldResults))) + return failure(); + int count = 1; + do { + LDBG() << "Folded in place #" << count++ + << " times: " << OpWithFlags(op, OpPrintingFlags().skipRegions()); + } while (foldResults.empty() && succeeded(op->fold(foldResults))); + + if (failed(processFoldResults(op, results, foldResults))) return failure(); return success(); } diff --git a/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir b/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir index 547c7355e00c6..b73bc69393dab 100644 --- a/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir +++ b/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir @@ -7,10 +7,8 @@ gpu.module @test { //CHECK: [[IDY:%.+]] = affine.apply #map()[[[sgId]]] //CHECK: [[c32:%.+]] = arith.constant 32 : index //CHECK: [[LOCALY:%.+]] = index.mul [[IDY]], [[c32]] - //CHECK: [[c0:%.+]] = arith.constant 0 : index - //CHECK: [[Y:%.+]] = arith.addi [[LOCALY]], [[c0]] : index //CHECK: [[c128:%.+]] = arith.constant 128 : index - //CHECK: [[MODY:%.+]] = index.remu [[Y]], [[c128]] + //CHECK: [[MODY:%.+]] = index.remu [[LOCALY]], [[c128]] //CHECK: [[BASE:%.+]] = vector.step : vector<32xindex> //CHECK: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<32xindex> //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<32xindex> @@ -23,10 +21,8 @@ gpu.module @test { //CHECK: [[IDY:%.+]] = affine.apply #map()[[[sgId]]] //CHECK: [[c32:%.+]] = arith.constant 32 : index //CHECK: [[LOCALY:%.+]] = index.mul [[IDY]], [[c32]] - //CHECK: [[c0:%.+]] = arith.constant 0 : index - //CHECK: [[Y:%.+]] = arith.addi [[LOCALY]], [[c0]] : index //CHECK: [[c128:%.+]] = arith.constant 128 : index - //CHECK: [[MODY:%.+]] = index.remu [[Y]], [[c128]] + //CHECK: [[MODY:%.+]] = index.remu [[LOCALY]], [[c128]] //CHECK: [[BASE:%.+]] = vector.step : vector<32xindex> //CHECK: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<32xindex> //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<32xindex> diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir index e5cc65e6bd3d7..d2d250cbe0f66 100644 --- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir +++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir @@ -27,12 +27,10 @@ gpu.module @test_round_robin_assignment { //CHECK: [[LX:%.+]] = index.mul [[IdX]], [[C64]] //CHECK: [[C0:%.+]] = arith.constant 0 : index //CHECK: [[C0_1:%.+]] = arith.constant 0 : index - //CHECK: [[ADDY:%.+]] = arith.addi [[LY]], [[C0]] : index - //CHECK: [[ADDX:%.+]] = arith.addi [[LX]], [[C0_1]] : index //CHECK: [[C128:%.+]] = arith.constant 128 : index - //CHECK: [[offY:%.+]] = index.remu [[ADDY]], [[C128]] + //CHECK: [[offY:%.+]] = index.remu [[LY]], [[C128]] //CHECK: [[C64_2:%.+]] = arith.constant 64 : index - //CHECK: [[offX:%.+]] = index.remu [[ADDX]], [[C64_2]] + //CHECK: [[offX:%.+]] = index.remu [[LX]], [[C64_2]] //CHECK: xegpu.create_nd_tdesc [[ARG_0]][[[offY]], [[offX]]] : memref<256x128xf32> -> !xegpu.tensor_desc<16x64xf32> %tdesc = xegpu.create_nd_tdesc %src[0, 0] : memref<256x128xf32> -> !xegpu.tensor_desc<128x64xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 64]>> diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir index 3478a9b91da5f..31d015d89a2dc 100644 --- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir +++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir @@ -325,12 +325,10 @@ gpu.module @test_distribution { //CHECK: [[l_off_x:%.+]] = index.mul [[id_x]], [[c32_1]] //CHECK: [[c0:%.+]] = arith.constant 0 : index //CHECK: [[c0_1:%.+]] = arith.constant 0 : index - //CHECK: [[l_off_y_0:%.+]] = arith.addi [[l_off_y]], [[c0]] : index - //CHECK: [[l_off_x_0:%.+]] = arith.addi [[l_off_x]], [[c0_1]] : index //CHECK: [[c64:%.+]] = arith.constant 64 : index - //CHECK: [[off_y:%.+]] = index.remu [[l_off_y_0]], [[c64]] + //CHECK: [[off_y:%.+]] = index.remu [[l_off_y]], [[c64]] //CHECK: [[c128:%.+]] = arith.constant 128 : index - //CHECK: [[off_x:%.+]] = index.remu [[l_off_x_0]], [[c128]] + //CHECK: [[off_x:%.+]] = index.remu [[l_off_x]], [[c128]] //CHECK: xegpu.load_matrix [[mdesc]][[[off_y]], [[off_x]]] <{layout = #xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1]>}>: !xegpu.mem_desc<64x128xf32>, index, index -> vector<32x32xf32> %0 = xegpu.create_mem_desc %arg0 : memref<32768xi8, 3> -> !xegpu.mem_desc<64x128xf32> %1 = xegpu.load_matrix %0[0, 0] <{layout = #xegpu.layout<sg_layout = [2, 4], sg_data = [32, 32], lane_layout = [2, 8], lane_data = [1, 1]>}>: !xegpu.mem_desc<64x128xf32> -> vector<64x128xf32> @@ -349,13 +347,11 @@ gpu.module @test_distribution { //CHECK: [[id_y:%.+]] = affine.apply #map()[[[sgid]]] //CHECK: [[id_x:%.+]] = affine.apply #map1()[[[sgid]]] //CHECK: [[c32:%.+]] = arith.constant 32 : index - //CHECK: [[l_off_y_0:%.+]] = index.mul [[id_y]], [[c32]] + //CHECK: [[l_off_y:%.+]] = index.mul [[id_y]], [[c32]] //CHECK: [[c32_1:%.+]] = arith.constant 32 : index - //CHECK: [[l_off_x_0:%.+]] = index.mul [[id_x]], [[c32_1]] + //CHECK: [[l_off_x:%.+]] = index.mul [[id_x]], [[c32_1]] //CHECK: [[c0:%.+]] = arith.constant 0 : index //CHECK: [[c0_2:%.+]] = arith.constant 0 : index - //CHECK: [[l_off_y:%.+]] = arith.addi [[l_off_y_0]], [[c0]] : index - //CHECK: [[l_off_x:%.+]] = arith.addi [[l_off_x_0]], [[c0_2]] : index //CHECK: [[c64:%.+]] = arith.constant 64 : index //CHECK: [[off_y:%.+]] = index.remu [[l_off_y]], [[c64]] //CHECK: [[c128:%.+]] = arith.constant 128 : index @@ -372,11 +368,10 @@ gpu.module @test_distribution { //CHECK: [[sgId:%.+]] = gpu.subgroup_id : index //CHECK-DAG: [[IDY:%.+]] = affine.apply #map2()[[[sgId]]] //CHECK-DAG: [[c32:%.+]] = arith.constant 32 : index - //CHECK-DAG: [[LOCALY:%.+]] = index.mul [[IDY]], [[c32]] + //CHECK-DAG: [[LY:%.+]] = index.mul [[IDY]], [[c32]] //CHECK-DAG: [[c0:%.+]] = arith.constant 0 : index - //CHECK-DAG: [[Y:%.+]] = arith.addi [[LOCALY]], [[c0]] : index //CHECK-DAG: [[c128:%.+]] = arith.constant 128 : index - //CHECK-DAG: [[MODY:%.+]] = index.remu [[Y]], [[c128]] + //CHECK-DAG: [[MODY:%.+]] = index.remu [[LY]], [[c128]] //CHECK-DAG: [[BASE:%.+]] = vector.step : vector<32xindex> //CHECK-DAG: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<32xindex> //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<32xindex> @@ -390,9 +385,8 @@ gpu.module @test_distribution { //CHECK-DAG: [[c8:%.+]] = arith.constant 8 : index //CHECK-DAG: [[LOCALY:%.+]] = index.mul [[sgId]], [[c8]] //CHECK-DAG: [[c0:%.+]] = arith.constant 0 : index - //CHECK-DAG: [[Y:%.+]] = arith.addi [[LOCALY]], [[c0]] : index //CHECK-DAG: [[c128:%.+]] = arith.constant 128 : index - //CHECK-DAG: [[MODY:%.+]] = index.remu [[Y]], [[c128]] + //CHECK-DAG: [[MODY:%.+]] = index.remu [[LOCALY]], [[c128]] //CHECK-DAG: [[BASE:%.+]] = vector.step : vector<8xindex> //CHECK-DAG: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<8xindex> //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<8xindex> diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir index c0fb373835e3d..e83229e3a3995 100644 --- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir +++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir @@ -14,12 +14,10 @@ gpu.module @test_1_1_assignment { //CHECK: [[LX:%.+]] = index.mul [[SGIDX]], [[C32]] //CHECK: [[C0:%.+]] = arith.constant 0 : index //CHECK: [[C0_1:%.+]] = arith.constant 0 : index - //CHECK: [[UY:%.+]] = arith.addi [[LY]], [[C0]] : index - //CHECK: [[UX:%.+]] = arith.addi [[LX]], [[C0_1]] : index //CHECK: [[C256:%.+]] = arith.constant 256 : index - //CHECK: [[Y:%.+]] = index.remu [[UY]], [[C256]] + //CHECK: [[Y:%.+]] = index.remu [[LY]], [[C256]] //CHECK: [[C128:%.+]] = arith.constant 128 : index - //CHECK: [[X:%.+]] = index.remu [[UX]], [[C128]] + //CHECK: [[X:%.+]] = index.remu [[LX]], [[C128]] //CHECK: [[TDESC:%.+]] = xegpu.create_nd_tdesc [[ARG_0]][[[Y]], [[X]]] : memref<256x128xf32> -> !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> %tdesc = xegpu.create_nd_tdesc %src[0, 0] : memref<256x128xf32> -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>> @@ -37,17 +35,13 @@ gpu.module @test_1_1_assignment { //CHECK: [[LX:%.+]] = index.mul [[SGIDX]], [[C32]] //CHECK: [[C0:%.+]] = arith.constant 0 : index //CHECK: [[C0_2:%.+]] = arith.constant 0 : index - //CHECK: [[UY:%.+]] = arith.addi [[LY]], [[C0]] : index - //CHECK: [[UX:%.+]] = arith.addi [[LX]], [[C0_2]] : index //CHECK: [[C256:%.+]] = arith.constant 256 : index - //CHECK: [[MODY:%.+]] = index.remu [[UY]], [[C256]] + //CHECK: [[MODY:%.+]] = index.remu [[LY]], [[C256]] //CHECK: [[C128:%.+]] = arith.constant 128 : index - //CHECK: [[MODX:%.+]] = index.remu [[UX]], [[C128]] + //CHECK: [[MODX:%.+]] = index.remu [[LX]], [[C128]] //CHECK: [[C0_3:%.+]] = arith.constant 0 : index - //CHECK: [[Y:%.+]] = index.add [[MODY]], [[C0_3]] //CHECK: [[C0_4:%.+]] = arith.constant 0 : index - //CHECK: [[X:%.+]] = index.add [[MODX]], [[C0_4]] - //CHECK: [[TDESC:%.+]] = xegpu.create_nd_tdesc [[ARG_0]][1, [[Y]], [[X]]] : memref<3x256x128xf32> -> !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> + //CHECK: [[TDESC:%.+]] = xegpu.create_nd_tdesc [[ARG_0]][1, [[MODY]], [[MODX]]] : memref<3x256x128xf32> -> !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> %tdesc = xegpu.create_nd_tdesc %src[1, 0, 0] : memref<3x256x128xf32> -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>> gpu.return 
@llvmbot
Copy link
Member

llvmbot commented Sep 24, 2025

@llvm/pr-subscribers-mlir-gpu

Author: Mehdi Amini (joker-eph)

Changes

When executed in the context of canonicalization, the folders are invoked in a fixed-point iterative process. However in the context of an API like createOrFold() or in DialectConversion for example, we expect a "one-shot" call to fold to be as "folded" as possible. However, even when folders themselves are indempotent, folders on a given operation interact with each other. For example:

// X = 0 + Y
%X = arith.addi %c_0, %Y : i32

should fold to %Y, but the process actually involves first the folder provided by the IsCommutative trait to move the constant to the right. However this happens after attempting to fold the operation and the operation folder isn't attempt again after applying the trait folder.

This commit makes sure we iterate until fixed point on folder applications.


Full diff: https://github.com/llvm/llvm-project/pull/160615.diff

6 Files Affected:

  • (modified) mlir/lib/IR/Builders.cpp (+7)
  • (modified) mlir/lib/Transforms/Utils/FoldUtils.cpp (+10-2)
  • (modified) mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir (+2-6)
  • (modified) mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir (+2-4)
  • (modified) mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir (+7-13)
  • (modified) mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir (+5-11)
diff --git a/mlir/lib/IR/Builders.cpp b/mlir/lib/IR/Builders.cpp index 3d366276b4375..6373b1ead94d1 100644 --- a/mlir/lib/IR/Builders.cpp +++ b/mlir/lib/IR/Builders.cpp @@ -14,6 +14,7 @@ #include "mlir/IR/IRMapping.h" #include "mlir/IR/Matchers.h" #include "llvm/ADT/SmallVectorExtras.h" +#include "llvm/Support/DebugLog.h" using namespace mlir; @@ -489,6 +490,12 @@ OpBuilder::tryFold(Operation *op, SmallVectorImpl<Value> &results, if (failed(op->fold(foldResults))) return cleanupFailure(); + int count = 0; + do { + LDBG() << "Folded in place #" << count++ + << " times: " << OpWithFlags(op, OpPrintingFlags().skipRegions()); + } while (foldResults.empty() && succeeded(op->fold(foldResults))); + // An in-place fold does not require generation of any constants. if (foldResults.empty()) return success(); diff --git a/mlir/lib/Transforms/Utils/FoldUtils.cpp b/mlir/lib/Transforms/Utils/FoldUtils.cpp index 5e07509871ea2..70310685315c4 100644 --- a/mlir/lib/Transforms/Utils/FoldUtils.cpp +++ b/mlir/lib/Transforms/Utils/FoldUtils.cpp @@ -16,6 +16,7 @@ #include "mlir/IR/Builders.h" #include "mlir/IR/Matchers.h" #include "mlir/IR/Operation.h" +#include "llvm/Support/DebugLog.h" using namespace mlir; @@ -226,8 +227,15 @@ bool OperationFolder::isFolderOwnedConstant(Operation *op) const { LogicalResult OperationFolder::tryToFold(Operation *op, SmallVectorImpl<Value> &results) { SmallVector<OpFoldResult, 8> foldResults; - if (failed(op->fold(foldResults)) || - failed(processFoldResults(op, results, foldResults))) + if (failed(op->fold(foldResults))) + return failure(); + int count = 1; + do { + LDBG() << "Folded in place #" << count++ + << " times: " << OpWithFlags(op, OpPrintingFlags().skipRegions()); + } while (foldResults.empty() && succeeded(op->fold(foldResults))); + + if (failed(processFoldResults(op, results, foldResults))) return failure(); return success(); } diff --git a/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir b/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir index 547c7355e00c6..b73bc69393dab 100644 --- a/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir +++ b/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir @@ -7,10 +7,8 @@ gpu.module @test { //CHECK: [[IDY:%.+]] = affine.apply #map()[[[sgId]]] //CHECK: [[c32:%.+]] = arith.constant 32 : index //CHECK: [[LOCALY:%.+]] = index.mul [[IDY]], [[c32]] - //CHECK: [[c0:%.+]] = arith.constant 0 : index - //CHECK: [[Y:%.+]] = arith.addi [[LOCALY]], [[c0]] : index //CHECK: [[c128:%.+]] = arith.constant 128 : index - //CHECK: [[MODY:%.+]] = index.remu [[Y]], [[c128]] + //CHECK: [[MODY:%.+]] = index.remu [[LOCALY]], [[c128]] //CHECK: [[BASE:%.+]] = vector.step : vector<32xindex> //CHECK: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<32xindex> //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<32xindex> @@ -23,10 +21,8 @@ gpu.module @test { //CHECK: [[IDY:%.+]] = affine.apply #map()[[[sgId]]] //CHECK: [[c32:%.+]] = arith.constant 32 : index //CHECK: [[LOCALY:%.+]] = index.mul [[IDY]], [[c32]] - //CHECK: [[c0:%.+]] = arith.constant 0 : index - //CHECK: [[Y:%.+]] = arith.addi [[LOCALY]], [[c0]] : index //CHECK: [[c128:%.+]] = arith.constant 128 : index - //CHECK: [[MODY:%.+]] = index.remu [[Y]], [[c128]] + //CHECK: [[MODY:%.+]] = index.remu [[LOCALY]], [[c128]] //CHECK: [[BASE:%.+]] = vector.step : vector<32xindex> //CHECK: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<32xindex> //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<32xindex> diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir index e5cc65e6bd3d7..d2d250cbe0f66 100644 --- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir +++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir @@ -27,12 +27,10 @@ gpu.module @test_round_robin_assignment { //CHECK: [[LX:%.+]] = index.mul [[IdX]], [[C64]] //CHECK: [[C0:%.+]] = arith.constant 0 : index //CHECK: [[C0_1:%.+]] = arith.constant 0 : index - //CHECK: [[ADDY:%.+]] = arith.addi [[LY]], [[C0]] : index - //CHECK: [[ADDX:%.+]] = arith.addi [[LX]], [[C0_1]] : index //CHECK: [[C128:%.+]] = arith.constant 128 : index - //CHECK: [[offY:%.+]] = index.remu [[ADDY]], [[C128]] + //CHECK: [[offY:%.+]] = index.remu [[LY]], [[C128]] //CHECK: [[C64_2:%.+]] = arith.constant 64 : index - //CHECK: [[offX:%.+]] = index.remu [[ADDX]], [[C64_2]] + //CHECK: [[offX:%.+]] = index.remu [[LX]], [[C64_2]] //CHECK: xegpu.create_nd_tdesc [[ARG_0]][[[offY]], [[offX]]] : memref<256x128xf32> -> !xegpu.tensor_desc<16x64xf32> %tdesc = xegpu.create_nd_tdesc %src[0, 0] : memref<256x128xf32> -> !xegpu.tensor_desc<128x64xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 64]>> diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir index 3478a9b91da5f..31d015d89a2dc 100644 --- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir +++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir @@ -325,12 +325,10 @@ gpu.module @test_distribution { //CHECK: [[l_off_x:%.+]] = index.mul [[id_x]], [[c32_1]] //CHECK: [[c0:%.+]] = arith.constant 0 : index //CHECK: [[c0_1:%.+]] = arith.constant 0 : index - //CHECK: [[l_off_y_0:%.+]] = arith.addi [[l_off_y]], [[c0]] : index - //CHECK: [[l_off_x_0:%.+]] = arith.addi [[l_off_x]], [[c0_1]] : index //CHECK: [[c64:%.+]] = arith.constant 64 : index - //CHECK: [[off_y:%.+]] = index.remu [[l_off_y_0]], [[c64]] + //CHECK: [[off_y:%.+]] = index.remu [[l_off_y]], [[c64]] //CHECK: [[c128:%.+]] = arith.constant 128 : index - //CHECK: [[off_x:%.+]] = index.remu [[l_off_x_0]], [[c128]] + //CHECK: [[off_x:%.+]] = index.remu [[l_off_x]], [[c128]] //CHECK: xegpu.load_matrix [[mdesc]][[[off_y]], [[off_x]]] <{layout = #xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1]>}>: !xegpu.mem_desc<64x128xf32>, index, index -> vector<32x32xf32> %0 = xegpu.create_mem_desc %arg0 : memref<32768xi8, 3> -> !xegpu.mem_desc<64x128xf32> %1 = xegpu.load_matrix %0[0, 0] <{layout = #xegpu.layout<sg_layout = [2, 4], sg_data = [32, 32], lane_layout = [2, 8], lane_data = [1, 1]>}>: !xegpu.mem_desc<64x128xf32> -> vector<64x128xf32> @@ -349,13 +347,11 @@ gpu.module @test_distribution { //CHECK: [[id_y:%.+]] = affine.apply #map()[[[sgid]]] //CHECK: [[id_x:%.+]] = affine.apply #map1()[[[sgid]]] //CHECK: [[c32:%.+]] = arith.constant 32 : index - //CHECK: [[l_off_y_0:%.+]] = index.mul [[id_y]], [[c32]] + //CHECK: [[l_off_y:%.+]] = index.mul [[id_y]], [[c32]] //CHECK: [[c32_1:%.+]] = arith.constant 32 : index - //CHECK: [[l_off_x_0:%.+]] = index.mul [[id_x]], [[c32_1]] + //CHECK: [[l_off_x:%.+]] = index.mul [[id_x]], [[c32_1]] //CHECK: [[c0:%.+]] = arith.constant 0 : index //CHECK: [[c0_2:%.+]] = arith.constant 0 : index - //CHECK: [[l_off_y:%.+]] = arith.addi [[l_off_y_0]], [[c0]] : index - //CHECK: [[l_off_x:%.+]] = arith.addi [[l_off_x_0]], [[c0_2]] : index //CHECK: [[c64:%.+]] = arith.constant 64 : index //CHECK: [[off_y:%.+]] = index.remu [[l_off_y]], [[c64]] //CHECK: [[c128:%.+]] = arith.constant 128 : index @@ -372,11 +368,10 @@ gpu.module @test_distribution { //CHECK: [[sgId:%.+]] = gpu.subgroup_id : index //CHECK-DAG: [[IDY:%.+]] = affine.apply #map2()[[[sgId]]] //CHECK-DAG: [[c32:%.+]] = arith.constant 32 : index - //CHECK-DAG: [[LOCALY:%.+]] = index.mul [[IDY]], [[c32]] + //CHECK-DAG: [[LY:%.+]] = index.mul [[IDY]], [[c32]] //CHECK-DAG: [[c0:%.+]] = arith.constant 0 : index - //CHECK-DAG: [[Y:%.+]] = arith.addi [[LOCALY]], [[c0]] : index //CHECK-DAG: [[c128:%.+]] = arith.constant 128 : index - //CHECK-DAG: [[MODY:%.+]] = index.remu [[Y]], [[c128]] + //CHECK-DAG: [[MODY:%.+]] = index.remu [[LY]], [[c128]] //CHECK-DAG: [[BASE:%.+]] = vector.step : vector<32xindex> //CHECK-DAG: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<32xindex> //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<32xindex> @@ -390,9 +385,8 @@ gpu.module @test_distribution { //CHECK-DAG: [[c8:%.+]] = arith.constant 8 : index //CHECK-DAG: [[LOCALY:%.+]] = index.mul [[sgId]], [[c8]] //CHECK-DAG: [[c0:%.+]] = arith.constant 0 : index - //CHECK-DAG: [[Y:%.+]] = arith.addi [[LOCALY]], [[c0]] : index //CHECK-DAG: [[c128:%.+]] = arith.constant 128 : index - //CHECK-DAG: [[MODY:%.+]] = index.remu [[Y]], [[c128]] + //CHECK-DAG: [[MODY:%.+]] = index.remu [[LOCALY]], [[c128]] //CHECK-DAG: [[BASE:%.+]] = vector.step : vector<8xindex> //CHECK-DAG: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<8xindex> //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<8xindex> diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir index c0fb373835e3d..e83229e3a3995 100644 --- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir +++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir @@ -14,12 +14,10 @@ gpu.module @test_1_1_assignment { //CHECK: [[LX:%.+]] = index.mul [[SGIDX]], [[C32]] //CHECK: [[C0:%.+]] = arith.constant 0 : index //CHECK: [[C0_1:%.+]] = arith.constant 0 : index - //CHECK: [[UY:%.+]] = arith.addi [[LY]], [[C0]] : index - //CHECK: [[UX:%.+]] = arith.addi [[LX]], [[C0_1]] : index //CHECK: [[C256:%.+]] = arith.constant 256 : index - //CHECK: [[Y:%.+]] = index.remu [[UY]], [[C256]] + //CHECK: [[Y:%.+]] = index.remu [[LY]], [[C256]] //CHECK: [[C128:%.+]] = arith.constant 128 : index - //CHECK: [[X:%.+]] = index.remu [[UX]], [[C128]] + //CHECK: [[X:%.+]] = index.remu [[LX]], [[C128]] //CHECK: [[TDESC:%.+]] = xegpu.create_nd_tdesc [[ARG_0]][[[Y]], [[X]]] : memref<256x128xf32> -> !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> %tdesc = xegpu.create_nd_tdesc %src[0, 0] : memref<256x128xf32> -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>> @@ -37,17 +35,13 @@ gpu.module @test_1_1_assignment { //CHECK: [[LX:%.+]] = index.mul [[SGIDX]], [[C32]] //CHECK: [[C0:%.+]] = arith.constant 0 : index //CHECK: [[C0_2:%.+]] = arith.constant 0 : index - //CHECK: [[UY:%.+]] = arith.addi [[LY]], [[C0]] : index - //CHECK: [[UX:%.+]] = arith.addi [[LX]], [[C0_2]] : index //CHECK: [[C256:%.+]] = arith.constant 256 : index - //CHECK: [[MODY:%.+]] = index.remu [[UY]], [[C256]] + //CHECK: [[MODY:%.+]] = index.remu [[LY]], [[C256]] //CHECK: [[C128:%.+]] = arith.constant 128 : index - //CHECK: [[MODX:%.+]] = index.remu [[UX]], [[C128]] + //CHECK: [[MODX:%.+]] = index.remu [[LX]], [[C128]] //CHECK: [[C0_3:%.+]] = arith.constant 0 : index - //CHECK: [[Y:%.+]] = index.add [[MODY]], [[C0_3]] //CHECK: [[C0_4:%.+]] = arith.constant 0 : index - //CHECK: [[X:%.+]] = index.add [[MODX]], [[C0_4]] - //CHECK: [[TDESC:%.+]] = xegpu.create_nd_tdesc [[ARG_0]][1, [[Y]], [[X]]] : memref<3x256x128xf32> -> !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> + //CHECK: [[TDESC:%.+]] = xegpu.create_nd_tdesc [[ARG_0]][1, [[MODY]], [[MODX]]] : memref<3x256x128xf32> -> !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> %tdesc = xegpu.create_nd_tdesc %src[1, 0, 0] : memref<3x256x128xf32> -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>> gpu.return 
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Do you have some statistics on the usual number of iterations and if this adds measurable overhead?

@joker-eph
Copy link
Collaborator Author

joker-eph commented Sep 24, 2025

This makes sense to me. Do you have some statistics on the usual number of iterations and if this adds measurable overhead?

This necessary adds some overhead, I don't have stats though (it's gonna be heavily dependent on the setup: how the folders are structured and what the input IR looks like).

For the number of iterations: should be 1/2 in general, but that's assuming that folders are well written (which we would like to because this is already iterative in canonicalization).

@kuhar
Copy link
Member

kuhar commented Sep 24, 2025

This necessary adds some overhead, I don't have stats though (it's gonna be heavily dependent on the setup: how the folders are structured and what the input IR looks like).

I can cherry pick this and check by compiling something like llama/SDXL in IREE. I should be able to find time for this in the next couple of days.

@joker-eph
Copy link
Collaborator Author

joker-eph commented Sep 24, 2025

FYI I added assertions for assert(count < 3); in the iteration and that hits only once upstream, a folding pattern on spirv.func apparently.

@joker-eph
Copy link
Collaborator Author

Actually it was vector.extract, but in the context of legalizing to spirv:

[dialect-conversion:1] Legalizing operation : 'vector.extract' (0x560f502f5100) { [dialect-conversion:1] %14 = "vector.extract"(%13, %1) <{static_position = array<i64: -9223372036854775808>}> : (vector<4xi32>, index) -> i32 [dialect-conversion:1] * Fold { [Builders.cpp:496 1] Folded in place #1 times: %14 = "vector.extract"(%11) <{static_position = array<i64: 1>}> : (vector<2xi32>) -> i32 [Builders.cpp:496 1] Folded in place #2 times: %14 = "vector.extract"(%3) <{static_position = array<i64: 1>}> : (vector<2xi32>) -> i32 [Builders.cpp:496 1] Folded in place #3 times: %14 = "vector.extract"(%3) <{static_position = array<i64: 1>}> : (vector<2xi32>) -> i32 

Seems like first it folds a dynamic to static, then it folds again by looking through the operand somehow?

Copy link
Member

@matthias-springer matthias-springer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me.

The same strategy could potentially be used in the greedy pattern rewrite driver to improve efficiency: we could try to fold the op in a loop until no more folding is possible. This may be more efficient than pushing/popping the op onto the worklist. (We still have to put the op on the worklist after the last successful folding, so it's unclear if this actually improves compilation time.)

Copy link
Member

@Groverkss Groverkss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I always just assumed this happened for some reason.

@dcaballe
Copy link
Contributor

dcaballe commented Sep 26, 2025

Nice!

we expect a "one-shot" call to fold to be as "folded" as possible.

I'm curious, how could we verify now that a particular folder fully applies in one iteration? As you know, we previously had ordering issues in the vector.extract folder that prevented it from folding completely in one iteration, even though it should have. Would it make sense to pass the maximum number of iterations as a parameter somehow so that test passes could explicitly set it to one, or something like that?

@joker-eph joker-eph force-pushed the fold_iterative branch 2 times, most recently from 63ded52 to 4232f1c Compare September 26, 2025 17:27
When executed in the context of canonicalization, the folders are invoked in a fixed-point iterative process. However in the context of an API like `createOrFold()` or in DialectConversion for example, we expect a "one-shot" call to fold to be as "folded" as possible. However, even when folders themselves are indempotent, folders on a given operation interact with each other. For example: // X = 0 + Y %X = arith.addi %c_0, %Y : i32 should fold to %Y, but the process actually involves first the folder provided by the IsCommutative trait to move the constant to the right. However this happens after attempting to fold the operation and the operation folder isn't attempt again after applying the trait folder. This commit makes sure we iterate until fixed point on folder applications.
@joker-eph
Copy link
Collaborator Author

@dcaballe : I added a "maxIterations" to the API, and an option to the test pass.
That way the pass has the same default behavior as before (it checks a single iteration of the folder), as shown in the newly added test. PTAL!

@joker-eph
Copy link
Collaborator Author

This necessary adds some overhead, I don't have stats though (it's gonna be heavily dependent on the setup: how the folders are structured and what the input IR looks like).

I can cherry pick this and check by compiling something like llama/SDXL in IREE. I should be able to find time for this in the next couple of days.

@kuhar just a reminder on this, this isn't urgent but I wouldn't want to this get forgotten either. Thanks!

@efric
Copy link
Contributor

efric commented Sep 26, 2025

@joker-eph

Cherry picked this and tested on SDXL in IREE.

Some stats:

 124054 mlir-builder-try-to-fold - Number of OpBuilder::tryFold loop total calls 3 mlir-builder-try-to-fold - Number of OpBuilder::tryFold loop max iterations 124247 mlir-builder-try-to-fold - Number of OpBuilder::tryFold loop total iterations 86 mlir-builder-try-to-fold - OpBuilder::tryFold total loop time (us) 

So average number of iterations is roughly 1.

No calls to the one in FoldUtils.cpp.

No meaningful increase in total compilation time 😄.

newling pushed a commit to newling/llvm-project that referenced this pull request Oct 10, 2025
newling pushed a commit to iree-org/llvm-project that referenced this pull request Oct 10, 2025
MaheshRavishankar pushed a commit to iree-org/llvm-project that referenced this pull request Oct 13, 2025
MaheshRavishankar added a commit to iree-org/iree that referenced this pull request Oct 13, 2025
Still carrying reverts - llvm/llvm-project#159083 - llvm/llvm-project#160615 Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
MaheshRavishankar added a commit to iree-org/iree that referenced this pull request Oct 14, 2025
Still carrying reverts - llvm/llvm-project#159083 - llvm/llvm-project#160615 Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
MaheshRavishankar pushed a commit to iree-org/llvm-project that referenced this pull request Oct 14, 2025
MaheshRavishankar pushed a commit to iree-org/llvm-project that referenced this pull request Oct 16, 2025
MaheshRavishankar added a commit to iree-org/iree that referenced this pull request Oct 16, 2025
Still carrying these reverts: - llvm/llvm-project#160615. See #22171 Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
MaheshRavishankar added a commit to iree-org/iree that referenced this pull request Oct 16, 2025
Still carrying these reverts: - llvm/llvm-project#160615. See #22171 - Fixes namespace usage in TD files. Not really sure this is the best way, but this works - Also update similar namespace usage in .td files in StableHLO. For now carrying a local patch, but will follow up with bump of StableHLO if that is fixed upstream. - Update `.Cases` to use `initializer_list` which was caught as an error on bazel builds. --------- Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
newling pushed a commit to iree-org/llvm-project that referenced this pull request Oct 18, 2025
Muzammiluddin-Syed-ECE pushed a commit to iree-org/llvm-project that referenced this pull request Oct 20, 2025
weidel-p pushed a commit to weidel-p/iree that referenced this pull request Oct 21, 2025
Still carrying reverts - llvm/llvm-project#159083 - llvm/llvm-project#160615 Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com> Signed-off-by: Philipp <philipp.weidel@intel.com>
weidel-p pushed a commit to weidel-p/iree that referenced this pull request Oct 21, 2025
Still carrying these reverts: - llvm/llvm-project#160615. See iree-org#22171 - Fixes namespace usage in TD files. Not really sure this is the best way, but this works - Also update similar namespace usage in .td files in StableHLO. For now carrying a local patch, but will follow up with bump of StableHLO if that is fixed upstream. - Update `.Cases` to use `initializer_list` which was caught as an error on bazel builds. --------- Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com> Signed-off-by: Philipp <philipp.weidel@intel.com>
Muzammiluddin-Syed-ECE pushed a commit to iree-org/llvm-project that referenced this pull request Oct 21, 2025
Muzammiluddin-Syed-ECE added a commit to iree-org/iree that referenced this pull request Oct 21, 2025
New Revert: - llvm/llvm-project@8c05b5c causes numerical errors see (#22354 (comment)) Still carrying these reverts: - llvm/llvm-project#160615 (See #22171) Fixes: - API change to linalg::rewriteAsPaddedOp (llvm/llvm-project@1508a8e) - API change to populateMathToROCDLConversionPatterns (llvm/llvm-project@fbbffc1) - Changes to lit tests --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
@uenoku uenoku mentioned this pull request Oct 21, 2025
Muzammiluddin-Syed-ECE added a commit to iree-org/iree that referenced this pull request Oct 22, 2025
Still carrying these reverts: - llvm/llvm-project#160615 (See #22171) - llvm/llvm-project#163440 (See #22354 (comment)) Fixes: - Remove deprecated references to `.CaseLowers` - Added dependencies required due to new added dependency to LLVMIR llvm/llvm-project#163881 --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Muzammiluddin-Syed-ECE pushed a commit to iree-org/llvm-project that referenced this pull request Oct 22, 2025
Muzammiluddin-Syed-ECE pushed a commit to iree-org/llvm-project that referenced this pull request Oct 24, 2025
Yu-Zhewen pushed a commit to iree-org/llvm-project that referenced this pull request Oct 27, 2025
Yu-Zhewen pushed a commit to iree-org/llvm-project that referenced this pull request Oct 28, 2025
Yu-Zhewen pushed a commit to iree-org/llvm-project that referenced this pull request Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment