[mlir][xegpu] Add OptimizeTranspose pass. #165483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

charithaintc wants to merge 13 commits into llvm:main from charithaintc:optimize_transpose

+745 −1

Contributor

charithaintc commented Oct 28, 2025

This pass rewrites certain xegpu CreateNd and LoadNd operations that feeds into vector.transpose to more optimal form to improve performance. Specifically, low precision (bitwidth < 32) LoadNd ops that feeds into transpose ops are rewritten to i32 loads with a valid transpose layout such that later passes can use the load with transpose HW feature to accelerate such load ops.

charithaintc added 11 commits

October 21, 2025 18:49

add pass

9d0341d

save work

76f7323

add some tests

43c35be

Merge branch 'main' into optimize_transpose

cf23eaf

add some tests

f79d2a2

Merge branch 'main' into optimize_transpose

e9211c8

save work

ca5d902

working version

35ca92b

Merge branch 'main' into optimize_transpose

9fcbe03

Merge branch 'main' into optimize_transpose

44e6ac4

add tests

17fd7c8

charithaintc requested review from Jianhui-Li, adam-smnk, akroviakov and silee2

October 28, 2025 21:47

charithaintc added 2 commits

October 29, 2025 00:12

add comments

cbcccf6

add comments

b55f6b0

akroviakov reviewed

View reviewed changes

Contributor

akroviakov left a comment

Some preliminary comments

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

       
    /// Helper to get the size range of a 2D block that can be transposed by HW.  
    /// TODO: Use uArch to get supported block ranges.  
    static Allowed2DShapeRange getTransposableBlockRange(int bitWidth) {

Contributor

akroviakov Oct 29, 2025

So now that uArch is upstreamed, shouldn't this already be incorporated there somehow?

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

       }  
    
    /// A layout can be optimized if its lane layout is transposed (lane[0] != 1 &&  
    /// lane[1] == 1), but inner lane data is not equal to [1, 1].

Contributor

akroviakov Oct 29, 2025

An illustrative example with shapes and layouts, and explanations of the benefit at the top of the cpp file would be helpful.

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

       static xegpu::TensorDescType tryOptimize(xegpu::TensorDescType tdescType) {  
     if (!canBeOptimized(tdescType))  
     return tdescType;  
     auto laneData = getMaybeLaneData(tdescType).value();  
 

Contributor

akroviakov Oct 29, 2025 •

edited

Loading

What happens to laneData[1] if it does not have a value?

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

        Type newElemTy = IntegerType::get(tdescType.getContext(), newBitWidth);  
     // Supported shape is the max transpose shape that can be supported by  
     // hardware that is less than or equal to required shape.  
     auto supportedHeight = std::min(  
 

Contributor

akroviakov Oct 29, 2025

What if the minimum is the user-supplied requiredShape[0] , but it is not supported by HW?

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

        xegpu::LayoutAttr newLayout = xegpu::LayoutAttr::get(  
     tdescType.getContext(),  
     tdescType.getLayoutAttr().getLaneLayout().asArrayRef(), {1, 1});  
     // Array length can not be larger than 1 for transpose case.  
 

Contributor

akroviakov Oct 29, 2025

Is this 1 a uArch-specific parameter?

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

       }  
    
    /// Helper to create a constant index value.  
    static Value createConstantIndex(ConversionPatternRewriter &rewriter,

Contributor

akroviakov Oct 29, 2025

The helper returns a value, but then the code uses it for ...Op = :

auto constantOp = createConstantIndex(

What was the motivation for such a short helper? Isn't there a create that returns Value already?

akroviakov reviewed

View reviewed changes

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

        xegpu::LoadNdOp origLoadOp) {  
     Location loc = data.getLoc();  
     assert(offsets.size() >= 2 && "Expecting at least 2 offsets for 2D LoadNdOp");  
     Value offsetX = convertToValue(rewriter, loc, offsets[offsets.size() - 2]);  
 

Contributor

akroviakov Oct 30, 2025 •

edited

Loading

The innermost dimension (the last one, linear) is usually along the X axis, and I do not see that _nd ops suggest otherwise. Is the "transpose" nature of the pass the reason to do it otherwise?

akroviakov reviewed

View reviewed changes

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

        return data;  
    }  
    
    /// Checks is a CreateNdDescOp can be optimized for transpose, if so creates a

Contributor

akroviakov Oct 30, 2025

nit

Suggested change

    /// Checks is a CreateNdDescOp can be optimized for transpose, if so creates a 
  /// Checks if a CreateNdDescOp can be optimized for transpose, if so creates a 
 

akroviakov reviewed

View reviewed changes

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

        auto maybeConstInnerStride = getConstantIntValue(strides.back());  
     // Only row-major memrefs are expected for now.  
     if (!maybeConstInnerStride || *maybeConstInnerStride != 1)  
     return failure();  
 

Contributor

akroviakov Oct 30, 2025

The above comment seems good enough to be a failure message.

akroviakov reviewed

View reviewed changes

mlir/lib/Dialect/XeGPU/Transforms/XeGPUOptimizeTranspose.cpp

        rewriter, createNdOp.getLoc(), source);  
     source = arith::IndexCastOp::create(  
     rewriter, createNdOp.getLoc(),  
     IntegerType::get(rewriter.getContext(), 64),  
 

Contributor

akroviakov Oct 30, 2025

nit: rewriter.getI64Type()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment