-   Notifications  You must be signed in to change notification settings 
- Fork 15k
[mlir][xegpu] Add OptimizeTranspose pass. #165483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary comments
|  | ||
| /// Helper to get the size range of a 2D block that can be transposed by HW. | ||
| /// TODO: Use uArch to get supported block ranges. | ||
| static Allowed2DShapeRange getTransposableBlockRange(int bitWidth) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now that uArch is upstreamed, shouldn't this already be incorporated there somehow?
| } | ||
|  | ||
| /// A layout can be optimized if its lane layout is transposed (lane[0] != 1 && | ||
| /// lane[1] == 1), but inner lane data is not equal to [1, 1]. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An illustrative example with shapes and layouts, and explanations of the benefit at the top of the cpp file would be helpful.
| static xegpu::TensorDescType tryOptimize(xegpu::TensorDescType tdescType) { | ||
| if (!canBeOptimized(tdescType)) | ||
| return tdescType; | ||
| auto laneData = getMaybeLaneData(tdescType).value(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to laneData[1] if it does not have a value?
| Type newElemTy = IntegerType::get(tdescType.getContext(), newBitWidth); | ||
| // Supported shape is the max transpose shape that can be supported by | ||
| // hardware that is less than or equal to required shape. | ||
| auto supportedHeight = std::min( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the minimum is the user-supplied requiredShape[0] , but it is not supported by HW?
| xegpu::LayoutAttr newLayout = xegpu::LayoutAttr::get( | ||
| tdescType.getContext(), | ||
| tdescType.getLayoutAttr().getLaneLayout().asArrayRef(), {1, 1}); | ||
| // Array length can not be larger than 1 for transpose case. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this 1 a uArch-specific parameter?
| } | ||
|  | ||
| /// Helper to create a constant index value. | ||
| static Value createConstantIndex(ConversionPatternRewriter &rewriter, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper returns a value, but then the code uses it for ...Op = :
auto constantOp = createConstantIndex( What was the motivation for such a short helper? Isn't there a create that returns Value already?
| xegpu::LoadNdOp origLoadOp) { | ||
| Location loc = data.getLoc(); | ||
| assert(offsets.size() >= 2 && "Expecting at least 2 offsets for 2D LoadNdOp"); | ||
| Value offsetX = convertToValue(rewriter, loc, offsets[offsets.size() - 2]); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The innermost dimension (the last one, linear) is usually along the X axis, and I do not see that _nd ops suggest otherwise. Is the "transpose" nature of the pass the reason to do it otherwise?
| return data; | ||
| } | ||
|  | ||
| /// Checks is a CreateNdDescOp can be optimized for transpose, if so creates a | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| /// Checks is a CreateNdDescOp can be optimized for transpose, if so creates a | |
| /// Checks if a CreateNdDescOp can be optimized for transpose, if so creates a | 
| auto maybeConstInnerStride = getConstantIntValue(strides.back()); | ||
| // Only row-major memrefs are expected for now. | ||
| if (!maybeConstInnerStride || *maybeConstInnerStride != 1) | ||
| return failure(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above comment seems good enough to be a failure message.
| rewriter, createNdOp.getLoc(), source); | ||
| source = arith::IndexCastOp::create( | ||
| rewriter, createNdOp.getLoc(), | ||
| IntegerType::get(rewriter.getContext(), 64), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rewriter.getI64Type()?
This pass rewrites certain xegpu
CreateNdandLoadNdoperations that feeds intovector.transposeto more optimal form to improve performance. Specifically, low precision (bitwidth < 32)LoadNdops that feeds into transpose ops are rewritten to i32 loads with a valid transpose layout such that later passes can use the load with transpose HW feature to accelerate such load ops.