Skip to content

Conversation

@charithaintc
Copy link
Contributor

This pass rewrites certain xegpu CreateNd and LoadNd operations that feeds into vector.transpose to more optimal form to improve performance. Specifically, low precision (bitwidth < 32) LoadNd ops that feeds into transpose ops are rewritten to i32 loads with a valid transpose layout such that later passes can use the load with transpose HW feature to accelerate such load ops.

Copy link
Contributor

@akroviakov akroviakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary comments


/// Helper to get the size range of a 2D block that can be transposed by HW.
/// TODO: Use uArch to get supported block ranges.
static Allowed2DShapeRange getTransposableBlockRange(int bitWidth) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now that uArch is upstreamed, shouldn't this already be incorporated there somehow?

}

/// A layout can be optimized if its lane layout is transposed (lane[0] != 1 &&
/// lane[1] == 1), but inner lane data is not equal to [1, 1].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An illustrative example with shapes and layouts, and explanations of the benefit at the top of the cpp file would be helpful.

static xegpu::TensorDescType tryOptimize(xegpu::TensorDescType tdescType) {
if (!canBeOptimized(tdescType))
return tdescType;
auto laneData = getMaybeLaneData(tdescType).value();
Copy link
Contributor

@akroviakov akroviakov Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to laneData[1] if it does not have a value?

Type newElemTy = IntegerType::get(tdescType.getContext(), newBitWidth);
// Supported shape is the max transpose shape that can be supported by
// hardware that is less than or equal to required shape.
auto supportedHeight = std::min(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the minimum is the user-supplied requiredShape[0] , but it is not supported by HW?

xegpu::LayoutAttr newLayout = xegpu::LayoutAttr::get(
tdescType.getContext(),
tdescType.getLayoutAttr().getLaneLayout().asArrayRef(), {1, 1});
// Array length can not be larger than 1 for transpose case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 1 a uArch-specific parameter?

}

/// Helper to create a constant index value.
static Value createConstantIndex(ConversionPatternRewriter &rewriter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper returns a value, but then the code uses it for ...Op = :

auto constantOp = createConstantIndex( 

What was the motivation for such a short helper? Isn't there a create that returns Value already?

xegpu::LoadNdOp origLoadOp) {
Location loc = data.getLoc();
assert(offsets.size() >= 2 && "Expecting at least 2 offsets for 2D LoadNdOp");
Value offsetX = convertToValue(rewriter, loc, offsets[offsets.size() - 2]);
Copy link
Contributor

@akroviakov akroviakov Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The innermost dimension (the last one, linear) is usually along the X axis, and I do not see that _nd ops suggest otherwise. Is the "transpose" nature of the pass the reason to do it otherwise?

return data;
}

/// Checks is a CreateNdDescOp can be optimized for transpose, if so creates a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
/// Checks is a CreateNdDescOp can be optimized for transpose, if so creates a
/// Checks if a CreateNdDescOp can be optimized for transpose, if so creates a
auto maybeConstInnerStride = getConstantIntValue(strides.back());
// Only row-major memrefs are expected for now.
if (!maybeConstInnerStride || *maybeConstInnerStride != 1)
return failure();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment seems good enough to be a failure message.

rewriter, createNdOp.getLoc(), source);
source = arith::IndexCastOp::create(
rewriter, createNdOp.getLoc(),
IntegerType::get(rewriter.getContext(), 64),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rewriter.getI64Type()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants