Skip to content

Conversation

@fhahn
Copy link
Contributor

@fhahn fhahn commented Oct 12, 2025

When narrowing stores of a single-scalar, we currently use ExtractLastElement, which extracts the last element across all parts. This is not correct if the store's address is not uniform across all parts. If it is only uniform-per-part, the last lane per part must be extracted. Add a new ExtractLastLanePerPart opcode to handle this correctly. Most transforms apply to both ExtractLastElement and ExtractLastLanePerPart, with the only difference being their treatment during unrolling.

Fixes #162498.

When narrowing stores of a single-scalar, we currently use ExtractLastElement, which extracts the last element across all parts. This is not correct if the store's address is not uniform across all parts. If it is only uniform-per-part, the last lane per part must be extracted. Add a new ExtractLastLanePerPart opcode to handle this correctly. Most transforms apply to both ExtractLastElement and ExtractLastLanePerPart, with the only difference being their treatment during unrolling. Fixes llvm#162498.
@llvmbot
Copy link
Member

llvmbot commented Oct 12, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

When narrowing stores of a single-scalar, we currently use ExtractLastElement, which extracts the last element across all parts. This is not correct if the store's address is not uniform across all parts. If it is only uniform-per-part, the last lane per part must be extracted. Add a new ExtractLastLanePerPart opcode to handle this correctly. Most transforms apply to both ExtractLastElement and ExtractLastLanePerPart, with the only difference being their treatment during unrolling.

Fixes #162498.


Full diff: https://github.com/llvm/llvm-project/pull/163056.diff

7 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h (+6)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+9-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+21-7)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/replicating-load-store-costs.ll (+5-2)
  • (modified) llvm/test/Transforms/LoopVectorize/narrow-to-single-scalar.ll (+5-4)
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index fb696bea671af..fc2e7f81baa31 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -1021,6 +1021,8 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags, // part if scalar. In the latter case, the recipe will be removed during // unrolling. ExtractLastElement, + // Extracts the last lane for each part from its operand. + ExtractLastLanePerPart, // Extracts the second-to-last lane from its operand or the second-to-last // part if it is scalar. In the latter case, the recipe will be removed // during unrolling. diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 07bfe7a896d86..f413c63c6d14c 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -116,6 +116,7 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) { case VPInstruction::FirstActiveLane: return Type::getIntNTy(Ctx, 64); case VPInstruction::ExtractLastElement: + case VPInstruction::ExtractLastLanePerPart: case VPInstruction::ExtractPenultimateElement: { Type *BaseTy = inferScalarType(R->getOperand(0)); if (auto *VecTy = dyn_cast<VectorType>(BaseTy)) diff --git a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h index 555efea1ea840..ef51d46c65c5c 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h +++ b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h @@ -368,6 +368,12 @@ m_ExtractLastElement(const Op0_t &Op0) { return m_VPInstruction<VPInstruction::ExtractLastElement>(Op0); } +template <typename Op0_t> +inline VPInstruction_match<VPInstruction::ExtractLastLanePerPart, Op0_t> +m_ExtractLastLanePerPart(const Op0_t &Op0) { + return m_VPInstruction<VPInstruction::ExtractLastLanePerPart>(Op0); +} + template <typename Op0_t, typename Op1_t, typename Op2_t> inline VPInstruction_match<VPInstruction::ActiveLaneMask, Op0_t, Op1_t, Op2_t> m_ActiveLaneMask(const Op0_t &Op0, const Op1_t &Op1, const Op2_t &Op2) { diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index 8e91677292788..585c84b938a71 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -511,6 +511,7 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) { case VPInstruction::CanonicalIVIncrementForPart: case VPInstruction::ExplicitVectorLength: case VPInstruction::ExtractLastElement: + case VPInstruction::ExtractLastLanePerPart: case VPInstruction::ExtractPenultimateElement: case VPInstruction::FirstActiveLane: case VPInstruction::Not: @@ -878,9 +879,11 @@ Value *VPInstruction::generate(VPTransformState &State) { return ReducedPartRdx; } + case VPInstruction::ExtractLastLanePerPart: case VPInstruction::ExtractLastElement: case VPInstruction::ExtractPenultimateElement: { - unsigned Offset = getOpcode() == VPInstruction::ExtractLastElement ? 1 : 2; + unsigned Offset = + getOpcode() == VPInstruction::ExtractPenultimateElement ? 2 : 1; Value *Res; if (State.VF.isVector()) { assert(Offset <= State.VF.getKnownMinValue() && @@ -1166,6 +1169,7 @@ InstructionCost VPInstruction::computeCost(ElementCount VF, bool VPInstruction::isVectorToScalar() const { return getOpcode() == VPInstruction::ExtractLastElement || + getOpcode() == VPInstruction::ExtractLastLanePerPart || getOpcode() == VPInstruction::ExtractPenultimateElement || getOpcode() == Instruction::ExtractElement || getOpcode() == VPInstruction::ExtractLane || @@ -1229,6 +1233,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const { case VPInstruction::CanonicalIVIncrementForPart: case VPInstruction::ExtractLane: case VPInstruction::ExtractLastElement: + case VPInstruction::ExtractLastLanePerPart: case VPInstruction::ExtractPenultimateElement: case VPInstruction::ActiveLaneMask: case VPInstruction::FirstActiveLane: @@ -1376,6 +1381,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent, case VPInstruction::ExtractLastElement: O << "extract-last-element"; break; + case VPInstruction::ExtractLastLanePerPart: + O << "extract-last-lane-per-part"; + break; case VPInstruction::ExtractPenultimateElement: O << "extract-penultimate-element"; break; diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 7563cd719b19c..3bc7f58449e29 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -1207,7 +1207,8 @@ static void simplifyRecipe(VPRecipeBase &R, VPTypeAnalysis &TypeInfo) { } // Look through ExtractLastElement (BuildVector ....). - if (match(&R, m_ExtractLastElement(m_BuildVector()))) { + if (match(&R, m_CombineOr(m_ExtractLastElement(m_BuildVector()), + m_ExtractLastLanePerPart(m_BuildVector())))) { auto *BuildVector = cast<VPInstruction>(R.getOperand(0)); Def->replaceAllUsesWith( BuildVector->getOperand(BuildVector->getNumOperands() - 1)); @@ -1273,13 +1274,16 @@ static void simplifyRecipe(VPRecipeBase &R, VPTypeAnalysis &TypeInfo) { return; } - if (match(Def, m_ExtractLastElement(m_Broadcast(m_VPValue(A))))) { + if (match(Def, + m_CombineOr(m_ExtractLastElement(m_Broadcast(m_VPValue(A))), + m_ExtractLastLanePerPart(m_Broadcast(m_VPValue(A)))))) { Def->replaceAllUsesWith(A); return; } - if (match(Def, - m_VPInstruction<VPInstruction::ExtractLastElement>(m_VPValue(A))) && + if (match(Def, m_CombineOr(m_VPInstruction<VPInstruction::ExtractLastElement>( + m_VPValue(A)), + m_ExtractLastLanePerPart(m_VPValue(A)))) && ((isa<VPInstruction>(A) && vputils::isSingleScalar(A)) || (isa<VPReplicateRecipe>(A) && cast<VPReplicateRecipe>(A)->isSingleScalar())) && @@ -1287,6 +1291,12 @@ static void simplifyRecipe(VPRecipeBase &R, VPTypeAnalysis &TypeInfo) { [Def, A](VPUser *U) { return U->usesScalars(A) || Def == U; })) { return Def->replaceAllUsesWith(A); } + + if (Plan->getUF() == 1 && + match(Def, m_ExtractLastLanePerPart(m_VPValue(A)))) { + return Def->replaceAllUsesWith( + Builder.createNaryOp(VPInstruction::ExtractLastElement, {A})); + } } void VPlanTransforms::simplifyRecipes(VPlan &Plan) { @@ -1324,8 +1334,11 @@ static void narrowToSingleScalarRecipes(VPlan &Plan) { RepOrWidenR->getUnderlyingInstr(), RepOrWidenR->operands(), true /*IsSingleScalar*/, nullptr /*Mask*/, *RepR /*Metadata*/); Clone->insertBefore(RepOrWidenR); - auto *Ext = new VPInstruction(VPInstruction::ExtractLastElement, - {Clone->getOperand(0)}); + unsigned ExtractOpc = + vputils::isUniformAcrossVFsAndUFs(RepR->getOperand(1)) + ? VPInstruction::ExtractLastElement + : VPInstruction::ExtractLastLanePerPart; + auto *Ext = new VPInstruction(ExtractOpc, {Clone->getOperand(0)}); Ext->insertBefore(Clone); Clone->setOperand(0, Ext); RepR->eraseFromParent(); @@ -1339,7 +1352,8 @@ static void narrowToSingleScalarRecipes(VPlan &Plan) { !all_of(RepOrWidenR->users(), [RepOrWidenR](const VPUser *U) { return U->usesScalars(RepOrWidenR) || match(cast<VPRecipeBase>(U), - m_ExtractLastElement(m_VPValue())); + m_CombineOr(m_ExtractLastElement(m_VPValue()), + m_ExtractLastLanePerPart(m_VPValue()))); })) continue; diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/replicating-load-store-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/replicating-load-store-costs.ll index ab9b48fb68f6b..aff2c4cb644eb 100644 --- a/llvm/test/Transforms/LoopVectorize/AArch64/replicating-load-store-costs.ll +++ b/llvm/test/Transforms/LoopVectorize/AArch64/replicating-load-store-costs.ll @@ -153,17 +153,20 @@ define void @uniform_gep_for_replicating_gep(ptr %dst) { ; CHECK-NEXT: [[VEC_IND:%.*]] = phi <2 x i32> [ <i32 0, i32 1>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ] ; CHECK-NEXT: [[STEP_ADD:%.*]] = add <2 x i32> [[VEC_IND]], splat (i32 2) ; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[INDEX]], 2 -; CHECK-NEXT: [[TMP5:%.*]] = icmp eq <2 x i32> [[STEP_ADD]], zeroinitializer +; CHECK-NEXT: [[TMP5:%.*]] = icmp eq <2 x i32> [[VEC_IND]], zeroinitializer +; CHECK-NEXT: [[TMP3:%.*]] = icmp eq <2 x i32> [[STEP_ADD]], zeroinitializer ; CHECK-NEXT: [[TMP8:%.*]] = lshr i32 [[INDEX]], 1 ; CHECK-NEXT: [[TMP9:%.*]] = lshr i32 [[TMP2]], 1 ; CHECK-NEXT: [[TMP11:%.*]] = zext <2 x i1> [[TMP5]] to <2 x i8> +; CHECK-NEXT: [[TMP6:%.*]] = zext <2 x i1> [[TMP3]] to <2 x i8> ; CHECK-NEXT: [[TMP14:%.*]] = zext i32 [[TMP8]] to i64 ; CHECK-NEXT: [[TMP15:%.*]] = zext i32 [[TMP9]] to i64 ; CHECK-NEXT: [[TMP18:%.*]] = getelementptr i64, ptr [[DST]], i64 [[TMP14]] ; CHECK-NEXT: [[TMP19:%.*]] = getelementptr i64, ptr [[DST]], i64 [[TMP15]] ; CHECK-NEXT: [[TMP22:%.*]] = extractelement <2 x i8> [[TMP11]], i32 1 +; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i8> [[TMP6]], i32 1 ; CHECK-NEXT: store i8 [[TMP22]], ptr [[TMP18]], align 1 -; CHECK-NEXT: store i8 [[TMP22]], ptr [[TMP19]], align 1 +; CHECK-NEXT: store i8 [[TMP12]], ptr [[TMP19]], align 1 ; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4 ; CHECK-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[STEP_ADD]], splat (i32 2) ; CHECK-NEXT: [[TMP24:%.*]] = icmp eq i32 [[INDEX_NEXT]], 128 diff --git a/llvm/test/Transforms/LoopVectorize/narrow-to-single-scalar.ll b/llvm/test/Transforms/LoopVectorize/narrow-to-single-scalar.ll index 1533906247739..53dad3a74482c 100644 --- a/llvm/test/Transforms/LoopVectorize/narrow-to-single-scalar.ll +++ b/llvm/test/Transforms/LoopVectorize/narrow-to-single-scalar.ll @@ -74,8 +74,7 @@ exit: ret void } -; FIXME: Currently this mis-compiled when interleaving; all stores store the -; last lane of the last part, instead of the last lane per part. +; Check each unrolled store stores the last lane of the corresponding part. ; Test case for https://github.com/llvm/llvm-project/issues/162498. define void @narrow_to_single_scalar_store_address_not_uniform_across_all_parts(ptr %dst) { ; VF4IC1-LABEL: define void @narrow_to_single_scalar_store_address_not_uniform_across_all_parts( @@ -121,13 +120,15 @@ define void @narrow_to_single_scalar_store_address_not_uniform_across_all_parts( ; VF2IC2-NEXT: br label %[[VECTOR_BODY:.*]] ; VF2IC2: [[VECTOR_BODY]]: ; VF2IC2-NEXT: [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ] +; VF2IC2-NEXT: [[TMP7:%.*]] = add i32 [[INDEX]], 0 +; VF2IC2-NEXT: [[TMP8:%.*]] = add i32 [[INDEX]], 1 ; VF2IC2-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 2 ; VF2IC2-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 3 -; VF2IC2-NEXT: [[TMP2:%.*]] = lshr i32 [[INDEX]], 1 +; VF2IC2-NEXT: [[TMP2:%.*]] = lshr i32 [[TMP7]], 1 ; VF2IC2-NEXT: [[TMP3:%.*]] = lshr i32 [[TMP0]], 1 ; VF2IC2-NEXT: [[TMP4:%.*]] = getelementptr i32, ptr [[DST]], i32 [[TMP2]] ; VF2IC2-NEXT: [[TMP5:%.*]] = getelementptr i32, ptr [[DST]], i32 [[TMP3]] -; VF2IC2-NEXT: store i32 [[TMP1]], ptr [[TMP4]], align 4 +; VF2IC2-NEXT: store i32 [[TMP8]], ptr [[TMP4]], align 4 ; VF2IC2-NEXT: store i32 [[TMP1]], ptr [[TMP5]], align 4 ; VF2IC2-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4 ; VF2IC2-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100 
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


if (match(Def,
m_VPInstruction<VPInstruction::ExtractLastElement>(m_VPValue(A))) &&
if (match(Def, m_CombineOr(m_VPInstruction<VPInstruction::ExtractLastElement>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since you're here, perhaps worth changing m_VPInstruction<VPInstruction::ExtractLastElement> to m_ExtractLastElement as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep updated thanks

Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post approval follow-up/alternative thoughts.

Comment on lines 1021 to +1025
// part if scalar. In the latter case, the recipe will be removed during
// unrolling.
ExtractLastElement,
// Extracts the last lane for each part from its operand.
ExtractLastLanePerPart,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given ExtractLastLanePerPart which produces a scalar per part, should ExtractLastPart also be added - which together could then replace ExtractLastElement? Perhaps suffice to call it ExtractLastLane(*) - producing a scalar from (each) vector, orthogonal of how many parts/vectors it is given. I.e., ExtractLastElement could then be replaced by ExtractLastPart if its operand is scalar or else by ExtractLastPart(ExtractLastLane) or ExtractLastLane(ExtractLastPart) if it's vector.

(*) ExtractLane seems to apply similar double extraction, which could also be simplified by combining with some "de-splat" reduction of a uniform-across-UF value that provides a(ny) single part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I will check that as follow-up, thanks!

@fhahn fhahn merged commit 7f54fcc into llvm:main Oct 15, 2025
10 checks passed
@fhahn fhahn deleted the vplan-extract-last-element-per-part branch October 15, 2025 12:46
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Oct 15, 2025
…calar. (#163056) When narrowing stores of a single-scalar, we currently use ExtractLastElement, which extracts the last element across all parts. This is not correct if the store's address is not uniform across all parts. If it is only uniform-per-part, the last lane per part must be extracted. Add a new ExtractLastLanePerPart opcode to handle this correctly. Most transforms apply to both ExtractLastElement and ExtractLastLanePerPart, with the only difference being their treatment during unrolling. Fixes llvm/llvm-project#162498. PR: llvm/llvm-project#163056
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment