[LV] Convert gather loads with invariant stride into strided loads #147297

Mel-Chen · 2025-07-07T13:29:42Z

This patch detects stride memory accesses such as:

 void stride(int* a, int *b, int n) { for (int i = 0; i < n; i++) a[i * 5] = b[i * 5] + i; }

and converts widen non-consecutive loads (i.e. gathers) into strided loads when profitable and legal.
The transformation is implemented as part of VPlan and is inspired by existing logic in RISCVGatherScatterLowering. Some of the legality and analysis logic has been moved into VPlan to enable this conversion earlier during vectorization planning.

This enables more efficient code generation for targets like RISC-V that support strided loads natively.

Mel-Chen · 2025-07-07T13:30:06Z

This is a port of the approach used in RISCVGatherScatterLowering, implemented entirely in VPlan. If you'd prefer to switch to LLVM IR–based analysis using SCEV, please let me know.

That said, the current patch faces an awkward situation: all existing strided access patterns in our lit tests are currently not converted to strided accesses, because the cost returned by TTI.getStridedMemoryOpCost is higher than that of a gather.

As a result, I'm marking this patch as Draft for now, until we have test cases that can demonstrate the intended functionality.

github-actions · 2025-07-07T13:32:52Z

✅ With the latest revision this PR passed the C/C++ code formatter.

lukel97 · 2025-07-08T10:14:18Z

+1 on this approach of doing at as VPlan transformation. That way we don't need to work with the legacy cost model.

I presume we will need to handle VPWidenStridedLoadRecipes in planContainsAdditionalSimplifcations and skip the assertion?

That said, the current patch faces an awkward situation: all existing strided access patterns in our lit tests are currently not converted to strided accesses, because the cost returned by TTI.getStridedMemoryOpCost is higher than that of a gather.

Yeah I think the current RISCVTTIImpl::getStridedMemoryOpCost is too expensive. It should definitely be at least cheaper than RISCVTTIImpl::getGatherScatterOpCost, and at the moment the cost model seems to return the same for them.

I know the spacemit-x60 doesn't have fast vlse but I don't think they're as slow as vluxei. Do we have benchmark results for these? cc) @mikhailramalho

wangpc-pp · 2025-07-08T12:44:09Z

I know the spacemit-x60 doesn't have fast vlse but I don't think they're as slow as vluxei. Do we have benchmark results for these? cc) @mikhailramalho

I have some data on C908 for strided load (which is weird): camel-cdr/rvv-bench#12. The result may hold for spacemit-x60.

Mel-Chen · 2025-07-22T08:45:26Z

+1 on this approach of doing at as VPlan transformation. That way we don't need to work with the legacy cost model.

Another approach is using SCEV to get the base and stride. Both can be transformed within VPlanTransform without relying on the legacy cost model.

I presume we will need to handle VPWidenStridedLoadRecipes in planContainsAdditionalSimplifcations and skip the assertion?

I’ve previously forced the cost model to always return profitable in order to trigger the transformation, and didn’t encounter any assertions. I plan to use the same method to test against the llvm-test-suite and see if any issues come up.

Mel-Chen · 2025-07-22T08:49:51Z

@lukel97 @wangpc-pp
#149955
Elvis is working on improving the gather/scatter cost model. This should help enable strided accesses.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

Mel-Chen · 2025-08-22T09:49:37Z

I presume we will need to handle VPWidenStridedLoadRecipes in planContainsAdditionalSimplifcations and skip the assertion?

I’ve previously forced the cost model to always return profitable in order to trigger the transformation, and didn’t encounter any assertions. I plan to use the same method to test against the llvm-test-suite and see if any issues come up.

@lukel97 I think you're right—planContainsAdditionalSimplifications will need to be changed. It's just that right now, the costs for gather and strided load are exactly the same, so it works fine no matter how I execute it. Once the costs are corrected, this issue will become apparent.

llvmbot · 2025-08-22T09:54:31Z

@llvm/pr-subscribers-llvm-transforms

Author: Mel Chen (Mel-Chen)

Changes

This patch detects stride memory accesses such as:

 void stride(int* a, int *b, int n) { for (int i = 0; i &lt; n; i++) a[i * 5] = b[i * 5] + i; }

and converts widen non-consecutive loads (i.e. gathers) into strided loads when profitable and legal.
The transformation is implemented as part of VPlan and is inspired by existing logic in RISCVGatherScatterLowering. Some of the legality and analysis logic has been moved into VPlan to enable this conversion earlier during vectorization planning.

This enables more efficient code generation for targets like RISC-V that support strided loads natively.

Patch is 73.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147297.diff

14 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+17-9)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+83-12)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+4-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+66-6)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+193-5)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+6)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+2-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/blocks-with-dead-instructions.ll (+3-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/masked_gather_scatter.ll (+12-10)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/strided-accesses.ll (+12-6)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/tail-folding-gather-scatter.ll (+5-19)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/tail-folding-interleave.ll (+46-85)
(modified) llvm/test/Transforms/LoopVectorize/vplan-dot-printing.ll (+2-2)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index c4110582da1ef..528e6abc78aee 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -4073,7 +4073,7 @@ void LoopVectorizationPlanner::emitInvalidCostRemarks( [](const auto *R) { return Instruction::Select; }) .Case<VPWidenStoreRecipe>( [](const auto *R) { return Instruction::Store; }) - .Case<VPWidenLoadRecipe>( + .Case<VPWidenLoadRecipe, VPWidenStridedLoadRecipe>( [](const auto *R) { return Instruction::Load; }) .Case<VPWidenCallRecipe, VPWidenIntrinsicRecipe>( [](const auto *R) { return Instruction::Call; }) @@ -4172,6 +4172,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPWidenPointerInductionSC: case VPDef::VPReductionPHISC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenStridedLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -7732,7 +7733,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, new VPVectorEndPointerRecipe(Ptr, &Plan.getVF(), getLoadStoreType(I), /*Stride*/ -1, Flags, I->getDebugLoc()); } else { - VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), + const DataLayout &DL = I->getDataLayout(); + auto *StrideTy = DL.getIndexType(Ptr->getUnderlyingValue()->getType()); + VPValue *StrideOne = Plan.getOrAddLiveIn(ConstantInt::get(StrideTy, 1)); + VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), StrideOne, GEP ? GEP->getNoWrapFlags() : GEPNoWrapFlags::none(), I->getDebugLoc()); @@ -8832,19 +8836,14 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( *Plan)) return nullptr; + VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); // Transform recipes to abstract recipes if it is legal and beneficial and // clamp the range for better cost estimation. // TODO: Enable following transform when the EVL-version of extended-reduction // and mulacc-reduction are implemented. - if (!CM.foldTailWithEVL()) { - VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); + if (!CM.foldTailWithEVL()) VPlanTransforms::runPass(VPlanTransforms::convertToAbstractRecipes, *Plan, CostCtx, Range); - } - - for (ElementCount VF : Range) - Plan->addVF(VF); - Plan->setName("Initial VPlan"); // Interleave memory: for each Interleave Group we marked earlier as relevant // for this VPlan, replace the Recipes widening its memory instructions with a @@ -8853,6 +8852,15 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed()); + // Convert memory recipes to strided access recipes if the strided access is + // legal and profitable. + VPlanTransforms::runPass(VPlanTransforms::convertToStridedAccesses, *Plan, + CostCtx, Range); + + for (ElementCount VF : Range) + Plan->addVF(VF); + Plan->setName("Initial VPlan"); + // Replace VPValues for known constant strides guaranteed by predicate scalar // evolution. auto CanUseVersionedStride = [&Plan](VPUser &U, unsigned) { diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index 6814dc5de6716..3c3af87cc0f6e 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -559,6 +559,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPBranchOnMaskSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenStridedLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -1746,10 +1747,6 @@ struct LLVM_ABI_FOR_TEST VPWidenSelectRecipe : public VPRecipeWithIRFlags, /// A recipe for handling GEP instructions. class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { - bool isPointerLoopInvariant() const { - return getOperand(0)->isDefinedOutsideLoopRegions(); - } - bool isIndexLoopInvariant(unsigned I) const { return getOperand(I + 1)->isDefinedOutsideLoopRegions(); } @@ -1778,6 +1775,30 @@ class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { VP_CLASSOF_IMPL(VPDef::VPWidenGEPSC) + bool isPointerLoopInvariant() const { + return getOperand(0)->isDefinedOutsideLoopRegions(); + } + + std::optional<unsigned> getUniqueVariantIndex() const { + std::optional<unsigned> VarIdx; + for (unsigned I = 0, E = getNumOperands() - 1; I < E; ++I) { + if (isIndexLoopInvariant(I)) + continue; + + if (VarIdx) + return std::nullopt; + VarIdx = I; + } + return VarIdx; + } + + Type *getIndexedType(unsigned I) const { + auto *GEP = cast<GetElementPtrInst>(getUnderlyingInstr()); + Type *SourceElementType = GEP->getSourceElementType(); + SmallVector<Value *, 4> Ops(GEP->idx_begin(), GEP->idx_begin() + I); + return GetElementPtrInst::getIndexedType(SourceElementType, Ops); + } + /// Generate the gep nodes. void execute(VPTransformState &State) override; @@ -1866,20 +1887,23 @@ class VPVectorEndPointerRecipe : public VPRecipeWithIRFlags, #endif }; -/// A recipe to compute the pointers for widened memory accesses of IndexTy. +/// A recipe to compute the pointers for widened memory accesses of IndexedTy, +/// with the Stride expressed in units of IndexedTy. class VPVectorPointerRecipe : public VPRecipeWithIRFlags, - public VPUnrollPartAccessor<1> { + public VPUnrollPartAccessor<2> { Type *IndexedTy; public: - VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, GEPNoWrapFlags GEPFlags, - DebugLoc DL) - : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, ArrayRef<VPValue *>(Ptr), - GEPFlags, DL), + VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, VPValue *Stride, + GEPNoWrapFlags GEPFlags, DebugLoc DL) + : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, + ArrayRef<VPValue *>({Ptr, Stride}), GEPFlags, DL), IndexedTy(IndexedTy) {} VP_CLASSOF_IMPL(VPDef::VPVectorPointerSC) + VPValue *getStride() const { return getOperand(1); } + void execute(VPTransformState &State) override; bool onlyFirstLaneUsed(const VPValue *Op) const override { @@ -1897,7 +1921,7 @@ class VPVectorPointerRecipe : public VPRecipeWithIRFlags, } VPVectorPointerRecipe *clone() override { - return new VPVectorPointerRecipe(getOperand(0), IndexedTy, + return new VPVectorPointerRecipe(getOperand(0), IndexedTy, getStride(), getGEPNoWrapFlags(), getDebugLoc()); } @@ -3040,7 +3064,8 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || - R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; + R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC || + R->getVPDefID() == VPRecipeBase::VPWidenStridedLoadSC; } static inline bool classof(const VPUser *U) { @@ -3161,6 +3186,52 @@ struct VPWidenLoadEVLRecipe final : public VPWidenMemoryRecipe, public VPValue { } }; +/// A recipe for strided load operations, using the base address, stride, and an +/// optional mask. This recipe will generate an vp.strided.load intrinsic call +/// to represent memory accesses with a fixed stride. +struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe, + public VPValue { + VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride, + VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe( + VPDef::VPWidenStridedLoadSC, Load, {Addr, Stride, VF}, + /*Consecutive=*/false, /*Reverse=*/false, Metadata, DL), + VPValue(this, &Load) { + setMask(Mask); + } + + VPWidenStridedLoadRecipe *clone() override { + return new VPWidenStridedLoadRecipe(cast<LoadInst>(Ingredient), getAddr(), + getStride(), getVF(), getMask(), *this, + getDebugLoc()); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenStridedLoadSC); + + /// Return the stride operand. + VPValue *getStride() const { return getOperand(1); } + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(2); } + + /// Generate a strided load. + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getAddr() || Op == getStride() || Op == getVF(); + } +}; + /// A recipe for widening store operations, using the stored value, the address /// to store to and an optional mask. struct LLVM_ABI_FOR_TEST VPWidenStoreRecipe final : public VPWidenMemoryRecipe { diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 6c9a78add6822..36760feb1cb14 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -187,8 +187,10 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index a92540f457c73..90c05b7b8076c 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -81,6 +81,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPWidenCastSC: case VPWidenGEPSC: case VPWidenIntOrFpInductionSC: + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenPHISC: @@ -104,6 +105,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPExpressionRecipe>(this)->mayReadOrWriteMemory(); case VPInstructionSC: return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: return true; @@ -185,6 +187,7 @@ bool VPRecipeBase::mayHaveSideEffects() const { } case VPInterleaveSC: return mayWriteToMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenStoreEVLSC: @@ -2512,13 +2515,22 @@ void VPVectorEndPointerRecipe::print(raw_ostream &O, const Twine &Indent, void VPVectorPointerRecipe::execute(VPTransformState &State) { auto &Builder = State.Builder; unsigned CurrentPart = getUnrollPart(*this); - Type *IndexTy = getGEPIndexTy(State.VF.isScalable(), /*IsReverse*/ false, - /*IsUnitStride*/ true, CurrentPart, Builder); + Value *Stride = State.get(getStride(), /*IsScalar*/ true); + + auto *StrideC = dyn_cast<ConstantInt>(Stride); + bool IsStrideOne = StrideC && StrideC->isOne(); + bool IsUnitStride = IsStrideOne || (StrideC && StrideC->isMinusOne()); + Type *IndexTy = + getGEPIndexTy(State.VF.isScalable(), + /*IsReverse*/ false, IsUnitStride, CurrentPart, Builder); Value *Ptr = State.get(getOperand(0), VPLane(0)); + Stride = Builder.CreateSExtOrTrunc(Stride, IndexTy); Value *Increment = createStepForVF(Builder, IndexTy, State.VF, CurrentPart); + Value *Index = IsStrideOne ? Increment : Builder.CreateMul(Increment, Stride); + Value *ResultPtr = - Builder.CreateGEP(IndexedTy, Ptr, Increment, "", getGEPNoWrapFlags()); + Builder.CreateGEP(IndexedTy, Ptr, Index, "", getGEPNoWrapFlags()); State.set(this, ResultPtr, /*IsScalar*/ true); } @@ -3170,9 +3182,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, const Align Alignment = getLoadStoreAlignment(&Ingredient); unsigned AS = cast<PointerType>(Ctx.Types.inferScalarType(getAddr())) ->getAddressSpace(); - unsigned Opcode = isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(this) - ? Instruction::Load - : Instruction::Store; + unsigned Opcode = + isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + this) + ? Instruction::Load + : Instruction::Store; if (!Consecutive) { // TODO: Using the original IR may not be accurate. @@ -3182,6 +3196,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, Type *PtrTy = toVectorTy(Ptr->getType(), VF); assert(!Reverse && "Inconsecutive memory access should not have the order."); + + if (isa<VPWidenStridedLoadRecipe>(this)) + return Ctx.TTI.getStridedMemoryOpCost( + Opcode, Ty, Ptr, IsMasked, Alignment, Ctx.CostKind, &Ingredient); + return Ctx.TTI.getAddressComputationCost(PtrTy, nullptr, nullptr, Ctx.CostKind) + Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked, Alignment, @@ -3331,6 +3350,47 @@ void VPWidenLoadEVLRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenStridedLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + Value *Addr = State.get(getAddr(), /*IsScalar*/ true); + Value *StrideInBytes = State.get(getStride(), /*IsScalar*/ true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + Value *RunTimeVF = Builder.CreateZExtOrTrunc(State.get(getVF(), VPLane(0)), + Builder.getInt32Ty()); + + auto *PtrTy = Addr->getType(); + auto *StrideTy = StrideInBytes->getType(); + CallInst *NewLI = Builder.CreateIntrinsic( + Intrinsic::experimental_vp_strided_load, {DataTy, PtrTy, StrideTy}, + {Addr, StrideInBytes, Mask, RunTimeVF}, nullptr, "wide.strided.load"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + State.set(this, NewLI); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenStridedLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << " = load "; + getAddr()->printAsOperand(O, SlotTracker); + O << ", stride = "; + getStride()->printAsOperand(O, SlotTracker); + O << ", runtimeVF = "; + getVF()->printAsOperand(O, SlotTracker); +} +#endif + void VPWidenStoreRecipe::execute(VPTransformState &State) { VPValue *StoredVPValue = getStoredValue(); bool CreateScatter = !isConsecutive(); diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 677d97e54d556..cf9b1c4199449 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2203,6 +2203,12 @@ static VPRecipeBase *optimizeMaskToEVL(VPValue *HeaderMask, VPValue *NewAddr = GetNewAddr(L->getAddr()); return new VPWidenLoadEVLRecipe(*L, NewAddr, EVL, NewMask); }) + .Case<VPWidenStridedLoadRecipe>([&](VPWidenStridedLoadRecipe *L) { + VPValue *NewMask = GetNewMask(L->getMask()); + return new VPWidenStridedLoadRecipe( + *cast<LoadInst>(&L->getIngredient()), L->getAddr(), L->getStride(), + &EVL, NewMask, *L, L->getDebugLoc()); + }) .Case<VPWidenStoreRecipe>([&](VPWidenStoreRecipe *S) { VPValue *NewMask = GetNewMask(S->getMask()); VPValue *NewAddr = GetNewAddr(S->getAddr()); @@ -2237,10 +2243,12 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion(); VPBasicBlock *Header = LoopRegion->getEntryBasicBlock(); - assert(all_of(Plan.getVF().users(), - IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, - VPWidenIntOrFpInductionRecipe>) && - "User of VF that we can't transform to EVL."); + assert( + all_of( + Plan.getVF().users(), + IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, + VPWidenIntOrFpInductionRecipe, VPWidenStridedLoadRecipe>) && + "User of VF that we can't transform to EVL."); Plan.getVF().replaceUsesWithIf(&EVL, [](VPUser &U, unsigned Idx) { return isa<VPWidenIntOrFpInductionRecipe, VPScalarIVStepsRecipe>(U); }); @@ -2337,7 +2345,8 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { assert(NumDefVal <= 1 && "Only supports recipes with a single definition or without users."); EVLRecipe->insertBefore(CurRecipe); - if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe>(EVLRecipe)) { + if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + EVLRecipe)) { VPValue *CurVPV = CurRecipe->getVPSingleValue(); CurVPV->replaceAllUsesWith(EVLRecipe->getVPSingleValue()); } @@ -3767,3 +3776,182 @@ void VPlanTransforms::addBranchWeightToMiddleTerminator( MDB.createBranchWeights({1, VectorStep - 1}, /*IsExpected=*/false); MiddleTerm->addMetadata(LLVMContext::MD_prof, BranchWeights); } + +static std::pair<VPValue *, VPValue *> matchStridedStart(VPValue *CurIndex) { + if (auto *WidenIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(CurIndex)) + return {WidenIV, WidenIV->getStepValue()}; + + auto *WidenR = dyn_cast<VPWidenRecipe>(CurIndex); + if (!WidenR || !CurIndex->getUnderlyingValue()) + return {nullptr, nullptr}; + + unsigned Opcode = WidenR->getOpcode(); + // TODO: Support Instruction::Add and Instruction::Or. + if (Opcode != Instruction::Shl && Opcode != Instruction::Mul) + return {nullptr, nullptr}; + + // Match the pattern binop(variant, invariant), or binop(invariant, variant) + // if the binary operator is commutative. + bool IsLHSUniform = vputils::isSingleScalar(WidenR->getOperand(0)); + if (IsLHSUniform == vputils::isSingleScalar(WidenR->getOperand(1)) || + (IsLHSUniform && !Instruction::isCommutative(Opcode))) + return {nullptr, nullptr}; + unsigned VarIdx = IsLHSUniform ? 1 : 0; + + auto [Start, Stride] = matchStridedStart(WidenR->getOperand(VarIdx)); + if (!Start) + return {nullptr, nullptr}; + + SmallVector<VPValue *> StartOps(WidenR->operands()); + StartOps[VarIdx] = Start; + auto *StartR = new VPReplicateRecipe(WidenR->getUnderlyingInstr(), StartOps, + /*IsUniform*/ true); + StartR->insertBefore(WidenR); + + unsigned InvIdx = VarIdx == 0 ? 1 : 0; + auto *StrideR = + new VPInstruction(Opcode, {Stride,... [truncated]

llvmbot · 2025-08-22T09:54:32Z

@llvm/pr-subscribers-vectorizers

Author: Mel Chen (Mel-Chen)

Changes

This patch detects stride memory accesses such as:

 void stride(int* a, int *b, int n) { for (int i = 0; i &lt; n; i++) a[i * 5] = b[i * 5] + i; }

and converts widen non-consecutive loads (i.e. gathers) into strided loads when profitable and legal.
The transformation is implemented as part of VPlan and is inspired by existing logic in RISCVGatherScatterLowering. Some of the legality and analysis logic has been moved into VPlan to enable this conversion earlier during vectorization planning.

This enables more efficient code generation for targets like RISC-V that support strided loads natively.

Patch is 73.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147297.diff

14 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+17-9)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+83-12)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+4-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+66-6)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+193-5)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+6)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+2-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/blocks-with-dead-instructions.ll (+3-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/masked_gather_scatter.ll (+12-10)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/strided-accesses.ll (+12-6)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/tail-folding-gather-scatter.ll (+5-19)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/tail-folding-interleave.ll (+46-85)
(modified) llvm/test/Transforms/LoopVectorize/vplan-dot-printing.ll (+2-2)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index c4110582da1ef..528e6abc78aee 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -4073,7 +4073,7 @@ void LoopVectorizationPlanner::emitInvalidCostRemarks( [](const auto *R) { return Instruction::Select; }) .Case<VPWidenStoreRecipe>( [](const auto *R) { return Instruction::Store; }) - .Case<VPWidenLoadRecipe>( + .Case<VPWidenLoadRecipe, VPWidenStridedLoadRecipe>( [](const auto *R) { return Instruction::Load; }) .Case<VPWidenCallRecipe, VPWidenIntrinsicRecipe>( [](const auto *R) { return Instruction::Call; }) @@ -4172,6 +4172,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPWidenPointerInductionSC: case VPDef::VPReductionPHISC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenStridedLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -7732,7 +7733,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, new VPVectorEndPointerRecipe(Ptr, &Plan.getVF(), getLoadStoreType(I), /*Stride*/ -1, Flags, I->getDebugLoc()); } else { - VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), + const DataLayout &DL = I->getDataLayout(); + auto *StrideTy = DL.getIndexType(Ptr->getUnderlyingValue()->getType()); + VPValue *StrideOne = Plan.getOrAddLiveIn(ConstantInt::get(StrideTy, 1)); + VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), StrideOne, GEP ? GEP->getNoWrapFlags() : GEPNoWrapFlags::none(), I->getDebugLoc()); @@ -8832,19 +8836,14 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( *Plan)) return nullptr; + VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); // Transform recipes to abstract recipes if it is legal and beneficial and // clamp the range for better cost estimation. // TODO: Enable following transform when the EVL-version of extended-reduction // and mulacc-reduction are implemented. - if (!CM.foldTailWithEVL()) { - VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); + if (!CM.foldTailWithEVL()) VPlanTransforms::runPass(VPlanTransforms::convertToAbstractRecipes, *Plan, CostCtx, Range); - } - - for (ElementCount VF : Range) - Plan->addVF(VF); - Plan->setName("Initial VPlan"); // Interleave memory: for each Interleave Group we marked earlier as relevant // for this VPlan, replace the Recipes widening its memory instructions with a @@ -8853,6 +8852,15 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed()); + // Convert memory recipes to strided access recipes if the strided access is + // legal and profitable. + VPlanTransforms::runPass(VPlanTransforms::convertToStridedAccesses, *Plan, + CostCtx, Range); + + for (ElementCount VF : Range) + Plan->addVF(VF); + Plan->setName("Initial VPlan"); + // Replace VPValues for known constant strides guaranteed by predicate scalar // evolution. auto CanUseVersionedStride = [&Plan](VPUser &U, unsigned) { diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index 6814dc5de6716..3c3af87cc0f6e 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -559,6 +559,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPBranchOnMaskSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenStridedLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -1746,10 +1747,6 @@ struct LLVM_ABI_FOR_TEST VPWidenSelectRecipe : public VPRecipeWithIRFlags, /// A recipe for handling GEP instructions. class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { - bool isPointerLoopInvariant() const { - return getOperand(0)->isDefinedOutsideLoopRegions(); - } - bool isIndexLoopInvariant(unsigned I) const { return getOperand(I + 1)->isDefinedOutsideLoopRegions(); } @@ -1778,6 +1775,30 @@ class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { VP_CLASSOF_IMPL(VPDef::VPWidenGEPSC) + bool isPointerLoopInvariant() const { + return getOperand(0)->isDefinedOutsideLoopRegions(); + } + + std::optional<unsigned> getUniqueVariantIndex() const { + std::optional<unsigned> VarIdx; + for (unsigned I = 0, E = getNumOperands() - 1; I < E; ++I) { + if (isIndexLoopInvariant(I)) + continue; + + if (VarIdx) + return std::nullopt; + VarIdx = I; + } + return VarIdx; + } + + Type *getIndexedType(unsigned I) const { + auto *GEP = cast<GetElementPtrInst>(getUnderlyingInstr()); + Type *SourceElementType = GEP->getSourceElementType(); + SmallVector<Value *, 4> Ops(GEP->idx_begin(), GEP->idx_begin() + I); + return GetElementPtrInst::getIndexedType(SourceElementType, Ops); + } + /// Generate the gep nodes. void execute(VPTransformState &State) override; @@ -1866,20 +1887,23 @@ class VPVectorEndPointerRecipe : public VPRecipeWithIRFlags, #endif }; -/// A recipe to compute the pointers for widened memory accesses of IndexTy. +/// A recipe to compute the pointers for widened memory accesses of IndexedTy, +/// with the Stride expressed in units of IndexedTy. class VPVectorPointerRecipe : public VPRecipeWithIRFlags, - public VPUnrollPartAccessor<1> { + public VPUnrollPartAccessor<2> { Type *IndexedTy; public: - VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, GEPNoWrapFlags GEPFlags, - DebugLoc DL) - : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, ArrayRef<VPValue *>(Ptr), - GEPFlags, DL), + VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, VPValue *Stride, + GEPNoWrapFlags GEPFlags, DebugLoc DL) + : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, + ArrayRef<VPValue *>({Ptr, Stride}), GEPFlags, DL), IndexedTy(IndexedTy) {} VP_CLASSOF_IMPL(VPDef::VPVectorPointerSC) + VPValue *getStride() const { return getOperand(1); } + void execute(VPTransformState &State) override; bool onlyFirstLaneUsed(const VPValue *Op) const override { @@ -1897,7 +1921,7 @@ class VPVectorPointerRecipe : public VPRecipeWithIRFlags, } VPVectorPointerRecipe *clone() override { - return new VPVectorPointerRecipe(getOperand(0), IndexedTy, + return new VPVectorPointerRecipe(getOperand(0), IndexedTy, getStride(), getGEPNoWrapFlags(), getDebugLoc()); } @@ -3040,7 +3064,8 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || - R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; + R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC || + R->getVPDefID() == VPRecipeBase::VPWidenStridedLoadSC; } static inline bool classof(const VPUser *U) { @@ -3161,6 +3186,52 @@ struct VPWidenLoadEVLRecipe final : public VPWidenMemoryRecipe, public VPValue { } }; +/// A recipe for strided load operations, using the base address, stride, and an +/// optional mask. This recipe will generate an vp.strided.load intrinsic call +/// to represent memory accesses with a fixed stride. +struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe, + public VPValue { + VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride, + VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe( + VPDef::VPWidenStridedLoadSC, Load, {Addr, Stride, VF}, + /*Consecutive=*/false, /*Reverse=*/false, Metadata, DL), + VPValue(this, &Load) { + setMask(Mask); + } + + VPWidenStridedLoadRecipe *clone() override { + return new VPWidenStridedLoadRecipe(cast<LoadInst>(Ingredient), getAddr(), + getStride(), getVF(), getMask(), *this, + getDebugLoc()); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenStridedLoadSC); + + /// Return the stride operand. + VPValue *getStride() const { return getOperand(1); } + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(2); } + + /// Generate a strided load. + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getAddr() || Op == getStride() || Op == getVF(); + } +}; + /// A recipe for widening store operations, using the stored value, the address /// to store to and an optional mask. struct LLVM_ABI_FOR_TEST VPWidenStoreRecipe final : public VPWidenMemoryRecipe { diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 6c9a78add6822..36760feb1cb14 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -187,8 +187,10 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index a92540f457c73..90c05b7b8076c 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -81,6 +81,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPWidenCastSC: case VPWidenGEPSC: case VPWidenIntOrFpInductionSC: + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenPHISC: @@ -104,6 +105,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPExpressionRecipe>(this)->mayReadOrWriteMemory(); case VPInstructionSC: return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: return true; @@ -185,6 +187,7 @@ bool VPRecipeBase::mayHaveSideEffects() const { } case VPInterleaveSC: return mayWriteToMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenStoreEVLSC: @@ -2512,13 +2515,22 @@ void VPVectorEndPointerRecipe::print(raw_ostream &O, const Twine &Indent, void VPVectorPointerRecipe::execute(VPTransformState &State) { auto &Builder = State.Builder; unsigned CurrentPart = getUnrollPart(*this); - Type *IndexTy = getGEPIndexTy(State.VF.isScalable(), /*IsReverse*/ false, - /*IsUnitStride*/ true, CurrentPart, Builder); + Value *Stride = State.get(getStride(), /*IsScalar*/ true); + + auto *StrideC = dyn_cast<ConstantInt>(Stride); + bool IsStrideOne = StrideC && StrideC->isOne(); + bool IsUnitStride = IsStrideOne || (StrideC && StrideC->isMinusOne()); + Type *IndexTy = + getGEPIndexTy(State.VF.isScalable(), + /*IsReverse*/ false, IsUnitStride, CurrentPart, Builder); Value *Ptr = State.get(getOperand(0), VPLane(0)); + Stride = Builder.CreateSExtOrTrunc(Stride, IndexTy); Value *Increment = createStepForVF(Builder, IndexTy, State.VF, CurrentPart); + Value *Index = IsStrideOne ? Increment : Builder.CreateMul(Increment, Stride); + Value *ResultPtr = - Builder.CreateGEP(IndexedTy, Ptr, Increment, "", getGEPNoWrapFlags()); + Builder.CreateGEP(IndexedTy, Ptr, Index, "", getGEPNoWrapFlags()); State.set(this, ResultPtr, /*IsScalar*/ true); } @@ -3170,9 +3182,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, const Align Alignment = getLoadStoreAlignment(&Ingredient); unsigned AS = cast<PointerType>(Ctx.Types.inferScalarType(getAddr())) ->getAddressSpace(); - unsigned Opcode = isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(this) - ? Instruction::Load - : Instruction::Store; + unsigned Opcode = + isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + this) + ? Instruction::Load + : Instruction::Store; if (!Consecutive) { // TODO: Using the original IR may not be accurate. @@ -3182,6 +3196,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, Type *PtrTy = toVectorTy(Ptr->getType(), VF); assert(!Reverse && "Inconsecutive memory access should not have the order."); + + if (isa<VPWidenStridedLoadRecipe>(this)) + return Ctx.TTI.getStridedMemoryOpCost( + Opcode, Ty, Ptr, IsMasked, Alignment, Ctx.CostKind, &Ingredient); + return Ctx.TTI.getAddressComputationCost(PtrTy, nullptr, nullptr, Ctx.CostKind) + Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked, Alignment, @@ -3331,6 +3350,47 @@ void VPWidenLoadEVLRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenStridedLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + Value *Addr = State.get(getAddr(), /*IsScalar*/ true); + Value *StrideInBytes = State.get(getStride(), /*IsScalar*/ true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + Value *RunTimeVF = Builder.CreateZExtOrTrunc(State.get(getVF(), VPLane(0)), + Builder.getInt32Ty()); + + auto *PtrTy = Addr->getType(); + auto *StrideTy = StrideInBytes->getType(); + CallInst *NewLI = Builder.CreateIntrinsic( + Intrinsic::experimental_vp_strided_load, {DataTy, PtrTy, StrideTy}, + {Addr, StrideInBytes, Mask, RunTimeVF}, nullptr, "wide.strided.load"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + State.set(this, NewLI); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenStridedLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << " = load "; + getAddr()->printAsOperand(O, SlotTracker); + O << ", stride = "; + getStride()->printAsOperand(O, SlotTracker); + O << ", runtimeVF = "; + getVF()->printAsOperand(O, SlotTracker); +} +#endif + void VPWidenStoreRecipe::execute(VPTransformState &State) { VPValue *StoredVPValue = getStoredValue(); bool CreateScatter = !isConsecutive(); diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 677d97e54d556..cf9b1c4199449 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2203,6 +2203,12 @@ static VPRecipeBase *optimizeMaskToEVL(VPValue *HeaderMask, VPValue *NewAddr = GetNewAddr(L->getAddr()); return new VPWidenLoadEVLRecipe(*L, NewAddr, EVL, NewMask); }) + .Case<VPWidenStridedLoadRecipe>([&](VPWidenStridedLoadRecipe *L) { + VPValue *NewMask = GetNewMask(L->getMask()); + return new VPWidenStridedLoadRecipe( + *cast<LoadInst>(&L->getIngredient()), L->getAddr(), L->getStride(), + &EVL, NewMask, *L, L->getDebugLoc()); + }) .Case<VPWidenStoreRecipe>([&](VPWidenStoreRecipe *S) { VPValue *NewMask = GetNewMask(S->getMask()); VPValue *NewAddr = GetNewAddr(S->getAddr()); @@ -2237,10 +2243,12 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion(); VPBasicBlock *Header = LoopRegion->getEntryBasicBlock(); - assert(all_of(Plan.getVF().users(), - IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, - VPWidenIntOrFpInductionRecipe>) && - "User of VF that we can't transform to EVL."); + assert( + all_of( + Plan.getVF().users(), + IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, + VPWidenIntOrFpInductionRecipe, VPWidenStridedLoadRecipe>) && + "User of VF that we can't transform to EVL."); Plan.getVF().replaceUsesWithIf(&EVL, [](VPUser &U, unsigned Idx) { return isa<VPWidenIntOrFpInductionRecipe, VPScalarIVStepsRecipe>(U); }); @@ -2337,7 +2345,8 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { assert(NumDefVal <= 1 && "Only supports recipes with a single definition or without users."); EVLRecipe->insertBefore(CurRecipe); - if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe>(EVLRecipe)) { + if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + EVLRecipe)) { VPValue *CurVPV = CurRecipe->getVPSingleValue(); CurVPV->replaceAllUsesWith(EVLRecipe->getVPSingleValue()); } @@ -3767,3 +3776,182 @@ void VPlanTransforms::addBranchWeightToMiddleTerminator( MDB.createBranchWeights({1, VectorStep - 1}, /*IsExpected=*/false); MiddleTerm->addMetadata(LLVMContext::MD_prof, BranchWeights); } + +static std::pair<VPValue *, VPValue *> matchStridedStart(VPValue *CurIndex) { + if (auto *WidenIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(CurIndex)) + return {WidenIV, WidenIV->getStepValue()}; + + auto *WidenR = dyn_cast<VPWidenRecipe>(CurIndex); + if (!WidenR || !CurIndex->getUnderlyingValue()) + return {nullptr, nullptr}; + + unsigned Opcode = WidenR->getOpcode(); + // TODO: Support Instruction::Add and Instruction::Or. + if (Opcode != Instruction::Shl && Opcode != Instruction::Mul) + return {nullptr, nullptr}; + + // Match the pattern binop(variant, invariant), or binop(invariant, variant) + // if the binary operator is commutative. + bool IsLHSUniform = vputils::isSingleScalar(WidenR->getOperand(0)); + if (IsLHSUniform == vputils::isSingleScalar(WidenR->getOperand(1)) || + (IsLHSUniform && !Instruction::isCommutative(Opcode))) + return {nullptr, nullptr}; + unsigned VarIdx = IsLHSUniform ? 1 : 0; + + auto [Start, Stride] = matchStridedStart(WidenR->getOperand(VarIdx)); + if (!Start) + return {nullptr, nullptr}; + + SmallVector<VPValue *> StartOps(WidenR->operands()); + StartOps[VarIdx] = Start; + auto *StartR = new VPReplicateRecipe(WidenR->getUnderlyingInstr(), StartOps, + /*IsUniform*/ true); + StartR->insertBefore(WidenR); + + unsigned InvIdx = VarIdx == 0 ? 1 : 0; + auto *StrideR = + new VPInstruction(Opcode, {Stride,... [truncated]

llvmbot · 2025-08-22T09:54:33Z

@llvm/pr-subscribers-backend-risc-v

Author: Mel Chen (Mel-Chen)

Changes

This patch detects stride memory accesses such as:

 void stride(int* a, int *b, int n) { for (int i = 0; i &lt; n; i++) a[i * 5] = b[i * 5] + i; }

and converts widen non-consecutive loads (i.e. gathers) into strided loads when profitable and legal.
The transformation is implemented as part of VPlan and is inspired by existing logic in RISCVGatherScatterLowering. Some of the legality and analysis logic has been moved into VPlan to enable this conversion earlier during vectorization planning.

This enables more efficient code generation for targets like RISC-V that support strided loads natively.

Patch is 73.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147297.diff

14 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+17-9)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+83-12)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+4-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+66-6)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+193-5)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+6)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+2-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/blocks-with-dead-instructions.ll (+3-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/masked_gather_scatter.ll (+12-10)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/strided-accesses.ll (+12-6)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/tail-folding-gather-scatter.ll (+5-19)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/tail-folding-interleave.ll (+46-85)
(modified) llvm/test/Transforms/LoopVectorize/vplan-dot-printing.ll (+2-2)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index c4110582da1ef..528e6abc78aee 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -4073,7 +4073,7 @@ void LoopVectorizationPlanner::emitInvalidCostRemarks( [](const auto *R) { return Instruction::Select; }) .Case<VPWidenStoreRecipe>( [](const auto *R) { return Instruction::Store; }) - .Case<VPWidenLoadRecipe>( + .Case<VPWidenLoadRecipe, VPWidenStridedLoadRecipe>( [](const auto *R) { return Instruction::Load; }) .Case<VPWidenCallRecipe, VPWidenIntrinsicRecipe>( [](const auto *R) { return Instruction::Call; }) @@ -4172,6 +4172,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPWidenPointerInductionSC: case VPDef::VPReductionPHISC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenStridedLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -7732,7 +7733,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, new VPVectorEndPointerRecipe(Ptr, &Plan.getVF(), getLoadStoreType(I), /*Stride*/ -1, Flags, I->getDebugLoc()); } else { - VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), + const DataLayout &DL = I->getDataLayout(); + auto *StrideTy = DL.getIndexType(Ptr->getUnderlyingValue()->getType()); + VPValue *StrideOne = Plan.getOrAddLiveIn(ConstantInt::get(StrideTy, 1)); + VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), StrideOne, GEP ? GEP->getNoWrapFlags() : GEPNoWrapFlags::none(), I->getDebugLoc()); @@ -8832,19 +8836,14 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( *Plan)) return nullptr; + VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); // Transform recipes to abstract recipes if it is legal and beneficial and // clamp the range for better cost estimation. // TODO: Enable following transform when the EVL-version of extended-reduction // and mulacc-reduction are implemented. - if (!CM.foldTailWithEVL()) { - VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); + if (!CM.foldTailWithEVL()) VPlanTransforms::runPass(VPlanTransforms::convertToAbstractRecipes, *Plan, CostCtx, Range); - } - - for (ElementCount VF : Range) - Plan->addVF(VF); - Plan->setName("Initial VPlan"); // Interleave memory: for each Interleave Group we marked earlier as relevant // for this VPlan, replace the Recipes widening its memory instructions with a @@ -8853,6 +8852,15 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed()); + // Convert memory recipes to strided access recipes if the strided access is + // legal and profitable. + VPlanTransforms::runPass(VPlanTransforms::convertToStridedAccesses, *Plan, + CostCtx, Range); + + for (ElementCount VF : Range) + Plan->addVF(VF); + Plan->setName("Initial VPlan"); + // Replace VPValues for known constant strides guaranteed by predicate scalar // evolution. auto CanUseVersionedStride = [&Plan](VPUser &U, unsigned) { diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index 6814dc5de6716..3c3af87cc0f6e 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -559,6 +559,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPBranchOnMaskSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenStridedLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -1746,10 +1747,6 @@ struct LLVM_ABI_FOR_TEST VPWidenSelectRecipe : public VPRecipeWithIRFlags, /// A recipe for handling GEP instructions. class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { - bool isPointerLoopInvariant() const { - return getOperand(0)->isDefinedOutsideLoopRegions(); - } - bool isIndexLoopInvariant(unsigned I) const { return getOperand(I + 1)->isDefinedOutsideLoopRegions(); } @@ -1778,6 +1775,30 @@ class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { VP_CLASSOF_IMPL(VPDef::VPWidenGEPSC) + bool isPointerLoopInvariant() const { + return getOperand(0)->isDefinedOutsideLoopRegions(); + } + + std::optional<unsigned> getUniqueVariantIndex() const { + std::optional<unsigned> VarIdx; + for (unsigned I = 0, E = getNumOperands() - 1; I < E; ++I) { + if (isIndexLoopInvariant(I)) + continue; + + if (VarIdx) + return std::nullopt; + VarIdx = I; + } + return VarIdx; + } + + Type *getIndexedType(unsigned I) const { + auto *GEP = cast<GetElementPtrInst>(getUnderlyingInstr()); + Type *SourceElementType = GEP->getSourceElementType(); + SmallVector<Value *, 4> Ops(GEP->idx_begin(), GEP->idx_begin() + I); + return GetElementPtrInst::getIndexedType(SourceElementType, Ops); + } + /// Generate the gep nodes. void execute(VPTransformState &State) override; @@ -1866,20 +1887,23 @@ class VPVectorEndPointerRecipe : public VPRecipeWithIRFlags, #endif }; -/// A recipe to compute the pointers for widened memory accesses of IndexTy. +/// A recipe to compute the pointers for widened memory accesses of IndexedTy, +/// with the Stride expressed in units of IndexedTy. class VPVectorPointerRecipe : public VPRecipeWithIRFlags, - public VPUnrollPartAccessor<1> { + public VPUnrollPartAccessor<2> { Type *IndexedTy; public: - VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, GEPNoWrapFlags GEPFlags, - DebugLoc DL) - : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, ArrayRef<VPValue *>(Ptr), - GEPFlags, DL), + VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, VPValue *Stride, + GEPNoWrapFlags GEPFlags, DebugLoc DL) + : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, + ArrayRef<VPValue *>({Ptr, Stride}), GEPFlags, DL), IndexedTy(IndexedTy) {} VP_CLASSOF_IMPL(VPDef::VPVectorPointerSC) + VPValue *getStride() const { return getOperand(1); } + void execute(VPTransformState &State) override; bool onlyFirstLaneUsed(const VPValue *Op) const override { @@ -1897,7 +1921,7 @@ class VPVectorPointerRecipe : public VPRecipeWithIRFlags, } VPVectorPointerRecipe *clone() override { - return new VPVectorPointerRecipe(getOperand(0), IndexedTy, + return new VPVectorPointerRecipe(getOperand(0), IndexedTy, getStride(), getGEPNoWrapFlags(), getDebugLoc()); } @@ -3040,7 +3064,8 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || - R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; + R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC || + R->getVPDefID() == VPRecipeBase::VPWidenStridedLoadSC; } static inline bool classof(const VPUser *U) { @@ -3161,6 +3186,52 @@ struct VPWidenLoadEVLRecipe final : public VPWidenMemoryRecipe, public VPValue { } }; +/// A recipe for strided load operations, using the base address, stride, and an +/// optional mask. This recipe will generate an vp.strided.load intrinsic call +/// to represent memory accesses with a fixed stride. +struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe, + public VPValue { + VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride, + VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe( + VPDef::VPWidenStridedLoadSC, Load, {Addr, Stride, VF}, + /*Consecutive=*/false, /*Reverse=*/false, Metadata, DL), + VPValue(this, &Load) { + setMask(Mask); + } + + VPWidenStridedLoadRecipe *clone() override { + return new VPWidenStridedLoadRecipe(cast<LoadInst>(Ingredient), getAddr(), + getStride(), getVF(), getMask(), *this, + getDebugLoc()); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenStridedLoadSC); + + /// Return the stride operand. + VPValue *getStride() const { return getOperand(1); } + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(2); } + + /// Generate a strided load. + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getAddr() || Op == getStride() || Op == getVF(); + } +}; + /// A recipe for widening store operations, using the stored value, the address /// to store to and an optional mask. struct LLVM_ABI_FOR_TEST VPWidenStoreRecipe final : public VPWidenMemoryRecipe { diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 6c9a78add6822..36760feb1cb14 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -187,8 +187,10 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index a92540f457c73..90c05b7b8076c 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -81,6 +81,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPWidenCastSC: case VPWidenGEPSC: case VPWidenIntOrFpInductionSC: + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenPHISC: @@ -104,6 +105,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPExpressionRecipe>(this)->mayReadOrWriteMemory(); case VPInstructionSC: return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: return true; @@ -185,6 +187,7 @@ bool VPRecipeBase::mayHaveSideEffects() const { } case VPInterleaveSC: return mayWriteToMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenStoreEVLSC: @@ -2512,13 +2515,22 @@ void VPVectorEndPointerRecipe::print(raw_ostream &O, const Twine &Indent, void VPVectorPointerRecipe::execute(VPTransformState &State) { auto &Builder = State.Builder; unsigned CurrentPart = getUnrollPart(*this); - Type *IndexTy = getGEPIndexTy(State.VF.isScalable(), /*IsReverse*/ false, - /*IsUnitStride*/ true, CurrentPart, Builder); + Value *Stride = State.get(getStride(), /*IsScalar*/ true); + + auto *StrideC = dyn_cast<ConstantInt>(Stride); + bool IsStrideOne = StrideC && StrideC->isOne(); + bool IsUnitStride = IsStrideOne || (StrideC && StrideC->isMinusOne()); + Type *IndexTy = + getGEPIndexTy(State.VF.isScalable(), + /*IsReverse*/ false, IsUnitStride, CurrentPart, Builder); Value *Ptr = State.get(getOperand(0), VPLane(0)); + Stride = Builder.CreateSExtOrTrunc(Stride, IndexTy); Value *Increment = createStepForVF(Builder, IndexTy, State.VF, CurrentPart); + Value *Index = IsStrideOne ? Increment : Builder.CreateMul(Increment, Stride); + Value *ResultPtr = - Builder.CreateGEP(IndexedTy, Ptr, Increment, "", getGEPNoWrapFlags()); + Builder.CreateGEP(IndexedTy, Ptr, Index, "", getGEPNoWrapFlags()); State.set(this, ResultPtr, /*IsScalar*/ true); } @@ -3170,9 +3182,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, const Align Alignment = getLoadStoreAlignment(&Ingredient); unsigned AS = cast<PointerType>(Ctx.Types.inferScalarType(getAddr())) ->getAddressSpace(); - unsigned Opcode = isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(this) - ? Instruction::Load - : Instruction::Store; + unsigned Opcode = + isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + this) + ? Instruction::Load + : Instruction::Store; if (!Consecutive) { // TODO: Using the original IR may not be accurate. @@ -3182,6 +3196,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, Type *PtrTy = toVectorTy(Ptr->getType(), VF); assert(!Reverse && "Inconsecutive memory access should not have the order."); + + if (isa<VPWidenStridedLoadRecipe>(this)) + return Ctx.TTI.getStridedMemoryOpCost( + Opcode, Ty, Ptr, IsMasked, Alignment, Ctx.CostKind, &Ingredient); + return Ctx.TTI.getAddressComputationCost(PtrTy, nullptr, nullptr, Ctx.CostKind) + Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked, Alignment, @@ -3331,6 +3350,47 @@ void VPWidenLoadEVLRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenStridedLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + Value *Addr = State.get(getAddr(), /*IsScalar*/ true); + Value *StrideInBytes = State.get(getStride(), /*IsScalar*/ true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + Value *RunTimeVF = Builder.CreateZExtOrTrunc(State.get(getVF(), VPLane(0)), + Builder.getInt32Ty()); + + auto *PtrTy = Addr->getType(); + auto *StrideTy = StrideInBytes->getType(); + CallInst *NewLI = Builder.CreateIntrinsic( + Intrinsic::experimental_vp_strided_load, {DataTy, PtrTy, StrideTy}, + {Addr, StrideInBytes, Mask, RunTimeVF}, nullptr, "wide.strided.load"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + State.set(this, NewLI); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenStridedLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << " = load "; + getAddr()->printAsOperand(O, SlotTracker); + O << ", stride = "; + getStride()->printAsOperand(O, SlotTracker); + O << ", runtimeVF = "; + getVF()->printAsOperand(O, SlotTracker); +} +#endif + void VPWidenStoreRecipe::execute(VPTransformState &State) { VPValue *StoredVPValue = getStoredValue(); bool CreateScatter = !isConsecutive(); diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 677d97e54d556..cf9b1c4199449 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2203,6 +2203,12 @@ static VPRecipeBase *optimizeMaskToEVL(VPValue *HeaderMask, VPValue *NewAddr = GetNewAddr(L->getAddr()); return new VPWidenLoadEVLRecipe(*L, NewAddr, EVL, NewMask); }) + .Case<VPWidenStridedLoadRecipe>([&](VPWidenStridedLoadRecipe *L) { + VPValue *NewMask = GetNewMask(L->getMask()); + return new VPWidenStridedLoadRecipe( + *cast<LoadInst>(&L->getIngredient()), L->getAddr(), L->getStride(), + &EVL, NewMask, *L, L->getDebugLoc()); + }) .Case<VPWidenStoreRecipe>([&](VPWidenStoreRecipe *S) { VPValue *NewMask = GetNewMask(S->getMask()); VPValue *NewAddr = GetNewAddr(S->getAddr()); @@ -2237,10 +2243,12 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion(); VPBasicBlock *Header = LoopRegion->getEntryBasicBlock(); - assert(all_of(Plan.getVF().users(), - IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, - VPWidenIntOrFpInductionRecipe>) && - "User of VF that we can't transform to EVL."); + assert( + all_of( + Plan.getVF().users(), + IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, + VPWidenIntOrFpInductionRecipe, VPWidenStridedLoadRecipe>) && + "User of VF that we can't transform to EVL."); Plan.getVF().replaceUsesWithIf(&EVL, [](VPUser &U, unsigned Idx) { return isa<VPWidenIntOrFpInductionRecipe, VPScalarIVStepsRecipe>(U); }); @@ -2337,7 +2345,8 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { assert(NumDefVal <= 1 && "Only supports recipes with a single definition or without users."); EVLRecipe->insertBefore(CurRecipe); - if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe>(EVLRecipe)) { + if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + EVLRecipe)) { VPValue *CurVPV = CurRecipe->getVPSingleValue(); CurVPV->replaceAllUsesWith(EVLRecipe->getVPSingleValue()); } @@ -3767,3 +3776,182 @@ void VPlanTransforms::addBranchWeightToMiddleTerminator( MDB.createBranchWeights({1, VectorStep - 1}, /*IsExpected=*/false); MiddleTerm->addMetadata(LLVMContext::MD_prof, BranchWeights); } + +static std::pair<VPValue *, VPValue *> matchStridedStart(VPValue *CurIndex) { + if (auto *WidenIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(CurIndex)) + return {WidenIV, WidenIV->getStepValue()}; + + auto *WidenR = dyn_cast<VPWidenRecipe>(CurIndex); + if (!WidenR || !CurIndex->getUnderlyingValue()) + return {nullptr, nullptr}; + + unsigned Opcode = WidenR->getOpcode(); + // TODO: Support Instruction::Add and Instruction::Or. + if (Opcode != Instruction::Shl && Opcode != Instruction::Mul) + return {nullptr, nullptr}; + + // Match the pattern binop(variant, invariant), or binop(invariant, variant) + // if the binary operator is commutative. + bool IsLHSUniform = vputils::isSingleScalar(WidenR->getOperand(0)); + if (IsLHSUniform == vputils::isSingleScalar(WidenR->getOperand(1)) || + (IsLHSUniform && !Instruction::isCommutative(Opcode))) + return {nullptr, nullptr}; + unsigned VarIdx = IsLHSUniform ? 1 : 0; + + auto [Start, Stride] = matchStridedStart(WidenR->getOperand(VarIdx)); + if (!Start) + return {nullptr, nullptr}; + + SmallVector<VPValue *> StartOps(WidenR->operands()); + StartOps[VarIdx] = Start; + auto *StartR = new VPReplicateRecipe(WidenR->getUnderlyingInstr(), StartOps, + /*IsUniform*/ true); + StartR->insertBefore(WidenR); + + unsigned InvIdx = VarIdx == 0 ? 1 : 0; + auto *StrideR = + new VPInstruction(Opcode, {Stride,... [truncated]

Mel-Chen · 2025-09-05T10:17:27Z

Ping. Passed the llvm-testsuite with "-O3 -mcpu=sifive-x280". Please take a look if possible.

Mel-Chen · 2025-10-01T09:46:57Z

Rebased, and ping

llvm/lib/Transforms/Vectorize/VPlan.h

fhahn · 2025-10-01T09:50:10Z

llvm/lib/Transforms/Vectorize/VPlan.h

+ return VarIdx;
+ }
+
+ Type *getIndexedType(unsigned I) const {


would be good to document

fhahn · 2025-10-01T09:51:48Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+}
+
+static std::tuple<VPValue *, VPValue *, Type *>
+determineBaseAndStride(VPWidenGEPRecipe *WidenGEP) {


would help to document

fhahn · 2025-10-01T09:54:33Z

llvm/lib/Transforms/Vectorize/VPlan.h

+/// to represent memory accesses with a fixed stride.
+struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe,
+ public VPValue {
+ VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride,


Can we store the required info independent of the LLVM IR instruction?

It would also help to clarify what information is actually used and why this cannot be simply a VPWidenIntrinsicsRecipe, as it would map 1-1 to an intrinsic

Can we store the required info independent of the LLVM IR instruction?

Maybe, if we also store the type information in the recipe. But I want to know, in what scenario would we create a VPWidenStridedLoadRecipe without a LoadInst?

It would also help to clarify what information is actually used and why this cannot be simply a VPWidenIntrinsicsRecipe, as it would map 1-1 to an intrinsic

I tried to do that: Mel-Chen@d19fe61
This approach does help eliminate one recipe VPWidenStridedLoadRecipe, but there are still a few details we need to be careful about:
First, the current VPWidenIntrinsicRecipe cannot set attributes, so alignment information will be lost. This could be resolved by passing an AttributeList to VPWidenIntrinsicRecipe, allowing it to add the attributes during ::execute.

// ???: How to set alignment? auto *StridedLoad = new VPWidenIntrinsicRecipe( Ingredient, Intrinsic::experimental_vp_strided_load, {NewPtr, StrideInBytes, Mask, I32VF}, TypeInfo.inferScalarType(LoadR), LoadR->getDebugLoc());

Next, it might become difficult to ensure profitability before generating the strided access (i.e., we may not be able to achieve the change suggested in this comment
). For more accurate profitability analysis, it would be better to call getCostForIntrinsics directly during the profitability check, which requires that all operands are already prepared.

// Better to make getCostForIntrinsics to utils function, and directly // call getCostForIntrinsics to get the cost. SmallVector<Type *, 4> ParamTys = { TypeInfo.inferScalarType(BasePtr), TypeInfo.inferScalarType(StrideInElement), Type::getInt1Ty(Plan.getContext()), Type::getInt32Ty(Plan.getContext())}; // FIXME: Copy from getCostForIntrinsics, but I think this is a bug. We // don't have to vectorize every operands. Should be fixed in // getCostForIntrinsics. for (auto &Ty : ParamTys) Ty = toVectorTy(Ty, VF); IntrinsicCostAttributes CostAttrs( Intrinsic::experimental_vp_strided_load, DataTy, {}, ParamTys, FastMathFlags(), nullptr, InstructionCost::getInvalid(), &Ctx.TLI); const InstructionCost StridedLoadStoreCost = Ctx.TTI.getIntrinsicInstrCost(CostAttrs, Ctx.CostKind); return StridedLoadStoreCost < CurrentCost;

Finally, using VPWidenIntrinsicRecipe reduces readability during transformations. Currently, VPWidenStridedLoadRecipe provides members like getAddr() and getStride(), which improves readability. This issue is not limited to EVL lowering—it should also have an impact #164205.

.Case<VPWidenIntrinsicRecipe>( [&](VPWidenIntrinsicRecipe *WI) -> VPRecipeBase * { if (WI->getVectorIntrinsicID() == Intrinsic::experimental_vp_strided_load) { VPWidenIntrinsicRecipe *NewWI = WI->clone(); if (VPValue *NewMask = GetNewMask(WI->getOperand(2))) NewWI->setOperand(2, NewMask); else NewWI->setOperand(2, &AllOneMask); NewWI->setOperand(3, &EVL); return NewWI; } return nullptr; })

That’s my take on this. What’s your opinion?

First, the current VPWidenIntrinsicRecipe cannot set attributes, so alignment information will be lost. This could be resolved by passing an AttributeList to VPWidenIntrinsicRecipe, allowing it to add the attributes during ::execute.

Hm yes, this is a general modeling issue, but could be improved separately (and does not only apply to this intrinsic I think). Is setting the alignment critical for the initial implementation?

Next, it might become difficult to ensure profitability before generating the strided access (i.e., we may not be able to achieve the change suggested in this comment

I am not sure I understand this completely; We should be able to check the profitability up-front regardless of what recipe we create later on, I think?

Finally, using VPWidenIntrinsicRecipe reduces readability during transformations. Currently, VPWidenStridedLoadRecipe provides members like getAddr() and getStride(), which improves readability. This issue is not limited to EVL lowering—it should also have an impact #164205.

I think this could be mostly taken care of by providing a m_StridedLoad() pattern?

Chiming in for codegen. The VP strided load/store intrinsics currently assume element alignment in SelectionDAGBuilder if not specified. This is not spelled out in the LangRef. It is spelled out for the plain vp.load/store intrinsics. @nikic recently added a FIXME suggesting that we shouldn't be falling back to ABI alignment for VP intrinsics

llvm-project/llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp

Line 509 in 0e639ae

// FIXME: Falling back to ABI alignment is incorrect.

Some RISC-V hardware requires strided load/store to have at least element size alignment. More advanced hardware implementations may not require this.

I think we're missing an alignment check for all VP memory intrinsics in the backend. This would probably need to be legalized in IR like we do for non-VP masked.load/store/gather/scatter.

Hm yes, this is a general modeling issue, but could be improved separately (and does not only apply to this intrinsic I think). Is setting the alignment critical for the initial implementation?

As Craig mentioned above, I think we should still preserve the alignment information for now. Moreover, since #164205 converts strided accesses into gather/scatter, if we don’t preserve it now, wouldn’t that cause the masked.gather/scatter to lose their alignment information?

I am not sure I understand this completely; We should be able to check the profitability up-front regardless of what recipe we create later on, I think?

Yes, that works. I’m currently using TTI.getIntrinsicInstrCost to achieve this. But I think the best approach would be to directly reuse

static InstructionCost getCostForIntrinsics(Intrinsic::ID ID, ArrayRef<const VPValue *> Operands, const VPRecipeWithIRFlags &R, ElementCount VF, VPCostContext &Ctx)

that is called by VPWidenIntrinsicRecipe::computeCost.

This would prevent potential discrepancies between the pre-conversion cost estimation and the actual recipe cost. I believe refactoring getCostForIntrinsics would be needed to make that work.

I think this could be mostly taken care of by providing a m_StridedLoad() pattern?

Yes, that’s the least important issue. Using m_Intrinsic<> directly is also fine.

Anyway, the main concern is still the alignment. @fhahn Could you tell me what your major concern is about adding VPWidenStridedLoadRecipe in details? That would help me think of a way to address it.

fhahn · 2025-10-01T09:56:08Z

llvm/lib/Transforms/Vectorize/VPlan.h

+ bool onlyFirstLaneUsed(const VPValue *Op) const override {
+ assert(is_contained(operands(), Op) &&
+ "Op must be an operand of the recipe");
+ return Op == getAddr() || Op == getStride() || Op == getVF();


Might be good to document that this includes everything except the mask

fhahn · 2025-10-01T10:02:39Z

llvm/lib/Transforms/Vectorize/VPlan.h

+/// A recipe to compute the pointers for widened memory accesses of IndexedTy,
+/// with the Stride expressed in units of IndexedTy.


IndexedTy->SourceElementTy?

fhahn · 2025-10-01T10:07:27Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+ return {WidenIV, WidenIV->getStepValue()};
+
+ auto *WidenR = dyn_cast<VPWidenRecipe>(CurIndex);
+ if (!WidenR || !CurIndex->getUnderlyingValue())


Why do we require an underlying value?

I’ve changed it like this for now.
2064369
We need UnderlyingInstr here to narrow all base address recipes into scalar recipes. As discussed earlier, we don’t want to rely on the legacy cost model, so we also can’t depend on its scalarize analysis. The functionality needed here is actually similar to narrowToSingleScalarRecipes; in both cases, we require UnderlyingInstr in order to create a VPReplicateRecipe.

fhahn · 2025-10-01T10:08:01Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+
+ // Memory cost model requires the pointer operand of memory access
+ // instruction.
+ Value *PtrUV = Ptr->getUnderlyingValue();


Passing nullptr does not work?

12d526a
Since isLegalStridedLoadStore has already been called to check the legality before using Ptr->getUnderlyingValue(), I think passing nullptr should be fine for now. Any potential issues were discussed previously: #128718 (comment).

fhahn · 2025-10-01T10:09:23Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+ PossiblyDead.insert(Ptr);
+
+ // Create a new vector pointer for strided access.
+ auto *GEP = dyn_cast<GetElementPtrInst>(PtrUV->stripPointerCasts());


At this point, we should not look at the underlying value to determine flags, only use the ones from the recipes.

eb70f43
This approach is fine for now. However, in the VecEndPtr → VecPtr case, some flags may be lost. We’ve discussed this before in #128718 (comment).

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

Currently, LLVM vectorizes strided accesses with SVE as follows. ```c void func(double* restrict a, double* b, int n) { for (int i = 0; i < n; i++) { a[i] = b[i * 10] + 1; } } ``` => ``` ... index z1.d, #0, llvm#1 loop: add z2.d, z1.d, z0.d mul z1.d, z1.d, llvm#80 ld1d { z1.d }, p0/z, [x1, z1.d] ... mov z1.d, z2.d ... ``` This generated code is inefficient because it performs address calculation inside the loop using vector instructions. This can lead to performance degradation. Ideally, we want to generate efficient instructions that keep the offset vector `z1` constant and update the base register `x1` with a scalar instruction. ``` ... index z1.d, #0, llvm#10 loop: ld1d z2.d, p7/z, [x1, z1.d, lsl 3] ... add x1, x1, x2 ... ``` This patch enables strided accesses to be vectorized efficiently as shown above. This patch is based on llvm#147297. llvm#147297 detects strided accesses and converts them into stride recipes. This patch then changes it to a legal and efficient sequence of recipes for AArch64. I am making this patch as a draft for the following reasons: - I have not yet been able to create sufficient test cases for this patch. - I have not yet been able to confirm that there are no performance degradations.

Currently, LLVM vectorizes strided accesses with SVE as follows. ```c void func(double* restrict a, double* b, int n) { for (int i = 0; i < n; i++) { a[i] = b[i * 10] + 1; } } ``` => ``` ... index z1.d, #0, llvm#1 loop: add z2.d, z1.d, z0.d mul z1.d, z1.d, llvm#80 ld1d { z1.d }, p0/z, [x1, z1.d] ... mov z1.d, z2.d ... ``` This generated code is inefficient because it performs address calculation inside the loop using vector instructions. This can lead to performance degradation. llvm#129474 Ideally, we want to generate efficient instructions that keep the offset vector `z1` constant and update the base register `x1` with a scalar instruction. ``` ... index z1.d, #0, llvm#10 loop: ld1d z2.d, p7/z, [x1, z1.d, lsl 3] ... add x1, x1, x2 ... ``` This patch enables strided accesses to be vectorized efficiently as shown above. This patch is based on llvm#147297. llvm#147297 detects strided accesses and converts them into stride recipes. This patch then changes it to a legal and efficient sequence of recipes for AArch64. I am making this patch as a draft for the following reasons: - I have not yet been able to create sufficient test cases for this patch. - I have not yet been able to confirm that there are no performance degradations.

fhahn · 2025-11-03T09:53:32Z

llvm/lib/Transforms/Vectorize/VPlan.h

+/// to represent memory accesses with a fixed stride.
+struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe,
+ public VPValue {
+ VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride,


First, the current VPWidenIntrinsicRecipe cannot set attributes, so alignment information will be lost. This could be resolved by passing an AttributeList to VPWidenIntrinsicRecipe, allowing it to add the attributes during ::execute.

Hm yes, this is a general modeling issue, but could be improved separately (and does not only apply to this intrinsic I think). Is setting the alignment critical for the initial implementation?

Next, it might become difficult to ensure profitability before generating the strided access (i.e., we may not be able to achieve the change suggested in this comment

I am not sure I understand this completely; We should be able to check the profitability up-front regardless of what recipe we create later on, I think?

Finally, using VPWidenIntrinsicRecipe reduces readability during transformations. Currently, VPWidenStridedLoadRecipe provides members like getAddr() and getStride(), which improves readability. This issue is not limited to EVL lowering—it should also have an impact #164205.

I think this could be mostly taken care of by providing a m_StridedLoad() pattern?

llvm/lib/Transforms/Vectorize/VPlan.h

fhahn · 2025-11-03T09:59:12Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+
+/// Checks if the given VPWidenGEPRecipe \p WidenGEP represents a strided
+/// access. If so, it creates recipes representing the base pointer and stride
+/// in element type, and returns a tuple of {base pointer, stride, element


Suggested change

/// in element type, and returns a tuple of {base pointer, stride, element

/// scaled by the indexed element type's size and returns a tuple of {base pointer, stride, element

Should this say stride scaled by the indexed type?

determineBaseAndStride returns the stride in element units, and the scaling by the element size is applied later by convertToStridedAccesses. So at this point the returned stride is not yet scaled.

fhahn · 2025-11-03T10:01:14Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+ // TODO: Transform gather/scatter with uniform address into strided access
+ // with 0 stride.
+ // TODO: Transform interleave access into multiple strided accesses.
+ if (!MemR || !isa<VPWidenLoadRecipe>(MemR) || MemR->isConsecutive())


Suggested change

if (!MemR || !isa<VPWidenLoadRecipe>(MemR) || MemR->isConsecutive())

if (!MemR || MemR->isConsecutive())

Can dyn_cast to VPWidenLoadRecipe directly?

Yes, that works for now. 2c97e42
Although if we later support strided stores, we’ll probably go back to the current code.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

fhahn · 2025-11-03T10:05:30Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+
+ // Skip if the memory access is not a strided access.
+ if (!BasePtr) {
+ assert(!StrideInElement && !ElementTy);


Suggested change

assert(!StrideInElement && !ElementTy);

assert(!StrideInElement && !ElementTy && "..");

would be good to add message

Sure.
2be3029

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

lukel97 · 2025-11-11T08:02:02Z

llvm/lib/Transforms/Vectorize/VPlan.h

+/// A recipe for strided load operations, using the base address, stride, VF,
+/// and an optional mask. This recipe will generate a vp.strided.load intrinsic
+/// call to represent memory accesses with a fixed stride.
+struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe,


I think the name VPWidenStridedLoadRecipe is a bit misleading because it only emits VP strided loads. In #164205 this has to be replaced with a regular widened gather anyway for AArch64, since RISC-V is the only target that supports vp.strided.load.

I think we should first create a recipe or (canonical sequence of existing recipes) that executes something that all targets incl. AArch64 can lower.
And then separately optimize it to a VP intrinsic in optimizeMaskToEVL, possibly in a separate PR

I think the name VPWidenStridedLoadRecipe is a bit misleading because it only emits VP strided loads.

The reason is that there is no non-VP strided access intrinsic available.

I think we should first create a recipe or (canonical sequence of existing recipes) that executes something that all targets incl. AArch64 can lower.

My thinking here is different. I believe it is better to start with a high-level semantic recipe and gradually lower it into more concrete recipes, similar to what convertToConcreteRecipes does, rather than basing it solely on the number of targets that can support it.

And then separately optimize it to a VP intrinsic in optimizeMaskToEVL, possibly in a separate PR

No, strided access is also worked without tail folding, and this still can not solve the alignment information issue we discuss above.

My thinking here is different. I believe it is better to start with a high-level semantic recipe and gradually lower it into more concrete recipes, similar to what convertToConcreteRecipes does, rather than basing it solely on the number of targets that can support it.

I think its fine to have VPWidenStridedLoadRecipe as a high level recipe, I'm not suggesting to remove it necessarily. I just think it shouldn't emit vp.strided.load, since every other target that's not RISC-V will have to "legalize" it.

We should convert it to a VP intrinsic as a separate step that's only run on RISC-V

I have a proposal:

In VPlanTransforms::convertToStridedAccesses, we still generate VPWidenStridedLoadRecipe. Whether the target wants to convert the gather into a strided load should be determined uniformly via TTI.getStridedMemoryOpCost.
Then, in VPlanTransforms::optimize (perhaps inside SimplifyRecipe), we can decide whether to convert it into vp.strided.load by calling TTI.isLegalStridedLoadStore (though this would require passing TTI into ::optimize).
Finally, during VPlanTransforms::convertToConcreteRecipes, we expand all VPWidenStridedLoadRecipe into a widen GEP + gather.

Finally, during VPlanTransforms::convertToConcreteRecipes, we transform VPWidenStridedLoadRecipe into vp.strided.load or expand VPWidenStridedLoadRecipe into a widen GEP + gather, depend on TTI.isLegalStridedLoadStore. Or, emitting vp.strided.load or widen GEP + gather in VPlanTransforms::convertToStridedAccesses, also depend on TTI.isLegalStridedLoadStore.

However, both approached still requires VPWidenIntrinsicRecipe to support alignment information.

If this proposal sounds acceptable, maybe as an initial patch we can just leave a TODO in VPlanTransforms::convertToStridedAccesses to document this plan.

Mel-Chen requested review from ElvisWang123, LiqinWeng, alexey-bataev, arcbbb, ayalz, fhahn, lukel97 and wangpc-pp July 7, 2025 13:29

Mel-Chen force-pushed the const-stride branch from 16598e6 to 72d328d Compare July 7, 2025 13:37

kinoshita-fj reviewed Jul 24, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved

Mel-Chen force-pushed the const-stride branch from 72d328d to 8ec631e Compare August 22, 2025 09:36

Mel-Chen commented Aug 22, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved

Mel-Chen commented Aug 22, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved

Mel-Chen force-pushed the const-stride branch from 8ec631e to 1a5d820 Compare August 22, 2025 09:53

Mel-Chen marked this pull request as ready for review August 22, 2025 09:54

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Aug 22, 2025

Mel-Chen force-pushed the const-stride branch 2 times, most recently from 81f1d66 to 784afac Compare September 2, 2025 09:17

Mel-Chen force-pushed the const-stride branch from 784afac to ccbfb79 Compare September 5, 2025 02:39

kinoshita-fj mentioned this pull request Sep 16, 2025

[LV] Inefficient gather/scatter address calculation for strided access #129474

Open

Mel-Chen force-pushed the const-stride branch from ccbfb79 to 8d98c01 Compare October 1, 2025 09:46

fhahn reviewed Oct 1, 2025

View reviewed changes

kinoshita-fj mentioned this pull request Oct 20, 2025

[LV][AArch64] Improve strided access vectorization for AArch64 SVE #164205

Draft

Mel-Chen force-pushed the const-stride branch from 8d98c01 to b8930a9 Compare October 23, 2025 08:07

fhahn reviewed Nov 3, 2025

View reviewed changes

Mel-Chen added 16 commits November 10, 2025 18:11

New VPWidenStridedLoadRecipe

7312761

Expand VPVectorPointerRecipe to support stride

56f3c9e

Transform the gather to stride load

986b87a

Support EVL

558e6fb

patch planContainsAdditionalSimplifications

a320473

Update comment. nfc

c6ce4df

Find base pointer and stride after profitable check

6a73d1d

Replace getUnderlyingValue() with getUnderlyingInstr()

39e918c

Directly use flags from WidenGEP

6f58b64

Pass Ptr->getUnderlyingValue() in memory cost model

e6de605

Implement VPWidenStridedLoadRecipe::computeCost

8d66406

update comments. nfc

e7bc331

cast to VPWidenLoadRecipe directly for now. nfc

2c97e42

replace getOrAddLiveIn(ConstantInt::get()) with getConstantInt. nfc

a89b144

updated assertion message. nfc

2be3029

rebase and remove getLoadStoreAlignment call

b41a785

Mel-Chen force-pushed the const-stride branch from b8930a9 to b41a785 Compare November 11, 2025 07:26

lukel97 reviewed Nov 11, 2025

View reviewed changes

		/// A recipe to compute the pointers for widened memory accesses of IndexedTy,
		/// with the Stride expressed in units of IndexedTy.

	/// in element type, and returns a tuple of {base pointer, stride, element
	/// scaled by the indexed element type's size and returns a tuple of {base pointer, stride, element

	if (!MemR \|\| !isa<VPWidenLoadRecipe>(MemR) \|\| MemR->isConsecutive())
	if (!MemR \|\| MemR->isConsecutive())

	assert(!StrideInElement && !ElementTy);
	assert(!StrideInElement && !ElementTy && "..");

[LV] Convert gather loads with invariant stride into strided loads #147297

Are you sure you want to change the base?

[LV] Convert gather loads with invariant stride into strided loads #147297

Uh oh!

Conversation

Mel-Chen commented Jul 7, 2025

Mel-Chen commented Jul 7, 2025

github-actions bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

lukel97 commented Jul 8, 2025

wangpc-pp commented Jul 8, 2025

Mel-Chen commented Jul 22, 2025

Mel-Chen commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mel-Chen commented Aug 22, 2025

llvmbot commented Aug 22, 2025

llvmbot commented Aug 22, 2025

llvmbot commented Aug 22, 2025

Mel-Chen commented Sep 5, 2025

Mel-Chen commented Oct 1, 2025

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mel-Chen Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Labels

8 participants

github-actions bot commented Jul 7, 2025 •

edited

Loading

Mel-Chen commented Jul 22, 2025 •

edited

Loading

Mel-Chen Nov 12, 2025 •

edited

Loading