[VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads #151300

arcbbb · 2025-07-30T09:54:24Z

Following #152422, this patch enables auto-vectorization of early-exit loops containing a single potentially faulting, unit-stride load by using the vp.load.ff intrinsic introduced in #128593.

Key changes:

Add VPWidenFFLoadRecipe that produces two results: the loaded vector and VL (the number of lanes actually loaded).
Use VL to ensure correctness:
- Step the induction variable by VL (variable-length stepping).
- Cap AVL to min(VF, remainder) to avoid over-reads at last iteration.
- Compute the early-exit condition from VL by replacing AnyOf with vp.reduce.or to avoid branch-on-poison.
Introduce two transforms:
- adjustFFLoadEarlyExitForPoisonSafety: rewrites the exit condition (AnyOf → vp.reduce.or(VL)) and sets AVL = min(VF, remainder).
- convertFFLoadEarlyExitToVLStepping: after region dissolution, converts early-exit loops to step by VL.

Limitations:

Supports a single potentially faulting load with unit stride.
Interleave count (IC) must be 1.

stacks on top of #165218.

arcbbb · 2025-07-30T09:54:44Z

Regarding EVL propagation, I just noticed that header mask handling is fixed in #150202. This fix will need to be incorporated accordingly.

github-actions · 2025-07-30T09:57:31Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

llvm/include/llvm/CodeGen/SelectionDAG.h

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

alexey-bataev · 2025-07-30T13:01:29Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Check this first?

Updated. Thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

david-arm · 2025-07-30T13:13:43Z

I think it would be good for reviewers if this patch was split up into several parts, probably roughly in this order:

Codegen specific parts related to the RISCV intrinsic,
A standalone loop vectorisation legality patch similar to [LV] Add initial legality checks for early exit loops with side effects #145663,
A loop vectoriser patch that starts vectorising the loops you're interested in.

arcbbb · 2025-08-01T13:05:52Z

I think it would be good for reviewers if this patch was split up into several parts, probably roughly in this order:

Codegen specific parts related to the RISCV intrinsic,

A standalone loop vectorisation legality patch similar to [LV] Add initial legality checks for ee loops with stores #145663,

A loop vectoriser patch that starts vectorising the loops you're interested in.

Thanks for outlining the steps. That’s very helpful! I’ll follow this order for splitting up the patch.

lukel97 · 2025-09-02T04:50:05Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7749,6 +7758,12 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands,
 Builder.insert(VectorPtr);
 Ptr = VectorPtr;
 }
+ if (Legal->getSpeculativeLoads().contains(I)) {
+ auto *Load = dyn_cast<LoadInst>(I);
+ return new VPWidenFFLoadRecipe(*Load, Ptr, Mask, VPIRMetadata(*Load, LVer),


Just making a note for later. I think it would be good to avoid having a dead VPWidenFFLoadRecipe when there's no non-EVL version of vp.load.ff.

Can we instead just have one VPWidenFFLoadRecipe and here pass VF as the EVL argument? optimizeMaskToEVL can then set the EVL from the header mask later.

Updated! After rebasing, this patch supports WidenFFLoad in non–tail-folded mode only. To tail-fold early-exit loops and FFLoad support in the EVL transform will have to be addressed in a separate patch.

This patch splits out the legality checks from PR #151300, following the landing of PR #128593. It is a step toward supporting vectorization of early-exit loops that contain potentially faulting loads. In this commit, an early-exit loop is considered legal for vectorization if it satisfies the following criteria: 1. it is a read-only loop. 2. all potentially faulting loads are unit-stride, which is the only type currently supported by vp.load.ff.

llvmbot · 2025-09-21T01:43:02Z

@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Shih-Po Hung (arcbbb)

Changes

Following #152422, this patch enables auto-vectorization of early-exit loops containing a single potentially faulting, unit-stride load by using the vp.load.ff intrinsic introduced in #128593.

Key changes:

Add VPWidenFFLoadRecipe that produces two results: the loaded vector and VL (the number of lanes actually loaded).
Use VL to ensure correctness:
- Step the induction variable by VL (variable-length stepping).
- Cap AVL to min(VF, remainder) to avoid over-reads at last iteration.
- Compute the early-exit condition from VL by replacing AnyOf with vp.reduce.or to avoid branch-on-poison.
Introduce two transforms:
- adjustFFLoadEarlyExitForPoisonSafety: rewrites the exit condition (AnyOf → vp.reduce.or(VL)) and sets AVL = min(VF, remainder).
- convertFFLoadEarlyExitToVLStepping: after region dissolution, converts early-exit loops to step by VL.

Limitations:

Supports a single potentially faulting load with unit stride.
Interleave count (IC) must be 1.

Patch is 34.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151300.diff

9 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+41-1)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+45)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+3-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+43)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+96)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+11)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+3)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+2-2)
(added) llvm/test/Transforms/LoopVectorize/RISCV/find.ll (+236)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 1d3cffa2b61bf..e28d4c45d4ab8 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -393,6 +393,12 @@ static cl::opt<bool> EnableEarlyExitVectorization( cl::desc( "Enable vectorization of early exit loops with uncountable exits.")); +static cl::opt<bool> + EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false), + cl::Hidden, + cl::desc("Enable vectorization of early-exit " + "loops with fault-only-first loads.")); + static cl::opt<bool> ConsiderRegPressure( "vectorizer-consider-reg-pressure", cl::init(false), cl::Hidden, cl::desc("Discard VFs if their register pressure is too high.")); @@ -3507,6 +3513,15 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) { return FixedScalableVFPair::getNone(); } + if (!Legal->getPotentiallyFaultingLoads().empty() && UserIC > 1) { + reportVectorizationFailure("Auto-vectorization of loops with potentially " + "faulting loads is not supported when the " + "interleave count is more than 1", + "CantInterleaveLoopWithPotentiallyFaultingLoads", + ORE, TheLoop); + return FixedScalableVFPair::getNone(); + } + ScalarEvolution *SE = PSE.getSE(); ElementCount TC = getSmallConstantTripCount(SE, TheLoop); unsigned MaxTC = PSE.getSmallConstantMaxTripCount(); @@ -4076,6 +4091,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPReductionPHISC: case VPDef::VPInterleaveEVLSC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenFFLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -4550,6 +4566,10 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF, if (!Legal->isSafeForAnyVectorWidth()) return 1; + // No interleaving for potentially faulting loads. + if (!Legal->getPotentiallyFaultingLoads().empty()) + return 1; + // We don't attempt to perform interleaving for loops with uncountable early // exits because the VPInstruction::AnyOf code cannot currently handle // multiple parts. @@ -7216,6 +7236,9 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan( // Regions are dissolved after optimizing for VF and UF, which completely // removes unneeded loop regions first. VPlanTransforms::dissolveLoopRegions(BestVPlan); + + VPlanTransforms::convertFFLoadEarlyExitToVLStepping(BestVPlan); + // Canonicalize EVL loops after regions are dissolved. VPlanTransforms::canonicalizeEVLLoops(BestVPlan); VPlanTransforms::materializeBackedgeTakenCount(BestVPlan, VectorPH); @@ -7598,6 +7621,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, Builder.insert(VectorPtr); Ptr = VectorPtr; } + if (Legal->getPotentiallyFaultingLoads().contains(I)) + return new VPWidenFFLoadRecipe(*cast<LoadInst>(I), Ptr, &Plan.getVF(), Mask, + VPIRMetadata(*I, LVer), I->getDebugLoc()); + if (LoadInst *Load = dyn_cast<LoadInst>(I)) return new VPWidenLoadRecipe(*Load, Ptr, Mask, Consecutive, Reverse, VPIRMetadata(*Load, LVer), I->getDebugLoc()); @@ -8632,6 +8659,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( if (Recipe->getNumDefinedValues() == 1) { SingleDef->replaceAllUsesWith(Recipe->getVPSingleValue()); Old2New[SingleDef] = Recipe->getVPSingleValue(); + } else if (isa<VPWidenFFLoadRecipe>(Recipe)) { + VPValue *Data = Recipe->getVPValue(0); + SingleDef->replaceAllUsesWith(Data); + Old2New[SingleDef] = Data; } else { assert(Recipe->getNumDefinedValues() == 0 && "Unexpected multidef recipe"); @@ -8679,6 +8710,8 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( // Adjust the recipes for any inloop reductions. adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start); + VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(*Plan); + // Apply mandatory transformation to handle FP maxnum/minnum reduction with // NaNs if possible, bail out otherwise. if (!VPlanTransforms::runPass(VPlanTransforms::handleMaxMinNumReductions, @@ -9869,7 +9902,14 @@ bool LoopVectorizePass::processLoop(Loop *L) { return false; } - if (!LVL.getPotentiallyFaultingLoads().empty()) { + if (EnableEarlyExitWithFFLoads) { + if (LVL.getPotentiallyFaultingLoads().size() > 1) { + reportVectorizationFailure("Auto-vectorization of loops with more than 1 " + "potentially faulting load is not enabled", + "MoreThanOnePotentiallyFaultingLoad", ORE, L); + return false; + } + } else if (!LVL.getPotentiallyFaultingLoads().empty()) { reportVectorizationFailure("Auto-vectorization of loops with potentially " "faulting load is not supported", "PotentiallyFaultingLoadsNotSupported", ORE, L); diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index f79855f7e2c5f..6e28c95ca601a 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -563,6 +563,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPInterleaveEVLSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenFFLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -2811,6 +2812,13 @@ class LLVM_ABI_FOR_TEST VPReductionEVLRecipe : public VPReductionRecipe { ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp, R.isOrdered(), DL) {} + VPReductionEVLRecipe(RecurKind RdxKind, FastMathFlags FMFs, VPValue *ChainOp, + VPValue *VecOp, VPValue &EVL, VPValue *CondOp, + bool IsOrdered, DebugLoc DL = DebugLoc::getUnknown()) + : VPReductionRecipe(VPDef::VPReductionEVLSC, RdxKind, FMFs, nullptr, + ArrayRef<VPValue *>({ChainOp, VecOp, &EVL}), CondOp, + IsOrdered, DL) {} + ~VPReductionEVLRecipe() override = default; VPReductionEVLRecipe *clone() override { @@ -3159,6 +3167,7 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, static inline bool classof(const VPRecipeBase *R) { return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || + R->getVPDefID() == VPRecipeBase::VPWidenFFLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; } @@ -3240,6 +3249,42 @@ struct LLVM_ABI_FOR_TEST VPWidenLoadRecipe final : public VPWidenMemoryRecipe, } }; +/// A recipe for widening loads using fault-only-first intrinsics. +/// Produces two results: (1) the loaded data, and (2) the index of the first +/// non-dereferenceable lane, or VF if all lanes are successfully read. +struct VPWidenFFLoadRecipe final : public VPWidenMemoryRecipe, public VPValue { + VPWidenFFLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe(VPDef::VPWidenFFLoadSC, Load, {Addr, VF}, + /*Consecutive*/ true, /*Reverse*/ false, Metadata, + DL), + VPValue(this, &Load) { + new VPValue(nullptr, this); // Index of the first lane that faults. + setMask(Mask); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenFFLoadSC); + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(1); } + void setVF(VPValue *V) { setOperand(1, V); } + + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getVF() || Op == getAddr(); + } +}; + /// A recipe for widening load operations with vector-predication intrinsics, /// using the address to load from, the explicit vector length and an optional /// mask. diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 46ab7712e2671..684dbd25597e3 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -188,8 +188,9 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenFFLoadRecipe, VPWidenLoadEVLRecipe>(R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index 8e9c3db50319f..3da8613a1d3cc 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -73,6 +73,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPReductionPHISC: case VPScalarIVStepsSC: case VPPredInstPHISC: + case VPWidenFFLoadSC: return false; case VPBlendSC: case VPReductionEVLSC: @@ -107,6 +108,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); case VPWidenLoadEVLSC: case VPWidenLoadSC: + case VPWidenFFLoadSC: return true; case VPReplicateSC: return cast<Instruction>(getVPSingleValue()->getUnderlyingValue()) @@ -3409,6 +3411,47 @@ void VPWidenLoadRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenFFLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + State.setDebugLocFrom(getDebugLoc()); + + Value *VL = State.get(getVF(), VPLane(0)); + Type *I32Ty = Builder.getInt32Ty(); + VL = Builder.CreateZExtOrTrunc(VL, I32Ty); + Value *Addr = State.get(getAddr(), true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + CallInst *NewLI = + Builder.CreateIntrinsic(Intrinsic::vp_load_ff, {DataTy, Addr->getType()}, + {Addr, Mask, VL}, nullptr, "vp.op.load.ff"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + Value *V = cast<Instruction>(Builder.CreateExtractValue(NewLI, 0)); + Value *NewVL = Builder.CreateExtractValue(NewLI, 1); + State.set(getVPValue(0), V); + State.set(getVPValue(1), NewVL, /*NeedsScalar=*/true); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenFFLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << ", "; + getVPValue(1)->printAsOperand(O, SlotTracker); + O << " = vp.load.ff "; + printOperands(O, SlotTracker); +} +#endif + /// Use all-true mask for reverse rather than actual mask, as it avoids a /// dependence w/o affecting the result. static Instruction *createReverseEVL(IRBuilderBase &Builder, Value *Operand, diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 1f6b85270607e..7e78cb6ed02ac 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2760,6 +2760,102 @@ void VPlanTransforms::addExplicitVectorLength( Plan.setUF(1); } +void VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan) { + VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock(); + VPWidenFFLoadRecipe *LastFFLoad = nullptr; + for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>( + vp_depth_first_deep(Plan.getVectorLoopRegion()))) + for (VPRecipeBase &R : *VPBB) + if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) { + assert(!LastFFLoad && "Only one FFLoad is supported"); + LastFFLoad = Load; + } + + // Skip if no FFLoad. + if (!LastFFLoad) + return; + + // Ensure FFLoad does not read past the remainder in the last iteration. + // Set AVL to min(VF, remainder). + VPBuilder Builder(Header, Header->getFirstNonPhi()); + VPValue *Remainder = Builder.createNaryOp( + Instruction::Sub, {&Plan.getVectorTripCount(), Plan.getCanonicalIV()}); + VPValue *Cmp = + Builder.createICmp(CmpInst::ICMP_ULE, &Plan.getVF(), Remainder); + VPValue *AVL = Builder.createSelect(Cmp, &Plan.getVF(), Remainder); + LastFFLoad->setVF(AVL); + + // To prevent branch-on-poison, rewrite the early-exit condition to + // VPReductionEVLRecipe. Expected pattern here is: + // EMIT vp<%alt.exit.cond> = AnyOf + // EMIT vp<%exit.cond> = or vp<%alt.exit.cond>, vp<%main.exit.cond> + // EMIT branch-on-cond vp<%exit.cond> + auto *ExitingLatch = cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getExiting()); + auto *LatchExitingBr = cast<VPInstruction>(ExitingLatch->getTerminator()); + + VPValue *VPAnyOf = nullptr; + VPValue *VecOp = nullptr; + assert( + match(LatchExitingBr, + m_BranchOnCond(m_BinaryOr(m_VPValue(VPAnyOf), m_VPValue()))) && + match(VPAnyOf, m_VPInstruction<VPInstruction::AnyOf>(m_VPValue(VecOp))) && + "unexpected exiting sequence in early exit loop"); + + VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1); + VPValue *Mask = LastFFLoad->getMask(); + FastMathFlags FMF; + auto *I1Ty = Type::getInt1Ty(Plan.getContext()); + VPValue *VPZero = Plan.getOrAddLiveIn(ConstantInt::get(I1Ty, 0)); + DebugLoc DL = VPAnyOf->getDefiningRecipe()->getDebugLoc(); + auto *NewAnyOf = + new VPReductionEVLRecipe(RecurKind::Or, FMF, VPZero, VecOp, *OpVPEVLI32, + Mask, /*IsOrdered*/ false, DL); + NewAnyOf->insertBefore(VPAnyOf->getDefiningRecipe()); + VPAnyOf->replaceAllUsesWith(NewAnyOf); + + // Using FirstActiveLane in the early-exit block is safe, + // exiting conditions guarantees at least one valid lane precedes + // any poisoned lanes. +} + +void VPlanTransforms::convertFFLoadEarlyExitToVLStepping(VPlan &Plan) { + // Find loop header by locating VPWidenFFLoadRecipe. + VPWidenFFLoadRecipe *LastFFLoad = nullptr; + + for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>( + vp_depth_first_shallow(Plan.getEntry()))) + for (VPRecipeBase &R : *VPBB) + if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) { + assert(!LastFFLoad && "Only one FFLoad is supported"); + LastFFLoad = Load; + } + + // Skip if no FFLoad. + if (!LastFFLoad) + return; + + VPBasicBlock *HeaderVPBB = LastFFLoad->getParent(); + // Replace IVStep (VFxUF) with returned VL from FFLoad. + auto *CanonicalIV = cast<VPPhi>(&*HeaderVPBB->begin()); + VPValue *Backedge = CanonicalIV->getIncomingValue(1); + assert(match(Backedge, m_c_Add(m_Specific(CanonicalIV), + m_Specific(&Plan.getVFxUF()))) && + "Unexpected canonical iv"); + VPRecipeBase *CanonicalIVIncrement = Backedge->getDefiningRecipe(); + VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1); + VPBuilder Builder(HeaderVPBB, HeaderVPBB->getFirstNonPhi()); + Builder.setInsertPoint(CanonicalIVIncrement); + auto *TC = Plan.getTripCount(); + Type *CanIVTy = TC->isLiveIn() + ? TC->getLiveInIRValue()->getType() + : cast<VPExpandSCEVRecipe>(TC)->getSCEV()->getType(); + auto *I32Ty = Type::getInt32Ty(Plan.getContext()); + VPValue *OpVPEVL = Builder.createScalarZExtOrTrunc( + OpVPEVLI32, CanIVTy, I32Ty, CanonicalIVIncrement->getDebugLoc()); + + CanonicalIVIncrement->setOperand(1, OpVPEVL); +} + void VPlanTransforms::canonicalizeEVLLoops(VPlan &Plan) { // Find EVL loop entries by locating VPEVLBasedIVPHIRecipe. // There should be only one EVL PHI in the entire plan. diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h index 69452a7e37572..bc5ce3bc43e76 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h @@ -269,6 +269,17 @@ struct VPlanTransforms { /// (branch-on-cond eq AVLNext, 0) static void canonicalizeEVLLoops(VPlan &Plan); + /// Applies to early-exit loops that use FFLoad. FFLoad may yield fewer active + /// lanes than VF. To prevent branch-on-poison and over-reads past the vector + /// trip count, use the returned VL for both stepping and exit computation. + /// Implemented by: + /// - adjustFFLoadEarlyExitForPoisonSafety: replace AnyOf with vp.reduce.or over + /// the first VL lanes; set AVL = min(VF, remainder). + /// - convertFFLoadEarlyExitToVLStepping: after region dissolution, convert + /// early-exit loops to variable-length stepping. + static void adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan); + static void convertFFLoadEarlyExitToVLStepping(VPlan &Plan); + /// Lower abstract recipes to concrete ones, that can be codegen'd. static void convertToConcreteRecipes(VPlan &Plan); diff --git a/llvm/lib/Transforms/Vectorize/VPlanValue.h b/llvm/lib/Transforms/Vectorize/VPlanValue.h index 0678bc90ef4b5..b2bc430a09686 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanValue.h +++ b/llvm/lib/Transforms/Vectorize/VPlanValue.h @@ -40,6 +40,7 @@ class VPUser; class VPRecipeBase; class VPInterleaveBase; class VPPhiAccessors; +class VPWidenFFLoadRecipe; // This is the base class of the VPlan Def/Use graph, used for modeling the data // flow into, within and out of the VPlan. VPValues can stand for live-ins @@ -51,6 +52,7 @@ class LLVM_ABI_FOR_TEST VPValue { friend class VPInterleaveBase; friend class VPlan; friend class VPExpressionRecipe; + friend class VPWidenFFLoadRecipe; const unsigned char SubclassID; ///< Subclass identifier (for isa/dyn_cast). @@ -351,6 +353,7 @@ class VPDef { VPWidenCastSC, VPWidenGEPSC, VPWidenIntrinsicSC, + VPWidenFFLoadSC, VPWidenLoadEVLSC, VPWidenLoadSC, VPWidenStoreEVLSC, diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp index 92caa0b4e51d5..70e6e0d006eb6 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp @@ -166,8 +166,8 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const { } return VerifyEVLUse(*R, 2); }) - .Case<VPWidenLoadEVLRecipe, VPVectorEndPointerRecipe, - VPInterleaveEVLRecipe>( + .Case<VPWidenLoadEVLRecipe, VPWidenFFLoadRecipe, + VPVectorEndPointerRecipe, VPInterleaveEVLRecipe>( [&](const VPRecipeBase *R) { return VerifyEVLUse(*R, 1); }) .Case<VPInstructionWithType>( [&](const VPInstructionWithType *S) { return VerifyEVLUse(*S, 0); }) diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/find.ll b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll new file mode 100644 index 0000000000000..f734bd5f53c82 --- /dev/null +++ b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll @@ -0... [truncated]

llvmbot · 2025-09-21T01:43:02Z

@llvm/pr-subscribers-backend-risc-v

Author: Shih-Po Hung (arcbbb)

Changes

Following #152422, this patch enables auto-vectorization of early-exit loops containing a single potentially faulting, unit-stride load by using the vp.load.ff intrinsic introduced in #128593.

Key changes:

Add VPWidenFFLoadRecipe that produces two results: the loaded vector and VL (the number of lanes actually loaded).
Use VL to ensure correctness:
- Step the induction variable by VL (variable-length stepping).
- Cap AVL to min(VF, remainder) to avoid over-reads at last iteration.
- Compute the early-exit condition from VL by replacing AnyOf with vp.reduce.or to avoid branch-on-poison.
Introduce two transforms:
- adjustFFLoadEarlyExitForPoisonSafety: rewrites the exit condition (AnyOf → vp.reduce.or(VL)) and sets AVL = min(VF, remainder).
- convertFFLoadEarlyExitToVLStepping: after region dissolution, converts early-exit loops to step by VL.

Limitations:

Supports a single potentially faulting load with unit stride.
Interleave count (IC) must be 1.

Patch is 34.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151300.diff

9 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+41-1)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+45)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+3-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+43)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+96)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+11)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+3)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+2-2)
(added) llvm/test/Transforms/LoopVectorize/RISCV/find.ll (+236)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 1d3cffa2b61bf..e28d4c45d4ab8 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -393,6 +393,12 @@ static cl::opt<bool> EnableEarlyExitVectorization( cl::desc( "Enable vectorization of early exit loops with uncountable exits.")); +static cl::opt<bool> + EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false), + cl::Hidden, + cl::desc("Enable vectorization of early-exit " + "loops with fault-only-first loads.")); + static cl::opt<bool> ConsiderRegPressure( "vectorizer-consider-reg-pressure", cl::init(false), cl::Hidden, cl::desc("Discard VFs if their register pressure is too high.")); @@ -3507,6 +3513,15 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) { return FixedScalableVFPair::getNone(); } + if (!Legal->getPotentiallyFaultingLoads().empty() && UserIC > 1) { + reportVectorizationFailure("Auto-vectorization of loops with potentially " + "faulting loads is not supported when the " + "interleave count is more than 1", + "CantInterleaveLoopWithPotentiallyFaultingLoads", + ORE, TheLoop); + return FixedScalableVFPair::getNone(); + } + ScalarEvolution *SE = PSE.getSE(); ElementCount TC = getSmallConstantTripCount(SE, TheLoop); unsigned MaxTC = PSE.getSmallConstantMaxTripCount(); @@ -4076,6 +4091,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPReductionPHISC: case VPDef::VPInterleaveEVLSC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenFFLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -4550,6 +4566,10 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF, if (!Legal->isSafeForAnyVectorWidth()) return 1; + // No interleaving for potentially faulting loads. + if (!Legal->getPotentiallyFaultingLoads().empty()) + return 1; + // We don't attempt to perform interleaving for loops with uncountable early // exits because the VPInstruction::AnyOf code cannot currently handle // multiple parts. @@ -7216,6 +7236,9 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan( // Regions are dissolved after optimizing for VF and UF, which completely // removes unneeded loop regions first. VPlanTransforms::dissolveLoopRegions(BestVPlan); + + VPlanTransforms::convertFFLoadEarlyExitToVLStepping(BestVPlan); + // Canonicalize EVL loops after regions are dissolved. VPlanTransforms::canonicalizeEVLLoops(BestVPlan); VPlanTransforms::materializeBackedgeTakenCount(BestVPlan, VectorPH); @@ -7598,6 +7621,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, Builder.insert(VectorPtr); Ptr = VectorPtr; } + if (Legal->getPotentiallyFaultingLoads().contains(I)) + return new VPWidenFFLoadRecipe(*cast<LoadInst>(I), Ptr, &Plan.getVF(), Mask, + VPIRMetadata(*I, LVer), I->getDebugLoc()); + if (LoadInst *Load = dyn_cast<LoadInst>(I)) return new VPWidenLoadRecipe(*Load, Ptr, Mask, Consecutive, Reverse, VPIRMetadata(*Load, LVer), I->getDebugLoc()); @@ -8632,6 +8659,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( if (Recipe->getNumDefinedValues() == 1) { SingleDef->replaceAllUsesWith(Recipe->getVPSingleValue()); Old2New[SingleDef] = Recipe->getVPSingleValue(); + } else if (isa<VPWidenFFLoadRecipe>(Recipe)) { + VPValue *Data = Recipe->getVPValue(0); + SingleDef->replaceAllUsesWith(Data); + Old2New[SingleDef] = Data; } else { assert(Recipe->getNumDefinedValues() == 0 && "Unexpected multidef recipe"); @@ -8679,6 +8710,8 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( // Adjust the recipes for any inloop reductions. adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start); + VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(*Plan); + // Apply mandatory transformation to handle FP maxnum/minnum reduction with // NaNs if possible, bail out otherwise. if (!VPlanTransforms::runPass(VPlanTransforms::handleMaxMinNumReductions, @@ -9869,7 +9902,14 @@ bool LoopVectorizePass::processLoop(Loop *L) { return false; } - if (!LVL.getPotentiallyFaultingLoads().empty()) { + if (EnableEarlyExitWithFFLoads) { + if (LVL.getPotentiallyFaultingLoads().size() > 1) { + reportVectorizationFailure("Auto-vectorization of loops with more than 1 " + "potentially faulting load is not enabled", + "MoreThanOnePotentiallyFaultingLoad", ORE, L); + return false; + } + } else if (!LVL.getPotentiallyFaultingLoads().empty()) { reportVectorizationFailure("Auto-vectorization of loops with potentially " "faulting load is not supported", "PotentiallyFaultingLoadsNotSupported", ORE, L); diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index f79855f7e2c5f..6e28c95ca601a 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -563,6 +563,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPInterleaveEVLSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenFFLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -2811,6 +2812,13 @@ class LLVM_ABI_FOR_TEST VPReductionEVLRecipe : public VPReductionRecipe { ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp, R.isOrdered(), DL) {} + VPReductionEVLRecipe(RecurKind RdxKind, FastMathFlags FMFs, VPValue *ChainOp, + VPValue *VecOp, VPValue &EVL, VPValue *CondOp, + bool IsOrdered, DebugLoc DL = DebugLoc::getUnknown()) + : VPReductionRecipe(VPDef::VPReductionEVLSC, RdxKind, FMFs, nullptr, + ArrayRef<VPValue *>({ChainOp, VecOp, &EVL}), CondOp, + IsOrdered, DL) {} + ~VPReductionEVLRecipe() override = default; VPReductionEVLRecipe *clone() override { @@ -3159,6 +3167,7 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, static inline bool classof(const VPRecipeBase *R) { return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || + R->getVPDefID() == VPRecipeBase::VPWidenFFLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; } @@ -3240,6 +3249,42 @@ struct LLVM_ABI_FOR_TEST VPWidenLoadRecipe final : public VPWidenMemoryRecipe, } }; +/// A recipe for widening loads using fault-only-first intrinsics. +/// Produces two results: (1) the loaded data, and (2) the index of the first +/// non-dereferenceable lane, or VF if all lanes are successfully read. +struct VPWidenFFLoadRecipe final : public VPWidenMemoryRecipe, public VPValue { + VPWidenFFLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe(VPDef::VPWidenFFLoadSC, Load, {Addr, VF}, + /*Consecutive*/ true, /*Reverse*/ false, Metadata, + DL), + VPValue(this, &Load) { + new VPValue(nullptr, this); // Index of the first lane that faults. + setMask(Mask); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenFFLoadSC); + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(1); } + void setVF(VPValue *V) { setOperand(1, V); } + + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getVF() || Op == getAddr(); + } +}; + /// A recipe for widening load operations with vector-predication intrinsics, /// using the address to load from, the explicit vector length and an optional /// mask. diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 46ab7712e2671..684dbd25597e3 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -188,8 +188,9 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenFFLoadRecipe, VPWidenLoadEVLRecipe>(R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index 8e9c3db50319f..3da8613a1d3cc 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -73,6 +73,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPReductionPHISC: case VPScalarIVStepsSC: case VPPredInstPHISC: + case VPWidenFFLoadSC: return false; case VPBlendSC: case VPReductionEVLSC: @@ -107,6 +108,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); case VPWidenLoadEVLSC: case VPWidenLoadSC: + case VPWidenFFLoadSC: return true; case VPReplicateSC: return cast<Instruction>(getVPSingleValue()->getUnderlyingValue()) @@ -3409,6 +3411,47 @@ void VPWidenLoadRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenFFLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + State.setDebugLocFrom(getDebugLoc()); + + Value *VL = State.get(getVF(), VPLane(0)); + Type *I32Ty = Builder.getInt32Ty(); + VL = Builder.CreateZExtOrTrunc(VL, I32Ty); + Value *Addr = State.get(getAddr(), true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + CallInst *NewLI = + Builder.CreateIntrinsic(Intrinsic::vp_load_ff, {DataTy, Addr->getType()}, + {Addr, Mask, VL}, nullptr, "vp.op.load.ff"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + Value *V = cast<Instruction>(Builder.CreateExtractValue(NewLI, 0)); + Value *NewVL = Builder.CreateExtractValue(NewLI, 1); + State.set(getVPValue(0), V); + State.set(getVPValue(1), NewVL, /*NeedsScalar=*/true); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenFFLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << ", "; + getVPValue(1)->printAsOperand(O, SlotTracker); + O << " = vp.load.ff "; + printOperands(O, SlotTracker); +} +#endif + /// Use all-true mask for reverse rather than actual mask, as it avoids a /// dependence w/o affecting the result. static Instruction *createReverseEVL(IRBuilderBase &Builder, Value *Operand, diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 1f6b85270607e..7e78cb6ed02ac 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2760,6 +2760,102 @@ void VPlanTransforms::addExplicitVectorLength( Plan.setUF(1); } +void VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan) { + VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock(); + VPWidenFFLoadRecipe *LastFFLoad = nullptr; + for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>( + vp_depth_first_deep(Plan.getVectorLoopRegion()))) + for (VPRecipeBase &R : *VPBB) + if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) { + assert(!LastFFLoad && "Only one FFLoad is supported"); + LastFFLoad = Load; + } + + // Skip if no FFLoad. + if (!LastFFLoad) + return; + + // Ensure FFLoad does not read past the remainder in the last iteration. + // Set AVL to min(VF, remainder). + VPBuilder Builder(Header, Header->getFirstNonPhi()); + VPValue *Remainder = Builder.createNaryOp( + Instruction::Sub, {&Plan.getVectorTripCount(), Plan.getCanonicalIV()}); + VPValue *Cmp = + Builder.createICmp(CmpInst::ICMP_ULE, &Plan.getVF(), Remainder); + VPValue *AVL = Builder.createSelect(Cmp, &Plan.getVF(), Remainder); + LastFFLoad->setVF(AVL); + + // To prevent branch-on-poison, rewrite the early-exit condition to + // VPReductionEVLRecipe. Expected pattern here is: + // EMIT vp<%alt.exit.cond> = AnyOf + // EMIT vp<%exit.cond> = or vp<%alt.exit.cond>, vp<%main.exit.cond> + // EMIT branch-on-cond vp<%exit.cond> + auto *ExitingLatch = cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getExiting()); + auto *LatchExitingBr = cast<VPInstruction>(ExitingLatch->getTerminator()); + + VPValue *VPAnyOf = nullptr; + VPValue *VecOp = nullptr; + assert( + match(LatchExitingBr, + m_BranchOnCond(m_BinaryOr(m_VPValue(VPAnyOf), m_VPValue()))) && + match(VPAnyOf, m_VPInstruction<VPInstruction::AnyOf>(m_VPValue(VecOp))) && + "unexpected exiting sequence in early exit loop"); + + VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1); + VPValue *Mask = LastFFLoad->getMask(); + FastMathFlags FMF; + auto *I1Ty = Type::getInt1Ty(Plan.getContext()); + VPValue *VPZero = Plan.getOrAddLiveIn(ConstantInt::get(I1Ty, 0)); + DebugLoc DL = VPAnyOf->getDefiningRecipe()->getDebugLoc(); + auto *NewAnyOf = + new VPReductionEVLRecipe(RecurKind::Or, FMF, VPZero, VecOp, *OpVPEVLI32, + Mask, /*IsOrdered*/ false, DL); + NewAnyOf->insertBefore(VPAnyOf->getDefiningRecipe()); + VPAnyOf->replaceAllUsesWith(NewAnyOf); + + // Using FirstActiveLane in the early-exit block is safe, + // exiting conditions guarantees at least one valid lane precedes + // any poisoned lanes. +} + +void VPlanTransforms::convertFFLoadEarlyExitToVLStepping(VPlan &Plan) { + // Find loop header by locating VPWidenFFLoadRecipe. + VPWidenFFLoadRecipe *LastFFLoad = nullptr; + + for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>( + vp_depth_first_shallow(Plan.getEntry()))) + for (VPRecipeBase &R : *VPBB) + if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) { + assert(!LastFFLoad && "Only one FFLoad is supported"); + LastFFLoad = Load; + } + + // Skip if no FFLoad. + if (!LastFFLoad) + return; + + VPBasicBlock *HeaderVPBB = LastFFLoad->getParent(); + // Replace IVStep (VFxUF) with returned VL from FFLoad. + auto *CanonicalIV = cast<VPPhi>(&*HeaderVPBB->begin()); + VPValue *Backedge = CanonicalIV->getIncomingValue(1); + assert(match(Backedge, m_c_Add(m_Specific(CanonicalIV), + m_Specific(&Plan.getVFxUF()))) && + "Unexpected canonical iv"); + VPRecipeBase *CanonicalIVIncrement = Backedge->getDefiningRecipe(); + VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1); + VPBuilder Builder(HeaderVPBB, HeaderVPBB->getFirstNonPhi()); + Builder.setInsertPoint(CanonicalIVIncrement); + auto *TC = Plan.getTripCount(); + Type *CanIVTy = TC->isLiveIn() + ? TC->getLiveInIRValue()->getType() + : cast<VPExpandSCEVRecipe>(TC)->getSCEV()->getType(); + auto *I32Ty = Type::getInt32Ty(Plan.getContext()); + VPValue *OpVPEVL = Builder.createScalarZExtOrTrunc( + OpVPEVLI32, CanIVTy, I32Ty, CanonicalIVIncrement->getDebugLoc()); + + CanonicalIVIncrement->setOperand(1, OpVPEVL); +} + void VPlanTransforms::canonicalizeEVLLoops(VPlan &Plan) { // Find EVL loop entries by locating VPEVLBasedIVPHIRecipe. // There should be only one EVL PHI in the entire plan. diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h index 69452a7e37572..bc5ce3bc43e76 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h @@ -269,6 +269,17 @@ struct VPlanTransforms { /// (branch-on-cond eq AVLNext, 0) static void canonicalizeEVLLoops(VPlan &Plan); + /// Applies to early-exit loops that use FFLoad. FFLoad may yield fewer active + /// lanes than VF. To prevent branch-on-poison and over-reads past the vector + /// trip count, use the returned VL for both stepping and exit computation. + /// Implemented by: + /// - adjustFFLoadEarlyExitForPoisonSafety: replace AnyOf with vp.reduce.or over + /// the first VL lanes; set AVL = min(VF, remainder). + /// - convertFFLoadEarlyExitToVLStepping: after region dissolution, convert + /// early-exit loops to variable-length stepping. + static void adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan); + static void convertFFLoadEarlyExitToVLStepping(VPlan &Plan); + /// Lower abstract recipes to concrete ones, that can be codegen'd. static void convertToConcreteRecipes(VPlan &Plan); diff --git a/llvm/lib/Transforms/Vectorize/VPlanValue.h b/llvm/lib/Transforms/Vectorize/VPlanValue.h index 0678bc90ef4b5..b2bc430a09686 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanValue.h +++ b/llvm/lib/Transforms/Vectorize/VPlanValue.h @@ -40,6 +40,7 @@ class VPUser; class VPRecipeBase; class VPInterleaveBase; class VPPhiAccessors; +class VPWidenFFLoadRecipe; // This is the base class of the VPlan Def/Use graph, used for modeling the data // flow into, within and out of the VPlan. VPValues can stand for live-ins @@ -51,6 +52,7 @@ class LLVM_ABI_FOR_TEST VPValue { friend class VPInterleaveBase; friend class VPlan; friend class VPExpressionRecipe; + friend class VPWidenFFLoadRecipe; const unsigned char SubclassID; ///< Subclass identifier (for isa/dyn_cast). @@ -351,6 +353,7 @@ class VPDef { VPWidenCastSC, VPWidenGEPSC, VPWidenIntrinsicSC, + VPWidenFFLoadSC, VPWidenLoadEVLSC, VPWidenLoadSC, VPWidenStoreEVLSC, diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp index 92caa0b4e51d5..70e6e0d006eb6 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp @@ -166,8 +166,8 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const { } return VerifyEVLUse(*R, 2); }) - .Case<VPWidenLoadEVLRecipe, VPVectorEndPointerRecipe, - VPInterleaveEVLRecipe>( + .Case<VPWidenLoadEVLRecipe, VPWidenFFLoadRecipe, + VPVectorEndPointerRecipe, VPInterleaveEVLRecipe>( [&](const VPRecipeBase *R) { return VerifyEVLUse(*R, 1); }) .Case<VPInstructionWithType>( [&](const VPInstructionWithType *S) { return VerifyEVLUse(*S, 0); }) diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/find.ll b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll new file mode 100644 index 0000000000000..f734bd5f53c82 --- /dev/null +++ b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll @@ -0... [truncated]

arcbbb · 2025-09-21T02:22:06Z

Rebased after #152422 landed, and updated the PR description as well as title.

Split out from llvm#151300 to isolate TargetTransformInfo cost modelling for fault-only-first loads from VPlan implementation details. This change adds costing support for vp.load.ff independently of the VPlan work.

Split out from #151300 to isolate TargetTransformInfo cost modelling for fault-only-first loads from VPlan implementation details. This change adds costing support for vp.load.ff independently of the VPlan work. For now, model a vp.load.ff as cost-equivalent to a vp.load.

arcbbb · 2025-09-27T00:55:23Z

Gentle nudge. Any thoughts?

rzinsly · 2025-09-29T14:35:02Z

Hi, I wanted to try this patch but the new test is failing for a Release build:

******************** TEST 'LLVM :: Transforms/LoopVectorize/RISCV/find.ll' FAILED ******************** Exit Code: 2 Command Output (stdout): -- # RUN: at line 2 /home/rzinsly/build/llvm/bin/opt -passes=loop-vectorize -enable-early-exit-with-ffload -mtriple=riscv64 -mattr=+v -S /home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll | /home/rzinsly/build/llvm/bin/FileCheck /home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll # executed command: /home/rzinsly/build/llvm/bin/opt -passes=loop-vectorize -enable-early-exit-with-ffload -mtriple=riscv64 -mattr=+v -S /home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll # .---command stderr------------ # | PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace and instructions to reproduce the bug. # | Stack dump: # | 0. Program arguments: /home/rzinsly/build/llvm/bin/opt -passes=loop-vectorize -enable-early-exit-with-ffload -mtriple=riscv64 -mattr=+v -S /home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll # | 1. Running pass "function(loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>)" on module "/home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll" # | 2. Running pass "loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>" on function "find_with_liveout" # | #0 0x00005a92b002fe62 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/rzinsly/build/llvm/bin/opt+0x2f67e62) # | #1 0x00005a92b002cda2 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0 # | #2 0x0000736eee245330 (/lib/x86_64-linux-gnu/libc.so.6+0x45330) # | #3 0x00005a92ae0d7df4 llvm::VPValue::getDefiningRecipe() (/home/rzinsly/build/llvm/bin/opt+0x100fdf4) # | #4 0x00005a92ae1436c8 llvm::VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(llvm::VPlan&) (/home/rzinsly/build/llvm/bin/opt+0x107b6c8) # | #5 0x00005a92adf8422a llvm::LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(std::unique_ptr<llvm::VPlan, std::default_delete<llvm::VPlan>>, llvm::VFRange&, llvm::LoopVersioning*) (/home/rzinsly/build/llvm/bin/opt+0xebc22a) # | #6 0x00005a92adf85c79 llvm::LoopVectorizationPlanner::buildVPlansWithVPRecipes(llvm::ElementCount, llvm::ElementCount) (/home/rzinsly/build/llvm/bin/opt+0xebdc79) # | #7 0x00005a92adf861bd llvm::LoopVectorizationPlanner::plan(llvm::ElementCount, unsigned int) (/home/rzinsly/build/llvm/bin/opt+0xebe1bd) # | #8 0x00005a92adf8bae4 llvm::LoopVectorizePass::processLoop(llvm::Loop*) (/home/rzinsly/build/llvm/bin/opt+0xec3ae4) ...

I’m building with:
-DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD="RISCV" -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_PARALLEL_LINK_JOBS=1 -DLLVM_OPTIMIZED_TABLEGEN=ON

It works with -DCMAKE_BUILD_TYPE=Debug.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

arcbbb · 2025-10-03T03:01:36Z

Hi, I wanted to try this patch but the new test is failing for a Release build:

******************** TEST 'LLVM :: Transforms/LoopVectorize/RISCV/find.ll' FAILED ******************** Exit Code: 2 Command Output (stdout): -- # RUN: at line 2 /home/rzinsly/build/llvm/bin/opt -passes=loop-vectorize -enable-early-exit-with-ffload -mtriple=riscv64 -mattr=+v -S /home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll | /home/rzinsly/build/llvm/bin/FileCheck /home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll # executed command: /home/rzinsly/build/llvm/bin/opt -passes=loop-vectorize -enable-early-exit-with-ffload -mtriple=riscv64 -mattr=+v -S /home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll # .---command stderr------------ # | PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace and instructions to reproduce the bug. # | Stack dump: # | 0. Program arguments: /home/rzinsly/build/llvm/bin/opt -passes=loop-vectorize -enable-early-exit-with-ffload -mtriple=riscv64 -mattr=+v -S /home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll # | 1. Running pass "function(loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>)" on module "/home/rzinsly/src/llvm-project/llvm/test/Transforms/LoopVectorize/RISCV/find.ll" # | 2. Running pass "loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>" on function "find_with_liveout" # | #0 0x00005a92b002fe62 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/rzinsly/build/llvm/bin/opt+0x2f67e62) # | #1 0x00005a92b002cda2 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0 # | #2 0x0000736eee245330 (/lib/x86_64-linux-gnu/libc.so.6+0x45330) # | #3 0x00005a92ae0d7df4 llvm::VPValue::getDefiningRecipe() (/home/rzinsly/build/llvm/bin/opt+0x100fdf4) # | #4 0x00005a92ae1436c8 llvm::VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(llvm::VPlan&) (/home/rzinsly/build/llvm/bin/opt+0x107b6c8) # | #5 0x00005a92adf8422a llvm::LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(std::unique_ptr<llvm::VPlan, std::default_delete<llvm::VPlan>>, llvm::VFRange&, llvm::LoopVersioning*) (/home/rzinsly/build/llvm/bin/opt+0xebc22a) # | #6 0x00005a92adf85c79 llvm::LoopVectorizationPlanner::buildVPlansWithVPRecipes(llvm::ElementCount, llvm::ElementCount) (/home/rzinsly/build/llvm/bin/opt+0xebdc79) # | #7 0x00005a92adf861bd llvm::LoopVectorizationPlanner::plan(llvm::ElementCount, unsigned int) (/home/rzinsly/build/llvm/bin/opt+0xebe1bd) # | #8 0x00005a92adf8bae4 llvm::LoopVectorizePass::processLoop(llvm::Loop*) (/home/rzinsly/build/llvm/bin/opt+0xec3ae4) ...

I’m building with: -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD="RISCV" -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_PARALLEL_LINK_JOBS=1 -DLLVM_OPTIMIZED_TABLEGEN=ON

It works with -DCMAKE_BUILD_TYPE=Debug.

Thanks for the report! I missed this when testing with a Release build +LLVM_ENABLE_ASSERTIONS=ON.
I’ve pushed a fix in the latest commit

Split out from llvm#151300 to isolate TargetTransformInfo cost modelling for fault-only-first loads from VPlan implementation details. This change adds costing support for vp.load.ff independently of the VPlan work. For now, model a vp.load.ff as cost-equivalent to a vp.load.

llvm/test/Transforms/LoopVectorize/RISCV/find.ll

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

llvm/lib/Transforms/Vectorize/VPlan.h

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

llvm/test/Transforms/LoopVectorize/RISCV/find.ll

arcbbb · 2025-10-21T15:45:21Z

Latest update addressed review comments:

Remove the dedicated recipe for fault-only-first loads; handle struct return types generically.
Replace ReductionEVL with masked AnyOf.
Add VPlan printing test.

…C). llvm#165218 Splitting from llvm#151300, vp_load_ff returns a struct type that cannot be widened by toVectorizedTy. This patch adds isVectorIntrinsicWithStructReturnScalarAtField and widen each struct element type independently.

arcbbb · 2025-12-11T02:57:35Z

Rebased to include #169890.

lukel97

I just took a brief look at this, I'll hopefully have more time to give it a closer review next week.

From what I understand we currently only support early exit loops without tail folding. Do we have a plan on how to support tail folding eventually and how that would interact with vp.load.ff?

lukel97 · 2025-12-12T08:25:40Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

 "Enable vectorization of early exit loops with uncountable exits."));

+static cl::opt<bool>
+ EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false),


In the spirit of incremental development can we remove this option just have it on by default?

Does it work for other targets besides RISCV? If so, the PR should have some tests for other backends too, or even better in the top level Transforms/LoopVectorize directory.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

lukel97 · 2025-12-12T08:44:43Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+ return;
+
+ // Ensure FFLoad does not read past the remainder in the last iteration.
+ // Set AVL to min(VF, remainder).


I'm confused as to why we need to cap the AVL for vp.load.ff. The first lane should always be dereferenceable, and then it shouldn't be an issue if it tries to read the remainder lanes because it doesn't trap?

Is this more of an optimization to reduce VL so it's not overly large?

This is primarily a correctness issue. Since the header mask is derived from the second returned value of vp.load.ff, we can use it directly only if the AVL is capped before the call.
For example:

%9 = call { <vscale x 16 x i8>, i32 } @llvm.vp.load.ff.nxv16i8.p0(ptr %8, <vscale x 16 x i1> splat (i1 true), i32 %7) %10 = extractvalue { <vscale x 16 x i8>, i32 } %9, 1 %11 = zext i32 %10 to i64 %14 = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %11)

get.active.lane.mask generates the predicate mask and uses %11 directly because AVL was capped earlier. If not, we would cap it before calling get.active.lane.mask.

lukel97

I don't think this is going to play well with EVL tail folding because we'll now have two different transforms trying to convert the plan to variable stepping, convertFFLoadEarlyExitToVLStepping and transformRecipestoEVLRecipes.

At a high level I wonder if we even want to support vp.load.ff without EVL tail folding to begin with. This PR from what I understand is kind of reimplementing a weaker version of EVL tail folding, since the variable stepping is a hard requirement of the vp.load.ff intrinsic that we can't avoid. It can reduce the number of lanes read for any reason.

Trying this PR out on some llvm-test-suite benchmarks shows that the generated code always seems to generate LMUL 8 step vectors which is probably not great for performance:

+.LBB1197_155: # Parent Loop BB1197_123 Depth=1 + # => This Inner Loop Header: Depth=2 + sub a2, a0, a1 + addi a3, sp, 1280 + add a3, a3, a1 + minu a2, s11, a2 + vsetvli zero, a2, e8, m2, ta, ma + vle8ff.v v16, (a3) + csrr a2, vl + csrr a3, vlenb + vsetvli a4, zero, e64, m8, ta, ma + vid.v v8 + vmv.v.v v24, v8 + vadd.vx v8, v8, a3 + zext.w a4, a2 + vmsltu.vx v18, v8, a4 + vmsltu.vx v8, v24, a4 + srli a4, a3, 2 + vsetvli a5, zero, e8, m2, ta, ma + vmseq.vi v9, v16, 0 + srli a3, a3, 3 + vsetvli zero, a4, e8, mf4, ta, ma + vslideup.vx v8, v18, a3 + vsetvli a3, zero, e8, m2, ta, ma + vmand.mm v8, v8, v9 + vcpop.m a3, v8 + bnez a3, .LBB1197_157 +# %bb.156: # in Loop: Header=BB1197_155 Depth=2 + add.uw a1, a2, a1 + bne a1, a0, .LBB1197_155 +.LBB1197_157: # in Loop: Header=BB1197_123 Depth=1 + snez a1, a3 + beqz a1, .LBB1197_159 +.LBB1197_158: # in Loop: Header=BB1197_123 Depth=1

So I don't think there's really much reason why we would want to emit non-tail folded early-exit loops if we can tail fold them eventually.

I understand that this is supposed to be an incremental PR, but I think maybe a better ordering might be to start by supporting early exit loops with tail folding. I think this means we need to address the "variable header mask" TODO here:

bool LoopVectorizationLegality::canFoldTailByMasking() const { // The only loops we can vectorize without a scalar epilogue, are loops with // a bottom-test and a single exiting block. We'd have to handle the fact // that not every instruction executes on the last iteration. This will // require a lane mask which varies through the vector loop body. (TODO) if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) { LLVM_DEBUG( dbgs() << "LV: Cannot fold tail by masking. Requires a singe latch exit\n"); return false; }

I think we can do this if we replace the notion of a header mask with the notion of a per-block header mask. I'll see if I can create an issue to discuss some of the design of this more.

lukel97 · 2025-12-12T09:27:36Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+ // EMIT vp<%alt.exit.cond> = any-of vp<%and>
+ // EMIT vp<%exit.cond> = or vp<%alt.exit.cond>, vp<%main.exit.cond>
+ // EMIT branch-on-cond vp<%exit.cond>
+ auto *ExitingLatch =


I think this works on the assumption that the poison lanes from the vp.load.ff never get shuffled around, e.g. they aren't reversed. But I think this always holds, just making a note.

arcbbb · 2025-12-16T10:28:07Z

Trying this PR out on some llvm-test-suite benchmarks shows that the generated code always seems to generate LMUL 8 step vectors which is probably not great for performance:

So I don't think there's really much reason why we would want to emit non-tail folded early-exit loops if we can tail fold them eventually.

If performance is a concern, I’m considering replacing the active‑lane‑mask + reduce_or sequence with a vfirst intrinsic + icmp in CodeGenPrepare. The idea is to rewrite:

 %14 = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %11) %15 = select <vscale x 16 x i1> %14, <vscale x 16 x i1> %13, <vscale x 16 x i1> zeroinitializer %16 = freeze <vscale x 16 x i1> %15 %17 = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> %16)

into:

%tmp0 = call i32 @llvm.riscv.vfirst(%13, zeroinitializer, %11) %tmp1 = icmp sge i32 %tmp0, 0

Would this be preferable?

I understand that this is supposed to be an incremental PR, but I think maybe a better ordering might be to start by supporting early exit loops with tail folding. I think this means we need to address the "variable header mask" TODO here:

I think we can do this if we replace the notion of a header mask with the notion of a per-block header mask. I'll see if I can create an issue to discuss some of the design of this more.

Thanks for flagging early‑exit loops with tail folding. That is also a nice-to-have and I am keen to see it.
We can tackle it separately. Hopefully it doesn’t block this PR!

If we have an early exit loop with non-dereferenceable loads after the exit, we currently bail: int z; for (int i = 0; i < N; i++) { if (x[i]) break; z = y[i]; } If the early exit block dominates the block containing these loads, we can predicate them with mask like for (int i = 0; i < N/VF; i++) { c[0..VF] = x[i..i+VF] z[0..VF] = y[i..i+VF], mask=c if (anyof(c)) break; } In VPlan terms, this is `icmp ult step-vector, (first-active-lane exit-cond)` VPlanPredicator can handle predicating these blocks, but in tryToBuildVPlanWithVPRecipes we first disconnect all early exits before the masks are introduced: // entry -> exiting -> ... -> latch // | // +-----> earlyexit VPlanTransforms::handleEarlyExits(*Plan); // entry -> exiting -> ... -> latch VPlanTransforms::introduceMasksAndLinearize(*Plan); This is needed to keep the region single entry/single exit, but it also means that there isn't any control flow by the time we want to add the masks: exiting: %earlyexitcond = ... // one successor (latch) latch: %exitcond = or (anyof %earlyexitcond), %origexitcond br %exitcond, entry, exit This patch propagates the information to VPlanPredicator that the successors should be predicated even though there isn't actually control flow with a new EarlyExit VPInstruction: exiting: %earlyexitcond = ... earlyexit (icmp ult step-vector, (first-active-lane %earlyexitcond) // one successor (latch) latch: %exitcond = or (anyof %earlyexitcond), origexitcond br %exitcond, entry, exit It's just a placeholder and gets immediately removed whenever VPlanPredicator sees it, but allows it to use the correct mask. This makes way for supporting more types of loops, as we could also support stores/divs etc. as long as the exiting block dominates them. See the note in canUncountableExitConditionLoadBeMoved as to why we can't predicate stores when they're before the exiting block. However the main motivation for this to allow us to support tail folding with early exits, which I believe will be needed to make supporting fault-only-first loads simpler: llvm#151300 In order to actually test the changes from this, this PR allows non-dereferenceable loads that are properly dominated by the exiting block in LoopVectorizationLegality. But in practice, something else usually transforms these loops to be multiple-entry which prevents them from being vectorized.

david-arm · 2025-12-19T13:53:05Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

 "Enable vectorization of early exit loops with uncountable exits."));

+static cl::opt<bool>
+ EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false),


Does it work for other targets besides RISCV? If so, the PR should have some tests for other backends too, or even better in the top level Transforms/LoopVectorize directory.

david-arm · 2025-12-19T13:56:33Z

llvm/test/Transforms/LoopVectorize/RISCV/find.ll

I think we need tests in the top level LoopVectorize directory too. For example, I think it's worth adding an extra RUN line with the -enable-early-exit-with-ffload flag to Transforms/LoopVectorize/single_early_exit.ll and Transforms/LoopVectorize/single_early_exit_live_outs.ll.

Those tests have a very extensive coverage of different CFGs and combinations of live-outs in different scenarios.

arcbbb requested review from alexey-bataev, david-arm and fhahn July 30, 2025 09:54

Meinersbur mentioned this pull request Jul 30, 2025

[MLIR][OpenMP] Add canonical loop LLVM-IR lowering #147069

Merged

arcbbb requested review from Mel-Chen and lukel97 July 30, 2025 10:04

david-arm requested a review from huntergr-arm July 30, 2025 10:08

arcbbb mentioned this pull request Jul 30, 2025

[VP][RISCV] Add a vp.load.ff intrinsic for fault only first load. #128593

Merged

alexey-bataev reviewed Jul 30, 2025

View reviewed changes

arcbbb changed the title ~~[LV] Add support for speculative loads in loops that may fault~~ [RFC][LV] Add support for speculative loads in loops that may fault Jul 31, 2025

arcbbb mentioned this pull request Aug 7, 2025

[LV] Add initial legality checks for loops with unbound loads. #152422

Merged

arcbbb force-pushed the early-exit-dev branch from d044e7a to 1bb79f8 Compare August 20, 2025 07:54

lukel97 reviewed Sep 2, 2025

View reviewed changes

arcbbb force-pushed the early-exit-dev branch from 1bb79f8 to f8d7616 Compare September 21, 2025 01:14

arcbbb changed the title ~~[RFC][LV] Add support for speculative loads in loops that may fault~~ [VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads Sep 21, 2025

arcbbb force-pushed the early-exit-dev branch from f8d7616 to 1ab67ae Compare September 21, 2025 01:42

arcbbb marked this pull request as ready for review September 21, 2025 01:42

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Sep 21, 2025

llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Sep 22, 2025

arcbbb force-pushed the early-exit-dev branch from b6add27 to 755ad37 Compare September 22, 2025 07:23

arcbbb mentioned this pull request Sep 24, 2025

[TTI][RISCV] Add cost modelling for intrinsic vp.load.ff #160470

Merged

arcbbb force-pushed the early-exit-dev branch 2 times, most recently from e4051e2 to 02f8262 Compare September 27, 2025 00:52

rzinsly reviewed Sep 29, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved

fhahn reviewed Oct 6, 2025

View reviewed changes

arcbbb force-pushed the early-exit-dev branch from 4e8d2e9 to deb51fd Compare October 27, 2025 06:46

arcbbb mentioned this pull request Oct 27, 2025

[VectorTypeUtil] Support struct return types for widen intrinsics (NFC). #165218

Open

arcbbb added 2 commits December 10, 2025 05:16

[VPlan] Support struct return types for widen intrinsics (NFC)

34692f5

Support WidenFFLoad in early-exit loop

6a3dd4d

arcbbb force-pushed the early-exit-dev branch from f2b9253 to 6a3dd4d Compare December 11, 2025 02:55

llvmbot added the llvm:ir label Dec 11, 2025

arcbbb removed the backend:RISC-V label Dec 11, 2025

lukel97 reviewed Dec 12, 2025

View reviewed changes

Reuse getCostForIntrinsics and remove special handling for vp_load_ff

43bbad6

llvmbot added the backend:RISC-V label Dec 16, 2025

lukel97 mentioned this pull request Dec 16, 2025

[VPlan] Handle early exit loops with predicated successors #172454

Open

david-arm reviewed Dec 19, 2025

View reviewed changes

[VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads #151300

Are you sure you want to change the base?

[VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads #151300

Uh oh!

Conversation

arcbbb commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

arcbbb commented Jul 30, 2025

github-actions bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

david-arm commented Jul 30, 2025

arcbbb commented Aug 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvmbot commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

llvmbot commented Sep 21, 2025

arcbbb commented Sep 21, 2025

arcbbb commented Sep 27, 2025

rzinsly commented Sep 29, 2025

Uh oh!

arcbbb commented Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arcbbb commented Oct 21, 2025

arcbbb commented Dec 11, 2025

lukel97 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukel97 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcbbb commented Dec 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

7 participants

arcbbb commented Jul 30, 2025 •

edited

Loading

github-actions bot commented Jul 30, 2025 •

edited

Loading

llvmbot commented Sep 21, 2025 •

edited

Loading