- Notifications
You must be signed in to change notification settings - Fork 15.6k
[VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads #151300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Regarding EVL propagation, I just noticed that header mask handling is fixed in #150202. This fix will need to be incorporated accordingly. |
| ✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks!
| I think it would be good for reviewers if this patch was split up into several parts, probably roughly in this order:
|
Thanks for outlining the steps. That’s very helpful! I’ll follow this order for splitting up the patch. |
| @@ -7749,6 +7758,12 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, | |||
| Builder.insert(VectorPtr); | |||
| Ptr = VectorPtr; | |||
| } | |||
| if (Legal->getSpeculativeLoads().contains(I)) { | |||
| auto *Load = dyn_cast<LoadInst>(I); | |||
| return new VPWidenFFLoadRecipe(*Load, Ptr, Mask, VPIRMetadata(*Load, LVer), | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making a note for later. I think it would be good to avoid having a dead VPWidenFFLoadRecipe when there's no non-EVL version of vp.load.ff.
Can we instead just have one VPWidenFFLoadRecipe and here pass VF as the EVL argument? optimizeMaskToEVL can then set the EVL from the header mask later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated! After rebasing, this patch supports WidenFFLoad in non–tail-folded mode only. To tail-fold early-exit loops and FFLoad support in the EVL transform will have to be addressed in a separate patch.
This patch splits out the legality checks from PR #151300, following the landing of PR #128593. It is a step toward supporting vectorization of early-exit loops that contain potentially faulting loads. In this commit, an early-exit loop is considered legal for vectorization if it satisfies the following criteria: 1. it is a read-only loop. 2. all potentially faulting loads are unit-stride, which is the only type currently supported by vp.load.ff.
1bb79f8 to f8d7616 Compare f8d7616 to 1ab67ae Compare | @llvm/pr-subscribers-backend-risc-v @llvm/pr-subscribers-llvm-transforms Author: Shih-Po Hung (arcbbb) ChangesFollowing #152422, this patch enables auto-vectorization of early-exit loops containing a single potentially faulting, unit-stride load by using the vp.load.ff intrinsic introduced in #128593. Key changes:
Limitations:
Patch is 34.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151300.diff 9 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 1d3cffa2b61bf..e28d4c45d4ab8 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -393,6 +393,12 @@ static cl::opt<bool> EnableEarlyExitVectorization( cl::desc( "Enable vectorization of early exit loops with uncountable exits.")); +static cl::opt<bool> + EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false), + cl::Hidden, + cl::desc("Enable vectorization of early-exit " + "loops with fault-only-first loads.")); + static cl::opt<bool> ConsiderRegPressure( "vectorizer-consider-reg-pressure", cl::init(false), cl::Hidden, cl::desc("Discard VFs if their register pressure is too high.")); @@ -3507,6 +3513,15 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) { return FixedScalableVFPair::getNone(); } + if (!Legal->getPotentiallyFaultingLoads().empty() && UserIC > 1) { + reportVectorizationFailure("Auto-vectorization of loops with potentially " + "faulting loads is not supported when the " + "interleave count is more than 1", + "CantInterleaveLoopWithPotentiallyFaultingLoads", + ORE, TheLoop); + return FixedScalableVFPair::getNone(); + } + ScalarEvolution *SE = PSE.getSE(); ElementCount TC = getSmallConstantTripCount(SE, TheLoop); unsigned MaxTC = PSE.getSmallConstantMaxTripCount(); @@ -4076,6 +4091,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPReductionPHISC: case VPDef::VPInterleaveEVLSC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenFFLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -4550,6 +4566,10 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF, if (!Legal->isSafeForAnyVectorWidth()) return 1; + // No interleaving for potentially faulting loads. + if (!Legal->getPotentiallyFaultingLoads().empty()) + return 1; + // We don't attempt to perform interleaving for loops with uncountable early // exits because the VPInstruction::AnyOf code cannot currently handle // multiple parts. @@ -7216,6 +7236,9 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan( // Regions are dissolved after optimizing for VF and UF, which completely // removes unneeded loop regions first. VPlanTransforms::dissolveLoopRegions(BestVPlan); + + VPlanTransforms::convertFFLoadEarlyExitToVLStepping(BestVPlan); + // Canonicalize EVL loops after regions are dissolved. VPlanTransforms::canonicalizeEVLLoops(BestVPlan); VPlanTransforms::materializeBackedgeTakenCount(BestVPlan, VectorPH); @@ -7598,6 +7621,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, Builder.insert(VectorPtr); Ptr = VectorPtr; } + if (Legal->getPotentiallyFaultingLoads().contains(I)) + return new VPWidenFFLoadRecipe(*cast<LoadInst>(I), Ptr, &Plan.getVF(), Mask, + VPIRMetadata(*I, LVer), I->getDebugLoc()); + if (LoadInst *Load = dyn_cast<LoadInst>(I)) return new VPWidenLoadRecipe(*Load, Ptr, Mask, Consecutive, Reverse, VPIRMetadata(*Load, LVer), I->getDebugLoc()); @@ -8632,6 +8659,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( if (Recipe->getNumDefinedValues() == 1) { SingleDef->replaceAllUsesWith(Recipe->getVPSingleValue()); Old2New[SingleDef] = Recipe->getVPSingleValue(); + } else if (isa<VPWidenFFLoadRecipe>(Recipe)) { + VPValue *Data = Recipe->getVPValue(0); + SingleDef->replaceAllUsesWith(Data); + Old2New[SingleDef] = Data; } else { assert(Recipe->getNumDefinedValues() == 0 && "Unexpected multidef recipe"); @@ -8679,6 +8710,8 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( // Adjust the recipes for any inloop reductions. adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start); + VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(*Plan); + // Apply mandatory transformation to handle FP maxnum/minnum reduction with // NaNs if possible, bail out otherwise. if (!VPlanTransforms::runPass(VPlanTransforms::handleMaxMinNumReductions, @@ -9869,7 +9902,14 @@ bool LoopVectorizePass::processLoop(Loop *L) { return false; } - if (!LVL.getPotentiallyFaultingLoads().empty()) { + if (EnableEarlyExitWithFFLoads) { + if (LVL.getPotentiallyFaultingLoads().size() > 1) { + reportVectorizationFailure("Auto-vectorization of loops with more than 1 " + "potentially faulting load is not enabled", + "MoreThanOnePotentiallyFaultingLoad", ORE, L); + return false; + } + } else if (!LVL.getPotentiallyFaultingLoads().empty()) { reportVectorizationFailure("Auto-vectorization of loops with potentially " "faulting load is not supported", "PotentiallyFaultingLoadsNotSupported", ORE, L); diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index f79855f7e2c5f..6e28c95ca601a 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -563,6 +563,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPInterleaveEVLSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenFFLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -2811,6 +2812,13 @@ class LLVM_ABI_FOR_TEST VPReductionEVLRecipe : public VPReductionRecipe { ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp, R.isOrdered(), DL) {} + VPReductionEVLRecipe(RecurKind RdxKind, FastMathFlags FMFs, VPValue *ChainOp, + VPValue *VecOp, VPValue &EVL, VPValue *CondOp, + bool IsOrdered, DebugLoc DL = DebugLoc::getUnknown()) + : VPReductionRecipe(VPDef::VPReductionEVLSC, RdxKind, FMFs, nullptr, + ArrayRef<VPValue *>({ChainOp, VecOp, &EVL}), CondOp, + IsOrdered, DL) {} + ~VPReductionEVLRecipe() override = default; VPReductionEVLRecipe *clone() override { @@ -3159,6 +3167,7 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, static inline bool classof(const VPRecipeBase *R) { return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || + R->getVPDefID() == VPRecipeBase::VPWidenFFLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; } @@ -3240,6 +3249,42 @@ struct LLVM_ABI_FOR_TEST VPWidenLoadRecipe final : public VPWidenMemoryRecipe, } }; +/// A recipe for widening loads using fault-only-first intrinsics. +/// Produces two results: (1) the loaded data, and (2) the index of the first +/// non-dereferenceable lane, or VF if all lanes are successfully read. +struct VPWidenFFLoadRecipe final : public VPWidenMemoryRecipe, public VPValue { + VPWidenFFLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe(VPDef::VPWidenFFLoadSC, Load, {Addr, VF}, + /*Consecutive*/ true, /*Reverse*/ false, Metadata, + DL), + VPValue(this, &Load) { + new VPValue(nullptr, this); // Index of the first lane that faults. + setMask(Mask); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenFFLoadSC); + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(1); } + void setVF(VPValue *V) { setOperand(1, V); } + + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getVF() || Op == getAddr(); + } +}; + /// A recipe for widening load operations with vector-predication intrinsics, /// using the address to load from, the explicit vector length and an optional /// mask. diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 46ab7712e2671..684dbd25597e3 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -188,8 +188,9 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenFFLoadRecipe, VPWidenLoadEVLRecipe>(R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index 8e9c3db50319f..3da8613a1d3cc 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -73,6 +73,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPReductionPHISC: case VPScalarIVStepsSC: case VPPredInstPHISC: + case VPWidenFFLoadSC: return false; case VPBlendSC: case VPReductionEVLSC: @@ -107,6 +108,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); case VPWidenLoadEVLSC: case VPWidenLoadSC: + case VPWidenFFLoadSC: return true; case VPReplicateSC: return cast<Instruction>(getVPSingleValue()->getUnderlyingValue()) @@ -3409,6 +3411,47 @@ void VPWidenLoadRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenFFLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + State.setDebugLocFrom(getDebugLoc()); + + Value *VL = State.get(getVF(), VPLane(0)); + Type *I32Ty = Builder.getInt32Ty(); + VL = Builder.CreateZExtOrTrunc(VL, I32Ty); + Value *Addr = State.get(getAddr(), true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + CallInst *NewLI = + Builder.CreateIntrinsic(Intrinsic::vp_load_ff, {DataTy, Addr->getType()}, + {Addr, Mask, VL}, nullptr, "vp.op.load.ff"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + Value *V = cast<Instruction>(Builder.CreateExtractValue(NewLI, 0)); + Value *NewVL = Builder.CreateExtractValue(NewLI, 1); + State.set(getVPValue(0), V); + State.set(getVPValue(1), NewVL, /*NeedsScalar=*/true); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenFFLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << ", "; + getVPValue(1)->printAsOperand(O, SlotTracker); + O << " = vp.load.ff "; + printOperands(O, SlotTracker); +} +#endif + /// Use all-true mask for reverse rather than actual mask, as it avoids a /// dependence w/o affecting the result. static Instruction *createReverseEVL(IRBuilderBase &Builder, Value *Operand, diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 1f6b85270607e..7e78cb6ed02ac 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2760,6 +2760,102 @@ void VPlanTransforms::addExplicitVectorLength( Plan.setUF(1); } +void VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan) { + VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock(); + VPWidenFFLoadRecipe *LastFFLoad = nullptr; + for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>( + vp_depth_first_deep(Plan.getVectorLoopRegion()))) + for (VPRecipeBase &R : *VPBB) + if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) { + assert(!LastFFLoad && "Only one FFLoad is supported"); + LastFFLoad = Load; + } + + // Skip if no FFLoad. + if (!LastFFLoad) + return; + + // Ensure FFLoad does not read past the remainder in the last iteration. + // Set AVL to min(VF, remainder). + VPBuilder Builder(Header, Header->getFirstNonPhi()); + VPValue *Remainder = Builder.createNaryOp( + Instruction::Sub, {&Plan.getVectorTripCount(), Plan.getCanonicalIV()}); + VPValue *Cmp = + Builder.createICmp(CmpInst::ICMP_ULE, &Plan.getVF(), Remainder); + VPValue *AVL = Builder.createSelect(Cmp, &Plan.getVF(), Remainder); + LastFFLoad->setVF(AVL); + + // To prevent branch-on-poison, rewrite the early-exit condition to + // VPReductionEVLRecipe. Expected pattern here is: + // EMIT vp<%alt.exit.cond> = AnyOf + // EMIT vp<%exit.cond> = or vp<%alt.exit.cond>, vp<%main.exit.cond> + // EMIT branch-on-cond vp<%exit.cond> + auto *ExitingLatch = cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getExiting()); + auto *LatchExitingBr = cast<VPInstruction>(ExitingLatch->getTerminator()); + + VPValue *VPAnyOf = nullptr; + VPValue *VecOp = nullptr; + assert( + match(LatchExitingBr, + m_BranchOnCond(m_BinaryOr(m_VPValue(VPAnyOf), m_VPValue()))) && + match(VPAnyOf, m_VPInstruction<VPInstruction::AnyOf>(m_VPValue(VecOp))) && + "unexpected exiting sequence in early exit loop"); + + VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1); + VPValue *Mask = LastFFLoad->getMask(); + FastMathFlags FMF; + auto *I1Ty = Type::getInt1Ty(Plan.getContext()); + VPValue *VPZero = Plan.getOrAddLiveIn(ConstantInt::get(I1Ty, 0)); + DebugLoc DL = VPAnyOf->getDefiningRecipe()->getDebugLoc(); + auto *NewAnyOf = + new VPReductionEVLRecipe(RecurKind::Or, FMF, VPZero, VecOp, *OpVPEVLI32, + Mask, /*IsOrdered*/ false, DL); + NewAnyOf->insertBefore(VPAnyOf->getDefiningRecipe()); + VPAnyOf->replaceAllUsesWith(NewAnyOf); + + // Using FirstActiveLane in the early-exit block is safe, + // exiting conditions guarantees at least one valid lane precedes + // any poisoned lanes. +} + +void VPlanTransforms::convertFFLoadEarlyExitToVLStepping(VPlan &Plan) { + // Find loop header by locating VPWidenFFLoadRecipe. + VPWidenFFLoadRecipe *LastFFLoad = nullptr; + + for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>( + vp_depth_first_shallow(Plan.getEntry()))) + for (VPRecipeBase &R : *VPBB) + if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) { + assert(!LastFFLoad && "Only one FFLoad is supported"); + LastFFLoad = Load; + } + + // Skip if no FFLoad. + if (!LastFFLoad) + return; + + VPBasicBlock *HeaderVPBB = LastFFLoad->getParent(); + // Replace IVStep (VFxUF) with returned VL from FFLoad. + auto *CanonicalIV = cast<VPPhi>(&*HeaderVPBB->begin()); + VPValue *Backedge = CanonicalIV->getIncomingValue(1); + assert(match(Backedge, m_c_Add(m_Specific(CanonicalIV), + m_Specific(&Plan.getVFxUF()))) && + "Unexpected canonical iv"); + VPRecipeBase *CanonicalIVIncrement = Backedge->getDefiningRecipe(); + VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1); + VPBuilder Builder(HeaderVPBB, HeaderVPBB->getFirstNonPhi()); + Builder.setInsertPoint(CanonicalIVIncrement); + auto *TC = Plan.getTripCount(); + Type *CanIVTy = TC->isLiveIn() + ? TC->getLiveInIRValue()->getType() + : cast<VPExpandSCEVRecipe>(TC)->getSCEV()->getType(); + auto *I32Ty = Type::getInt32Ty(Plan.getContext()); + VPValue *OpVPEVL = Builder.createScalarZExtOrTrunc( + OpVPEVLI32, CanIVTy, I32Ty, CanonicalIVIncrement->getDebugLoc()); + + CanonicalIVIncrement->setOperand(1, OpVPEVL); +} + void VPlanTransforms::canonicalizeEVLLoops(VPlan &Plan) { // Find EVL loop entries by locating VPEVLBasedIVPHIRecipe. // There should be only one EVL PHI in the entire plan. diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h index 69452a7e37572..bc5ce3bc43e76 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h @@ -269,6 +269,17 @@ struct VPlanTransforms { /// (branch-on-cond eq AVLNext, 0) static void canonicalizeEVLLoops(VPlan &Plan); + /// Applies to early-exit loops that use FFLoad. FFLoad may yield fewer active + /// lanes than VF. To prevent branch-on-poison and over-reads past the vector + /// trip count, use the returned VL for both stepping and exit computation. + /// Implemented by: + /// - adjustFFLoadEarlyExitForPoisonSafety: replace AnyOf with vp.reduce.or over + /// the first VL lanes; set AVL = min(VF, remainder). + /// - convertFFLoadEarlyExitToVLStepping: after region dissolution, convert + /// early-exit loops to variable-length stepping. + static void adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan); + static void convertFFLoadEarlyExitToVLStepping(VPlan &Plan); + /// Lower abstract recipes to concrete ones, that can be codegen'd. static void convertToConcreteRecipes(VPlan &Plan); diff --git a/llvm/lib/Transforms/Vectorize/VPlanValue.h b/llvm/lib/Transforms/Vectorize/VPlanValue.h index 0678bc90ef4b5..b2bc430a09686 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanValue.h +++ b/llvm/lib/Transforms/Vectorize/VPlanValue.h @@ -40,6 +40,7 @@ class VPUser; class VPRecipeBase; class VPInterleaveBase; class VPPhiAccessors; +class VPWidenFFLoadRecipe; // This is the base class of the VPlan Def/Use graph, used for modeling the data // flow into, within and out of the VPlan. VPValues can stand for live-ins @@ -51,6 +52,7 @@ class LLVM_ABI_FOR_TEST VPValue { friend class VPInterleaveBase; friend class VPlan; friend class VPExpressionRecipe; + friend class VPWidenFFLoadRecipe; const unsigned char SubclassID; ///< Subclass identifier (for isa/dyn_cast). @@ -351,6 +353,7 @@ class VPDef { VPWidenCastSC, VPWidenGEPSC, VPWidenIntrinsicSC, + VPWidenFFLoadSC, VPWidenLoadEVLSC, VPWidenLoadSC, VPWidenStoreEVLSC, diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp index 92caa0b4e51d5..70e6e0d006eb6 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp @@ -166,8 +166,8 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const { } return VerifyEVLUse(*R, 2); }) - .Case<VPWidenLoadEVLRecipe, VPVectorEndPointerRecipe, - VPInterleaveEVLRecipe>( + .Case<VPWidenLoadEVLRecipe, VPWidenFFLoadRecipe, + VPVectorEndPointerRecipe, VPInterleaveEVLRecipe>( [&](const VPRecipeBase *R) { return VerifyEVLUse(*R, 1); }) .Case<VPInstructionWithType>( [&](const VPInstructionWithType *S) { return VerifyEVLUse(*S, 0); }) diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/find.ll b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll new file mode 100644 index 0000000000000..f734bd5f53c82 --- /dev/null +++ b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll @@ -0... [truncated] |
| @llvm/pr-subscribers-backend-risc-v Author: Shih-Po Hung (arcbbb) ChangesFollowing #152422, this patch enables auto-vectorization of early-exit loops containing a single potentially faulting, unit-stride load by using the vp.load.ff intrinsic introduced in #128593. Key changes:
Limitations:
Patch is 34.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151300.diff 9 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 1d3cffa2b61bf..e28d4c45d4ab8 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -393,6 +393,12 @@ static cl::opt<bool> EnableEarlyExitVectorization( cl::desc( "Enable vectorization of early exit loops with uncountable exits.")); +static cl::opt<bool> + EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false), + cl::Hidden, + cl::desc("Enable vectorization of early-exit " + "loops with fault-only-first loads.")); + static cl::opt<bool> ConsiderRegPressure( "vectorizer-consider-reg-pressure", cl::init(false), cl::Hidden, cl::desc("Discard VFs if their register pressure is too high.")); @@ -3507,6 +3513,15 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) { return FixedScalableVFPair::getNone(); } + if (!Legal->getPotentiallyFaultingLoads().empty() && UserIC > 1) { + reportVectorizationFailure("Auto-vectorization of loops with potentially " + "faulting loads is not supported when the " + "interleave count is more than 1", + "CantInterleaveLoopWithPotentiallyFaultingLoads", + ORE, TheLoop); + return FixedScalableVFPair::getNone(); + } + ScalarEvolution *SE = PSE.getSE(); ElementCount TC = getSmallConstantTripCount(SE, TheLoop); unsigned MaxTC = PSE.getSmallConstantMaxTripCount(); @@ -4076,6 +4091,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPReductionPHISC: case VPDef::VPInterleaveEVLSC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenFFLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -4550,6 +4566,10 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF, if (!Legal->isSafeForAnyVectorWidth()) return 1; + // No interleaving for potentially faulting loads. + if (!Legal->getPotentiallyFaultingLoads().empty()) + return 1; + // We don't attempt to perform interleaving for loops with uncountable early // exits because the VPInstruction::AnyOf code cannot currently handle // multiple parts. @@ -7216,6 +7236,9 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan( // Regions are dissolved after optimizing for VF and UF, which completely // removes unneeded loop regions first. VPlanTransforms::dissolveLoopRegions(BestVPlan); + + VPlanTransforms::convertFFLoadEarlyExitToVLStepping(BestVPlan); + // Canonicalize EVL loops after regions are dissolved. VPlanTransforms::canonicalizeEVLLoops(BestVPlan); VPlanTransforms::materializeBackedgeTakenCount(BestVPlan, VectorPH); @@ -7598,6 +7621,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, Builder.insert(VectorPtr); Ptr = VectorPtr; } + if (Legal->getPotentiallyFaultingLoads().contains(I)) + return new VPWidenFFLoadRecipe(*cast<LoadInst>(I), Ptr, &Plan.getVF(), Mask, + VPIRMetadata(*I, LVer), I->getDebugLoc()); + if (LoadInst *Load = dyn_cast<LoadInst>(I)) return new VPWidenLoadRecipe(*Load, Ptr, Mask, Consecutive, Reverse, VPIRMetadata(*Load, LVer), I->getDebugLoc()); @@ -8632,6 +8659,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( if (Recipe->getNumDefinedValues() == 1) { SingleDef->replaceAllUsesWith(Recipe->getVPSingleValue()); Old2New[SingleDef] = Recipe->getVPSingleValue(); + } else if (isa<VPWidenFFLoadRecipe>(Recipe)) { + VPValue *Data = Recipe->getVPValue(0); + SingleDef->replaceAllUsesWith(Data); + Old2New[SingleDef] = Data; } else { assert(Recipe->getNumDefinedValues() == 0 && "Unexpected multidef recipe"); @@ -8679,6 +8710,8 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( // Adjust the recipes for any inloop reductions. adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start); + VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(*Plan); + // Apply mandatory transformation to handle FP maxnum/minnum reduction with // NaNs if possible, bail out otherwise. if (!VPlanTransforms::runPass(VPlanTransforms::handleMaxMinNumReductions, @@ -9869,7 +9902,14 @@ bool LoopVectorizePass::processLoop(Loop *L) { return false; } - if (!LVL.getPotentiallyFaultingLoads().empty()) { + if (EnableEarlyExitWithFFLoads) { + if (LVL.getPotentiallyFaultingLoads().size() > 1) { + reportVectorizationFailure("Auto-vectorization of loops with more than 1 " + "potentially faulting load is not enabled", + "MoreThanOnePotentiallyFaultingLoad", ORE, L); + return false; + } + } else if (!LVL.getPotentiallyFaultingLoads().empty()) { reportVectorizationFailure("Auto-vectorization of loops with potentially " "faulting load is not supported", "PotentiallyFaultingLoadsNotSupported", ORE, L); diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index f79855f7e2c5f..6e28c95ca601a 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -563,6 +563,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPInterleaveEVLSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenFFLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -2811,6 +2812,13 @@ class LLVM_ABI_FOR_TEST VPReductionEVLRecipe : public VPReductionRecipe { ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp, R.isOrdered(), DL) {} + VPReductionEVLRecipe(RecurKind RdxKind, FastMathFlags FMFs, VPValue *ChainOp, + VPValue *VecOp, VPValue &EVL, VPValue *CondOp, + bool IsOrdered, DebugLoc DL = DebugLoc::getUnknown()) + : VPReductionRecipe(VPDef::VPReductionEVLSC, RdxKind, FMFs, nullptr, + ArrayRef<VPValue *>({ChainOp, VecOp, &EVL}), CondOp, + IsOrdered, DL) {} + ~VPReductionEVLRecipe() override = default; VPReductionEVLRecipe *clone() override { @@ -3159,6 +3167,7 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, static inline bool classof(const VPRecipeBase *R) { return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || + R->getVPDefID() == VPRecipeBase::VPWidenFFLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; } @@ -3240,6 +3249,42 @@ struct LLVM_ABI_FOR_TEST VPWidenLoadRecipe final : public VPWidenMemoryRecipe, } }; +/// A recipe for widening loads using fault-only-first intrinsics. +/// Produces two results: (1) the loaded data, and (2) the index of the first +/// non-dereferenceable lane, or VF if all lanes are successfully read. +struct VPWidenFFLoadRecipe final : public VPWidenMemoryRecipe, public VPValue { + VPWidenFFLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe(VPDef::VPWidenFFLoadSC, Load, {Addr, VF}, + /*Consecutive*/ true, /*Reverse*/ false, Metadata, + DL), + VPValue(this, &Load) { + new VPValue(nullptr, this); // Index of the first lane that faults. + setMask(Mask); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenFFLoadSC); + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(1); } + void setVF(VPValue *V) { setOperand(1, V); } + + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getVF() || Op == getAddr(); + } +}; + /// A recipe for widening load operations with vector-predication intrinsics, /// using the address to load from, the explicit vector length and an optional /// mask. diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 46ab7712e2671..684dbd25597e3 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -188,8 +188,9 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenFFLoadRecipe, VPWidenLoadEVLRecipe>(R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index 8e9c3db50319f..3da8613a1d3cc 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -73,6 +73,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPReductionPHISC: case VPScalarIVStepsSC: case VPPredInstPHISC: + case VPWidenFFLoadSC: return false; case VPBlendSC: case VPReductionEVLSC: @@ -107,6 +108,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); case VPWidenLoadEVLSC: case VPWidenLoadSC: + case VPWidenFFLoadSC: return true; case VPReplicateSC: return cast<Instruction>(getVPSingleValue()->getUnderlyingValue()) @@ -3409,6 +3411,47 @@ void VPWidenLoadRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenFFLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + State.setDebugLocFrom(getDebugLoc()); + + Value *VL = State.get(getVF(), VPLane(0)); + Type *I32Ty = Builder.getInt32Ty(); + VL = Builder.CreateZExtOrTrunc(VL, I32Ty); + Value *Addr = State.get(getAddr(), true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + CallInst *NewLI = + Builder.CreateIntrinsic(Intrinsic::vp_load_ff, {DataTy, Addr->getType()}, + {Addr, Mask, VL}, nullptr, "vp.op.load.ff"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + Value *V = cast<Instruction>(Builder.CreateExtractValue(NewLI, 0)); + Value *NewVL = Builder.CreateExtractValue(NewLI, 1); + State.set(getVPValue(0), V); + State.set(getVPValue(1), NewVL, /*NeedsScalar=*/true); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenFFLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << ", "; + getVPValue(1)->printAsOperand(O, SlotTracker); + O << " = vp.load.ff "; + printOperands(O, SlotTracker); +} +#endif + /// Use all-true mask for reverse rather than actual mask, as it avoids a /// dependence w/o affecting the result. static Instruction *createReverseEVL(IRBuilderBase &Builder, Value *Operand, diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 1f6b85270607e..7e78cb6ed02ac 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2760,6 +2760,102 @@ void VPlanTransforms::addExplicitVectorLength( Plan.setUF(1); } +void VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan) { + VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock(); + VPWidenFFLoadRecipe *LastFFLoad = nullptr; + for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>( + vp_depth_first_deep(Plan.getVectorLoopRegion()))) + for (VPRecipeBase &R : *VPBB) + if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) { + assert(!LastFFLoad && "Only one FFLoad is supported"); + LastFFLoad = Load; + } + + // Skip if no FFLoad. + if (!LastFFLoad) + return; + + // Ensure FFLoad does not read past the remainder in the last iteration. + // Set AVL to min(VF, remainder). + VPBuilder Builder(Header, Header->getFirstNonPhi()); + VPValue *Remainder = Builder.createNaryOp( + Instruction::Sub, {&Plan.getVectorTripCount(), Plan.getCanonicalIV()}); + VPValue *Cmp = + Builder.createICmp(CmpInst::ICMP_ULE, &Plan.getVF(), Remainder); + VPValue *AVL = Builder.createSelect(Cmp, &Plan.getVF(), Remainder); + LastFFLoad->setVF(AVL); + + // To prevent branch-on-poison, rewrite the early-exit condition to + // VPReductionEVLRecipe. Expected pattern here is: + // EMIT vp<%alt.exit.cond> = AnyOf + // EMIT vp<%exit.cond> = or vp<%alt.exit.cond>, vp<%main.exit.cond> + // EMIT branch-on-cond vp<%exit.cond> + auto *ExitingLatch = cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getExiting()); + auto *LatchExitingBr = cast<VPInstruction>(ExitingLatch->getTerminator()); + + VPValue *VPAnyOf = nullptr; + VPValue *VecOp = nullptr; + assert( + match(LatchExitingBr, + m_BranchOnCond(m_BinaryOr(m_VPValue(VPAnyOf), m_VPValue()))) && + match(VPAnyOf, m_VPInstruction<VPInstruction::AnyOf>(m_VPValue(VecOp))) && + "unexpected exiting sequence in early exit loop"); + + VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1); + VPValue *Mask = LastFFLoad->getMask(); + FastMathFlags FMF; + auto *I1Ty = Type::getInt1Ty(Plan.getContext()); + VPValue *VPZero = Plan.getOrAddLiveIn(ConstantInt::get(I1Ty, 0)); + DebugLoc DL = VPAnyOf->getDefiningRecipe()->getDebugLoc(); + auto *NewAnyOf = + new VPReductionEVLRecipe(RecurKind::Or, FMF, VPZero, VecOp, *OpVPEVLI32, + Mask, /*IsOrdered*/ false, DL); + NewAnyOf->insertBefore(VPAnyOf->getDefiningRecipe()); + VPAnyOf->replaceAllUsesWith(NewAnyOf); + + // Using FirstActiveLane in the early-exit block is safe, + // exiting conditions guarantees at least one valid lane precedes + // any poisoned lanes. +} + +void VPlanTransforms::convertFFLoadEarlyExitToVLStepping(VPlan &Plan) { + // Find loop header by locating VPWidenFFLoadRecipe. + VPWidenFFLoadRecipe *LastFFLoad = nullptr; + + for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>( + vp_depth_first_shallow(Plan.getEntry()))) + for (VPRecipeBase &R : *VPBB) + if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) { + assert(!LastFFLoad && "Only one FFLoad is supported"); + LastFFLoad = Load; + } + + // Skip if no FFLoad. + if (!LastFFLoad) + return; + + VPBasicBlock *HeaderVPBB = LastFFLoad->getParent(); + // Replace IVStep (VFxUF) with returned VL from FFLoad. + auto *CanonicalIV = cast<VPPhi>(&*HeaderVPBB->begin()); + VPValue *Backedge = CanonicalIV->getIncomingValue(1); + assert(match(Backedge, m_c_Add(m_Specific(CanonicalIV), + m_Specific(&Plan.getVFxUF()))) && + "Unexpected canonical iv"); + VPRecipeBase *CanonicalIVIncrement = Backedge->getDefiningRecipe(); + VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1); + VPBuilder Builder(HeaderVPBB, HeaderVPBB->getFirstNonPhi()); + Builder.setInsertPoint(CanonicalIVIncrement); + auto *TC = Plan.getTripCount(); + Type *CanIVTy = TC->isLiveIn() + ? TC->getLiveInIRValue()->getType() + : cast<VPExpandSCEVRecipe>(TC)->getSCEV()->getType(); + auto *I32Ty = Type::getInt32Ty(Plan.getContext()); + VPValue *OpVPEVL = Builder.createScalarZExtOrTrunc( + OpVPEVLI32, CanIVTy, I32Ty, CanonicalIVIncrement->getDebugLoc()); + + CanonicalIVIncrement->setOperand(1, OpVPEVL); +} + void VPlanTransforms::canonicalizeEVLLoops(VPlan &Plan) { // Find EVL loop entries by locating VPEVLBasedIVPHIRecipe. // There should be only one EVL PHI in the entire plan. diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h index 69452a7e37572..bc5ce3bc43e76 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h @@ -269,6 +269,17 @@ struct VPlanTransforms { /// (branch-on-cond eq AVLNext, 0) static void canonicalizeEVLLoops(VPlan &Plan); + /// Applies to early-exit loops that use FFLoad. FFLoad may yield fewer active + /// lanes than VF. To prevent branch-on-poison and over-reads past the vector + /// trip count, use the returned VL for both stepping and exit computation. + /// Implemented by: + /// - adjustFFLoadEarlyExitForPoisonSafety: replace AnyOf with vp.reduce.or over + /// the first VL lanes; set AVL = min(VF, remainder). + /// - convertFFLoadEarlyExitToVLStepping: after region dissolution, convert + /// early-exit loops to variable-length stepping. + static void adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan); + static void convertFFLoadEarlyExitToVLStepping(VPlan &Plan); + /// Lower abstract recipes to concrete ones, that can be codegen'd. static void convertToConcreteRecipes(VPlan &Plan); diff --git a/llvm/lib/Transforms/Vectorize/VPlanValue.h b/llvm/lib/Transforms/Vectorize/VPlanValue.h index 0678bc90ef4b5..b2bc430a09686 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanValue.h +++ b/llvm/lib/Transforms/Vectorize/VPlanValue.h @@ -40,6 +40,7 @@ class VPUser; class VPRecipeBase; class VPInterleaveBase; class VPPhiAccessors; +class VPWidenFFLoadRecipe; // This is the base class of the VPlan Def/Use graph, used for modeling the data // flow into, within and out of the VPlan. VPValues can stand for live-ins @@ -51,6 +52,7 @@ class LLVM_ABI_FOR_TEST VPValue { friend class VPInterleaveBase; friend class VPlan; friend class VPExpressionRecipe; + friend class VPWidenFFLoadRecipe; const unsigned char SubclassID; ///< Subclass identifier (for isa/dyn_cast). @@ -351,6 +353,7 @@ class VPDef { VPWidenCastSC, VPWidenGEPSC, VPWidenIntrinsicSC, + VPWidenFFLoadSC, VPWidenLoadEVLSC, VPWidenLoadSC, VPWidenStoreEVLSC, diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp index 92caa0b4e51d5..70e6e0d006eb6 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp @@ -166,8 +166,8 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const { } return VerifyEVLUse(*R, 2); }) - .Case<VPWidenLoadEVLRecipe, VPVectorEndPointerRecipe, - VPInterleaveEVLRecipe>( + .Case<VPWidenLoadEVLRecipe, VPWidenFFLoadRecipe, + VPVectorEndPointerRecipe, VPInterleaveEVLRecipe>( [&](const VPRecipeBase *R) { return VerifyEVLUse(*R, 1); }) .Case<VPInstructionWithType>( [&](const VPInstructionWithType *S) { return VerifyEVLUse(*S, 0); }) diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/find.ll b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll new file mode 100644 index 0000000000000..f734bd5f53c82 --- /dev/null +++ b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll @@ -0... [truncated] |
| Rebased after #152422 landed, and updated the PR description as well as title. |
b6add27 to 755ad37 Compare Split out from llvm#151300 to isolate TargetTransformInfo cost modelling for fault-only-first loads from VPlan implementation details. This change adds costing support for vp.load.ff independently of the VPlan work.
Split out from #151300 to isolate TargetTransformInfo cost modelling for fault-only-first loads from VPlan implementation details. This change adds costing support for vp.load.ff independently of the VPlan work. For now, model a vp.load.ff as cost-equivalent to a vp.load.
e4051e2 to 02f8262 Compare | Gentle nudge. Any thoughts? |
| Hi, I wanted to try this patch but the new test is failing for a Release build: I’m building with: It works with |
Thanks for the report! I missed this when testing with a Release build +LLVM_ENABLE_ASSERTIONS=ON. |
Split out from llvm#151300 to isolate TargetTransformInfo cost modelling for fault-only-first loads from VPlan implementation details. This change adds costing support for vp.load.ff independently of the VPlan work. For now, model a vp.load.ff as cost-equivalent to a vp.load.
| Latest update addressed review comments:
|
4e8d2e9 to deb51fd Compare …C). llvm#165218 Splitting from llvm#151300, vp_load_ff returns a struct type that cannot be widened by toVectorizedTy. This patch adds isVectorIntrinsicWithStructReturnScalarAtField and widen each struct element type independently.
f2b9253 to 6a3dd4d Compare | Rebased to include #169890. |
lukel97 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just took a brief look at this, I'll hopefully have more time to give it a closer review next week.
From what I understand we currently only support early exit loops without tail folding. Do we have a plan on how to support tail folding eventually and how that would interact with vp.load.ff?
| "Enable vectorization of early exit loops with uncountable exits.")); | ||
| | ||
| static cl::opt<bool> | ||
| EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spirit of incremental development can we remove this option just have it on by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work for other targets besides RISCV? If so, the PR should have some tests for other backends too, or even better in the top level Transforms/LoopVectorize directory.
| return; | ||
| | ||
| // Ensure FFLoad does not read past the remainder in the last iteration. | ||
| // Set AVL to min(VF, remainder). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused as to why we need to cap the AVL for vp.load.ff. The first lane should always be dereferenceable, and then it shouldn't be an issue if it tries to read the remainder lanes because it doesn't trap?
Is this more of an optimization to reduce VL so it's not overly large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is primarily a correctness issue. Since the header mask is derived from the second returned value of vp.load.ff, we can use it directly only if the AVL is capped before the call.
For example:
%9 = call { <vscale x 16 x i8>, i32 } @llvm.vp.load.ff.nxv16i8.p0(ptr %8, <vscale x 16 x i1> splat (i1 true), i32 %7) %10 = extractvalue { <vscale x 16 x i8>, i32 } %9, 1 %11 = zext i32 %10 to i64 %14 = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %11) get.active.lane.mask generates the predicate mask and uses %11 directly because AVL was capped earlier. If not, we would cap it before calling get.active.lane.mask.
lukel97 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is going to play well with EVL tail folding because we'll now have two different transforms trying to convert the plan to variable stepping, convertFFLoadEarlyExitToVLStepping and transformRecipestoEVLRecipes.
At a high level I wonder if we even want to support vp.load.ff without EVL tail folding to begin with. This PR from what I understand is kind of reimplementing a weaker version of EVL tail folding, since the variable stepping is a hard requirement of the vp.load.ff intrinsic that we can't avoid. It can reduce the number of lanes read for any reason.
Trying this PR out on some llvm-test-suite benchmarks shows that the generated code always seems to generate LMUL 8 step vectors which is probably not great for performance:
+.LBB1197_155: # Parent Loop BB1197_123 Depth=1 + # => This Inner Loop Header: Depth=2 + sub a2, a0, a1 + addi a3, sp, 1280 + add a3, a3, a1 + minu a2, s11, a2 + vsetvli zero, a2, e8, m2, ta, ma + vle8ff.v v16, (a3) + csrr a2, vl + csrr a3, vlenb + vsetvli a4, zero, e64, m8, ta, ma + vid.v v8 + vmv.v.v v24, v8 + vadd.vx v8, v8, a3 + zext.w a4, a2 + vmsltu.vx v18, v8, a4 + vmsltu.vx v8, v24, a4 + srli a4, a3, 2 + vsetvli a5, zero, e8, m2, ta, ma + vmseq.vi v9, v16, 0 + srli a3, a3, 3 + vsetvli zero, a4, e8, mf4, ta, ma + vslideup.vx v8, v18, a3 + vsetvli a3, zero, e8, m2, ta, ma + vmand.mm v8, v8, v9 + vcpop.m a3, v8 + bnez a3, .LBB1197_157 +# %bb.156: # in Loop: Header=BB1197_155 Depth=2 + add.uw a1, a2, a1 + bne a1, a0, .LBB1197_155 +.LBB1197_157: # in Loop: Header=BB1197_123 Depth=1 + snez a1, a3 + beqz a1, .LBB1197_159 +.LBB1197_158: # in Loop: Header=BB1197_123 Depth=1So I don't think there's really much reason why we would want to emit non-tail folded early-exit loops if we can tail fold them eventually.
I understand that this is supposed to be an incremental PR, but I think maybe a better ordering might be to start by supporting early exit loops with tail folding. I think this means we need to address the "variable header mask" TODO here:
bool LoopVectorizationLegality::canFoldTailByMasking() const { // The only loops we can vectorize without a scalar epilogue, are loops with // a bottom-test and a single exiting block. We'd have to handle the fact // that not every instruction executes on the last iteration. This will // require a lane mask which varies through the vector loop body. (TODO) if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) { LLVM_DEBUG( dbgs() << "LV: Cannot fold tail by masking. Requires a singe latch exit\n"); return false; } I think we can do this if we replace the notion of a header mask with the notion of a per-block header mask. I'll see if I can create an issue to discuss some of the design of this more.
| // EMIT vp<%alt.exit.cond> = any-of vp<%and> | ||
| // EMIT vp<%exit.cond> = or vp<%alt.exit.cond>, vp<%main.exit.cond> | ||
| // EMIT branch-on-cond vp<%exit.cond> | ||
| auto *ExitingLatch = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works on the assumption that the poison lanes from the vp.load.ff never get shuffled around, e.g. they aren't reversed. But I think this always holds, just making a note.
If performance is a concern, I’m considering replacing the active‑lane‑mask + reduce_or sequence with a vfirst intrinsic + icmp in CodeGenPrepare. The idea is to rewrite: into: Would this be preferable?
Thanks for flagging early‑exit loops with tail folding. That is also a nice-to-have and I am keen to see it. |
If we have an early exit loop with non-dereferenceable loads after the exit, we currently bail: int z; for (int i = 0; i < N; i++) { if (x[i]) break; z = y[i]; } If the early exit block dominates the block containing these loads, we can predicate them with mask like for (int i = 0; i < N/VF; i++) { c[0..VF] = x[i..i+VF] z[0..VF] = y[i..i+VF], mask=c if (anyof(c)) break; } In VPlan terms, this is `icmp ult step-vector, (first-active-lane exit-cond)` VPlanPredicator can handle predicating these blocks, but in tryToBuildVPlanWithVPRecipes we first disconnect all early exits before the masks are introduced: // entry -> exiting -> ... -> latch // | // +-----> earlyexit VPlanTransforms::handleEarlyExits(*Plan); // entry -> exiting -> ... -> latch VPlanTransforms::introduceMasksAndLinearize(*Plan); This is needed to keep the region single entry/single exit, but it also means that there isn't any control flow by the time we want to add the masks: exiting: %earlyexitcond = ... // one successor (latch) latch: %exitcond = or (anyof %earlyexitcond), %origexitcond br %exitcond, entry, exit This patch propagates the information to VPlanPredicator that the successors should be predicated even though there isn't actually control flow with a new EarlyExit VPInstruction: exiting: %earlyexitcond = ... earlyexit (icmp ult step-vector, (first-active-lane %earlyexitcond) // one successor (latch) latch: %exitcond = or (anyof %earlyexitcond), origexitcond br %exitcond, entry, exit It's just a placeholder and gets immediately removed whenever VPlanPredicator sees it, but allows it to use the correct mask. This makes way for supporting more types of loops, as we could also support stores/divs etc. as long as the exiting block dominates them. See the note in canUncountableExitConditionLoadBeMoved as to why we can't predicate stores when they're before the exiting block. However the main motivation for this to allow us to support tail folding with early exits, which I believe will be needed to make supporting fault-only-first loads simpler: llvm#151300 In order to actually test the changes from this, this PR allows non-dereferenceable loads that are properly dominated by the exiting block in LoopVectorizationLegality. But in practice, something else usually transforms these loops to be multiple-entry which prevents them from being vectorized. | "Enable vectorization of early exit loops with uncountable exits.")); | ||
| | ||
| static cl::opt<bool> | ||
| EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work for other targets besides RISCV? If so, the PR should have some tests for other backends too, or even better in the top level Transforms/LoopVectorize directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need tests in the top level LoopVectorize directory too. For example, I think it's worth adding an extra RUN line with the -enable-early-exit-with-ffload flag to Transforms/LoopVectorize/single_early_exit.ll and Transforms/LoopVectorize/single_early_exit_live_outs.ll.
Those tests have a very extensive coverage of different CFGs and combinations of live-outs in different scenarios.
Following #152422, this patch enables auto-vectorization of early-exit loops containing a single potentially faulting, unit-stride load by using the vp.load.ff intrinsic introduced in #128593.
Key changes:
Limitations:
stacks on top of #165218.