- Notifications
You must be signed in to change notification settings - Fork 15.2k
[LV] Convert gather loads with invariant stride into strided loads #147297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| This is a port of the approach used in RISCVGatherScatterLowering, implemented entirely in VPlan. If you'd prefer to switch to LLVM IR–based analysis using SCEV, please let me know. That said, the current patch faces an awkward situation: all existing strided access patterns in our lit tests are currently not converted to strided accesses, because the cost returned by TTI.getStridedMemoryOpCost is higher than that of a gather. As a result, I'm marking this patch as Draft for now, until we have test cases that can demonstrate the intended functionality. |
| ✅ With the latest revision this PR passed the C/C++ code formatter. |
| +1 on this approach of doing at as VPlan transformation. That way we don't need to work with the legacy cost model. I presume we will need to handle VPWidenStridedLoadRecipes in planContainsAdditionalSimplifcations and skip the assertion?
Yeah I think the current RISCVTTIImpl::getStridedMemoryOpCost is too expensive. It should definitely be at least cheaper than RISCVTTIImpl::getGatherScatterOpCost, and at the moment the cost model seems to return the same for them. I know the spacemit-x60 doesn't have fast vlse but I don't think they're as slow as vluxei. Do we have benchmark results for these? cc) @mikhailramalho |
I have some data on C908 for strided load (which is weird): camel-cdr/rvv-bench#12. The result may hold for spacemit-x60. |
Another approach is using SCEV to get the base and stride. Both can be transformed within VPlanTransform without relying on the legacy cost model.
I’ve previously forced the cost model to always return profitable in order to trigger the transformation, and didn’t encounter any assertions. I plan to use the same method to test against the llvm-test-suite and see if any issues come up. |
| @lukel97 @wangpc-pp |
@lukel97 I think you're right—planContainsAdditionalSimplifications will need to be changed. It's just that right now, the costs for gather and strided load are exactly the same, so it works fine no matter how I execute it. Once the costs are corrected, this issue will become apparent. |
| @llvm/pr-subscribers-llvm-transforms Author: Mel Chen (Mel-Chen) ChangesThis patch detects stride memory accesses such as: and converts widen non-consecutive loads (i.e. gathers) into strided loads when profitable and legal. This enables more efficient code generation for targets like RISC-V that support strided loads natively. Patch is 73.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147297.diff 14 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index c4110582da1ef..528e6abc78aee 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -4073,7 +4073,7 @@ void LoopVectorizationPlanner::emitInvalidCostRemarks( [](const auto *R) { return Instruction::Select; }) .Case<VPWidenStoreRecipe>( [](const auto *R) { return Instruction::Store; }) - .Case<VPWidenLoadRecipe>( + .Case<VPWidenLoadRecipe, VPWidenStridedLoadRecipe>( [](const auto *R) { return Instruction::Load; }) .Case<VPWidenCallRecipe, VPWidenIntrinsicRecipe>( [](const auto *R) { return Instruction::Call; }) @@ -4172,6 +4172,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPWidenPointerInductionSC: case VPDef::VPReductionPHISC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenStridedLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -7732,7 +7733,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, new VPVectorEndPointerRecipe(Ptr, &Plan.getVF(), getLoadStoreType(I), /*Stride*/ -1, Flags, I->getDebugLoc()); } else { - VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), + const DataLayout &DL = I->getDataLayout(); + auto *StrideTy = DL.getIndexType(Ptr->getUnderlyingValue()->getType()); + VPValue *StrideOne = Plan.getOrAddLiveIn(ConstantInt::get(StrideTy, 1)); + VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), StrideOne, GEP ? GEP->getNoWrapFlags() : GEPNoWrapFlags::none(), I->getDebugLoc()); @@ -8832,19 +8836,14 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( *Plan)) return nullptr; + VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); // Transform recipes to abstract recipes if it is legal and beneficial and // clamp the range for better cost estimation. // TODO: Enable following transform when the EVL-version of extended-reduction // and mulacc-reduction are implemented. - if (!CM.foldTailWithEVL()) { - VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); + if (!CM.foldTailWithEVL()) VPlanTransforms::runPass(VPlanTransforms::convertToAbstractRecipes, *Plan, CostCtx, Range); - } - - for (ElementCount VF : Range) - Plan->addVF(VF); - Plan->setName("Initial VPlan"); // Interleave memory: for each Interleave Group we marked earlier as relevant // for this VPlan, replace the Recipes widening its memory instructions with a @@ -8853,6 +8852,15 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed()); + // Convert memory recipes to strided access recipes if the strided access is + // legal and profitable. + VPlanTransforms::runPass(VPlanTransforms::convertToStridedAccesses, *Plan, + CostCtx, Range); + + for (ElementCount VF : Range) + Plan->addVF(VF); + Plan->setName("Initial VPlan"); + // Replace VPValues for known constant strides guaranteed by predicate scalar // evolution. auto CanUseVersionedStride = [&Plan](VPUser &U, unsigned) { diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index 6814dc5de6716..3c3af87cc0f6e 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -559,6 +559,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPBranchOnMaskSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenStridedLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -1746,10 +1747,6 @@ struct LLVM_ABI_FOR_TEST VPWidenSelectRecipe : public VPRecipeWithIRFlags, /// A recipe for handling GEP instructions. class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { - bool isPointerLoopInvariant() const { - return getOperand(0)->isDefinedOutsideLoopRegions(); - } - bool isIndexLoopInvariant(unsigned I) const { return getOperand(I + 1)->isDefinedOutsideLoopRegions(); } @@ -1778,6 +1775,30 @@ class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { VP_CLASSOF_IMPL(VPDef::VPWidenGEPSC) + bool isPointerLoopInvariant() const { + return getOperand(0)->isDefinedOutsideLoopRegions(); + } + + std::optional<unsigned> getUniqueVariantIndex() const { + std::optional<unsigned> VarIdx; + for (unsigned I = 0, E = getNumOperands() - 1; I < E; ++I) { + if (isIndexLoopInvariant(I)) + continue; + + if (VarIdx) + return std::nullopt; + VarIdx = I; + } + return VarIdx; + } + + Type *getIndexedType(unsigned I) const { + auto *GEP = cast<GetElementPtrInst>(getUnderlyingInstr()); + Type *SourceElementType = GEP->getSourceElementType(); + SmallVector<Value *, 4> Ops(GEP->idx_begin(), GEP->idx_begin() + I); + return GetElementPtrInst::getIndexedType(SourceElementType, Ops); + } + /// Generate the gep nodes. void execute(VPTransformState &State) override; @@ -1866,20 +1887,23 @@ class VPVectorEndPointerRecipe : public VPRecipeWithIRFlags, #endif }; -/// A recipe to compute the pointers for widened memory accesses of IndexTy. +/// A recipe to compute the pointers for widened memory accesses of IndexedTy, +/// with the Stride expressed in units of IndexedTy. class VPVectorPointerRecipe : public VPRecipeWithIRFlags, - public VPUnrollPartAccessor<1> { + public VPUnrollPartAccessor<2> { Type *IndexedTy; public: - VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, GEPNoWrapFlags GEPFlags, - DebugLoc DL) - : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, ArrayRef<VPValue *>(Ptr), - GEPFlags, DL), + VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, VPValue *Stride, + GEPNoWrapFlags GEPFlags, DebugLoc DL) + : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, + ArrayRef<VPValue *>({Ptr, Stride}), GEPFlags, DL), IndexedTy(IndexedTy) {} VP_CLASSOF_IMPL(VPDef::VPVectorPointerSC) + VPValue *getStride() const { return getOperand(1); } + void execute(VPTransformState &State) override; bool onlyFirstLaneUsed(const VPValue *Op) const override { @@ -1897,7 +1921,7 @@ class VPVectorPointerRecipe : public VPRecipeWithIRFlags, } VPVectorPointerRecipe *clone() override { - return new VPVectorPointerRecipe(getOperand(0), IndexedTy, + return new VPVectorPointerRecipe(getOperand(0), IndexedTy, getStride(), getGEPNoWrapFlags(), getDebugLoc()); } @@ -3040,7 +3064,8 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || - R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; + R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC || + R->getVPDefID() == VPRecipeBase::VPWidenStridedLoadSC; } static inline bool classof(const VPUser *U) { @@ -3161,6 +3186,52 @@ struct VPWidenLoadEVLRecipe final : public VPWidenMemoryRecipe, public VPValue { } }; +/// A recipe for strided load operations, using the base address, stride, and an +/// optional mask. This recipe will generate an vp.strided.load intrinsic call +/// to represent memory accesses with a fixed stride. +struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe, + public VPValue { + VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride, + VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe( + VPDef::VPWidenStridedLoadSC, Load, {Addr, Stride, VF}, + /*Consecutive=*/false, /*Reverse=*/false, Metadata, DL), + VPValue(this, &Load) { + setMask(Mask); + } + + VPWidenStridedLoadRecipe *clone() override { + return new VPWidenStridedLoadRecipe(cast<LoadInst>(Ingredient), getAddr(), + getStride(), getVF(), getMask(), *this, + getDebugLoc()); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenStridedLoadSC); + + /// Return the stride operand. + VPValue *getStride() const { return getOperand(1); } + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(2); } + + /// Generate a strided load. + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getAddr() || Op == getStride() || Op == getVF(); + } +}; + /// A recipe for widening store operations, using the stored value, the address /// to store to and an optional mask. struct LLVM_ABI_FOR_TEST VPWidenStoreRecipe final : public VPWidenMemoryRecipe { diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 6c9a78add6822..36760feb1cb14 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -187,8 +187,10 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index a92540f457c73..90c05b7b8076c 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -81,6 +81,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPWidenCastSC: case VPWidenGEPSC: case VPWidenIntOrFpInductionSC: + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenPHISC: @@ -104,6 +105,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPExpressionRecipe>(this)->mayReadOrWriteMemory(); case VPInstructionSC: return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: return true; @@ -185,6 +187,7 @@ bool VPRecipeBase::mayHaveSideEffects() const { } case VPInterleaveSC: return mayWriteToMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenStoreEVLSC: @@ -2512,13 +2515,22 @@ void VPVectorEndPointerRecipe::print(raw_ostream &O, const Twine &Indent, void VPVectorPointerRecipe::execute(VPTransformState &State) { auto &Builder = State.Builder; unsigned CurrentPart = getUnrollPart(*this); - Type *IndexTy = getGEPIndexTy(State.VF.isScalable(), /*IsReverse*/ false, - /*IsUnitStride*/ true, CurrentPart, Builder); + Value *Stride = State.get(getStride(), /*IsScalar*/ true); + + auto *StrideC = dyn_cast<ConstantInt>(Stride); + bool IsStrideOne = StrideC && StrideC->isOne(); + bool IsUnitStride = IsStrideOne || (StrideC && StrideC->isMinusOne()); + Type *IndexTy = + getGEPIndexTy(State.VF.isScalable(), + /*IsReverse*/ false, IsUnitStride, CurrentPart, Builder); Value *Ptr = State.get(getOperand(0), VPLane(0)); + Stride = Builder.CreateSExtOrTrunc(Stride, IndexTy); Value *Increment = createStepForVF(Builder, IndexTy, State.VF, CurrentPart); + Value *Index = IsStrideOne ? Increment : Builder.CreateMul(Increment, Stride); + Value *ResultPtr = - Builder.CreateGEP(IndexedTy, Ptr, Increment, "", getGEPNoWrapFlags()); + Builder.CreateGEP(IndexedTy, Ptr, Index, "", getGEPNoWrapFlags()); State.set(this, ResultPtr, /*IsScalar*/ true); } @@ -3170,9 +3182,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, const Align Alignment = getLoadStoreAlignment(&Ingredient); unsigned AS = cast<PointerType>(Ctx.Types.inferScalarType(getAddr())) ->getAddressSpace(); - unsigned Opcode = isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(this) - ? Instruction::Load - : Instruction::Store; + unsigned Opcode = + isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + this) + ? Instruction::Load + : Instruction::Store; if (!Consecutive) { // TODO: Using the original IR may not be accurate. @@ -3182,6 +3196,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, Type *PtrTy = toVectorTy(Ptr->getType(), VF); assert(!Reverse && "Inconsecutive memory access should not have the order."); + + if (isa<VPWidenStridedLoadRecipe>(this)) + return Ctx.TTI.getStridedMemoryOpCost( + Opcode, Ty, Ptr, IsMasked, Alignment, Ctx.CostKind, &Ingredient); + return Ctx.TTI.getAddressComputationCost(PtrTy, nullptr, nullptr, Ctx.CostKind) + Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked, Alignment, @@ -3331,6 +3350,47 @@ void VPWidenLoadEVLRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenStridedLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + Value *Addr = State.get(getAddr(), /*IsScalar*/ true); + Value *StrideInBytes = State.get(getStride(), /*IsScalar*/ true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + Value *RunTimeVF = Builder.CreateZExtOrTrunc(State.get(getVF(), VPLane(0)), + Builder.getInt32Ty()); + + auto *PtrTy = Addr->getType(); + auto *StrideTy = StrideInBytes->getType(); + CallInst *NewLI = Builder.CreateIntrinsic( + Intrinsic::experimental_vp_strided_load, {DataTy, PtrTy, StrideTy}, + {Addr, StrideInBytes, Mask, RunTimeVF}, nullptr, "wide.strided.load"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + State.set(this, NewLI); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenStridedLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << " = load "; + getAddr()->printAsOperand(O, SlotTracker); + O << ", stride = "; + getStride()->printAsOperand(O, SlotTracker); + O << ", runtimeVF = "; + getVF()->printAsOperand(O, SlotTracker); +} +#endif + void VPWidenStoreRecipe::execute(VPTransformState &State) { VPValue *StoredVPValue = getStoredValue(); bool CreateScatter = !isConsecutive(); diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 677d97e54d556..cf9b1c4199449 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2203,6 +2203,12 @@ static VPRecipeBase *optimizeMaskToEVL(VPValue *HeaderMask, VPValue *NewAddr = GetNewAddr(L->getAddr()); return new VPWidenLoadEVLRecipe(*L, NewAddr, EVL, NewMask); }) + .Case<VPWidenStridedLoadRecipe>([&](VPWidenStridedLoadRecipe *L) { + VPValue *NewMask = GetNewMask(L->getMask()); + return new VPWidenStridedLoadRecipe( + *cast<LoadInst>(&L->getIngredient()), L->getAddr(), L->getStride(), + &EVL, NewMask, *L, L->getDebugLoc()); + }) .Case<VPWidenStoreRecipe>([&](VPWidenStoreRecipe *S) { VPValue *NewMask = GetNewMask(S->getMask()); VPValue *NewAddr = GetNewAddr(S->getAddr()); @@ -2237,10 +2243,12 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion(); VPBasicBlock *Header = LoopRegion->getEntryBasicBlock(); - assert(all_of(Plan.getVF().users(), - IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, - VPWidenIntOrFpInductionRecipe>) && - "User of VF that we can't transform to EVL."); + assert( + all_of( + Plan.getVF().users(), + IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, + VPWidenIntOrFpInductionRecipe, VPWidenStridedLoadRecipe>) && + "User of VF that we can't transform to EVL."); Plan.getVF().replaceUsesWithIf(&EVL, [](VPUser &U, unsigned Idx) { return isa<VPWidenIntOrFpInductionRecipe, VPScalarIVStepsRecipe>(U); }); @@ -2337,7 +2345,8 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { assert(NumDefVal <= 1 && "Only supports recipes with a single definition or without users."); EVLRecipe->insertBefore(CurRecipe); - if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe>(EVLRecipe)) { + if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + EVLRecipe)) { VPValue *CurVPV = CurRecipe->getVPSingleValue(); CurVPV->replaceAllUsesWith(EVLRecipe->getVPSingleValue()); } @@ -3767,3 +3776,182 @@ void VPlanTransforms::addBranchWeightToMiddleTerminator( MDB.createBranchWeights({1, VectorStep - 1}, /*IsExpected=*/false); MiddleTerm->addMetadata(LLVMContext::MD_prof, BranchWeights); } + +static std::pair<VPValue *, VPValue *> matchStridedStart(VPValue *CurIndex) { + if (auto *WidenIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(CurIndex)) + return {WidenIV, WidenIV->getStepValue()}; + + auto *WidenR = dyn_cast<VPWidenRecipe>(CurIndex); + if (!WidenR || !CurIndex->getUnderlyingValue()) + return {nullptr, nullptr}; + + unsigned Opcode = WidenR->getOpcode(); + // TODO: Support Instruction::Add and Instruction::Or. + if (Opcode != Instruction::Shl && Opcode != Instruction::Mul) + return {nullptr, nullptr}; + + // Match the pattern binop(variant, invariant), or binop(invariant, variant) + // if the binary operator is commutative. + bool IsLHSUniform = vputils::isSingleScalar(WidenR->getOperand(0)); + if (IsLHSUniform == vputils::isSingleScalar(WidenR->getOperand(1)) || + (IsLHSUniform && !Instruction::isCommutative(Opcode))) + return {nullptr, nullptr}; + unsigned VarIdx = IsLHSUniform ? 1 : 0; + + auto [Start, Stride] = matchStridedStart(WidenR->getOperand(VarIdx)); + if (!Start) + return {nullptr, nullptr}; + + SmallVector<VPValue *> StartOps(WidenR->operands()); + StartOps[VarIdx] = Start; + auto *StartR = new VPReplicateRecipe(WidenR->getUnderlyingInstr(), StartOps, + /*IsUniform*/ true); + StartR->insertBefore(WidenR); + + unsigned InvIdx = VarIdx == 0 ? 1 : 0; + auto *StrideR = + new VPInstruction(Opcode, {Stride,... [truncated] |
| @llvm/pr-subscribers-vectorizers Author: Mel Chen (Mel-Chen) ChangesThis patch detects stride memory accesses such as: and converts widen non-consecutive loads (i.e. gathers) into strided loads when profitable and legal. This enables more efficient code generation for targets like RISC-V that support strided loads natively. Patch is 73.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147297.diff 14 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index c4110582da1ef..528e6abc78aee 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -4073,7 +4073,7 @@ void LoopVectorizationPlanner::emitInvalidCostRemarks( [](const auto *R) { return Instruction::Select; }) .Case<VPWidenStoreRecipe>( [](const auto *R) { return Instruction::Store; }) - .Case<VPWidenLoadRecipe>( + .Case<VPWidenLoadRecipe, VPWidenStridedLoadRecipe>( [](const auto *R) { return Instruction::Load; }) .Case<VPWidenCallRecipe, VPWidenIntrinsicRecipe>( [](const auto *R) { return Instruction::Call; }) @@ -4172,6 +4172,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPWidenPointerInductionSC: case VPDef::VPReductionPHISC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenStridedLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -7732,7 +7733,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, new VPVectorEndPointerRecipe(Ptr, &Plan.getVF(), getLoadStoreType(I), /*Stride*/ -1, Flags, I->getDebugLoc()); } else { - VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), + const DataLayout &DL = I->getDataLayout(); + auto *StrideTy = DL.getIndexType(Ptr->getUnderlyingValue()->getType()); + VPValue *StrideOne = Plan.getOrAddLiveIn(ConstantInt::get(StrideTy, 1)); + VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), StrideOne, GEP ? GEP->getNoWrapFlags() : GEPNoWrapFlags::none(), I->getDebugLoc()); @@ -8832,19 +8836,14 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( *Plan)) return nullptr; + VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); // Transform recipes to abstract recipes if it is legal and beneficial and // clamp the range for better cost estimation. // TODO: Enable following transform when the EVL-version of extended-reduction // and mulacc-reduction are implemented. - if (!CM.foldTailWithEVL()) { - VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); + if (!CM.foldTailWithEVL()) VPlanTransforms::runPass(VPlanTransforms::convertToAbstractRecipes, *Plan, CostCtx, Range); - } - - for (ElementCount VF : Range) - Plan->addVF(VF); - Plan->setName("Initial VPlan"); // Interleave memory: for each Interleave Group we marked earlier as relevant // for this VPlan, replace the Recipes widening its memory instructions with a @@ -8853,6 +8852,15 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed()); + // Convert memory recipes to strided access recipes if the strided access is + // legal and profitable. + VPlanTransforms::runPass(VPlanTransforms::convertToStridedAccesses, *Plan, + CostCtx, Range); + + for (ElementCount VF : Range) + Plan->addVF(VF); + Plan->setName("Initial VPlan"); + // Replace VPValues for known constant strides guaranteed by predicate scalar // evolution. auto CanUseVersionedStride = [&Plan](VPUser &U, unsigned) { diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index 6814dc5de6716..3c3af87cc0f6e 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -559,6 +559,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPBranchOnMaskSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenStridedLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -1746,10 +1747,6 @@ struct LLVM_ABI_FOR_TEST VPWidenSelectRecipe : public VPRecipeWithIRFlags, /// A recipe for handling GEP instructions. class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { - bool isPointerLoopInvariant() const { - return getOperand(0)->isDefinedOutsideLoopRegions(); - } - bool isIndexLoopInvariant(unsigned I) const { return getOperand(I + 1)->isDefinedOutsideLoopRegions(); } @@ -1778,6 +1775,30 @@ class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { VP_CLASSOF_IMPL(VPDef::VPWidenGEPSC) + bool isPointerLoopInvariant() const { + return getOperand(0)->isDefinedOutsideLoopRegions(); + } + + std::optional<unsigned> getUniqueVariantIndex() const { + std::optional<unsigned> VarIdx; + for (unsigned I = 0, E = getNumOperands() - 1; I < E; ++I) { + if (isIndexLoopInvariant(I)) + continue; + + if (VarIdx) + return std::nullopt; + VarIdx = I; + } + return VarIdx; + } + + Type *getIndexedType(unsigned I) const { + auto *GEP = cast<GetElementPtrInst>(getUnderlyingInstr()); + Type *SourceElementType = GEP->getSourceElementType(); + SmallVector<Value *, 4> Ops(GEP->idx_begin(), GEP->idx_begin() + I); + return GetElementPtrInst::getIndexedType(SourceElementType, Ops); + } + /// Generate the gep nodes. void execute(VPTransformState &State) override; @@ -1866,20 +1887,23 @@ class VPVectorEndPointerRecipe : public VPRecipeWithIRFlags, #endif }; -/// A recipe to compute the pointers for widened memory accesses of IndexTy. +/// A recipe to compute the pointers for widened memory accesses of IndexedTy, +/// with the Stride expressed in units of IndexedTy. class VPVectorPointerRecipe : public VPRecipeWithIRFlags, - public VPUnrollPartAccessor<1> { + public VPUnrollPartAccessor<2> { Type *IndexedTy; public: - VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, GEPNoWrapFlags GEPFlags, - DebugLoc DL) - : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, ArrayRef<VPValue *>(Ptr), - GEPFlags, DL), + VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, VPValue *Stride, + GEPNoWrapFlags GEPFlags, DebugLoc DL) + : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, + ArrayRef<VPValue *>({Ptr, Stride}), GEPFlags, DL), IndexedTy(IndexedTy) {} VP_CLASSOF_IMPL(VPDef::VPVectorPointerSC) + VPValue *getStride() const { return getOperand(1); } + void execute(VPTransformState &State) override; bool onlyFirstLaneUsed(const VPValue *Op) const override { @@ -1897,7 +1921,7 @@ class VPVectorPointerRecipe : public VPRecipeWithIRFlags, } VPVectorPointerRecipe *clone() override { - return new VPVectorPointerRecipe(getOperand(0), IndexedTy, + return new VPVectorPointerRecipe(getOperand(0), IndexedTy, getStride(), getGEPNoWrapFlags(), getDebugLoc()); } @@ -3040,7 +3064,8 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || - R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; + R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC || + R->getVPDefID() == VPRecipeBase::VPWidenStridedLoadSC; } static inline bool classof(const VPUser *U) { @@ -3161,6 +3186,52 @@ struct VPWidenLoadEVLRecipe final : public VPWidenMemoryRecipe, public VPValue { } }; +/// A recipe for strided load operations, using the base address, stride, and an +/// optional mask. This recipe will generate an vp.strided.load intrinsic call +/// to represent memory accesses with a fixed stride. +struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe, + public VPValue { + VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride, + VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe( + VPDef::VPWidenStridedLoadSC, Load, {Addr, Stride, VF}, + /*Consecutive=*/false, /*Reverse=*/false, Metadata, DL), + VPValue(this, &Load) { + setMask(Mask); + } + + VPWidenStridedLoadRecipe *clone() override { + return new VPWidenStridedLoadRecipe(cast<LoadInst>(Ingredient), getAddr(), + getStride(), getVF(), getMask(), *this, + getDebugLoc()); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenStridedLoadSC); + + /// Return the stride operand. + VPValue *getStride() const { return getOperand(1); } + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(2); } + + /// Generate a strided load. + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getAddr() || Op == getStride() || Op == getVF(); + } +}; + /// A recipe for widening store operations, using the stored value, the address /// to store to and an optional mask. struct LLVM_ABI_FOR_TEST VPWidenStoreRecipe final : public VPWidenMemoryRecipe { diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 6c9a78add6822..36760feb1cb14 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -187,8 +187,10 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index a92540f457c73..90c05b7b8076c 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -81,6 +81,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPWidenCastSC: case VPWidenGEPSC: case VPWidenIntOrFpInductionSC: + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenPHISC: @@ -104,6 +105,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPExpressionRecipe>(this)->mayReadOrWriteMemory(); case VPInstructionSC: return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: return true; @@ -185,6 +187,7 @@ bool VPRecipeBase::mayHaveSideEffects() const { } case VPInterleaveSC: return mayWriteToMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenStoreEVLSC: @@ -2512,13 +2515,22 @@ void VPVectorEndPointerRecipe::print(raw_ostream &O, const Twine &Indent, void VPVectorPointerRecipe::execute(VPTransformState &State) { auto &Builder = State.Builder; unsigned CurrentPart = getUnrollPart(*this); - Type *IndexTy = getGEPIndexTy(State.VF.isScalable(), /*IsReverse*/ false, - /*IsUnitStride*/ true, CurrentPart, Builder); + Value *Stride = State.get(getStride(), /*IsScalar*/ true); + + auto *StrideC = dyn_cast<ConstantInt>(Stride); + bool IsStrideOne = StrideC && StrideC->isOne(); + bool IsUnitStride = IsStrideOne || (StrideC && StrideC->isMinusOne()); + Type *IndexTy = + getGEPIndexTy(State.VF.isScalable(), + /*IsReverse*/ false, IsUnitStride, CurrentPart, Builder); Value *Ptr = State.get(getOperand(0), VPLane(0)); + Stride = Builder.CreateSExtOrTrunc(Stride, IndexTy); Value *Increment = createStepForVF(Builder, IndexTy, State.VF, CurrentPart); + Value *Index = IsStrideOne ? Increment : Builder.CreateMul(Increment, Stride); + Value *ResultPtr = - Builder.CreateGEP(IndexedTy, Ptr, Increment, "", getGEPNoWrapFlags()); + Builder.CreateGEP(IndexedTy, Ptr, Index, "", getGEPNoWrapFlags()); State.set(this, ResultPtr, /*IsScalar*/ true); } @@ -3170,9 +3182,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, const Align Alignment = getLoadStoreAlignment(&Ingredient); unsigned AS = cast<PointerType>(Ctx.Types.inferScalarType(getAddr())) ->getAddressSpace(); - unsigned Opcode = isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(this) - ? Instruction::Load - : Instruction::Store; + unsigned Opcode = + isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + this) + ? Instruction::Load + : Instruction::Store; if (!Consecutive) { // TODO: Using the original IR may not be accurate. @@ -3182,6 +3196,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, Type *PtrTy = toVectorTy(Ptr->getType(), VF); assert(!Reverse && "Inconsecutive memory access should not have the order."); + + if (isa<VPWidenStridedLoadRecipe>(this)) + return Ctx.TTI.getStridedMemoryOpCost( + Opcode, Ty, Ptr, IsMasked, Alignment, Ctx.CostKind, &Ingredient); + return Ctx.TTI.getAddressComputationCost(PtrTy, nullptr, nullptr, Ctx.CostKind) + Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked, Alignment, @@ -3331,6 +3350,47 @@ void VPWidenLoadEVLRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenStridedLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + Value *Addr = State.get(getAddr(), /*IsScalar*/ true); + Value *StrideInBytes = State.get(getStride(), /*IsScalar*/ true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + Value *RunTimeVF = Builder.CreateZExtOrTrunc(State.get(getVF(), VPLane(0)), + Builder.getInt32Ty()); + + auto *PtrTy = Addr->getType(); + auto *StrideTy = StrideInBytes->getType(); + CallInst *NewLI = Builder.CreateIntrinsic( + Intrinsic::experimental_vp_strided_load, {DataTy, PtrTy, StrideTy}, + {Addr, StrideInBytes, Mask, RunTimeVF}, nullptr, "wide.strided.load"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + State.set(this, NewLI); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenStridedLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << " = load "; + getAddr()->printAsOperand(O, SlotTracker); + O << ", stride = "; + getStride()->printAsOperand(O, SlotTracker); + O << ", runtimeVF = "; + getVF()->printAsOperand(O, SlotTracker); +} +#endif + void VPWidenStoreRecipe::execute(VPTransformState &State) { VPValue *StoredVPValue = getStoredValue(); bool CreateScatter = !isConsecutive(); diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 677d97e54d556..cf9b1c4199449 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2203,6 +2203,12 @@ static VPRecipeBase *optimizeMaskToEVL(VPValue *HeaderMask, VPValue *NewAddr = GetNewAddr(L->getAddr()); return new VPWidenLoadEVLRecipe(*L, NewAddr, EVL, NewMask); }) + .Case<VPWidenStridedLoadRecipe>([&](VPWidenStridedLoadRecipe *L) { + VPValue *NewMask = GetNewMask(L->getMask()); + return new VPWidenStridedLoadRecipe( + *cast<LoadInst>(&L->getIngredient()), L->getAddr(), L->getStride(), + &EVL, NewMask, *L, L->getDebugLoc()); + }) .Case<VPWidenStoreRecipe>([&](VPWidenStoreRecipe *S) { VPValue *NewMask = GetNewMask(S->getMask()); VPValue *NewAddr = GetNewAddr(S->getAddr()); @@ -2237,10 +2243,12 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion(); VPBasicBlock *Header = LoopRegion->getEntryBasicBlock(); - assert(all_of(Plan.getVF().users(), - IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, - VPWidenIntOrFpInductionRecipe>) && - "User of VF that we can't transform to EVL."); + assert( + all_of( + Plan.getVF().users(), + IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, + VPWidenIntOrFpInductionRecipe, VPWidenStridedLoadRecipe>) && + "User of VF that we can't transform to EVL."); Plan.getVF().replaceUsesWithIf(&EVL, [](VPUser &U, unsigned Idx) { return isa<VPWidenIntOrFpInductionRecipe, VPScalarIVStepsRecipe>(U); }); @@ -2337,7 +2345,8 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { assert(NumDefVal <= 1 && "Only supports recipes with a single definition or without users."); EVLRecipe->insertBefore(CurRecipe); - if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe>(EVLRecipe)) { + if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + EVLRecipe)) { VPValue *CurVPV = CurRecipe->getVPSingleValue(); CurVPV->replaceAllUsesWith(EVLRecipe->getVPSingleValue()); } @@ -3767,3 +3776,182 @@ void VPlanTransforms::addBranchWeightToMiddleTerminator( MDB.createBranchWeights({1, VectorStep - 1}, /*IsExpected=*/false); MiddleTerm->addMetadata(LLVMContext::MD_prof, BranchWeights); } + +static std::pair<VPValue *, VPValue *> matchStridedStart(VPValue *CurIndex) { + if (auto *WidenIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(CurIndex)) + return {WidenIV, WidenIV->getStepValue()}; + + auto *WidenR = dyn_cast<VPWidenRecipe>(CurIndex); + if (!WidenR || !CurIndex->getUnderlyingValue()) + return {nullptr, nullptr}; + + unsigned Opcode = WidenR->getOpcode(); + // TODO: Support Instruction::Add and Instruction::Or. + if (Opcode != Instruction::Shl && Opcode != Instruction::Mul) + return {nullptr, nullptr}; + + // Match the pattern binop(variant, invariant), or binop(invariant, variant) + // if the binary operator is commutative. + bool IsLHSUniform = vputils::isSingleScalar(WidenR->getOperand(0)); + if (IsLHSUniform == vputils::isSingleScalar(WidenR->getOperand(1)) || + (IsLHSUniform && !Instruction::isCommutative(Opcode))) + return {nullptr, nullptr}; + unsigned VarIdx = IsLHSUniform ? 1 : 0; + + auto [Start, Stride] = matchStridedStart(WidenR->getOperand(VarIdx)); + if (!Start) + return {nullptr, nullptr}; + + SmallVector<VPValue *> StartOps(WidenR->operands()); + StartOps[VarIdx] = Start; + auto *StartR = new VPReplicateRecipe(WidenR->getUnderlyingInstr(), StartOps, + /*IsUniform*/ true); + StartR->insertBefore(WidenR); + + unsigned InvIdx = VarIdx == 0 ? 1 : 0; + auto *StrideR = + new VPInstruction(Opcode, {Stride,... [truncated] |
| @llvm/pr-subscribers-backend-risc-v Author: Mel Chen (Mel-Chen) ChangesThis patch detects stride memory accesses such as: and converts widen non-consecutive loads (i.e. gathers) into strided loads when profitable and legal. This enables more efficient code generation for targets like RISC-V that support strided loads natively. Patch is 73.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147297.diff 14 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index c4110582da1ef..528e6abc78aee 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -4073,7 +4073,7 @@ void LoopVectorizationPlanner::emitInvalidCostRemarks( [](const auto *R) { return Instruction::Select; }) .Case<VPWidenStoreRecipe>( [](const auto *R) { return Instruction::Store; }) - .Case<VPWidenLoadRecipe>( + .Case<VPWidenLoadRecipe, VPWidenStridedLoadRecipe>( [](const auto *R) { return Instruction::Load; }) .Case<VPWidenCallRecipe, VPWidenIntrinsicRecipe>( [](const auto *R) { return Instruction::Call; }) @@ -4172,6 +4172,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF, case VPDef::VPWidenPointerInductionSC: case VPDef::VPReductionPHISC: case VPDef::VPInterleaveSC: + case VPDef::VPWidenStridedLoadSC: case VPDef::VPWidenLoadEVLSC: case VPDef::VPWidenLoadSC: case VPDef::VPWidenStoreEVLSC: @@ -7732,7 +7733,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, new VPVectorEndPointerRecipe(Ptr, &Plan.getVF(), getLoadStoreType(I), /*Stride*/ -1, Flags, I->getDebugLoc()); } else { - VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), + const DataLayout &DL = I->getDataLayout(); + auto *StrideTy = DL.getIndexType(Ptr->getUnderlyingValue()->getType()); + VPValue *StrideOne = Plan.getOrAddLiveIn(ConstantInt::get(StrideTy, 1)); + VectorPtr = new VPVectorPointerRecipe(Ptr, getLoadStoreType(I), StrideOne, GEP ? GEP->getNoWrapFlags() : GEPNoWrapFlags::none(), I->getDebugLoc()); @@ -8832,19 +8836,14 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( *Plan)) return nullptr; + VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); // Transform recipes to abstract recipes if it is legal and beneficial and // clamp the range for better cost estimation. // TODO: Enable following transform when the EVL-version of extended-reduction // and mulacc-reduction are implemented. - if (!CM.foldTailWithEVL()) { - VPCostContext CostCtx(CM.TTI, *CM.TLI, *Plan, CM, CM.CostKind); + if (!CM.foldTailWithEVL()) VPlanTransforms::runPass(VPlanTransforms::convertToAbstractRecipes, *Plan, CostCtx, Range); - } - - for (ElementCount VF : Range) - Plan->addVF(VF); - Plan->setName("Initial VPlan"); // Interleave memory: for each Interleave Group we marked earlier as relevant // for this VPlan, replace the Recipes widening its memory instructions with a @@ -8853,6 +8852,15 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes( InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed()); + // Convert memory recipes to strided access recipes if the strided access is + // legal and profitable. + VPlanTransforms::runPass(VPlanTransforms::convertToStridedAccesses, *Plan, + CostCtx, Range); + + for (ElementCount VF : Range) + Plan->addVF(VF); + Plan->setName("Initial VPlan"); + // Replace VPValues for known constant strides guaranteed by predicate scalar // evolution. auto CanUseVersionedStride = [&Plan](VPUser &U, unsigned) { diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h index 6814dc5de6716..3c3af87cc0f6e 100644 --- a/llvm/lib/Transforms/Vectorize/VPlan.h +++ b/llvm/lib/Transforms/Vectorize/VPlan.h @@ -559,6 +559,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue { case VPRecipeBase::VPBranchOnMaskSC: case VPRecipeBase::VPInterleaveSC: case VPRecipeBase::VPIRInstructionSC: + case VPRecipeBase::VPWidenStridedLoadSC: case VPRecipeBase::VPWidenLoadEVLSC: case VPRecipeBase::VPWidenLoadSC: case VPRecipeBase::VPWidenStoreEVLSC: @@ -1746,10 +1747,6 @@ struct LLVM_ABI_FOR_TEST VPWidenSelectRecipe : public VPRecipeWithIRFlags, /// A recipe for handling GEP instructions. class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { - bool isPointerLoopInvariant() const { - return getOperand(0)->isDefinedOutsideLoopRegions(); - } - bool isIndexLoopInvariant(unsigned I) const { return getOperand(I + 1)->isDefinedOutsideLoopRegions(); } @@ -1778,6 +1775,30 @@ class LLVM_ABI_FOR_TEST VPWidenGEPRecipe : public VPRecipeWithIRFlags { VP_CLASSOF_IMPL(VPDef::VPWidenGEPSC) + bool isPointerLoopInvariant() const { + return getOperand(0)->isDefinedOutsideLoopRegions(); + } + + std::optional<unsigned> getUniqueVariantIndex() const { + std::optional<unsigned> VarIdx; + for (unsigned I = 0, E = getNumOperands() - 1; I < E; ++I) { + if (isIndexLoopInvariant(I)) + continue; + + if (VarIdx) + return std::nullopt; + VarIdx = I; + } + return VarIdx; + } + + Type *getIndexedType(unsigned I) const { + auto *GEP = cast<GetElementPtrInst>(getUnderlyingInstr()); + Type *SourceElementType = GEP->getSourceElementType(); + SmallVector<Value *, 4> Ops(GEP->idx_begin(), GEP->idx_begin() + I); + return GetElementPtrInst::getIndexedType(SourceElementType, Ops); + } + /// Generate the gep nodes. void execute(VPTransformState &State) override; @@ -1866,20 +1887,23 @@ class VPVectorEndPointerRecipe : public VPRecipeWithIRFlags, #endif }; -/// A recipe to compute the pointers for widened memory accesses of IndexTy. +/// A recipe to compute the pointers for widened memory accesses of IndexedTy, +/// with the Stride expressed in units of IndexedTy. class VPVectorPointerRecipe : public VPRecipeWithIRFlags, - public VPUnrollPartAccessor<1> { + public VPUnrollPartAccessor<2> { Type *IndexedTy; public: - VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, GEPNoWrapFlags GEPFlags, - DebugLoc DL) - : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, ArrayRef<VPValue *>(Ptr), - GEPFlags, DL), + VPVectorPointerRecipe(VPValue *Ptr, Type *IndexedTy, VPValue *Stride, + GEPNoWrapFlags GEPFlags, DebugLoc DL) + : VPRecipeWithIRFlags(VPDef::VPVectorPointerSC, + ArrayRef<VPValue *>({Ptr, Stride}), GEPFlags, DL), IndexedTy(IndexedTy) {} VP_CLASSOF_IMPL(VPDef::VPVectorPointerSC) + VPValue *getStride() const { return getOperand(1); } + void execute(VPTransformState &State) override; bool onlyFirstLaneUsed(const VPValue *Op) const override { @@ -1897,7 +1921,7 @@ class VPVectorPointerRecipe : public VPRecipeWithIRFlags, } VPVectorPointerRecipe *clone() override { - return new VPVectorPointerRecipe(getOperand(0), IndexedTy, + return new VPVectorPointerRecipe(getOperand(0), IndexedTy, getStride(), getGEPNoWrapFlags(), getDebugLoc()); } @@ -3040,7 +3064,8 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase, return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC || R->getVPDefID() == VPRecipeBase::VPWidenStoreSC || R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC || - R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC; + R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC || + R->getVPDefID() == VPRecipeBase::VPWidenStridedLoadSC; } static inline bool classof(const VPUser *U) { @@ -3161,6 +3186,52 @@ struct VPWidenLoadEVLRecipe final : public VPWidenMemoryRecipe, public VPValue { } }; +/// A recipe for strided load operations, using the base address, stride, and an +/// optional mask. This recipe will generate an vp.strided.load intrinsic call +/// to represent memory accesses with a fixed stride. +struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe, + public VPValue { + VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride, + VPValue *VF, VPValue *Mask, + const VPIRMetadata &Metadata, DebugLoc DL) + : VPWidenMemoryRecipe( + VPDef::VPWidenStridedLoadSC, Load, {Addr, Stride, VF}, + /*Consecutive=*/false, /*Reverse=*/false, Metadata, DL), + VPValue(this, &Load) { + setMask(Mask); + } + + VPWidenStridedLoadRecipe *clone() override { + return new VPWidenStridedLoadRecipe(cast<LoadInst>(Ingredient), getAddr(), + getStride(), getVF(), getMask(), *this, + getDebugLoc()); + } + + VP_CLASSOF_IMPL(VPDef::VPWidenStridedLoadSC); + + /// Return the stride operand. + VPValue *getStride() const { return getOperand(1); } + + /// Return the VF operand. + VPValue *getVF() const { return getOperand(2); } + + /// Generate a strided load. + void execute(VPTransformState &State) override; + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const override; +#endif + + /// Returns true if the recipe only uses the first lane of operand \p Op. + bool onlyFirstLaneUsed(const VPValue *Op) const override { + assert(is_contained(operands(), Op) && + "Op must be an operand of the recipe"); + return Op == getAddr() || Op == getStride() || Op == getVF(); + } +}; + /// A recipe for widening store operations, using the stored value, the address /// to store to and an optional mask. struct LLVM_ABI_FOR_TEST VPWidenStoreRecipe final : public VPWidenMemoryRecipe { diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp index 6c9a78add6822..36760feb1cb14 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp @@ -187,8 +187,10 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) { } Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) { - assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) && - "Store recipes should not define any values"); + assert( + (isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + R)) && + "Store recipes should not define any values"); return cast<LoadInst>(&R->getIngredient())->getType(); } diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp index a92540f457c73..90c05b7b8076c 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp @@ -81,6 +81,7 @@ bool VPRecipeBase::mayWriteToMemory() const { case VPWidenCastSC: case VPWidenGEPSC: case VPWidenIntOrFpInductionSC: + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenPHISC: @@ -104,6 +105,7 @@ bool VPRecipeBase::mayReadFromMemory() const { return cast<VPExpressionRecipe>(this)->mayReadOrWriteMemory(); case VPInstructionSC: return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: return true; @@ -185,6 +187,7 @@ bool VPRecipeBase::mayHaveSideEffects() const { } case VPInterleaveSC: return mayWriteToMemory(); + case VPWidenStridedLoadSC: case VPWidenLoadEVLSC: case VPWidenLoadSC: case VPWidenStoreEVLSC: @@ -2512,13 +2515,22 @@ void VPVectorEndPointerRecipe::print(raw_ostream &O, const Twine &Indent, void VPVectorPointerRecipe::execute(VPTransformState &State) { auto &Builder = State.Builder; unsigned CurrentPart = getUnrollPart(*this); - Type *IndexTy = getGEPIndexTy(State.VF.isScalable(), /*IsReverse*/ false, - /*IsUnitStride*/ true, CurrentPart, Builder); + Value *Stride = State.get(getStride(), /*IsScalar*/ true); + + auto *StrideC = dyn_cast<ConstantInt>(Stride); + bool IsStrideOne = StrideC && StrideC->isOne(); + bool IsUnitStride = IsStrideOne || (StrideC && StrideC->isMinusOne()); + Type *IndexTy = + getGEPIndexTy(State.VF.isScalable(), + /*IsReverse*/ false, IsUnitStride, CurrentPart, Builder); Value *Ptr = State.get(getOperand(0), VPLane(0)); + Stride = Builder.CreateSExtOrTrunc(Stride, IndexTy); Value *Increment = createStepForVF(Builder, IndexTy, State.VF, CurrentPart); + Value *Index = IsStrideOne ? Increment : Builder.CreateMul(Increment, Stride); + Value *ResultPtr = - Builder.CreateGEP(IndexedTy, Ptr, Increment, "", getGEPNoWrapFlags()); + Builder.CreateGEP(IndexedTy, Ptr, Index, "", getGEPNoWrapFlags()); State.set(this, ResultPtr, /*IsScalar*/ true); } @@ -3170,9 +3182,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, const Align Alignment = getLoadStoreAlignment(&Ingredient); unsigned AS = cast<PointerType>(Ctx.Types.inferScalarType(getAddr())) ->getAddressSpace(); - unsigned Opcode = isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(this) - ? Instruction::Load - : Instruction::Store; + unsigned Opcode = + isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + this) + ? Instruction::Load + : Instruction::Store; if (!Consecutive) { // TODO: Using the original IR may not be accurate. @@ -3182,6 +3196,11 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF, Type *PtrTy = toVectorTy(Ptr->getType(), VF); assert(!Reverse && "Inconsecutive memory access should not have the order."); + + if (isa<VPWidenStridedLoadRecipe>(this)) + return Ctx.TTI.getStridedMemoryOpCost( + Opcode, Ty, Ptr, IsMasked, Alignment, Ctx.CostKind, &Ingredient); + return Ctx.TTI.getAddressComputationCost(PtrTy, nullptr, nullptr, Ctx.CostKind) + Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked, Alignment, @@ -3331,6 +3350,47 @@ void VPWidenLoadEVLRecipe::print(raw_ostream &O, const Twine &Indent, } #endif +void VPWidenStridedLoadRecipe::execute(VPTransformState &State) { + Type *ScalarDataTy = getLoadStoreType(&Ingredient); + auto *DataTy = VectorType::get(ScalarDataTy, State.VF); + const Align Alignment = getLoadStoreAlignment(&Ingredient); + + auto &Builder = State.Builder; + Value *Addr = State.get(getAddr(), /*IsScalar*/ true); + Value *StrideInBytes = State.get(getStride(), /*IsScalar*/ true); + Value *Mask = nullptr; + if (VPValue *VPMask = getMask()) + Mask = State.get(VPMask); + else + Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue()); + Value *RunTimeVF = Builder.CreateZExtOrTrunc(State.get(getVF(), VPLane(0)), + Builder.getInt32Ty()); + + auto *PtrTy = Addr->getType(); + auto *StrideTy = StrideInBytes->getType(); + CallInst *NewLI = Builder.CreateIntrinsic( + Intrinsic::experimental_vp_strided_load, {DataTy, PtrTy, StrideTy}, + {Addr, StrideInBytes, Mask, RunTimeVF}, nullptr, "wide.strided.load"); + NewLI->addParamAttr( + 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment)); + applyMetadata(*NewLI); + State.set(this, NewLI); +} + +#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) +void VPWidenStridedLoadRecipe::print(raw_ostream &O, const Twine &Indent, + VPSlotTracker &SlotTracker) const { + O << Indent << "WIDEN "; + printAsOperand(O, SlotTracker); + O << " = load "; + getAddr()->printAsOperand(O, SlotTracker); + O << ", stride = "; + getStride()->printAsOperand(O, SlotTracker); + O << ", runtimeVF = "; + getVF()->printAsOperand(O, SlotTracker); +} +#endif + void VPWidenStoreRecipe::execute(VPTransformState &State) { VPValue *StoredVPValue = getStoredValue(); bool CreateScatter = !isConsecutive(); diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp index 677d97e54d556..cf9b1c4199449 100644 --- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp +++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp @@ -2203,6 +2203,12 @@ static VPRecipeBase *optimizeMaskToEVL(VPValue *HeaderMask, VPValue *NewAddr = GetNewAddr(L->getAddr()); return new VPWidenLoadEVLRecipe(*L, NewAddr, EVL, NewMask); }) + .Case<VPWidenStridedLoadRecipe>([&](VPWidenStridedLoadRecipe *L) { + VPValue *NewMask = GetNewMask(L->getMask()); + return new VPWidenStridedLoadRecipe( + *cast<LoadInst>(&L->getIngredient()), L->getAddr(), L->getStride(), + &EVL, NewMask, *L, L->getDebugLoc()); + }) .Case<VPWidenStoreRecipe>([&](VPWidenStoreRecipe *S) { VPValue *NewMask = GetNewMask(S->getMask()); VPValue *NewAddr = GetNewAddr(S->getAddr()); @@ -2237,10 +2243,12 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion(); VPBasicBlock *Header = LoopRegion->getEntryBasicBlock(); - assert(all_of(Plan.getVF().users(), - IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, - VPWidenIntOrFpInductionRecipe>) && - "User of VF that we can't transform to EVL."); + assert( + all_of( + Plan.getVF().users(), + IsaPred<VPVectorEndPointerRecipe, VPScalarIVStepsRecipe, + VPWidenIntOrFpInductionRecipe, VPWidenStridedLoadRecipe>) && + "User of VF that we can't transform to EVL."); Plan.getVF().replaceUsesWithIf(&EVL, [](VPUser &U, unsigned Idx) { return isa<VPWidenIntOrFpInductionRecipe, VPScalarIVStepsRecipe>(U); }); @@ -2337,7 +2345,8 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) { assert(NumDefVal <= 1 && "Only supports recipes with a single definition or without users."); EVLRecipe->insertBefore(CurRecipe); - if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe>(EVLRecipe)) { + if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe, VPWidenStridedLoadRecipe>( + EVLRecipe)) { VPValue *CurVPV = CurRecipe->getVPSingleValue(); CurVPV->replaceAllUsesWith(EVLRecipe->getVPSingleValue()); } @@ -3767,3 +3776,182 @@ void VPlanTransforms::addBranchWeightToMiddleTerminator( MDB.createBranchWeights({1, VectorStep - 1}, /*IsExpected=*/false); MiddleTerm->addMetadata(LLVMContext::MD_prof, BranchWeights); } + +static std::pair<VPValue *, VPValue *> matchStridedStart(VPValue *CurIndex) { + if (auto *WidenIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(CurIndex)) + return {WidenIV, WidenIV->getStepValue()}; + + auto *WidenR = dyn_cast<VPWidenRecipe>(CurIndex); + if (!WidenR || !CurIndex->getUnderlyingValue()) + return {nullptr, nullptr}; + + unsigned Opcode = WidenR->getOpcode(); + // TODO: Support Instruction::Add and Instruction::Or. + if (Opcode != Instruction::Shl && Opcode != Instruction::Mul) + return {nullptr, nullptr}; + + // Match the pattern binop(variant, invariant), or binop(invariant, variant) + // if the binary operator is commutative. + bool IsLHSUniform = vputils::isSingleScalar(WidenR->getOperand(0)); + if (IsLHSUniform == vputils::isSingleScalar(WidenR->getOperand(1)) || + (IsLHSUniform && !Instruction::isCommutative(Opcode))) + return {nullptr, nullptr}; + unsigned VarIdx = IsLHSUniform ? 1 : 0; + + auto [Start, Stride] = matchStridedStart(WidenR->getOperand(VarIdx)); + if (!Start) + return {nullptr, nullptr}; + + SmallVector<VPValue *> StartOps(WidenR->operands()); + StartOps[VarIdx] = Start; + auto *StartR = new VPReplicateRecipe(WidenR->getUnderlyingInstr(), StartOps, + /*IsUniform*/ true); + StartR->insertBefore(WidenR); + + unsigned InvIdx = VarIdx == 0 ? 1 : 0; + auto *StrideR = + new VPInstruction(Opcode, {Stride,... [truncated] |
81f1d66 to 784afac Compare 784afac to ccbfb79 Compare | Ping. Passed the llvm-testsuite with "-O3 -mcpu=sifive-x280". Please take a look if possible. |
| Rebased, and ping |
| return VarIdx; | ||
| } | ||
| | ||
| Type *getIndexedType(unsigned I) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
| | ||
| static std::tuple<VPValue *, VPValue *, Type *> | ||
| determineBaseAndStride(VPWidenGEPRecipe *WidenGEP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would help to document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// to represent memory accesses with a fixed stride. | ||
| struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe, | ||
| public VPValue { | ||
| VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we store the required info independent of the LLVM IR instruction?
It would also help to clarify what information is actually used and why this cannot be simply a VPWidenIntrinsicsRecipe, as it would map 1-1 to an intrinsic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we store the required info independent of the LLVM IR instruction?
Maybe, if we also store the type information in the recipe. But I want to know, in what scenario would we create a VPWidenStridedLoadRecipe without a LoadInst?
It would also help to clarify what information is actually used and why this cannot be simply a VPWidenIntrinsicsRecipe, as it would map 1-1 to an intrinsic
I tried to do that: Mel-Chen@d19fe61
This approach does help eliminate one recipe VPWidenStridedLoadRecipe, but there are still a few details we need to be careful about:
First, the current VPWidenIntrinsicRecipe cannot set attributes, so alignment information will be lost. This could be resolved by passing an AttributeList to VPWidenIntrinsicRecipe, allowing it to add the attributes during ::execute.
// ???: How to set alignment? auto *StridedLoad = new VPWidenIntrinsicRecipe( Ingredient, Intrinsic::experimental_vp_strided_load, {NewPtr, StrideInBytes, Mask, I32VF}, TypeInfo.inferScalarType(LoadR), LoadR->getDebugLoc()); Next, it might become difficult to ensure profitability before generating the strided access (i.e., we may not be able to achieve the change suggested in this comment
). For more accurate profitability analysis, it would be better to call getCostForIntrinsics directly during the profitability check, which requires that all operands are already prepared.
// Better to make getCostForIntrinsics to utils function, and directly // call getCostForIntrinsics to get the cost. SmallVector<Type *, 4> ParamTys = { TypeInfo.inferScalarType(BasePtr), TypeInfo.inferScalarType(StrideInElement), Type::getInt1Ty(Plan.getContext()), Type::getInt32Ty(Plan.getContext())}; // FIXME: Copy from getCostForIntrinsics, but I think this is a bug. We // don't have to vectorize every operands. Should be fixed in // getCostForIntrinsics. for (auto &Ty : ParamTys) Ty = toVectorTy(Ty, VF); IntrinsicCostAttributes CostAttrs( Intrinsic::experimental_vp_strided_load, DataTy, {}, ParamTys, FastMathFlags(), nullptr, InstructionCost::getInvalid(), &Ctx.TLI); const InstructionCost StridedLoadStoreCost = Ctx.TTI.getIntrinsicInstrCost(CostAttrs, Ctx.CostKind); return StridedLoadStoreCost < CurrentCost; Finally, using VPWidenIntrinsicRecipe reduces readability during transformations. Currently, VPWidenStridedLoadRecipe provides members like getAddr() and getStride(), which improves readability. This issue is not limited to EVL lowering—it should also have an impact #164205.
.Case<VPWidenIntrinsicRecipe>( [&](VPWidenIntrinsicRecipe *WI) -> VPRecipeBase * { if (WI->getVectorIntrinsicID() == Intrinsic::experimental_vp_strided_load) { VPWidenIntrinsicRecipe *NewWI = WI->clone(); if (VPValue *NewMask = GetNewMask(WI->getOperand(2))) NewWI->setOperand(2, NewMask); else NewWI->setOperand(2, &AllOneMask); NewWI->setOperand(3, &EVL); return NewWI; } return nullptr; }) That’s my take on this. What’s your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, the current VPWidenIntrinsicRecipe cannot set attributes, so alignment information will be lost. This could be resolved by passing an AttributeList to VPWidenIntrinsicRecipe, allowing it to add the attributes during ::execute.
Hm yes, this is a general modeling issue, but could be improved separately (and does not only apply to this intrinsic I think). Is setting the alignment critical for the initial implementation?
Next, it might become difficult to ensure profitability before generating the strided access (i.e., we may not be able to achieve the change suggested in this comment
I am not sure I understand this completely; We should be able to check the profitability up-front regardless of what recipe we create later on, I think?
Finally, using VPWidenIntrinsicRecipe reduces readability during transformations. Currently, VPWidenStridedLoadRecipe provides members like getAddr() and getStride(), which improves readability. This issue is not limited to EVL lowering—it should also have an impact #164205.
I think this could be mostly taken care of by providing a m_StridedLoad() pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chiming in for codegen. The VP strided load/store intrinsics currently assume element alignment in SelectionDAGBuilder if not specified. This is not spelled out in the LangRef. It is spelled out for the plain vp.load/store intrinsics. @nikic recently added a FIXME suggesting that we shouldn't be falling back to ABI alignment for VP intrinsics
| // FIXME: Falling back to ABI alignment is incorrect. |
Some RISC-V hardware requires strided load/store to have at least element size alignment. More advanced hardware implementations may not require this.
I think we're missing an alignment check for all VP memory intrinsics in the backend. This would probably need to be legalized in IR like we do for non-VP masked.load/store/gather/scatter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm yes, this is a general modeling issue, but could be improved separately (and does not only apply to this intrinsic I think). Is setting the alignment critical for the initial implementation?
As Craig mentioned above, I think we should still preserve the alignment information for now. Moreover, since #164205 converts strided accesses into gather/scatter, if we don’t preserve it now, wouldn’t that cause the masked.gather/scatter to lose their alignment information?
I am not sure I understand this completely; We should be able to check the profitability up-front regardless of what recipe we create later on, I think?
Yes, that works. I’m currently using TTI.getIntrinsicInstrCost to achieve this. But I think the best approach would be to directly reuse
static InstructionCost getCostForIntrinsics(Intrinsic::ID ID, ArrayRef<const VPValue *> Operands, const VPRecipeWithIRFlags &R, ElementCount VF, VPCostContext &Ctx) that is called by VPWidenIntrinsicRecipe::computeCost.
This would prevent potential discrepancies between the pre-conversion cost estimation and the actual recipe cost. I believe refactoring getCostForIntrinsics would be needed to make that work.
I think this could be mostly taken care of by providing a
m_StridedLoad()pattern?
Yes, that’s the least important issue. Using m_Intrinsic<> directly is also fine.
Anyway, the main concern is still the alignment. @fhahn Could you tell me what your major concern is about adding VPWidenStridedLoadRecipe in details? That would help me think of a way to address it.
| bool onlyFirstLaneUsed(const VPValue *Op) const override { | ||
| assert(is_contained(operands(), Op) && | ||
| "Op must be an operand of the recipe"); | ||
| return Op == getAddr() || Op == getStride() || Op == getVF(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to document that this includes everything except the mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// A recipe to compute the pointers for widened memory accesses of IndexedTy, | ||
| /// with the Stride expressed in units of IndexedTy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexedTy->SourceElementTy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return {WidenIV, WidenIV->getStepValue()}; | ||
| | ||
| auto *WidenR = dyn_cast<VPWidenRecipe>(CurIndex); | ||
| if (!WidenR || !CurIndex->getUnderlyingValue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we require an underlying value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve changed it like this for now.
2064369
We need UnderlyingInstr here to narrow all base address recipes into scalar recipes. As discussed earlier, we don’t want to rely on the legacy cost model, so we also can’t depend on its scalarize analysis. The functionality needed here is actually similar to narrowToSingleScalarRecipes; in both cases, we require UnderlyingInstr in order to create a VPReplicateRecipe.
| | ||
| // Memory cost model requires the pointer operand of memory access | ||
| // instruction. | ||
| Value *PtrUV = Ptr->getUnderlyingValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing nullptr does not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12d526a
Since isLegalStridedLoadStore has already been called to check the legality before using Ptr->getUnderlyingValue(), I think passing nullptr should be fine for now. Any potential issues were discussed previously: #128718 (comment).
| PossiblyDead.insert(Ptr); | ||
| | ||
| // Create a new vector pointer for strided access. | ||
| auto *GEP = dyn_cast<GetElementPtrInst>(PtrUV->stripPointerCasts()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, we should not look at the underlying value to determine flags, only use the ones from the recipes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eb70f43
This approach is fine for now. However, in the VecEndPtr → VecPtr case, some flags may be lost. We’ve discussed this before in #128718 (comment).
Currently, LLVM vectorizes strided accesses with SVE as follows. ```c void func(double* restrict a, double* b, int n) { for (int i = 0; i < n; i++) { a[i] = b[i * 10] + 1; } } ``` => ``` ... index z1.d, #0, llvm#1 loop: add z2.d, z1.d, z0.d mul z1.d, z1.d, llvm#80 ld1d { z1.d }, p0/z, [x1, z1.d] ... mov z1.d, z2.d ... ``` This generated code is inefficient because it performs address calculation inside the loop using vector instructions. This can lead to performance degradation. Ideally, we want to generate efficient instructions that keep the offset vector `z1` constant and update the base register `x1` with a scalar instruction. ``` ... index z1.d, #0, llvm#10 loop: ld1d z2.d, p7/z, [x1, z1.d, lsl 3] ... add x1, x1, x2 ... ``` This patch enables strided accesses to be vectorized efficiently as shown above. This patch is based on llvm#147297. llvm#147297 detects strided accesses and converts them into stride recipes. This patch then changes it to a legal and efficient sequence of recipes for AArch64. I am making this patch as a draft for the following reasons: - I have not yet been able to create sufficient test cases for this patch. - I have not yet been able to confirm that there are no performance degradations. Currently, LLVM vectorizes strided accesses with SVE as follows. ```c void func(double* restrict a, double* b, int n) { for (int i = 0; i < n; i++) { a[i] = b[i * 10] + 1; } } ``` => ``` ... index z1.d, #0, llvm#1 loop: add z2.d, z1.d, z0.d mul z1.d, z1.d, llvm#80 ld1d { z1.d }, p0/z, [x1, z1.d] ... mov z1.d, z2.d ... ``` This generated code is inefficient because it performs address calculation inside the loop using vector instructions. This can lead to performance degradation. Ideally, we want to generate efficient instructions that keep the offset vector `z1` constant and update the base register `x1` with a scalar instruction. ``` ... index z1.d, #0, llvm#10 loop: ld1d z2.d, p7/z, [x1, z1.d, lsl 3] ... add x1, x1, x2 ... ``` This patch enables strided accesses to be vectorized efficiently as shown above. This patch is based on llvm#147297. llvm#147297 detects strided accesses and converts them into stride recipes. This patch then changes it to a legal and efficient sequence of recipes for AArch64. I am making this patch as a draft for the following reasons: - I have not yet been able to create sufficient test cases for this patch. - I have not yet been able to confirm that there are no performance degradations. Currently, LLVM vectorizes strided accesses with SVE as follows. ```c void func(double* restrict a, double* b, int n) { for (int i = 0; i < n; i++) { a[i] = b[i * 10] + 1; } } ``` => ``` ... index z1.d, #0, llvm#1 loop: add z2.d, z1.d, z0.d mul z1.d, z1.d, llvm#80 ld1d { z1.d }, p0/z, [x1, z1.d] ... mov z1.d, z2.d ... ``` This generated code is inefficient because it performs address calculation inside the loop using vector instructions. This can lead to performance degradation. llvm#129474 Ideally, we want to generate efficient instructions that keep the offset vector `z1` constant and update the base register `x1` with a scalar instruction. ``` ... index z1.d, #0, llvm#10 loop: ld1d z2.d, p7/z, [x1, z1.d, lsl 3] ... add x1, x1, x2 ... ``` This patch enables strided accesses to be vectorized efficiently as shown above. This patch is based on llvm#147297. llvm#147297 detects strided accesses and converts them into stride recipes. This patch then changes it to a legal and efficient sequence of recipes for AArch64. I am making this patch as a draft for the following reasons: - I have not yet been able to create sufficient test cases for this patch. - I have not yet been able to confirm that there are no performance degradations. 8d98c01 to b8930a9 Compare | /// to represent memory accesses with a fixed stride. | ||
| struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe, | ||
| public VPValue { | ||
| VPWidenStridedLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *Stride, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, the current VPWidenIntrinsicRecipe cannot set attributes, so alignment information will be lost. This could be resolved by passing an AttributeList to VPWidenIntrinsicRecipe, allowing it to add the attributes during ::execute.
Hm yes, this is a general modeling issue, but could be improved separately (and does not only apply to this intrinsic I think). Is setting the alignment critical for the initial implementation?
Next, it might become difficult to ensure profitability before generating the strided access (i.e., we may not be able to achieve the change suggested in this comment
I am not sure I understand this completely; We should be able to check the profitability up-front regardless of what recipe we create later on, I think?
Finally, using VPWidenIntrinsicRecipe reduces readability during transformations. Currently, VPWidenStridedLoadRecipe provides members like getAddr() and getStride(), which improves readability. This issue is not limited to EVL lowering—it should also have an impact #164205.
I think this could be mostly taken care of by providing a m_StridedLoad() pattern?
| | ||
| /// Checks if the given VPWidenGEPRecipe \p WidenGEP represents a strided | ||
| /// access. If so, it creates recipes representing the base pointer and stride | ||
| /// in element type, and returns a tuple of {base pointer, stride, element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// in element type, and returns a tuple of {base pointer, stride, element | |
| /// scaled by the indexed element type's size and returns a tuple of {base pointer, stride, element |
Should this say stride scaled by the indexed type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
determineBaseAndStride returns the stride in element units, and the scaling by the element size is applied later by convertToStridedAccesses. So at this point the returned stride is not yet scaled.
| // TODO: Transform gather/scatter with uniform address into strided access | ||
| // with 0 stride. | ||
| // TODO: Transform interleave access into multiple strided accesses. | ||
| if (!MemR || !isa<VPWidenLoadRecipe>(MemR) || MemR->isConsecutive()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (!MemR || !isa<VPWidenLoadRecipe>(MemR) || MemR->isConsecutive()) | |
| if (!MemR || MemR->isConsecutive()) |
Can dyn_cast to VPWidenLoadRecipe directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that works for now. 2c97e42
Although if we later support strided stores, we’ll probably go back to the current code.
| | ||
| // Skip if the memory access is not a strided access. | ||
| if (!BasePtr) { | ||
| assert(!StrideInElement && !ElementTy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert(!StrideInElement && !ElementTy); | |
| assert(!StrideInElement && !ElementTy && ".."); |
would be good to add message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
2be3029
b8930a9 to b41a785 Compare | /// A recipe for strided load operations, using the base address, stride, VF, | ||
| /// and an optional mask. This recipe will generate a vp.strided.load intrinsic | ||
| /// call to represent memory accesses with a fixed stride. | ||
| struct VPWidenStridedLoadRecipe final : public VPWidenMemoryRecipe, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name VPWidenStridedLoadRecipe is a bit misleading because it only emits VP strided loads. In #164205 this has to be replaced with a regular widened gather anyway for AArch64, since RISC-V is the only target that supports vp.strided.load.
I think we should first create a recipe or (canonical sequence of existing recipes) that executes something that all targets incl. AArch64 can lower.
And then separately optimize it to a VP intrinsic in optimizeMaskToEVL, possibly in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name VPWidenStridedLoadRecipe is a bit misleading because it only emits VP strided loads.
The reason is that there is no non-VP strided access intrinsic available.
I think we should first create a recipe or (canonical sequence of existing recipes) that executes something that all targets incl. AArch64 can lower.
My thinking here is different. I believe it is better to start with a high-level semantic recipe and gradually lower it into more concrete recipes, similar to what convertToConcreteRecipes does, rather than basing it solely on the number of targets that can support it.
And then separately optimize it to a VP intrinsic in optimizeMaskToEVL, possibly in a separate PR
No, strided access is also worked without tail folding, and this still can not solve the alignment information issue we discuss above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking here is different. I believe it is better to start with a high-level semantic recipe and gradually lower it into more concrete recipes, similar to what convertToConcreteRecipes does, rather than basing it solely on the number of targets that can support it.
I think its fine to have VPWidenStridedLoadRecipe as a high level recipe, I'm not suggesting to remove it necessarily. I just think it shouldn't emit vp.strided.load, since every other target that's not RISC-V will have to "legalize" it.
We should convert it to a VP intrinsic as a separate step that's only run on RISC-V
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a proposal:
In VPlanTransforms::convertToStridedAccesses, we still generate VPWidenStridedLoadRecipe. Whether the target wants to convert the gather into a strided load should be determined uniformly via TTI.getStridedMemoryOpCost.
Then, in VPlanTransforms::optimize (perhaps inside SimplifyRecipe), we can decide whether to convert it into vp.strided.load by calling TTI.isLegalStridedLoadStore (though this would require passing TTI into ::optimize).
Finally, during VPlanTransforms::convertToConcreteRecipes, we expand all VPWidenStridedLoadRecipe into a widen GEP + gather.
Finally, during VPlanTransforms::convertToConcreteRecipes, we transform VPWidenStridedLoadRecipe into vp.strided.load or expand VPWidenStridedLoadRecipe into a widen GEP + gather, depend on TTI.isLegalStridedLoadStore. Or, emitting vp.strided.load or widen GEP + gather in VPlanTransforms::convertToStridedAccesses, also depend on TTI.isLegalStridedLoadStore.
However, both approached still requires VPWidenIntrinsicRecipe to support alignment information.
If this proposal sounds acceptable, maybe as an initial patch we can just leave a TODO in VPlanTransforms::convertToStridedAccesses to document this plan.
This patch detects stride memory accesses such as:
and converts widen non-consecutive loads (i.e. gathers) into strided loads when profitable and legal.
The transformation is implemented as part of VPlan and is inspired by existing logic in RISCVGatherScatterLowering. Some of the legality and analysis logic has been moved into VPlan to enable this conversion earlier during vectorization planning.
This enables more efficient code generation for targets like RISC-V that support strided loads natively.