What is the purpose of "vectorizeLoadInsert" in the VectorCombine pass?

cmc-rep · October 9, 2025, 3:05am

I would appreciate some expert opinion on this function “vectorizeLoadInsert” in VectorCombine pass. For my test case, it only makes the IR more complicated for downstream pass to process:

If I disable VectorCombine, the IR snippet emitted by clang looks like the following:

%5 = load float, ptr addrspace(3) &in, align 4, !tbaa !6
%vecins = insertelement <8 x float> poison, float %5, i64 0
%6 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 4), align 4, !tbaa !6
%vecins.1 = insertelement <8 x float> %vecins, float %6, i64 1
%7 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 8), align 4, !tbaa !6
%vecins.2 = insertelement <8 x float> %vecins.1, float %7, i64 2
%8 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 12), align 4, !tbaa !6
%vecins.3 = insertelement <8 x float> %vecins.2, float %8, i64 3
%9 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 16), align 4, !tbaa !6
%vecins.4 = insertelement <8 x float> %vecins.3, float %9, i64 4
%10 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 20), align 4, !tbaa !6
%vecins.5 = insertelement <8 x float> %vecins.4, float %10, i64 5
%11 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 24), align 4, !tbaa !6
%vecins.6 = insertelement <8 x float> %vecins.5, float %11, i64 6
%12 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 28), align 4, !tbaa !6
%vecins.7 = insertelement <8 x float> %vecins.6, float %12, i64 7
%13 = load <16 x half>, ptr addrspace(3) &a, align 32, !tbaa !13
%14 = load <16 x half>, ptr addrspace(3) &b, align 32, !tbaa !13
%15 = tail call contract <8 x float> &llvm.amdgcn.wmma.f32.16x16x16.f16.v8f32.v16f16(<16 x half> %13, <16 x half> %14, <8 x float> %vecins.7)

If I enable VectorCombine, the IR looks the following:

%vecins = load <8 x float>, ptr addrspace(3) &in**, align 4**
%5 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 4), align 4, !tbaa !6
%vecins.1 = insertelement <8 x float> %vecins, float %5, i64 1
%6 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 8), align 4, !tbaa !6
%vecins.2 = insertelement <8 x float> %vecins.1, float %6, i64 2
%7 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 12), align 4, !tbaa !6
%vecins.3 = insertelement <8 x float> %vecins.2, float %7, i64 3
%8 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 16), align 4, !tbaa !6
%vecins.4 = insertelement <8 x float> %vecins.3, float %8, i64 4
%9 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 20), align 4, !tbaa !6
%vecins.5 = insertelement <8 x float> %vecins.4, float %9, i64 5
%10 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 24), align 4, !tbaa !6
%vecins.6 = insertelement <8 x float> %vecins.5, float %10, i64 6
%11 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 28), align 4, !tbaa !6
%vecins.7 = insertelement <8 x float> %vecins.6, float %11, i64 7
%12 = load <16 x half>, ptr addrspace(3) &a, align 32, !tbaa !13
%13 = load <16 x half>, ptr addrspace(3) &b, align 32, !tbaa !13
%14 = tail call contract <8 x float> &llvm.amdgcn.wmma.f32.16x16x16.f16.v8f32.v16f16(<16 x half> %12, <16 x half> %13, <8 x float> %vecins.7)

So vectorizeLoadInsert turns the 1st load in the sequence of loads into the vector-load I want, however, it does not do anything with the remaining redundant load. I don’t see any followup processing to clean this up, which makes it more difficult to deal with in our backend.

I do not see any reason that vectorizeLoadInsert should exist in VectorCombine. We can rely on some other pass such as LoadStoreVectorizer to get the vector load and store. Right now, the IR is messed up by VectorCombine before it gets to LoadStoreVectorizer.

nikic · October 9, 2025, 7:26am

cc @RKSimon

Topic		Replies	Views
vector optimization LLVM Dev List Archives	2	97	May 14, 2010
Folding an insertelt chain LLVM Dev List Archives	3	66	February 17, 2012
[LLVM, llc] TypeLegalization, DAGCombining, vectors loading LLVM Dev List Archives	10	147	December 14, 2011
loop vectorizer: Unexpected extract/insertelement LLVM Dev List Archives	4	88	November 6, 2013
MatchLoadCombine(): handling for vectorized loop. LLVM Dev List Archives	5	126	December 10, 2018

What is the purpose of "vectorizeLoadInsert" in the VectorCombine pass?

Related topics