I would appreciate some expert opinion on this function “vectorizeLoadInsert” in VectorCombine pass. For my test case, it only makes the IR more complicated for downstream pass to process:
If I disable VectorCombine, the IR snippet emitted by clang looks like the following:
%5 = load float, ptr addrspace(3) &in, align 4, !tbaa !6
%vecins = insertelement <8 x float> poison, float %5, i64 0
%6 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 4), align 4, !tbaa !6
%vecins.1 = insertelement <8 x float> %vecins, float %6, i64 1
%7 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 8), align 4, !tbaa !6
%vecins.2 = insertelement <8 x float> %vecins.1, float %7, i64 2
%8 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 12), align 4, !tbaa !6
%vecins.3 = insertelement <8 x float> %vecins.2, float %8, i64 3
%9 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 16), align 4, !tbaa !6
%vecins.4 = insertelement <8 x float> %vecins.3, float %9, i64 4
%10 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 20), align 4, !tbaa !6
%vecins.5 = insertelement <8 x float> %vecins.4, float %10, i64 5
%11 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 24), align 4, !tbaa !6
%vecins.6 = insertelement <8 x float> %vecins.5, float %11, i64 6
%12 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 28), align 4, !tbaa !6
%vecins.7 = insertelement <8 x float> %vecins.6, float %12, i64 7
%13 = load <16 x half>, ptr addrspace(3) &a, align 32, !tbaa !13
%14 = load <16 x half>, ptr addrspace(3) &b, align 32, !tbaa !13
%15 = tail call contract <8 x float> &llvm.amdgcn.wmma.f32.16x16x16.f16.v8f32.v16f16(<16 x half> %13, <16 x half> %14, <8 x float> %vecins.7)
If I enable VectorCombine, the IR looks the following:
%vecins = load <8 x float>, ptr addrspace(3) &in**, align 4**
%5 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 4), align 4, !tbaa !6
%vecins.1 = insertelement <8 x float> %vecins, float %5, i64 1
%6 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 8), align 4, !tbaa !6
%vecins.2 = insertelement <8 x float> %vecins.1, float %6, i64 2
%7 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 12), align 4, !tbaa !6
%vecins.3 = insertelement <8 x float> %vecins.2, float %7, i64 3
%8 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 16), align 4, !tbaa !6
%vecins.4 = insertelement <8 x float> %vecins.3, float %8, i64 4
%9 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 20), align 4, !tbaa !6
%vecins.5 = insertelement <8 x float> %vecins.4, float %9, i64 5
%10 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 24), align 4, !tbaa !6
%vecins.6 = insertelement <8 x float> %vecins.5, float %10, i64 6
%11 = load float, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) &in, i32 28), align 4, !tbaa !6
%vecins.7 = insertelement <8 x float> %vecins.6, float %11, i64 7
%12 = load <16 x half>, ptr addrspace(3) &a, align 32, !tbaa !13
%13 = load <16 x half>, ptr addrspace(3) &b, align 32, !tbaa !13
%14 = tail call contract <8 x float> &llvm.amdgcn.wmma.f32.16x16x16.f16.v8f32.v16f16(<16 x half> %12, <16 x half> %13, <8 x float> %vecins.7)
So vectorizeLoadInsert turns the 1st load in the sequence of loads into the vector-load I want, however, it does not do anything with the remaining redundant load. I don’t see any followup processing to clean this up, which makes it more difficult to deal with in our backend.
I do not see any reason that vectorizeLoadInsert should exist in VectorCombine. We can rely on some other pass such as LoadStoreVectorizer to get the vector load and store. Right now, the IR is messed up by VectorCombine before it gets to LoadStoreVectorizer.