Skip to content

Conversation

@JonChesterfield
Copy link
Collaborator

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit.

The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence.

The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX.

Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct.

AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit.

In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.

@llvmbot llvmbot added cmake Build system in general and CMake in particular clang Clang issues not falling into any other category backend:AMDGPU backend:WebAssembly clang:codegen IR generation bugs: mangling, exceptions, etc. llvm:transforms labels May 25, 2024
@llvmbot
Copy link
Member

llvmbot commented May 25, 2024

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-libc
@llvm/pr-subscribers-backend-webassembly

@llvm/pr-subscribers-clang

Author: Jon Chesterfield (JonChesterfield)

Changes

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit.

The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence.

The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX.

Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct.

AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit.

In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.


Patch is 206.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93362.diff

26 Files Affected:

  • (modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+21-4)
  • (added) clang/test/CodeGen/voidptr-vaarg.c (+478)
  • (added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+180)
  • (modified) llvm/cmake/modules/HandleLLVMOptions.cmake (+1-1)
  • (modified) llvm/include/llvm/InitializePasses.h (+1)
  • (added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+43)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassRegistry.def (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+4)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
  • (modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
  • (added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+1031)
  • (added) llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll (+524)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/unsupported-calls.ll (-19)
  • (added) llvm/test/CodeGen/WebAssembly/expand-variadic-call.ll (+483)
  • (added) llvm/test/CodeGen/WebAssembly/vararg-frame.ll (+525)
  • (added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-linkage.ll (+230)
  • (added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-simple.ll (+212)
  • (added) llvm/test/Transforms/ExpandVariadics/indirect-calls.ll (+58)
  • (added) llvm/test/Transforms/ExpandVariadics/intrinsics.ll (+117)
  • (added) llvm/test/Transforms/ExpandVariadics/invoke.ll (+88)
  • (added) llvm/test/Transforms/ExpandVariadics/pass-byval-byref.ll (+148)
  • (added) llvm/test/Transforms/ExpandVariadics/pass-indirect.ll (+58)
  • (added) llvm/test/Transforms/ExpandVariadics/pass-integers.ll (+344)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp index 44e86c0b40f68..47e18535f8fe0 100644 --- a/clang/lib/CodeGen/Targets/AMDGPU.cpp +++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp @@ -45,7 +45,7 @@ class AMDGPUABIInfo final : public DefaultABIInfo { ABIArgInfo classifyReturnType(QualType RetTy) const; ABIArgInfo classifyKernelArgumentType(QualType Ty) const; - ABIArgInfo classifyArgumentType(QualType Ty, unsigned &NumRegsLeft) const; + ABIArgInfo classifyArgumentType(QualType Ty, bool Variadic, unsigned &NumRegsLeft) const; void computeInfo(CGFunctionInfo &FI) const override; Address EmitVAArg(CodeGenFunction &CGF, Address VAListAddr, @@ -103,19 +103,27 @@ void AMDGPUABIInfo::computeInfo(CGFunctionInfo &FI) const { if (!getCXXABI().classifyReturnType(FI)) FI.getReturnInfo() = classifyReturnType(FI.getReturnType()); + unsigned ArgumentIndex = 0; + const unsigned numFixedArguments = FI.getNumRequiredArgs(); + unsigned NumRegsLeft = MaxNumRegsForArgsRet; for (auto &Arg : FI.arguments()) { if (CC == llvm::CallingConv::AMDGPU_KERNEL) { Arg.info = classifyKernelArgumentType(Arg.type); } else { - Arg.info = classifyArgumentType(Arg.type, NumRegsLeft); + bool FixedArgument = ArgumentIndex++ < numFixedArguments; + Arg.info = classifyArgumentType(Arg.type, !FixedArgument, NumRegsLeft); } } } Address AMDGPUABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr, - QualType Ty) const { - llvm_unreachable("AMDGPU does not support varargs"); + QualType Ty) const {  + const bool IsIndirect = false; + const bool AllowHigherAlign = false; + return emitVoidPtrVAArg(CGF, VAListAddr, Ty, IsIndirect, + getContext().getTypeInfoInChars(Ty), + CharUnits::fromQuantity(4), AllowHigherAlign); } ABIArgInfo AMDGPUABIInfo::classifyReturnType(QualType RetTy) const { @@ -198,11 +206,20 @@ ABIArgInfo AMDGPUABIInfo::classifyKernelArgumentType(QualType Ty) const { } ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty, + bool Variadic, unsigned &NumRegsLeft) const { assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow"); Ty = useFirstFieldIfTransparentUnion(Ty); + if (Variadic) { + return ABIArgInfo::getDirect(/*T=*/nullptr, + /*Offset=*/0, + /*Padding=*/nullptr, + /*CanBeFlattened=*/false, + /*Align=*/0); + } +  if (isAggregateTypeForABI(Ty)) { // Records with non-trivial destructors/copy-constructors should not be // passed by value. diff --git a/clang/test/CodeGen/voidptr-vaarg.c b/clang/test/CodeGen/voidptr-vaarg.c new file mode 100644 index 0000000000000..d023ddf0fb5d2 --- /dev/null +++ b/clang/test/CodeGen/voidptr-vaarg.c @@ -0,0 +1,478 @@ +// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py +// REQUIRES: webassembly-registered-target +// RUN: %clang_cc1 -triple wasm32-unknown-unknown -emit-llvm -o - %s | FileCheck %s + +// Multiple targets use emitVoidPtrVAArg to lower va_arg instructions in clang +// PPC is complicated, excluding from this case analysis +// ForceRightAdjust is false for all non-PPC targets +// AllowHigherAlign is only false for two Microsoft targets, both of which +// pass most things by reference. +// +// Address emitVoidPtrVAArg(CodeGenFunction &CGF, Address VAListAddr, +// QualType ValueTy, bool IsIndirect, +// TypeInfoChars ValueInfo, CharUnits SlotSizeAndAlign, +// bool AllowHigherAlign, bool ForceRightAdjust = +// false); +// +// Target IsIndirect SlotSize AllowHigher ForceRightAdjust +// ARC false four true false +// ARM varies four true false +// Mips false 4 or 8 true false +// RISCV varies register true false +// PPC elided +// LoongArch varies register true false +// NVPTX WIP +// AMDGPU WIP +// X86_32 false four true false +// X86_64 MS varies eight false false +// CSKY false four true false +// Webassembly varies four true false +// AArch64 false eight true false +// AArch64 MS false eight false false +// +// Webassembly passes indirectly iff it's an aggregate of multiple values +// Choosing this as a representative architecture to check IR generation +// partly because it has a relatively simple variadic calling convention. + +// Int, by itself and packed in structs +// CHECK-LABEL: @raw_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +int raw_int(__builtin_va_list list) { return __builtin_va_arg(list, int); } + +typedef struct { + int x; +} one_int_t; + +// CHECK-LABEL: @one_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_INT_T:%.*]], align 4 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_INT_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +one_int_t one_int(__builtin_va_list list) { + return __builtin_va_arg(list, one_int_t); +} + +typedef struct { + int x; + int y; +} two_int_t; + +// CHECK-LABEL: @two_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[AGG_RESULT:%.*]], ptr align 4 [[TMP0]], i32 8, i1 false) +// CHECK-NEXT: ret void +// +two_int_t two_int(__builtin_va_list list) { + return __builtin_va_arg(list, two_int_t); +} + +// Double, by itself and packed in structs +// CHECK-LABEL: @raw_double( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7 +// CHECK-NEXT: [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8) +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP1:%.*]] = load double, ptr [[ARGP_CUR_ALIGNED]], align 8 +// CHECK-NEXT: ret double [[TMP1]] +// +double raw_double(__builtin_va_list list) { + return __builtin_va_arg(list, double); +} + +typedef struct { + double x; +} one_double_t; + +// CHECK-LABEL: @one_double( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_DOUBLE_T:%.*]], align 8 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7 +// CHECK-NEXT: [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8) +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[RETVAL]], ptr align 8 [[ARGP_CUR_ALIGNED]], i32 8, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_DOUBLE_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP1:%.*]] = load double, ptr [[COERCE_DIVE]], align 8 +// CHECK-NEXT: ret double [[TMP1]] +// +one_double_t one_double(__builtin_va_list list) { + return __builtin_va_arg(list, one_double_t); +} + +typedef struct { + double x; + double y; +} two_double_t; + +// CHECK-LABEL: @two_double( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[AGG_RESULT:%.*]], ptr align 8 [[TMP0]], i32 16, i1 false) +// CHECK-NEXT: ret void +// +two_double_t two_double(__builtin_va_list list) { + return __builtin_va_arg(list, two_double_t); +} + +// Scalar smaller than the slot size (C would promote a short to int) +typedef struct { + char x; +} one_char_t; + +// CHECK-LABEL: @one_char( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_CHAR_T:%.*]], align 1 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_CHAR_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i8, ptr [[COERCE_DIVE]], align 1 +// CHECK-NEXT: ret i8 [[TMP0]] +// +one_char_t one_char(__builtin_va_list list) { + return __builtin_va_arg(list, one_char_t); +} + +typedef struct { + short x; +} one_short_t; + +// CHECK-LABEL: @one_short( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_SHORT_T:%.*]], align 2 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 2, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_SHORT_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i16, ptr [[COERCE_DIVE]], align 2 +// CHECK-NEXT: ret i16 [[TMP0]] +// +one_short_t one_short(__builtin_va_list list) { + return __builtin_va_arg(list, one_short_t); +} + +// Composite smaller than the slot size +typedef struct { + _Alignas(2) char x; + char y; +} char_pair_t; + +// CHECK-LABEL: @char_pair( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[AGG_RESULT:%.*]], ptr align 2 [[TMP0]], i32 2, i1 false) +// CHECK-NEXT: ret void +// +char_pair_t char_pair(__builtin_va_list list) { + return __builtin_va_arg(list, char_pair_t); +} + +// Empty struct +typedef struct { +} empty_t; + +// CHECK-LABEL: @empty( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_T:%.*]], align 1 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 0 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 0, i1 false) +// CHECK-NEXT: ret void +// +empty_t empty(__builtin_va_list list) { + return __builtin_va_arg(list, empty_t); +} + +typedef struct { + empty_t x; + int y; +} empty_int_t; + +// CHECK-LABEL: @empty_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_INT_T:%.*]], align 4 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false) +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[RETVAL]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +empty_int_t empty_int(__builtin_va_list list) { + return __builtin_va_arg(list, empty_int_t); +} + +typedef struct { + int x; + empty_t y; +} int_empty_t; + +// CHECK-LABEL: @int_empty( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_INT_EMPTY_T:%.*]], align 4 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_INT_EMPTY_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +int_empty_t int_empty(__builtin_va_list list) { + return __builtin_va_arg(list, int_empty_t); +} + +// Need multiple va_arg instructions to check the postincrement +// Using types that are passed directly as the indirect handling +// is independent of the alignment handling in emitVoidPtrDirectVAArg. + +// CHECK-LABEL: @multiple_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT0_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT1_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT2_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP0]], ptr [[TMP1]], align 4 +// CHECK-NEXT: [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR1]], align 4 +// CHECK-NEXT: [[TMP3:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP2]], ptr [[TMP3]], align 4 +// CHECK-NEXT: [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4 +// CHECK-NEXT: [[TMP5:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP4]], ptr [[TMP5]], align 4 +// CHECK-NEXT: ret void +// +void multiple_int(__builtin_va_list list, int *out0, int *out1, int *out2) { + *out0 = __builtin_va_arg(list, int); + *out1 = __builtin_va_arg(list, int); + *out2 = __builtin_va_arg(list, int); +} + +// Scalars in structs are an easy way of specifying alignment from C +// CHECK-LABEL: @increasing_alignment( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT0_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT1_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT2_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT3_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT3:%.*]], ptr [[OUT3_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[TMP0]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false) +// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[TMP1]], ptr align 4 [[ARGP_CUR1]], i32 2, i1 false) +// CHECK-NEXT: [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4 +// CHECK-NEXT: [[TMP3:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP2]], ptr [[TMP3]], align 4 +// CHECK-NEXT: [[ARGP_CUR5:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR5]], i32 7 +// CHECK-NEXT: [[ARGP_CUR5_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP4... [truncated] 
@llvmbot
Copy link
Member

llvmbot commented May 25, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Jon Chesterfield (JonChesterfield)

Changes

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit.

The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence.

The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX.

Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct.

AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit.

In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.


Patch is 206.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93362.diff

26 Files Affected:

  • (modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+21-4)
  • (added) clang/test/CodeGen/voidptr-vaarg.c (+478)
  • (added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+180)
  • (modified) llvm/cmake/modules/HandleLLVMOptions.cmake (+1-1)
  • (modified) llvm/include/llvm/InitializePasses.h (+1)
  • (added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+43)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassRegistry.def (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+4)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
  • (modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
  • (added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+1031)
  • (added) llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll (+524)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/unsupported-calls.ll (-19)
  • (added) llvm/test/CodeGen/WebAssembly/expand-variadic-call.ll (+483)
  • (added) llvm/test/CodeGen/WebAssembly/vararg-frame.ll (+525)
  • (added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-linkage.ll (+230)
  • (added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-simple.ll (+212)
  • (added) llvm/test/Transforms/ExpandVariadics/indirect-calls.ll (+58)
  • (added) llvm/test/Transforms/ExpandVariadics/intrinsics.ll (+117)
  • (added) llvm/test/Transforms/ExpandVariadics/invoke.ll (+88)
  • (added) llvm/test/Transforms/ExpandVariadics/pass-byval-byref.ll (+148)
  • (added) llvm/test/Transforms/ExpandVariadics/pass-indirect.ll (+58)
  • (added) llvm/test/Transforms/ExpandVariadics/pass-integers.ll (+344)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp index 44e86c0b40f68..47e18535f8fe0 100644 --- a/clang/lib/CodeGen/Targets/AMDGPU.cpp +++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp @@ -45,7 +45,7 @@ class AMDGPUABIInfo final : public DefaultABIInfo { ABIArgInfo classifyReturnType(QualType RetTy) const; ABIArgInfo classifyKernelArgumentType(QualType Ty) const; - ABIArgInfo classifyArgumentType(QualType Ty, unsigned &NumRegsLeft) const; + ABIArgInfo classifyArgumentType(QualType Ty, bool Variadic, unsigned &NumRegsLeft) const; void computeInfo(CGFunctionInfo &FI) const override; Address EmitVAArg(CodeGenFunction &CGF, Address VAListAddr, @@ -103,19 +103,27 @@ void AMDGPUABIInfo::computeInfo(CGFunctionInfo &FI) const { if (!getCXXABI().classifyReturnType(FI)) FI.getReturnInfo() = classifyReturnType(FI.getReturnType()); + unsigned ArgumentIndex = 0; + const unsigned numFixedArguments = FI.getNumRequiredArgs(); + unsigned NumRegsLeft = MaxNumRegsForArgsRet; for (auto &Arg : FI.arguments()) { if (CC == llvm::CallingConv::AMDGPU_KERNEL) { Arg.info = classifyKernelArgumentType(Arg.type); } else { - Arg.info = classifyArgumentType(Arg.type, NumRegsLeft); + bool FixedArgument = ArgumentIndex++ < numFixedArguments; + Arg.info = classifyArgumentType(Arg.type, !FixedArgument, NumRegsLeft); } } } Address AMDGPUABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr, - QualType Ty) const { - llvm_unreachable("AMDGPU does not support varargs"); + QualType Ty) const {  + const bool IsIndirect = false; + const bool AllowHigherAlign = false; + return emitVoidPtrVAArg(CGF, VAListAddr, Ty, IsIndirect, + getContext().getTypeInfoInChars(Ty), + CharUnits::fromQuantity(4), AllowHigherAlign); } ABIArgInfo AMDGPUABIInfo::classifyReturnType(QualType RetTy) const { @@ -198,11 +206,20 @@ ABIArgInfo AMDGPUABIInfo::classifyKernelArgumentType(QualType Ty) const { } ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty, + bool Variadic, unsigned &NumRegsLeft) const { assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow"); Ty = useFirstFieldIfTransparentUnion(Ty); + if (Variadic) { + return ABIArgInfo::getDirect(/*T=*/nullptr, + /*Offset=*/0, + /*Padding=*/nullptr, + /*CanBeFlattened=*/false, + /*Align=*/0); + } +  if (isAggregateTypeForABI(Ty)) { // Records with non-trivial destructors/copy-constructors should not be // passed by value. diff --git a/clang/test/CodeGen/voidptr-vaarg.c b/clang/test/CodeGen/voidptr-vaarg.c new file mode 100644 index 0000000000000..d023ddf0fb5d2 --- /dev/null +++ b/clang/test/CodeGen/voidptr-vaarg.c @@ -0,0 +1,478 @@ +// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py +// REQUIRES: webassembly-registered-target +// RUN: %clang_cc1 -triple wasm32-unknown-unknown -emit-llvm -o - %s | FileCheck %s + +// Multiple targets use emitVoidPtrVAArg to lower va_arg instructions in clang +// PPC is complicated, excluding from this case analysis +// ForceRightAdjust is false for all non-PPC targets +// AllowHigherAlign is only false for two Microsoft targets, both of which +// pass most things by reference. +// +// Address emitVoidPtrVAArg(CodeGenFunction &CGF, Address VAListAddr, +// QualType ValueTy, bool IsIndirect, +// TypeInfoChars ValueInfo, CharUnits SlotSizeAndAlign, +// bool AllowHigherAlign, bool ForceRightAdjust = +// false); +// +// Target IsIndirect SlotSize AllowHigher ForceRightAdjust +// ARC false four true false +// ARM varies four true false +// Mips false 4 or 8 true false +// RISCV varies register true false +// PPC elided +// LoongArch varies register true false +// NVPTX WIP +// AMDGPU WIP +// X86_32 false four true false +// X86_64 MS varies eight false false +// CSKY false four true false +// Webassembly varies four true false +// AArch64 false eight true false +// AArch64 MS false eight false false +// +// Webassembly passes indirectly iff it's an aggregate of multiple values +// Choosing this as a representative architecture to check IR generation +// partly because it has a relatively simple variadic calling convention. + +// Int, by itself and packed in structs +// CHECK-LABEL: @raw_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +int raw_int(__builtin_va_list list) { return __builtin_va_arg(list, int); } + +typedef struct { + int x; +} one_int_t; + +// CHECK-LABEL: @one_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_INT_T:%.*]], align 4 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_INT_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +one_int_t one_int(__builtin_va_list list) { + return __builtin_va_arg(list, one_int_t); +} + +typedef struct { + int x; + int y; +} two_int_t; + +// CHECK-LABEL: @two_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[AGG_RESULT:%.*]], ptr align 4 [[TMP0]], i32 8, i1 false) +// CHECK-NEXT: ret void +// +two_int_t two_int(__builtin_va_list list) { + return __builtin_va_arg(list, two_int_t); +} + +// Double, by itself and packed in structs +// CHECK-LABEL: @raw_double( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7 +// CHECK-NEXT: [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8) +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP1:%.*]] = load double, ptr [[ARGP_CUR_ALIGNED]], align 8 +// CHECK-NEXT: ret double [[TMP1]] +// +double raw_double(__builtin_va_list list) { + return __builtin_va_arg(list, double); +} + +typedef struct { + double x; +} one_double_t; + +// CHECK-LABEL: @one_double( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_DOUBLE_T:%.*]], align 8 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7 +// CHECK-NEXT: [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8) +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[RETVAL]], ptr align 8 [[ARGP_CUR_ALIGNED]], i32 8, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_DOUBLE_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP1:%.*]] = load double, ptr [[COERCE_DIVE]], align 8 +// CHECK-NEXT: ret double [[TMP1]] +// +one_double_t one_double(__builtin_va_list list) { + return __builtin_va_arg(list, one_double_t); +} + +typedef struct { + double x; + double y; +} two_double_t; + +// CHECK-LABEL: @two_double( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[AGG_RESULT:%.*]], ptr align 8 [[TMP0]], i32 16, i1 false) +// CHECK-NEXT: ret void +// +two_double_t two_double(__builtin_va_list list) { + return __builtin_va_arg(list, two_double_t); +} + +// Scalar smaller than the slot size (C would promote a short to int) +typedef struct { + char x; +} one_char_t; + +// CHECK-LABEL: @one_char( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_CHAR_T:%.*]], align 1 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_CHAR_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i8, ptr [[COERCE_DIVE]], align 1 +// CHECK-NEXT: ret i8 [[TMP0]] +// +one_char_t one_char(__builtin_va_list list) { + return __builtin_va_arg(list, one_char_t); +} + +typedef struct { + short x; +} one_short_t; + +// CHECK-LABEL: @one_short( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_SHORT_T:%.*]], align 2 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 2, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_SHORT_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i16, ptr [[COERCE_DIVE]], align 2 +// CHECK-NEXT: ret i16 [[TMP0]] +// +one_short_t one_short(__builtin_va_list list) { + return __builtin_va_arg(list, one_short_t); +} + +// Composite smaller than the slot size +typedef struct { + _Alignas(2) char x; + char y; +} char_pair_t; + +// CHECK-LABEL: @char_pair( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[AGG_RESULT:%.*]], ptr align 2 [[TMP0]], i32 2, i1 false) +// CHECK-NEXT: ret void +// +char_pair_t char_pair(__builtin_va_list list) { + return __builtin_va_arg(list, char_pair_t); +} + +// Empty struct +typedef struct { +} empty_t; + +// CHECK-LABEL: @empty( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_T:%.*]], align 1 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 0 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 0, i1 false) +// CHECK-NEXT: ret void +// +empty_t empty(__builtin_va_list list) { + return __builtin_va_arg(list, empty_t); +} + +typedef struct { + empty_t x; + int y; +} empty_int_t; + +// CHECK-LABEL: @empty_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_INT_T:%.*]], align 4 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false) +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[RETVAL]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +empty_int_t empty_int(__builtin_va_list list) { + return __builtin_va_arg(list, empty_int_t); +} + +typedef struct { + int x; + empty_t y; +} int_empty_t; + +// CHECK-LABEL: @int_empty( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_INT_EMPTY_T:%.*]], align 4 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_INT_EMPTY_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +int_empty_t int_empty(__builtin_va_list list) { + return __builtin_va_arg(list, int_empty_t); +} + +// Need multiple va_arg instructions to check the postincrement +// Using types that are passed directly as the indirect handling +// is independent of the alignment handling in emitVoidPtrDirectVAArg. + +// CHECK-LABEL: @multiple_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT0_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT1_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT2_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP0]], ptr [[TMP1]], align 4 +// CHECK-NEXT: [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR1]], align 4 +// CHECK-NEXT: [[TMP3:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP2]], ptr [[TMP3]], align 4 +// CHECK-NEXT: [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4 +// CHECK-NEXT: [[TMP5:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP4]], ptr [[TMP5]], align 4 +// CHECK-NEXT: ret void +// +void multiple_int(__builtin_va_list list, int *out0, int *out1, int *out2) { + *out0 = __builtin_va_arg(list, int); + *out1 = __builtin_va_arg(list, int); + *out2 = __builtin_va_arg(list, int); +} + +// Scalars in structs are an easy way of specifying alignment from C +// CHECK-LABEL: @increasing_alignment( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT0_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT1_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT2_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT3_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT3:%.*]], ptr [[OUT3_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[TMP0]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false) +// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[TMP1]], ptr align 4 [[ARGP_CUR1]], i32 2, i1 false) +// CHECK-NEXT: [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4 +// CHECK-NEXT: [[TMP3:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP2]], ptr [[TMP3]], align 4 +// CHECK-NEXT: [[ARGP_CUR5:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR5]], i32 7 +// CHECK-NEXT: [[ARGP_CUR5_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP4... [truncated] 
@llvmbot
Copy link
Member

llvmbot commented May 25, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Jon Chesterfield (JonChesterfield)

Changes

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit.

The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence.

The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX.

Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct.

AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit.

In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.


Patch is 206.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93362.diff

26 Files Affected:

  • (modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+21-4)
  • (added) clang/test/CodeGen/voidptr-vaarg.c (+478)
  • (added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+180)
  • (modified) llvm/cmake/modules/HandleLLVMOptions.cmake (+1-1)
  • (modified) llvm/include/llvm/InitializePasses.h (+1)
  • (added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+43)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassRegistry.def (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+4)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
  • (modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
  • (added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+1031)
  • (added) llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll (+524)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/unsupported-calls.ll (-19)
  • (added) llvm/test/CodeGen/WebAssembly/expand-variadic-call.ll (+483)
  • (added) llvm/test/CodeGen/WebAssembly/vararg-frame.ll (+525)
  • (added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-linkage.ll (+230)
  • (added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-simple.ll (+212)
  • (added) llvm/test/Transforms/ExpandVariadics/indirect-calls.ll (+58)
  • (added) llvm/test/Transforms/ExpandVariadics/intrinsics.ll (+117)
  • (added) llvm/test/Transforms/ExpandVariadics/invoke.ll (+88)
  • (added) llvm/test/Transforms/ExpandVariadics/pass-byval-byref.ll (+148)
  • (added) llvm/test/Transforms/ExpandVariadics/pass-indirect.ll (+58)
  • (added) llvm/test/Transforms/ExpandVariadics/pass-integers.ll (+344)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp index 44e86c0b40f68..47e18535f8fe0 100644 --- a/clang/lib/CodeGen/Targets/AMDGPU.cpp +++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp @@ -45,7 +45,7 @@ class AMDGPUABIInfo final : public DefaultABIInfo { ABIArgInfo classifyReturnType(QualType RetTy) const; ABIArgInfo classifyKernelArgumentType(QualType Ty) const; - ABIArgInfo classifyArgumentType(QualType Ty, unsigned &NumRegsLeft) const; + ABIArgInfo classifyArgumentType(QualType Ty, bool Variadic, unsigned &NumRegsLeft) const; void computeInfo(CGFunctionInfo &FI) const override; Address EmitVAArg(CodeGenFunction &CGF, Address VAListAddr, @@ -103,19 +103,27 @@ void AMDGPUABIInfo::computeInfo(CGFunctionInfo &FI) const { if (!getCXXABI().classifyReturnType(FI)) FI.getReturnInfo() = classifyReturnType(FI.getReturnType()); + unsigned ArgumentIndex = 0; + const unsigned numFixedArguments = FI.getNumRequiredArgs(); + unsigned NumRegsLeft = MaxNumRegsForArgsRet; for (auto &Arg : FI.arguments()) { if (CC == llvm::CallingConv::AMDGPU_KERNEL) { Arg.info = classifyKernelArgumentType(Arg.type); } else { - Arg.info = classifyArgumentType(Arg.type, NumRegsLeft); + bool FixedArgument = ArgumentIndex++ < numFixedArguments; + Arg.info = classifyArgumentType(Arg.type, !FixedArgument, NumRegsLeft); } } } Address AMDGPUABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr, - QualType Ty) const { - llvm_unreachable("AMDGPU does not support varargs"); + QualType Ty) const {  + const bool IsIndirect = false; + const bool AllowHigherAlign = false; + return emitVoidPtrVAArg(CGF, VAListAddr, Ty, IsIndirect, + getContext().getTypeInfoInChars(Ty), + CharUnits::fromQuantity(4), AllowHigherAlign); } ABIArgInfo AMDGPUABIInfo::classifyReturnType(QualType RetTy) const { @@ -198,11 +206,20 @@ ABIArgInfo AMDGPUABIInfo::classifyKernelArgumentType(QualType Ty) const { } ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty, + bool Variadic, unsigned &NumRegsLeft) const { assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow"); Ty = useFirstFieldIfTransparentUnion(Ty); + if (Variadic) { + return ABIArgInfo::getDirect(/*T=*/nullptr, + /*Offset=*/0, + /*Padding=*/nullptr, + /*CanBeFlattened=*/false, + /*Align=*/0); + } +  if (isAggregateTypeForABI(Ty)) { // Records with non-trivial destructors/copy-constructors should not be // passed by value. diff --git a/clang/test/CodeGen/voidptr-vaarg.c b/clang/test/CodeGen/voidptr-vaarg.c new file mode 100644 index 0000000000000..d023ddf0fb5d2 --- /dev/null +++ b/clang/test/CodeGen/voidptr-vaarg.c @@ -0,0 +1,478 @@ +// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py +// REQUIRES: webassembly-registered-target +// RUN: %clang_cc1 -triple wasm32-unknown-unknown -emit-llvm -o - %s | FileCheck %s + +// Multiple targets use emitVoidPtrVAArg to lower va_arg instructions in clang +// PPC is complicated, excluding from this case analysis +// ForceRightAdjust is false for all non-PPC targets +// AllowHigherAlign is only false for two Microsoft targets, both of which +// pass most things by reference. +// +// Address emitVoidPtrVAArg(CodeGenFunction &CGF, Address VAListAddr, +// QualType ValueTy, bool IsIndirect, +// TypeInfoChars ValueInfo, CharUnits SlotSizeAndAlign, +// bool AllowHigherAlign, bool ForceRightAdjust = +// false); +// +// Target IsIndirect SlotSize AllowHigher ForceRightAdjust +// ARC false four true false +// ARM varies four true false +// Mips false 4 or 8 true false +// RISCV varies register true false +// PPC elided +// LoongArch varies register true false +// NVPTX WIP +// AMDGPU WIP +// X86_32 false four true false +// X86_64 MS varies eight false false +// CSKY false four true false +// Webassembly varies four true false +// AArch64 false eight true false +// AArch64 MS false eight false false +// +// Webassembly passes indirectly iff it's an aggregate of multiple values +// Choosing this as a representative architecture to check IR generation +// partly because it has a relatively simple variadic calling convention. + +// Int, by itself and packed in structs +// CHECK-LABEL: @raw_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +int raw_int(__builtin_va_list list) { return __builtin_va_arg(list, int); } + +typedef struct { + int x; +} one_int_t; + +// CHECK-LABEL: @one_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_INT_T:%.*]], align 4 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_INT_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +one_int_t one_int(__builtin_va_list list) { + return __builtin_va_arg(list, one_int_t); +} + +typedef struct { + int x; + int y; +} two_int_t; + +// CHECK-LABEL: @two_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[AGG_RESULT:%.*]], ptr align 4 [[TMP0]], i32 8, i1 false) +// CHECK-NEXT: ret void +// +two_int_t two_int(__builtin_va_list list) { + return __builtin_va_arg(list, two_int_t); +} + +// Double, by itself and packed in structs +// CHECK-LABEL: @raw_double( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7 +// CHECK-NEXT: [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8) +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP1:%.*]] = load double, ptr [[ARGP_CUR_ALIGNED]], align 8 +// CHECK-NEXT: ret double [[TMP1]] +// +double raw_double(__builtin_va_list list) { + return __builtin_va_arg(list, double); +} + +typedef struct { + double x; +} one_double_t; + +// CHECK-LABEL: @one_double( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_DOUBLE_T:%.*]], align 8 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7 +// CHECK-NEXT: [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8) +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[RETVAL]], ptr align 8 [[ARGP_CUR_ALIGNED]], i32 8, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_DOUBLE_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP1:%.*]] = load double, ptr [[COERCE_DIVE]], align 8 +// CHECK-NEXT: ret double [[TMP1]] +// +one_double_t one_double(__builtin_va_list list) { + return __builtin_va_arg(list, one_double_t); +} + +typedef struct { + double x; + double y; +} two_double_t; + +// CHECK-LABEL: @two_double( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[AGG_RESULT:%.*]], ptr align 8 [[TMP0]], i32 16, i1 false) +// CHECK-NEXT: ret void +// +two_double_t two_double(__builtin_va_list list) { + return __builtin_va_arg(list, two_double_t); +} + +// Scalar smaller than the slot size (C would promote a short to int) +typedef struct { + char x; +} one_char_t; + +// CHECK-LABEL: @one_char( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_CHAR_T:%.*]], align 1 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_CHAR_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i8, ptr [[COERCE_DIVE]], align 1 +// CHECK-NEXT: ret i8 [[TMP0]] +// +one_char_t one_char(__builtin_va_list list) { + return __builtin_va_arg(list, one_char_t); +} + +typedef struct { + short x; +} one_short_t; + +// CHECK-LABEL: @one_short( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_ONE_SHORT_T:%.*]], align 2 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 2, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_SHORT_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i16, ptr [[COERCE_DIVE]], align 2 +// CHECK-NEXT: ret i16 [[TMP0]] +// +one_short_t one_short(__builtin_va_list list) { + return __builtin_va_arg(list, one_short_t); +} + +// Composite smaller than the slot size +typedef struct { + _Alignas(2) char x; + char y; +} char_pair_t; + +// CHECK-LABEL: @char_pair( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[AGG_RESULT:%.*]], ptr align 2 [[TMP0]], i32 2, i1 false) +// CHECK-NEXT: ret void +// +char_pair_t char_pair(__builtin_va_list list) { + return __builtin_va_arg(list, char_pair_t); +} + +// Empty struct +typedef struct { +} empty_t; + +// CHECK-LABEL: @empty( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_T:%.*]], align 1 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 0 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 0, i1 false) +// CHECK-NEXT: ret void +// +empty_t empty(__builtin_va_list list) { + return __builtin_va_arg(list, empty_t); +} + +typedef struct { + empty_t x; + int y; +} empty_int_t; + +// CHECK-LABEL: @empty_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_INT_T:%.*]], align 4 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false) +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[RETVAL]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +empty_int_t empty_int(__builtin_va_list list) { + return __builtin_va_arg(list, empty_int_t); +} + +typedef struct { + int x; + empty_t y; +} int_empty_t; + +// CHECK-LABEL: @int_empty( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[RETVAL:%.*]] = alloca [[STRUCT_INT_EMPTY_T:%.*]], align 4 +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false) +// CHECK-NEXT: [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_INT_EMPTY_T]], ptr [[RETVAL]], i32 0, i32 0 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4 +// CHECK-NEXT: ret i32 [[TMP0]] +// +int_empty_t int_empty(__builtin_va_list list) { + return __builtin_va_arg(list, int_empty_t); +} + +// Need multiple va_arg instructions to check the postincrement +// Using types that are passed directly as the indirect handling +// is independent of the alignment handling in emitVoidPtrDirectVAArg. + +// CHECK-LABEL: @multiple_int( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT0_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT1_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT2_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4 +// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP0]], ptr [[TMP1]], align 4 +// CHECK-NEXT: [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR1]], align 4 +// CHECK-NEXT: [[TMP3:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP2]], ptr [[TMP3]], align 4 +// CHECK-NEXT: [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4 +// CHECK-NEXT: [[TMP5:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP4]], ptr [[TMP5]], align 4 +// CHECK-NEXT: ret void +// +void multiple_int(__builtin_va_list list, int *out0, int *out1, int *out2) { + *out0 = __builtin_va_arg(list, int); + *out1 = __builtin_va_arg(list, int); + *out2 = __builtin_va_arg(list, int); +} + +// Scalars in structs are an easy way of specifying alignment from C +// CHECK-LABEL: @increasing_alignment( +// CHECK-NEXT: entry: +// CHECK-NEXT: [[LIST_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT0_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT1_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT2_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: [[OUT3_ADDR:%.*]] = alloca ptr, align 4 +// CHECK-NEXT: store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: store ptr [[OUT3:%.*]], ptr [[OUT3_ADDR]], align 4 +// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[TMP0]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false) +// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[TMP1]], ptr align 4 [[ARGP_CUR1]], i32 2, i1 false) +// CHECK-NEXT: [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4 +// CHECK-NEXT: store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4 +// CHECK-NEXT: [[TMP3:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4 +// CHECK-NEXT: store i32 [[TMP2]], ptr [[TMP3]], align 4 +// CHECK-NEXT: [[ARGP_CUR5:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4 +// CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR5]], i32 7 +// CHECK-NEXT: [[ARGP_CUR5_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP4... [truncated] 
@github-actions
Copy link

github-actions bot commented May 25, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.


ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty,
ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty, bool Variadic,
unsigned &NumRegsLeft) const {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was subtle. Structs that aren't packed into integers and passed in registers fall through to default handling which sets CanBeFlattened, saying that it's OK to spread the struct across multiple arguments. This is then very difficult to reassemble robustly using the va_arg(x, type) interface - one needs to compute how type is likely to have been spread out across part of the call frame.

Noting that these values aren't being usefully passed in registers anyway, the if (Variadic) {} sets up call instructions that pass values by value (not byval) and declares that every value shall be exactly four byte aligned (including doubles, as that's something Matt suggested for amdgpu some time ago). This means the frame setup implementation and the case analysis for testing are very straightforward.

@JonChesterfield JonChesterfield force-pushed the jc_varargs_amdgpu branch 4 times, most recently from 1efc9f6 to add8686 Compare May 25, 2024 14:00
@JonChesterfield
Copy link
Collaborator Author

Joseph reports "memory error" from a libc test when running with this patch. This is unfortunate. I haven't reproduced that yet (I don't mean libc passes, I mean libc fails with or without this patch). The blast radius for "memory error" on amdgpu is wide but there is very little amdgpu specific code in this patch so it's either something handling addrspacecast incorrectly or an unlucky interaction with something outside of this patch.

My plan is to spin up a separate patch which is the non-amdgpu part of this and hope someone signs off on it - the development overhead of juggling lots of branches is significantly compromising time to solution here. Bringing up x64 / aarch64 / nvptx or similar will, if I'm lucky, uncover a bug in this pass which is causing the libc test failure.

For debugging amdgpu, I'll add more tests around addrspace cast and hope to see a bug in the IR, try to get libc to pass and, in extremis, try to build rocm from source in case the debugger helps.

@JonChesterfield JonChesterfield force-pushed the jc_varargs_amdgpu branch 2 times, most recently from 682ba92 to db14ca7 Compare May 28, 2024 14:41
// suffice here -Wno-varargs avoids warning second argument to 'va_start' is not
// the last named parameter

// RUN: %clang_cc1 %s -triple wasm32-unknown-unknown -Wno-varargs -O1 -emit-llvm -o - | opt - -S --passes='module(expand-variadics,default<O1>)' --expand-variadics-override=optimize -o - | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need REQUIRES: wasm-registered-target

Comment on lines 107 to 108
unsigned ArgumentIndex = 0;
const unsigned numFixedArguments = FI.getNumRequiredArgs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split the clang AMDGPU ABI changes into a separate PR? The tests for this are also missing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I can do - the ABI change only affects variadic functions, which currently fatal_error anyway - but I think the C to IR tests will succeed as long as nothing calls va_arg and it stops before codegen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the test to this PR and also split out #94083. Can land that subpatch first and rebase this for a reduction in complexity.

JonChesterfield added a commit to JonChesterfield/llvm-project that referenced this pull request Jun 6, 2024
Pass variadic arguments without changing their type, unlike the fixed ones. Fixed arguments are modified to better fit into registers. This patch leaves those unchanged. Splitting struct types into individual fields and packing small structs into integers works well for passing via registers. Variadic arguments are currently unimplemented in the backend. They're likely to be implemented as a pointer to stack memory in which case register-themed optimisations are inapplicable. Splitting the struct into fields makes it difficult to implement va_arg robustly. The rules around padding and alignment to inverse the struct splitting could be constructed, but at high complexity and no particular advantage. Passing types as-is means there is a 1:1 correspondence with the type information va_arg has to work with and the parameter type at the call site. This is an ABI change, but as the only functions affected are variadic ones which are presently a compilation error, not a functional break. Factored out of the larger llvm#93362 and can land independently.
JonChesterfield added a commit to JonChesterfield/llvm-project that referenced this pull request Jun 6, 2024
This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit. The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence. The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX. Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct. AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit. In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Jun 14, 2024
Pass variadic arguments without changing their type, unlike the fixed ones. Fixed arguments are modified to better fit into registers. This patch leaves those unchanged. Splitting struct types into individual fields and packing small structs into integers works well for passing via registers. Variadic arguments are currently unimplemented in the backend. They're likely to be implemented as a pointer to stack memory in which case register-themed optimisations are inapplicable. Splitting the struct into fields makes it difficult to implement va_arg robustly. The rules around padding and alignment to inverse the struct splitting could be constructed, but at high complexity and no particular advantage. Passing types as-is means there is a 1:1 correspondence with the type information va_arg has to work with and the parameter type at the call site. This is an ABI change, but as the only functions affected are variadic ones which are presently a compilation error, not a functional break. Factored out of the larger llvm#93362 and can land independently. Change-Id: I372aaff076a227fe5752fec7451bff30071c8443
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Jun 14, 2024
This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit. The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence. The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX. Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct. AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit. In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2. Change-Id: I82529bd8fe91edbe772c091e89796e4549316304
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jun 19, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in llvm#93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jun 19, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in llvm#93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jun 19, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in llvm#93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jun 19, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in llvm#93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jun 21, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in llvm#93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jun 25, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in llvm#93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jul 1, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in llvm#93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jul 12, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in llvm#93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
jhuber6 added a commit that referenced this pull request Jul 12, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in #93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
Summary: This patch implements support for variadic functions for NVPTX targets. The implementation here mainly follows what was done to implement it for AMDGPU in llvm#93362. We change the NVPTX codegen to lower all variadic arguments to functions by-value. This creates a flattened set of arguments that the IR lowering pass converts into a struct with the proper alignment. The behavior of this function was determined by iteratively checking what the NVCC copmiler generates for its output. See examples like https://godbolt.org/z/KavfTGY93. I have noted the main methods that NVIDIA uses to lower variadic functions. 1. All arguments are passed in a pointer to aggregate. 2. The minimum alignment for a plain argument is 4 bytes. 3. Alignment is dictated by the underlying type 4. Structs are flattened and do not have their alignment changed. 5. NVPTX never passes any arguments indirectly, even very large ones. This patch passes the tests in the `libc` project currently, including support for `sprintf`.
jrbyrnes pushed a commit to jrbyrnes/llvm-project that referenced this pull request Jul 17, 2024
This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit. The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence. The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX. Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct. AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit. In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2. Change-Id: I82529bd8fe91edbe772c091e89796e4549316304
jrbyrnes pushed a commit to jrbyrnes/llvm-project that referenced this pull request Jul 17, 2024
Pass variadic arguments without changing their type, unlike the fixed ones. Fixed arguments are modified to better fit into registers. This patch leaves those unchanged. Splitting struct types into individual fields and packing small structs into integers works well for passing via registers. Variadic arguments are currently unimplemented in the backend. They're likely to be implemented as a pointer to stack memory in which case register-themed optimisations are inapplicable. Splitting the struct into fields makes it difficult to implement va_arg robustly. The rules around padding and alignment to inverse the struct splitting could be constructed, but at high complexity and no particular advantage. Passing types as-is means there is a 1:1 correspondence with the type information va_arg has to work with and the parameter type at the call site. This is an ABI change, but as the only functions affected are variadic ones which are presently a compilation error, not a functional break. Factored out of the larger llvm#93362 and can land independently. Change-Id: I372aaff076a227fe5752fec7451bff30071c8443
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AMDGPU backend:WebAssembly clang:codegen IR generation bugs: mangling, exceptions, etc. clang Clang issues not falling into any other category cmake Build system in general and CMake in particular libc llvm:ir llvm:transforms

7 participants