What you’re doing wrong is looking at the PTX. Look at the SASS. It’s rarely a good idea to judge code based on the PTX.
Another suggestion I have is to include a simple, complete code, that others can copy, paste and compile, and inspect. If you want help, the smart move is to make it easy for others to help you. This also helps to preclude the possibility of a difference between what others are trying to do (guess at) and what you intend to do.
Here’s a simple, complete code built around the snippets you have shown:
#include <vector_types.h> inline __host__ __device__ float4 my_sqrt(float4 a) { float4 b; b.x = sqrt(a.x); b.y = sqrt(a.y); b.z = sqrt(a.z); b.w = sqrt(a.w); return b; } __global__ void kernel(float4* p) { float4 a = p[blockIdx.x*blockDim.x + threadIdx.x]; float4 b = my_sqrt(a); p[blockIdx.x*blockDim.x + threadIdx.x] = b; } int main(){ float4 *d_data = NULL; kernel<<<1,1>>>(d_data); cudaDeviceSynchronize(); return 0; }
note that I have chosen to give my function a name other than sqrt, to ease in disambiguation later.
I chose to compile with:
nvcc -O3 -arch=sm_50 -lineinfo -cubin -o t608.cubin t608.cu
I then disassembled the resultant SASS with:
nvdisasm --print-line-info t608.cubin
and the relevant part of the output (the kernel code) is:
//--------------------- .text._Z6kernelP6float4 -------------------------- .section .text._Z6kernelP6float4,"ax",@progbits .sectioninfo @"SHI_REGISTERS=16" .align 32 .global _Z6kernelP6float4 .type _Z6kernelP6float4,@function .size _Z6kernelP6float4,(.L_36 - _Z6kernelP6float4) .other _Z6kernelP6float4,@"STO_CUDA_ENTRY STV_DEFAULT" _Z6kernelP6float4: .text._Z6kernelP6float4: /*0008*/ MOV R1, c[0x0][0x20]; //## File "/home-2/robertc/misc/t608.cu", line 14 /*0010*/ S2R R0, SR_CTAID.X; /*0018*/ S2R R2, SR_TID.X; /*0028*/ MOV R3, c[0x0][0x8]; /*0030*/ XMAD R2, R3.reuse, R0.reuse, R2; /*0038*/ XMAD.MRG R0, R3, R0.H1, RZ; /*0048*/ XMAD.PSL.CBCC R0, R3.H1, R0.H1, R2; /*0050*/ SHL.W R2, R0.reuse, 0x4; /*0058*/ SHR.U32 R0, R0, 0x1c; /*0068*/ IADD R2.CC, R2, c[0x0][0x140]; /*0070*/ IADD.X R3, R0, c[0x0][0x144]; /*0078*/ LDG.E.128 R8, [R2]; //## File "/shared/apps/cuda/CUDA-v6.5.14/include/device_functions.h", line 3964 /*0088*/ MOV R4, R11; /*0090*/ CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32); /*0098*/ MOV R7, R4; /*00a8*/ MOV R4, R10; /*00b0*/ CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32); /*00b8*/ MOV R6, R4; /*00c8*/ MOV R4, R9; /*00d0*/ CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32); /*00d8*/ MOV R5, R4; /*00e8*/ MOV R4, R8; /*00f0*/ CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32); //## File "/home-2/robertc/misc/t608.cu", line 16 /*00f8*/ STG.E.128 [R2], R4; /*0108*/ DEPBAR {0}; //## File "/home-2/robertc/misc/t608.cu", line 18 /*0110*/ EXIT; .weak $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32 .type $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32,@function .size $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32,($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath - $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32) $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32: /*0118*/ IADD32I R0, R4, -0xd000000; /*0128*/ ISETP.LE.U32.AND P0, PT, R0, c[0x2][0x0], PT; /*0130*/ @P0 BRA `(.L_1); /*0138*/ CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath); /*0148*/ BRA `(.L_2); .L_1: /*0150*/ MUFU.RSQ R13, R4; /*0158*/ FMUL.FTZ R14, R13, R4; /*0168*/ F2F.FTZ.F32.F32 R15, -R14; /*0170*/ FMUL.FTZ R13, R13, 0.5; /*0178*/ FFMA R0, R15, R14, R4; /*0188*/ FFMA R4, R0, R13, R14; .L_2: /*0190*/ RET; .weak $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath .type $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath,@function .size $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath,(.L_36 - $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath) $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath: /*0198*/ LOP.AND.NZ P0, RZ, R4, c[0x2][0x4]; /*01a8*/ @!P0 BRA `(.L_3); /*01b0*/ FSETP.LT.FTZ.AND P0, PT, R4, RZ, PT; /*01b8*/ @!P0 BRA `(.L_4); /*01c8*/ MOV32I R4, 0x7fffffff; /*01d0*/ BRA `(.L_3); .L_4: /*01d8*/ FSETP.LE.FTZ.AND P0, PT, |R4|, +INF , PT; /*01e8*/ @P0 BRA `(.L_5); /*01f0*/ FADD.FTZ R4, R4, 1; /*01f8*/ BRA `(.L_3); .L_5: /*0208*/ FSETP.EQ.FTZ.AND P0, PT, |R4|, +INF , PT; /*0210*/ @P0 BRA `(.L_3); /*0218*/ FFMA R0, R4, 1.84467440737095516160e+19, RZ; /*0228*/ MUFU.RSQ R11, R0; /*0230*/ FMUL.FTZ R13, R11, R0; /*0238*/ F2F.FTZ.F32.F32 R14, -R13; /*0248*/ FMUL.FTZ R11, R11, 0.5; /*0250*/ FFMA R0, R14, R13, R0; /*0258*/ FFMA R0, R0, R11, R13; /*0268*/ FMUL.FTZ R4, R0, 2.3283064365386962891e-10; .L_3: /*0270*/ RET; .L_6: /*0278*/ BRA `(.L_6); .L_36:
(the line 14 that is referenced on line 12 of the above output refers to the source code, it is the first non-whitespace line in the kernel body code)
So the function is inlined. There is no CAL to my_sqrt or anything like that, and the only CAL instructions in the kernel code pertain to CUDA math library routines to perform the basic sqrt function on POD types.
Regarding your statement about modifying the PTX and seeing a 25% speedup, I have no idea what you did. If you want to post enough information so that someone could reproduce your results, you may get more useful help.
If you want to learn how to use the binary utilities like nvdisasm, start here:
http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#abstract
note that -prec-sqrt=false will affect the above output, but not in any way that pertains to inlining of my_sqrt
and if you want to see what it looks like when the my_sqrt function is actually called from kernel code, change the inline to noinline in the source code I have posted, and repeat the steps.