Macro expansion bug? NVCC dying with internal error

spencer · April 11, 2007, 7:45pm

I am trying port the following code constructs to run on the GPU and nvcc is dying with internal errors

Here is the relevent code fragment from my hacked up version of x264 codec’s pixel.c

#define PIXEL_SAD_C( name, lx, ly ) \ DEVICE int name( uint8_t *pix1, int i_stride_pix1,  \                 uint8_t *pix2, int i_stride_pix2 ) \ {                                                   \    int i_sum = 0;                                  \    int x, y;                                       \    for( y = 0; y < ly; y++ )                       \    {                                               \        for( x = 0; x < lx; x++ )                   \        {                                           \            i_sum += abs( pix1[x] - pix2[x] );      \        }                                           \        pix1 += i_stride_pix1;                      \        pix2 += i_stride_pix2;                      \    }                                               \    return i_sum;                                   \ } PIXEL_SAD_C( pixel_sad_16x16, 16, 16 ) PIXEL_SAD_C( pixel_sad_16x8,  16,  8 ) PIXEL_SAD_C( pixel_sad_8x16,   8, 16 ) PIXEL_SAD_C( pixel_sad_8x8,    8,  8 ) PIXEL_SAD_C( pixel_sad_8x4,    8,  4 ) PIXEL_SAD_C( pixel_sad_4x8,    4,  8 ) PIXEL_SAD_C( pixel_sad_4x4,    4,  4 ) PIXEL_SAD_C( pixel_sad_4x2,    4,  2 ) PIXEL_SAD_C( pixel_sad_2x4,    2,  4 ) PIXEL_SAD_C( pixel_sad_2x2,    2,  2 ) typedef int  (*x264_pixel_cmp_t) ( uint8_t *, int, uint8_t *, int ); DEVICE x264_pixel_cmp_t pixel_sad[10] = { pixel_sad_16x16,       pixel_sad_16x8,       pixel_sad_8x8,       pixel_sad_8x8,       pixel_sad_8x4,       pixel_sad_4x8,       pixel_sad_4x4,       pixel_sad_4x2,       pixel_sad_2x4,       pixel_sad_2x2 };

And later on I call it like this

       results[ tid ]= pixel_sad[i_pixel]( x_pixels,         FENC_STRIDE,         y_pixels +  __umul24( mb_y, i_stride) + mb_x,         i_stride)                 + p_cost_mvx[mb_x<<2] + p_cost_mvy[mb_y<<2];

So is this an illegal code construct? I tried to declare the expanded functions in the macro as device (define DEVICE device) as they are only called on the GPU in this context. There are no device function restrictions that obviously apply to this except maybe the restriction that _device functions cannot have their pointers taken.

I know I can expand the macro by hand and declare the functions individually but this way is more maintainable and expanding the macro won’t get around issues with pointers to device function (if that is the problem).

Suggestions?

Spencer

JaredHoberock · April 11, 2007, 8:03pm

Device function pointers are illegal. In 4.2.1.4 of cuda programming guide.

pyrtsa · April 11, 2007, 8:23pm

Hi,

I bet the problem is here, in trying to use a pointer to device function. The device functions are, to my knowledge, inlined by default, so there is no possibility to split your problem that way. To me it looks like your code could be rewritten without preprocessor defines so that the constant parameters lx and ly were actually parameters to a single function named

int pixel_sad(uint8_t * pix1, int i_stride_pix1, uint8_t * pix2, int i_stride_pix2, int lx, int ly);

with no performance penalty compared to your code.

Calling that function sequentially with different values of lx and ly, however, probably has to be done manually, because there is no for loop unrolling in the compiler v. 0.8.

/Pyry

spencer · April 11, 2007, 8:59pm

OK but that is a very poor way for the compiler to tell me that. :)

spencer · April 11, 2007, 9:04pm

Hi,

I bet the problem is here, in trying to use a pointer to device function. The device functions are, to my knowledge, inlined by default, so there is no possibility to split your problem that way. To me it looks like your code could be rewritten without preprocessor defines so that the constant parameters lx and ly were actually parameters to a single function named
int pixel_sad(uint8_t * pix1, int i_stride_pix1, uint8_t * pix2, int i_stride_pix2, int lx, int ly); 
with no performance penalty compared to your code.

Calling that function sequentially with different values of lx and ly, however, probably has to be done manually, because there is no for loop unrolling in the compiler v. 0.8.

/Pyry

[snapback]182990[/snapback]

If there is no support for loop unrolling, you are quite correct. A more generalized implementation like what you suggested would work just as well though there might be a slight performance penalty because of 2 more parameters has to be pushed onto the stack and maybe more registers used vs a table look up.

Regards,

Spencer

pyrtsa · April 12, 2007, 7:25am

Again, because of implicit device function inlining, there should be no extra stack operations for literal constants as function parameters, if the compiler does things right. And even if there would be, often the global memory accesses will most probably still remain the bottleneck.

/Pyry

Topic		Replies	Views
macro function Is it impossible to use macro functions? CUDA Programming and Performance	9	5685	November 7, 2007
Suggestion for nvcc docs No function pointers CUDA Programming and Performance	0	1611	May 3, 2007
Question about a wired compiling error. CUDA Programming and Performance	2	2565	June 1, 2007
Assertion in exp_loadstore.cxx CUDA Programming and Performance	3	1050	May 28, 2011
__device__ function clarifications CUDA Programming and Performance	6	21581	December 10, 2008
function pointers CUDA Programming and Performance	8	10768	June 15, 2007
CUDA bug: use of "" CUDA Programming and Performance	3	7657	October 17, 2007
Non-inlined device functions for compute capability 2.0? CUDA Programming and Performance	6	23758	January 21, 2011
conditional compilation for nvcc/c++ compiler CUDA Programming and Performance	1	20173	March 13, 2007
ERROR: EXTERNAL CALLS NOT SUPPORTED CUDA Programming and Performance	20	76033	June 24, 2012

Macro expansion bug? NVCC dying with internal error

Related topics