Problem about cudaMemcpy3D()

Quoc_Vinh · September 22, 2008, 4:15am

Hi everybody.

i have a problem about cudaMemcpy3D().

when I use this function to copy data from host to global memory. it is OK

int numOfElements = dimx *dimy *dimz; float* hostInput = 0;	hostInput = new float[numOfElements];	for(int i = 0; i < numOfElements; ++i)	{  hostInput[i] = (float)i;	} float* deviceData =	0;	cudaPitchedPtr pitchPtr = make_cudaPitchedPtr( (void*)deviceData, dimx*sizeof(float), dimx, dimy );	printf("cudaPitchedPtr: %s\n", cudaGetErrorString(cudaGetLastError()));	cudaExtent ca_extent;	ca_extent.width  = dimx;	ca_extent.height = dimy;	ca_extent.depth  = dimz;	cudaMalloc3D( &pitchPtr, ca_extent);	cudaMemset3D( pitchPtr, 0, ca_extent);	cudaMemcpy3DParms cpy_params = {0};	cpy_params.srcPtr   = make_cudaPitchedPtr( (void*)hostInput, dimx * sizeof(float), dimx, dimy );	cpy_params.dstPtr   = pitchPtr;	cpy_params.extent   = ca_extent;	cpy_params.kind     = cudaMemcpyHostToDevice;	cudaMemcpy3D( &cpy_params );	printf("cudaMemcpy3D: %s\n", cudaGetErrorString(cudaGetLastError()));

but when I use this function to copy data from host to cudaArray, i get a problem.

“unspecified launch failure”

int numOfElements = dimx *dimy *dimz; float* hostInput = 0;	hostInput = new float[numOfElements ];	for(int i = 0; i < numOfElements; ++i)	{  hostInput[i] = (float)i;	}       cudaChannelFormatDesc ca_descriptor = cudaCreateChannelDesc<float>();	cudaExtent ca_extent;	ca_extent.width  = dimx;	ca_extent.height = dimy;	ca_extent.depth  = dimz;	cudaArray *cudaArrayData = 0;	cudaMalloc3DArray( &cudaArrayData, &ca_descriptor, ca_extent );	cudaBindTextureToArray( textureRef, cudaArrayData, ca_descriptor );	cudaMemcpy3DParms cpy_params = {0};	cpy_params.srcPtr   = make_cudaPitchedPtr( (void*)hostInput, dimx * sizeof(float), dimx, dimy );	cpy_params.extent   = ca_extent;	cpy_params.kind     = cudaMemcpyHostToDevice;	cpy_params.dstArray = cudaArrayData;	CUDA_SAFE_CALL( cudaMemcpy3D(&cpy_params) );	printf("cudaMemcpy3D: %s\n", cudaGetErrorString(cudaGetLastError()));

ojaswa · October 9, 2008, 1:45pm

Hi,

The function documentation for cudaMalloc3D and cudaMemcpy3D seem quite confusing.

Here is my code:

Allocation:

cudaExtent pitchedVolSize = make_cudaExtent(subvolSize.width*sizeof(float), subvolSize.height, subvolSize.depth);  CUDA_SAFE_CALL(cudaMalloc3D(&d_volPPtr, pitchedVolSize));

memcpy:

 cudaMemcpy3DParms copyParams = {0};  copyParams.srcPtr = d_volPPtr;  copyParams.dstPtr = make_cudaPitchedPtr((void*)h_subvol, (SUBVOL_DIM-2)*sizeof(float), SUBVOL_DIM-2, SUBVOL_DIM-2);  copyParams.kind = cudaMemcpyDeviceToHost;  copyParams.extent = make_cudaExtent((SUBVOL_DIM-2)*sizeof(float), SUBVOL_DIM-2, SUBVOL_DIM-2);  copyParams.srcPos = make_cudaPos(1*sizeof(float), 1, 1); // We want to copy (SUBVOL_DIM-2)^3 startint at (1, 1, 1).  CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));

The documentation says that if the transfer doesn’t include cuda arrays, then the width parameter is always the width in bytes!

I’m not very sure if my code works, but would definitely like to have some good examples of mem allocation in 3D and 3D memcpy, specially from device to host.

-Ojas

ojaswa · October 10, 2008, 7:39am

Hi,

I just figured out how to perform 3D memory transfers (and subvolume copies). If someone wants the code, I can provide.

Cheers,
Oj

vandhu_confused · October 20, 2008, 9:25pm

Hey,

Could you please mail me the code (or paste it here) maybe with a small description…I’m running 700700700 arrays and really need this. Also if you have some code wherein blocking is done on them, that’ll be great!

Cheers,

Vandhan!

Ailleur · October 20, 2008, 10:49pm

This is how i do it.

cudaChannelFormatDesc tfloatTex = cudaCreateChannelDesc<short>();; const cudaExtent tvolumeSize = make_cudaExtent(pdose->terma->x_dim, pdose->terma->y_dim, pdose->terma->z_dim);	CUDA_SAFE_CALL( cudaMalloc3DArray(&tArray, &tfloatTex, tvolumeSize) );	cudaMemcpy3DParms tcopyParams = {0};	tcopyParams.srcPtr ï¿½ Â = make_cudaPitchedPtr((void*)pdose->terma->data, tvolumeSize.width*sizeof(short), tvolumeSize.width, tvolumeSize.height);	tcopyParams.dstArray = tArray; ï¿½ Â Â tcopyParams.extent ï¿½ Â = tvolumeSize; ï¿½ Â Â tcopyParams.kind ï¿½ Â ï¿½ Â = cudaMemcpyHostToDevice; ï¿½ Â Â CUDA_SAFE_CALL( cudaMemcpy3D(&tcopyParams) );

Hope this helps

callingearthling · October 21, 2008, 12:15am

Hey thanks a ton!

Another doubt…Something very similiar…Say I have a n^3 array (cube) which I need to break up into smaller arrays (cubes) of p^3 size (i.e. I have to do blocking!). The small p^3 block is worked on and another p^3 block generated in the device kernel. I realised copying the full n^3 array into global space using solution above would be very inefficient and not scalable. ( Sizes start at 700700700 elements)

The other solution is passing p^3 elements to the block and getting back p^3 element. How do I do this in CUDA? [ Small code with function parameters, etc would be helpful ]When I pass these as function parameters to kernel, don’t they get copied into shared memory ? (i.e. only about 16 KB).

Also finally once I have got these p^3 cubes (from n^3/p^3 blocks), how do I join them together to get one massive n^3 array?

Thanks,

Vandhan

This is how i do it.

cudaChannelFormatDesc tfloatTex = cudaCreateChannelDesc<short>();; const cudaExtent tvolumeSize = make_cudaExtent(pdose->terma->x_dim, pdose->terma->y_dim, pdose->terma->z_dim);	CUDA_SAFE_CALL( cudaMalloc3DArray(&tArray, &tfloatTex, tvolumeSize) );	cudaMemcpy3DParms tcopyParams = {0};	tcopyParams.srcPtr ï¿½ Â = make_cudaPitchedPtr((void*)pdose->terma->data, tvolumeSize.width*sizeof(short), tvolumeSize.width, tvolumeSize.height);	tcopyParams.dstArray = tArray; ï¿½ Â Â tcopyParams.extent ï¿½ Â = tvolumeSize; ï¿½ Â Â tcopyParams.kind ï¿½ Â ï¿½ Â = cudaMemcpyHostToDevice; ï¿½ Â Â CUDA_SAFE_CALL( cudaMemcpy3D(&tcopyParams) );

Hope this helps

[snapback]454450[/snapback]

Ailleur · October 21, 2008, 1:38am

Im not sure i get what youre asking…

There is nothing different, that i can see, in the way to pass n^3 instead of passing n^3.

You memcpy it to the device, work on it, and get it back.

You cant pass data directly to shared memory. Threads will fetch the data from global memory to shared memory.

If youre talking about using a struct and passing it as a parameter, then im not sure how that would work out. Id be interested in that answer as well!

As far as building them back into a big array, youll be keeping track of which sub array youre working on, its just gonna be a matter a setting the right offsets when copying back.

Maybe theres some clever way to still use a big memcpy that im not thinking off, but as far as i can tell in this 5 minutes analysis, youll have to memcpy it to a temporary array then loop with the right offsets in your “big array” to copy the data item by item.

I mean, you could structure your big array to be [block1x][block1y][block1z][block2x][…]… which would allow you to do memcpy’s from the device to the host, but i guess youd have to structure that array back into shape if you have some other processing to do on it that assumes a “normal” configuration.

Anyway, ive rambled a bit more than i thought i would, with no real substance!

Hey thanks a ton!

Another doubt…Something very similiar…Say I have a n^3 array (cube) which I need to break up into smaller arrays (cubes) of p^3 size (i.e. I have to do blocking!). The small p^3 block is worked on and another p^3 block generated in the device kernel. I realised copying the full n^3 array into global space using solution above would be very inefficient and not scalable. ( Sizes start at 700700700 elements)

The other solution is passing p^3 elements to the block and getting back p^3 element. How do I do this in CUDA? [ Small code with function parameters, etc would be helpful ]When I pass these as function parameters to kernel, don’t they get copied into shared memory ? (i.e. only about 16 KB).

Also finally once I have got these p^3 cubes (from n^3/p^3 blocks), how do I join them together to get one massive n^3 array?

Thanks,

Vandhan

[snapback]454476[/snapback]

ojaswa · October 21, 2008, 6:01am

Hi folks,

I’ll post of my memcopy (3D host → array, host → device, device->host) and blocking code later this afternoon. I used arrays to bind them to 3D textures so that I get the speedup while accessing 3D elements (using shared mem is another possibility).

One the host side, as Ailleur has said, you must do some memory management to keep correct subvolume offsets. My application required shared subvolumes (i’.e, subvolumes with 1 or 2 voxel shared border). Otherwise, if you just need to slice the volume without subvolume sharing, the code will look cleaner.

Will post it soon,

Cheers,
Oj

ojaswa · October 21, 2008, 1:04pm

#define SUBVOL_DIM 128 cudaExtent volSize = make_cudaExtent(...); cudaArray *d_volArray = 0; // subvolume bound to texture cudaPitchedPtr d_volPPtr;// subvolume in device memory float* h_vol = NULL; // The full volume on host int iDivUp(int a, int b) {  return ((a % b) != 0)? (a / b + 1): (a / b); } //Initialization and mem allocations void initCuda() { ... cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>(); cudaExtent subvolSize = make_cudaExtent(SUBVOL_DIM, SUBVOL_DIM, SUBVOL_DIM); CUDA_SAFE_CALL(cudaMalloc3DArray(&d_volArray, &channelDesc, subvolSize)); ... cudaExtent pitchedVolSize = make_cudaExtent(SUBVOL_DIM*sizeof(float), SUBVOL_DIM, SUBVOL_DIM); CUDA_SAFE_CALL(cudaMalloc3D(&d_volPPtr, pitchedVolSize)); ... }

Host to array copy:

void copy3DHostToArray(float *_src, cudaArray *_dst, cudaExtent copy_extent, cudaPos src_offset) {  cudaMemcpy3DParms copyParams = {0};  float *h_source = _src + src_offset.x + src_offset.y*volSize.width + src_offset.z*volSize.width*volSize.height;  copyParams.srcPtr = make_cudaPitchedPtr((void*)h_source, volSize.width*sizeof(float), volSize.width, volSize.height);  copyParams.dstArray = _dst;  copyParams.kind = cudaMemcpyHostToDevice;  copyParams.extent = copy_extent;  CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));  CUT_CHECK_ERROR("Host -> Array Memcpy failed\n"); }

Device mem to array copy:

void copy3DMemToArray(cudaPitchedPtr _src, cudaArray *_dst) {  cudaMemcpy3DParms copyParams = {0};  copyParams.srcPtr =  _src;  copyParams.dstArray = _dst;  copyParams.kind = cudaMemcpyDeviceToDevice;  copyParams.extent = make_cudaExtent(SUBVOL_DIM, SUBVOL_DIM, SUBVOL_DIM);  CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));  CUT_CHECK_ERROR("Mem -> Array Memcpy failed\n"); }

Device mem to host mem copy:

void copy3DMemToHost(cudaPitchedPtr _src, float *_dst, cudaExtent copy_extent, cudaExtent dst_extent, cudaPos src_offset, cudaPos dst_offset) {  cudaMemcpy3DParms copyParams = {0};  copyParams.srcPtr = _src;  float *h_target = _dst + dst_offset.x + dst_offset.y*dst_extent.width + dst_offset.z*dst_extent.width*dst_extent.height;//For some reason, using copyParams.dstPos doesn't give correct results, so we set the offset here.  copyParams.dstPtr = make_cudaPitchedPtr((void*)h_target, dst_extent.width*sizeof(float), dst_extent.width, dst_extent.height);  copyParams.kind = cudaMemcpyDeviceToHost;  copyParams.extent = make_cudaExtent(copy_extent.width*sizeof(float), copy_extent.height, copy_extent.depth);  copyParams.srcPos = make_cudaPos(src_offset.x*sizeof(float), src_offset.y, src_offset.z); // We want to copy copy_extent sized volume starting at (x_off, y_off, z_off).  CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));  CUT_CHECK_ERROR("Mem -> Host Memcpy failed\n"); }

Memory management (note that there is a one voxel border around every subvolume which is shared with other subvolumes):

  cudaExtent subvolIndicesExtents = make_cudaExtent(iDivUp(volSize.width-2, SUBVOL_DIM-2), iDivUp(volSize.height-2, SUBVOL_DIM-2), iDivUp(volSize.depth-2, SUBVOL_DIM-2)); for(int _z = 0; _z< subvolIndicesExtents.depth; _z++)      for(int _y = 0; _y< subvolIndicesExtents.height; _y++)        for(int _x = 0; _x< subvolIndicesExtents.width; _x++)        {   //copy the subvolume to texture   copy3DHostToArray(h_vol, d_volArray, make_cudaExtent(SUBVOL_DIM, SUBVOL_DIM, SUBVOL_DIM), make_cudaPos(_x*(SUBVOL_DIM-2), _y*(SUBVOL_DIM-2), _z*(SUBVOL_DIM-2)));     //fprintf(stderr, "->%s", cudaGetErrorString(cudaGetLastError()));   //run a kernel on subvolume. reads from texture (via d_volArray)and writes to d_volPPtr   d_kernel<<<gridSize, blockSize>>>(d_volPPtr, ...);     CUT_CHECK_ERROR("Kernel failed");     //fprintf(stderr, "---%d-%d-%d %s---", _x, _y, _z, cudaGetErrorString(cudaGetLastError()));     cudaThreadSynchronize();   //Copy results back to host mem from device mem   dst_off.x = 1 + _x*(SUBVOL_DIM-2); dst_off.y = 1 + _y*(SUBVOL_DIM-2); dst_off.z = 1 + _z*(SUBVOL_DIM-2);    copy3DMemToHost(d_volPPtr, h_phi, copyvol, volSize, src_off, dst_off);     //fprintf(stderr, "%s<-\n", cudaGetErrorString(cudaGetLastError()));        }

Notes:

copy3DMemToHost() is the most generic of the three functions, but t shouldn’t be difficult to do the same with other two. I have hard-coded some values in the other two.
In my experience, setting a copy position (offset) on the host pitched ptr (host memory) never worked. So, I set the correct offset using pointer arithmetic myself and it works. Does anyone know why? Setting offsets on device memory and arrays always works.
In case of normal 3D memory on device and host, always set the first argument of cudaPos, cudaExtent to the offset along X in “bytes”. While, with arrays, this has to be in number of elements in X direction. Anyway, this is documented in the API guide.

Most of the code is trivial and taken from cuda examples.

Hope it helps.

Cheers,

Ojaswa

vandhu_confused · October 26, 2008, 11:02pm

Hi,

Another small doubt. How do I go about accessing the elements of d_volArray and d_volPPtr in the kernel? I want to read from d_volArray, run from modifications and store in d_volPPtr? A small code for the kernel function to do this would be awesome.

Thanks,

Vandhan!

#define SUBVOL_DIM 128 cudaExtent volSize = make_cudaExtent(...); cudaArray *d_volArray = 0; // subvolume bound to texture cudaPitchedPtr d_volPPtr;// subvolume in device memory float* h_vol = NULL; // The full volume on host int iDivUp(int a, int b) { Â return ((a % b) != 0)? (a / b + 1): (a / b); } //Initialization and mem allocations void initCuda() { ... cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>(); cudaExtent subvolSize = make_cudaExtent(SUBVOL_DIM, SUBVOL_DIM, SUBVOL_DIM); CUDA_SAFE_CALL(cudaMalloc3DArray(&d_volArray, &channelDesc, subvolSize)); ... cudaExtent pitchedVolSize = make_cudaExtent(SUBVOL_DIM*sizeof(float), SUBVOL_DIM, SUBVOL_DIM); CUDA_SAFE_CALL(cudaMalloc3D(&d_volPPtr, pitchedVolSize)); ... }

Host to array copy:

void copy3DHostToArray(float *_src, cudaArray *_dst, cudaExtent copy_extent, cudaPos src_offset) { Â cudaMemcpy3DParms copyParams = {0}; Â float *h_source = _src + src_offset.x + src_offset.y*volSize.width + src_offset.z*volSize.width*volSize.height; Â copyParams.srcPtr = make_cudaPitchedPtr((void*)h_source, volSize.width*sizeof(float), volSize.width, volSize.height); Â copyParams.dstArray = _dst; Â copyParams.kind = cudaMemcpyHostToDevice; Â copyParams.extent = copy_extent; Â CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams)); Â CUT_CHECK_ERROR("Host -> Array Memcpy failed\n"); }

Device mem to array copy:

void copy3DMemToArray(cudaPitchedPtr _src, cudaArray *_dst) { Â cudaMemcpy3DParms copyParams = {0}; Â copyParams.srcPtr = Â _src; Â copyParams.dstArray = _dst; Â copyParams.kind = cudaMemcpyDeviceToDevice; Â copyParams.extent = make_cudaExtent(SUBVOL_DIM, SUBVOL_DIM, SUBVOL_DIM); Â CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams)); Â CUT_CHECK_ERROR("Mem -> Array Memcpy failed\n"); }

Device mem to host mem copy:

void copy3DMemToHost(cudaPitchedPtr _src, float *_dst, cudaExtent copy_extent, cudaExtent dst_extent, cudaPos src_offset, cudaPos dst_offset) { Â cudaMemcpy3DParms copyParams = {0}; Â copyParams.srcPtr = _src; Â float *h_target = _dst + dst_offset.x + dst_offset.y*dst_extent.width + dst_offset.z*dst_extent.width*dst_extent.height;//For some reason, using copyParams.dstPos doesn't give correct results, so we set the offset here. Â copyParams.dstPtr = make_cudaPitchedPtr((void*)h_target, dst_extent.width*sizeof(float), dst_extent.width, dst_extent.height); Â copyParams.kind = cudaMemcpyDeviceToHost; Â copyParams.extent = make_cudaExtent(copy_extent.width*sizeof(float), copy_extent.height, copy_extent.depth); Â copyParams.srcPos = make_cudaPos(src_offset.x*sizeof(float), src_offset.y, src_offset.z); // We want to copy copy_extent sized volume starting at (x_off, y_off, z_off). Â CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams)); Â CUT_CHECK_ERROR("Mem -> Host Memcpy failed\n"); }

Memory management (note that there is a one voxel border around every subvolume which is shared with other subvolumes):

Â   cudaExtent subvolIndicesExtents = make_cudaExtent(iDivUp(volSize.width-2, SUBVOL_DIM-2), iDivUp(volSize.height-2, SUBVOL_DIM-2), iDivUp(volSize.depth-2, SUBVOL_DIM-2)); for(int _z = 0; _z< subvolIndicesExtents.depth; _z++) Â  Â  Â for(int _y = 0; _y< subvolIndicesExtents.height; _y++) Â  Â  Â  Â for(int _x = 0; _x< subvolIndicesExtents.width; _x++) Â  Â  Â  Â { Â  //copy the subvolume to texture Â  copy3DHostToArray(h_vol, d_volArray, make_cudaExtent(SUBVOL_DIM, SUBVOL_DIM, SUBVOL_DIM), make_cudaPos(_x*(SUBVOL_DIM-2), _y*(SUBVOL_DIM-2), _z*(SUBVOL_DIM-2))); Â  Â  //fprintf(stderr, "->%s", cudaGetErrorString(cudaGetLastError())); Â  //run a kernel on subvolume. reads from texture (via d_volArray)and writes to d_volPPtr Â  d_kernel<<<gridSize, blockSize>>>(d_volPPtr, ...); Â  Â  CUT_CHECK_ERROR("Kernel failed"); Â  Â  //fprintf(stderr, "---%d-%d-%d %s---", _x, _y, _z, cudaGetErrorString(cudaGetLastError())); Â  Â  cudaThreadSynchronize(); Â  //Copy results back to host mem from device mem Â  dst_off.x = 1 + _x*(SUBVOL_DIM-2); dst_off.y = 1 + _y*(SUBVOL_DIM-2); dst_off.z = 1 + _z*(SUBVOL_DIM-2); Â  Â copy3DMemToHost(d_volPPtr, h_phi, copyvol, volSize, src_off, dst_off); Â  Â  //fprintf(stderr, "%s<-\n", cudaGetErrorString(cudaGetLastError())); Â  Â  Â  Â }

Notes:

copy3DMemToHost() is the most generic of the three functions, but t shouldn’t be difficult to do the same with other two. I have hard-coded some values in the other two.
In my experience, setting a copy position (offset) on the host pitched ptr (host memory) never worked. So, I set the correct offset using pointer arithmetic myself and it works. Does anyone know why? Setting offsets on device memory and arrays always works.
In case of normal 3D memory on device and host, always set the first argument of cudaPos, cudaExtent to the offset along X in “bytes”. While, with arrays, this has to be in number of elements in X direction. Anyway, this is documented in the API guide.

Most of the code is trivial and taken from cuda examples.

Hope it helps.

Cheers,

Ojaswa

[snapback]454672[/snapback]

Topic		Replies	Views
Setting up 3d arryas I have some questions about how to use 3d arrays and cudaArrays CUDA Programming and Performance	10	27957	April 5, 2010
cudaMemcpy3D and cudaMalloc cudaDeviceToDevice copies CUDA Programming and Performance	4	10618	June 29, 2010
copy 3D data from host to device CUDA Programming and Performance	3	9883	December 1, 2010
How to use cudaMalloc3DArray to copy dynamic 3d array from host to device? CUDA Programming and Performance	2	3106	September 2, 2011
cudaMemcpy3D for allocating arrays for textures only? CUDA Programming and Performance	3	1185	March 2, 2011
Issues with cuda Memcpy3D and pitched memory CUDA Programming and Performance	0	985	August 1, 2013
simple cudaMemcpy3D question CUDA Programming and Performance	0	783	April 6, 2011
cudaMemcpy3D problem CUDA Programming and Performance	4	2961	March 22, 2009
How to allocate a 3d array such that you can use the indecies to access its elements CUDA Programming and Performance	20	5469	October 24, 2009
Writing to 3D texture CUDA Programming and Performance	10	20016	December 28, 2010

Problem about cudaMemcpy3D()

Related topics