memory fragmentation (perhaps?)

vargant51 · February 9, 2009, 6:38am

Hi,

edit: my /previous/ issue, not my current one was memory corruption. I actually think I got rid of that…that was a different error message entirely. Something along the lines of glibc detected such-and-such* !prev, and then a map of allocated addresses and a stacktrace. Sorry about the confusing title.

I am researching quantum chemistry and plasma physics using a robust, perturbative, renormalized QFT formulated using MaxEnt, Spacetime Algebra/Bivector General Relativity, and active stochastic processes (see e.g. doi:10.1016/j.physletb.2004.03.036, doi:10.1016/j.physletb.2007.10.060, also check out publications by D. Hestenes and E.T. Jaynes). The result will be open-sourced as soon as I can debug it, so feel free to ask for code snippets, etc.

Right now, I’ve got a nasty problem. Using the runtime API, I need to allocate 2 float arrays, each 1286464, asynchronously on a 512mb version of the 8600GT (yes, I know, I’m cheap). Then I need to call my initialization kernel on an input file stored as a uchar*, which has only about 26 registers—I launch 2 streams, each 2 blocks of size (2,2,2). These sizes are subject to as much change as they need given the hardware. I do, however, want to test support for asynchronous transfers and launches if I can, because I made an object-oriented wrapper and am going to write template to abstract the initialization kernel (right now it’s just a dummy that acts as “cudaMemset”). Ideally, more powerful hardware would allow applications to perform multiple simulations without interference.

I don’t know exactly what needs to be fixed. Here is what seems to be the problem code, lines 36-76:

__host__ cystGPUMsim::cystGPUMsim(struct simParams *params, const char* buff){	dp=(cudaPitchedPtr*) malloc(sizeof(cudaPitchedPtr)*(nstreams));	width=params->width;	height=params->height;	depth=params->depth;	dT=params->dT;	dx=params->dx;	n=params->n;	grid=params->grid;	block=params->block;	max=params->max;	nstreams=params->nstreams;	ca_extent.width = width/nstreams;	ca_extent.height = height;	ca_extent.depth = depth;	unsigned char* dbuff;	cudaMalloc((void**) &dbuff, sizeof(buff));	cudaMemcpy(dbuff, buff, sizeof(buff), cudaMemcpyHostToDevice);	stream=(cudaStream_t*) malloc(sizeof(cudaStream_t)*nstreams);	for (int s = 0; s < nstreams; ++s){	dp[s]=make_cudaPitchedPtr( NULL, width/nstreams*sizeof(float), height, depth*4 );	cudaMalloc3D(&(dp[s]), ca_extent);	}	for(int i=0; i<int(width/nstreams); i+=grid.x*block.x){	for(int j=0; j<int(height); i+=grid.y*block.y){	for(int k=0; k<int(depth*4); i+=grid.z*block.z){	for (int s = 0; s < nstreams; ++s){	initialize<<<grid, block, 0, stream[s]>>>((float***) dp[s].ptr, dbuff, i,j,k);	}	cudaThreadSynchronize();	checkCUDAError(" ");	}	}	}	cudaFree(dbuff);	}

Compile-time yields no relevant warnings. What I see on runtime is the following error:

and the gdb stack says:

#0 0xb7f73b13 in ?? () from /cuda/lib/libcudart.so.2

#1 0xb7f431de in ?? () from /cuda/lib/libcudart.so.2

#2 0xb7f3f550 in ?? () from /cuda/lib/libcudart.so.2

#3 0xb7f73bfc in ?? () from /cuda/lib/libcudart.so.2

#4 0xb7fa1fcf in ?? () from /lib/ld-linux.so.2

#5 0xb7ceb084 in exit () from /lib/tls/i686/cmov/libc.so.6

#6 0x0805ce1a in checkCUDAError (msg=0x80604ab " ") at cyst.cu:30

#7 0x0805d34b in cystGPUMsim (this=0x8167a58, params=0x8167a08,
buff=0x80604ad "k") at cyst.cu:70 
#8 0x0805d554 in demo () at demo.cu:40

#9 0x0805db92 in main () at main.cpp:22.

I would use deviceemu to debug. But right now, if I compile it with device emulation, I receive the following warnings:

Then, when I run it, this odd segmentation fault occurs:

I think this has something to do with my host system’s being single-threaded; I am pretty sure the CUDA Programming Guide said you need at least enough threads on the host to run the device code.

I am running Ubuntu 8.04 32-bit Desktop Ed. with the Cuda 2.1 Beta Toolkit and SDK (my installation doesn’t seem to cause any problems), drivers 180.06 beta, and all of the most recent package updates (=> up-to-date GCC, etc.). I have a Pentium D 805, 1GB of RAM (DDR2, don’t know what speed), and a BFG Tech GeForce 8600 GT OC w/ 512 GB GDDR3 (quite a mouthful). Only nvcc settings are -G, -g, and -arch SM_11, which is the correct target (I checked. I had an atomicAdd that I got rid of, if you must know).

Any thoughts? Appreciate your comments.

tmurray · February 9, 2009, 7:10am

Can you post full source?

vargant51 · February 9, 2009, 7:44am

sure, sure. here goes:

cyst.h:

/* Copyright (c) 2009 Kevin Daley	The following code is part of cYst.	cYst is free software: you can redistribute it and/or modify	it under the terms of the GNU General Public License as published by	the Free Software Foundation, either version 3 of the License, or	(at your option) any later version.	cYst is distributed in the hope that it will be useful,	but WITHOUT ANY WARRANTY; without even the implied warranty of	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	GNU General Public License for more details.	You should have received a copy of the GNU General Public License	along with cYst. If not, see <http://www.gnu.org/licenses/>. */ #include <cuda_gl_interop.h> #include <cstdio> #include <cstdlib> #include <cmath> __global__ void initialize(float*** hd, unsigned char* file, int i, int j, int k); extern int resx, resy; namespace cyst{ struct simParams{	float dT;	float n;	float4 dx;	dim3 grid;	dim3 block;	float max;	unsigned int width, height, depth;	int nstreams; }; class cystGPUMsim{ protected:	cudaPitchedPtr *dp;	float dT;	float4 dx;	float n;	dim3 grid;	dim3 block;	cudaStream_t *stream;	int nstreams;	float max;	unsigned int width, height, depth;	cudaExtent ca_extent; public:	__host__ explicit cystGPUMsim(struct simParams *params, const char* buff);	__host__ virtual ~cystGPUMsim();	__host__ void callGPUsim(int frames);	__host__ float* retrieveGPUsim(); }; }

cyst.cu

/* Copyright (c) 2009 Kevin Daley	The following code is part of cYst.	cYst is free software: you can redistribute it and/or modify	it under the terms of the GNU General Public License as published by	the Free Software Foundation, either version 3 of the License, or	(at your option) any later version.	cYst is distributed in the hope that it will be useful,	but WITHOUT ANY WARRANTY; without even the implied warranty of	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	GNU General Public License for more details.	You should have received a copy of the GNU General Public License	along with cYst. If not, see <http://www.gnu.org/licenses/>. */ #include "cyst.cu" __global__ void initialize(float*** hd, unsigned char* file, int i, int j, int k){	uint3 pg=make_uint3(threadIdx.x+blockIdx.x*blockDim.x+i,threadIdx.x+blockIdx .y*blockDim.y+j,threadIdx.z+blockIdx.z*blockDim.z+k);	hd[pg.x][pg.y][pg.z]=3; } void demo(){ cudaSetDevice(0); struct simParams *parms= new simParams; parms->dT=10; parms->n=.5; parms->dx=make_float4(0.1,0.1,0.1,0.1); dim3 grid (2,1); dim3 block (2,2,2); parms->width=128; parms->height=128; parms->depth=64; parms->max=50.0f; parms->nstreams=2; cudaSetDevice(0);	parms->grid=grid;	parms->block=block;	cyst::cystGPUMsim* sim=new cyst::cystGPUMsim(parms, "k");	sim->callGPUsim(1);	float* H=sim->retrieveGPUsim();	delete sim;	FILE* f=fopen("out.dat","rw+");	for(int i=0; i<256; i++){	for(int j=0; j<256; j++){	for(int k=0; k<256; k++){	fprintf(f, "%f %f %f %f %f %f %f %f\n", i*parms->dx.x, j*parms->dx.y,k*parms->dx.z, k*parms->dx.w, H[(i+j*256+k*256*1024)+3], H[(i+j*256+k*256*1024)], H[(i+j*256+k*256*1024)+1], H[(i+j*256+k*256*1024)+2]);	}}}	fclose(f);	exit(0); }

and demo.cu

/* Copyright (c) 2009 Kevin Daley	The following code is part of cYst.	cYst is free software: you can redistribute it and/or modify	it under the terms of the GNU General Public License as published by	the Free Software Foundation, either version 3 of the License, or	(at your option) any later version.	cYst is distributed in the hope that it will be useful,	but WITHOUT ANY WARRANTY; without even the implied warranty of	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	GNU General Public License for more details.	You should have received a copy of the GNU General Public License	along with cYst. If not, see <http://www.gnu.org/licenses/>. */ #include "cyst.cu" __global__ void initialize(float*** hd, unsigned char* file, int i, int j, int k){	uint3 pg=make_uint3(threadIdx.x+blockIdx.x*blockDim.x+i,threadIdx.x+blockIdx .y*blockDim.y+j,threadIdx.z+blockIdx.z*blockDim.z+k);	hd[pg.x][pg.y][pg.z]=3; } void demo(){ cudaSetDevice(0); struct simParams *parms= new simParams; parms->dT=10; parms->n=.5; parms->dx=make_float4(0.1,0.1,0.1,0.1); dim3 grid (2,1); dim3 block (2,2,2); parms->width=128; parms->height=128; parms->depth=64; parms->max=50.0f; parms->nstreams=2; cudaSetDevice(0);	parms->grid=grid;	parms->block=block;	cyst::cystGPUMsim* sim=new cyst::cystGPUMsim(parms, "k");	sim->callGPUsim(1);	float* H=sim->retrieveGPUsim();	delete sim;	FILE* f=fopen("out.dat","rw+");	for(int i=0; i<256; i++){	for(int j=0; j<256; j++){	for(int k=0; k<256; k++){	fprintf(f, "%f %f %f %f %f %f %f %f\n", i*parms->dx.x, j*parms->dx.y,k*parms->dx.z, k*parms->dx.w, H[(i+j*256+k*256*1024)+3], H[(i+j*256+k*256*1024)], H[(i+j*256+k*256*1024)+1], H[(i+j*256+k*256*1024)+2]);	}}}	fclose(f);	exit(0); }

and kernel.cu:

/* Copyright (c) 2009 Kevin Daley	The following code is part of cYst.	cYst is free software: you can redistribute it and/or modify	it under the terms of the GNU General Public License as published by	the Free Software Foundation, either version 3 of the License, or	(at your option) any later version.	cYst is distributed in the hope that it will be useful,	but WITHOUT ANY WARRANTY; without even the implied warranty of	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	GNU General Public License for more details.	You should have received a copy of the GNU General Public License	along with cYst. If not, see <http://www.gnu.org/licenses/>. */ //planck units #define const_pi 3.1415927f #define kC 1.0f #define kB 1.0f #define lightspeed 1.0f #define finestructure 1.0f #include <sm_11_atomic_functions.h> __global__ void kernelGPU(float ***H, float dT, float4 dx, float n, float max, int i, int j, int k){	float mg=0;	uint3 pl=make_uint3(threadIdx.x, threadIdx.y, threadIdx.z);	uint3 pg=make_uint3(threadIdx.x+blockIdx.x*blockDim.x+i,threadIdx.x+blockIdx .y*blockDim.y+j,threadIdx.z+blockIdx.z*blockDim.z+k);	float x=0.0f, y=0.0f, z=0.0f;	__shared__ float shared[2][2][2], E[2][2][2], E2[2][2][2];	E[pl.x][pl.y][pl.z]=H[pg.x][pg.y][pg.z];	E2[pl.x][pl.y][pl.z]=0.0f;	for(float i=0; i<max && i<E[pl.x][pl.y][pl.z]; i+=n){	if((threadIdx.z+1)%4==0){	shared[pl.x][pl.y][pl.z]=i;	mg=sqrtf(pow(shared[pl.x][pl.y][pl.z-3],2.0f)+pow(shared[pl.x][pl.y][pl.z-2],2.0f)+pow(shared[pl.x][pl.y][pl.z-1],2.0f));	x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));	}	else if((threadIdx.z+1)%3==0){	shared[pl.x][pl.y][pl.z]=i;	mg=sqrtf(pow(shared[pl.x][pl.y][pl.z+1],2.0f)+pow(shared[pl.x][pl.y][pl.z-2],2.0f)+pow(shared[pl.x][pl.y][pl.z-1],2.0f));	x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));	}	else if((threadIdx.z+1)%2==0){	shared[pl.x][pl.y][pl.z]=i;	mg=sqrtf(pow(shared[pl.x][pl.y][pl.z+1],2.0f)+pow(shared[pl.x][pl.y][pl.z+2],2.0f)+pow(shared[pl.x][pl.y][pl.z-1],2.0f));	x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));	}	else{	shared[pl.x][pl.y][pl.z]=i;	mg=sqrtf(pow(shared[pl.x][pl.y][pl.z+1],2.0f)+pow(shared[pl.x][pl.y][pl.z+2],2.0f)+pow(shared[pl.x][pl.y][pl.z+3],2.0f));	x+=shared[pl.x][pl.y][pl.z]*(cosf(mg)-1/mg*shared[pl.x][pl.y][pl.z]*sinf(mg));	}	shared[pl.x][pl.y][pl.z]=x;	__syncthreads();	__syncthreads();	if((threadIdx.z+1)%4==0){	z=shared[pl.x][pl.y][pl.z-1];	y=shared[pl.x][pl.y][pl.z-2];	x=shared[pl.x][pl.y][pl.z-3];	}	__syncthreads();	if((threadIdx.z+1)%4==0 && int(fdividef(x,dx.x))>=0 && int(fdividef(x,dx.x))){	E2[pl.x+int(fdividef(x,dx.x))][pl.y+int(fdividef(y,dx.y))][pl.z+4*int(fdividef(z,dx.z))]+=shared[pl.x][pl.y][pl.z];	}	else if((threadIdx.z+1)%4==0){	H[int(fdividef(x,dx.x))+pg.x][int(fdividef(y,dx.y))+pg.y][(int(fdividef(z,dx.z))*4+pg.z)]+=i;	}	E2[pl.x][pl.y][pl.z]-=i;	__syncthreads();	}	H[pg.x][pg.y][pg.z]=E[pl.x][pl.y][pl.z]+E2[pl.x][pl.y][pl.z];	}

Finally, main.cpp:

/* Copyright (c) 2009 Kevin Daley	The following code is part of cYst.	cYst is free software: you can redistribute it and/or modify	it under the terms of the GNU General Public License as published by	the Free Software Foundation, either version 3 of the License, or	(at your option) any later version.	cYst is distributed in the hope that it will be useful,	but WITHOUT ANY WARRANTY; without even the implied warranty of	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	GNU General Public License for more details.	You should have received a copy of the GNU General Public License	along with cYst. If not, see <http://www.gnu.org/licenses/>. */ extern void demo(); int main(int argc, char** argv){	demo(); }

There you go.

vargant51 · March 19, 2009, 5:22pm

bump

Topic		Replies	Views
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37825	April 23, 2009
Problem in Emulation CUDA Programming and Performance	1	857	September 4, 2009
[SOLVED]memory leak issues CUDA Programming and Performance	12	3720	October 12, 2015
memory fragmentation? CUDA Programming and Performance	2	4304	April 15, 2009
Memory allocation reliablity CUDA Programming and Performance	8	3278	August 18, 2008
random kernel execution failure with unknown error CUDA programming on Linux CUDA Programming and Performance	9	8704	June 11, 2008
Heap Corruption help needed! CUDA Programming and Performance	1	8214	April 16, 2009
Segmentation error in cudaMalloc - concurrent kernel execution CUDA Programming and Performance	8	1257	January 30, 2012
nvcc Segfault CUDA Programming and Performance	6	11474	October 14, 2010
Memory Corruption on a Fermi-Class GPU Error only on Fermis, program works on non-Fermis. CUDA Programming and Performance	18	7233	July 22, 2011

memory fragmentation (perhaps?)

Related topics