GPU computing provides a way to access the power of massively parallel graphics processing units (GPUs) for general purpose computing. GPUs contain over 100 processing cores and can achieve over 500 gigaflops of performance. The CUDA programming model allows programmers to leverage this parallelism by executing compute kernels on the GPU from their existing C/C++ applications. This approach democratizes parallel computing by making highly parallel systems accessible through inexpensive GPUs in personal computers and workstations. Researchers can now explore manycore architectures and parallel algorithms using GPUs as a platform.
Overview of GPU computing by NVIDIA, tutorial speakers and schedule, focused on democratizing parallel computing.
Discusses the golden age of parallel computing, significant architectures, followed by a dark age marked by limited impact and shift to commodity technology.
Explains GPU capabilities as multithreaded manycore chips, highlighting NVIDIA Tesla performance and the advantages of using GPUs in various fields.
CUDA as a parallel programming model that democratizes parallel computing, showcasing sales of CUDA-capable GPUs and affordable developer kits.
Introduction to GPU computing motivation, showcasing performance metrics and speedup data in various applications.
Presents peak and sustained performance of GPUs, including theoretical benchmarks and actual application performance figures.
Describes the architecture and functionality of manycore GPUs, focusing on the CUDA programming model and heterogeneous programming strategies.
Details on CUDA programming, including kernel functions, shared memory, and thread identification, aimed at harnessing GPU power for computations.
Introduces NVIDIA's Tesla product line as high-performance computing solutions, detailing specifications of various Tesla models.
Summarizes GPUs as powerful parallel processors with CUDA offering accessible programming models and large research opportunities, followed by a Q&A.
Parallel Computing’s DarkAge But…impact of data-parallel computing limited Thinking Machines sold 7 CM-1s (100s of systems total) MasPar sold ~200 systems Commercial and research activity subsided Massively-parallel machines replaced by clusters of ever-more powerful commodity microprocessors Beowulf, Legion, grid computing, … Massively parallel computing lost momentum to the inexorable advance of commodity technology
Heterogeneous Programming CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocks across multiple processing elements Serial Code Parallel Kernel KernelA<<< nBlk, nTid >>>(args); ... Serial Code Parallel Kernel KernelB<<< nBlk, nTid >>>(args); ...
CUDA: Programming GPUin C Philosophy: provide minimal set of extensions necessary to expose power Declaration specifiers to indicate where things live __global__ void KernelFunc(...); // kernel function, runs on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // variable in per-block shared memory Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each Special variables for thread identification in kernels dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim; Intrinsics that expose specific operations in kernel code __syncthreads(); // barrier synchronization within kernel