Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)

Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)

Language: English

Pages: 280

ISBN: 0123814723

Format: PDF / Kindle (mobi) / ePub

Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs.
Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA.
The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who need information about computational thinking and parallel programming.

  • Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing.
  • Utilizes CUDA (Compute Unified Device Architecture), NVIDIA's software development tool created specifically for massively parallel environments.
  • Shows you how to achieve both high-performance and high-reliability using the CUDA programming model as well as OpenCL.











run your code faster than today’s machines. We want to help you master parallel programming so that your programs can scale up to the level of performance of new generations of machines. The key to such scalability is to regularize and localize memory data accesses to minimize consumption of critical resources and conflicts in accessing and updating data structures. Much technical knowledge will be required to achieve these goals, so we will cover quite a few principles and patterns of parallel

In CUDA FORTRAN, automatic arrays are used: attributes (global) subroutine dynamicReverse2(d, nSize) real :: d(nSize) integer, value :: nSize integer :: t, tr real, shared :: s(nSize) t = threadIdx%x tr = nSize-t+1 s(t) = d(t) call syncthreads() d(t) = s(tr) end subroutine dynamicReverse2 Here nSize is not known at compile time, hence s is not a static shared memory array. Any in-scope variable, such as a variable declared in the module that contains this kernel, can be used to

difference between registers, shared memory, and global memory, we need to go into a little more detail of how these different types of memories are realized and used in modern processors. The global memory in the CUDA programming model maps to the memory of the von Neumann model (see “The von Neumann Model” sidebar). The processor box in Figure 5.3 corresponds to the processor chip boundary that we typically see today. The global memory is off the processor chip and is implemented with DRAM

the second row of addition operators. As a result, XY[i] now contains xi−3+xi−2+xi−1+xi. For example, after the first iteration, XY[3] contains x0+x1+x2+x3, shown as ∑x0..x3. Note that after the second iteration, XY[2] and XY[3] contain their final answers and will not need to be changed in subsequent iterations. Readers are encouraged to work through the rest of the iterations. We now work on the implementation of the algorithm illustrated in Figure 9.1. We assign each thread to evolve the

the size of the k-space samples can be much larger, in the order of hundreds of thousands or even millions. A typical way of working around the limitation of constant memory capacity is to break down a large data set into chunks or 64 KB or smaller. The developer must reorganize the kernel so that the kernel will be invoked multiple times, with each invocation of the kernel consuming only a chunk of the large data set. This turns out to be quite easy for the cmpFHD kernel. A careful examination

Download sample