David B.Kirk美國國家工程院院士、NVIDIAFellow,曾是NVIDIA公司首席科學家。他領(lǐng)導(dǎo)了nvidia圖形技術(shù)開發(fā),并使其成為當今最流行的大眾娛樂平臺,也是cuda技術(shù)的創(chuàng)始人之一。2002年,他榮獲ACMSIGGRAPH計算機圖形成就獎,以表彰其在把高性能計算機圖形系統(tǒng)推向大眾市場方面所做出的杰出貢獻。他擁有麻省理工學院的機械工程學學士學位和碩士學位,加州理工學院的計算機科學博士學位。kirk博士是50項與圖形芯片設(shè)計相關(guān)的專利和專利申請的發(fā)明者,發(fā)表了50多篇關(guān)于圖形處理技術(shù)的論文,是可視化計算技術(shù)方面的權(quán)威。Wen-MeiW.Hwu(胡文美)擁有美國加州大學伯克利分校計算機科學博士學位,現(xiàn)任美國伊利諾伊大學厄巴納—香檳分校(UIUC)協(xié)調(diào)科學實驗室電氣與計算機工程JerrySanders(AMD創(chuàng)始人)講座教授、微軟和英特爾聯(lián)合資助的通用并行計算研究中心聯(lián)合主任兼世界上第一個NVIDIACUDA卓越中心首席研究員。胡教授是世界頂級的并行處理器架構(gòu)與編譯器專家,擔任美國下一代千萬億級計算機——藍水系統(tǒng)的首席研究員。他是IEEEFellow、ACM Fellow。
圖書目錄
preface
acknowledgements
chapter 1 introduction
1.1 heterogeneous parallel computing
1.2 architecture of a modem gpu
1.3 why more speed or parallelism?
1.4 speeding up real applications
1.5 parallel programming languages and models
1.6 overarching goals
1.7 organization of the book
references
chapter 2 history of gpu computing
2.1 evolution of graphics pipelines
the era of fixed-function graphics pipelines
evolution of programmable real-time graphics
unified graphics and computing processors
2.2 gpgpu: an intermediate step
2.3 gpu computing
scalable gpus
recent developments
future trends
references and further reading
chapter 3 introduction to data parallelism and coda c
3.1 data parallelism
3.2 cuda program structure
3.3 a vector addition kernel
3.4 device global memory and data transfer
3.5 kernel functions and threading
3.6 summary
function declarations
kernel launch
predefined variables
runtime api
3.7 exercises
references
chapter 4 data-parallel execution model
4.1 cuda thread organization
4.2 mapping threads to multidimensional data
4.3 matrix-matrix multiplication--a more complex kernel
4.4 synchronization and transparent scalability
4.5 assigning resources to blocks
4.6 querying device properties
4.7 thread scheduling and latency tolerance
4.8 summary
4.9 exercises
chapter 5 coda memories
5.1 importance of memory access efficiency
5.2 cuda device memory types
5.3 a strategy for reducing global memory traffic
5.4 a tiled matrix-matrix multiplication kernel
5.5 memory as a limiting factor to parallelism
5.6 summary
5.7 exercises
chapter 6 performance considerations
6.1 warps and thread execution
6.2 global memory bandwidth
6.3 dynamic partitioning of execution resources
6.4 instruction mix and thread granularity
6.5 summary
6.6 exercises
references
chapter 7 floating-point considerations
7.1 floating-point format
normalized representation of m
excess encoding of e
7.2 representable numbers
7.3 special bit patterns and precision in ieee format
7.4 arithmetic accuracy and rounding
7.5 algorithm considerations
7.6 numerical stability
7.7 summary
7.8 exercises
references
chapter 8 parallel patterns: convolution
8.1 background
8.2 ID parallel convolution a basic algorithm
8.3 constant memory and caching
8.4 tiled 1d convolution with halo elements
8.5 a simpler tiled 1d convolution--general caching
8.6 summary
8.7 exercises
chapter 9 parallel patterns: prefix sum
9.1 background
9.2 a simple parallel scan
9.3 work efficiency considerations
9.4 a work-efficient parallel scan
9.5 parallel scan for arbitrary-length inputs
9.6 summary
9.7 exercises
reference
chapter 10 parallel patterns: sparse matrix-vector
multiplication
10.1 background
10.2 parallel spmv using csr
10.3 padding and transposition
10.4 using hybrid to control padding
10.5 sorting and partitioning for regularization
10.6 summary
10.7 exercises
references
chapter 11 application case study: advanced mri
reconstruction
11.1 application background
11.2 iterative reconstruction
11.3 computing fhd
step 1: determine the kernel parallelism structure
step 2: getting around the memory bandwidth limitation.
step 3: using hardware trigonometry functions
step 4: experimental performance tuning
11.4 final evaluation
11.5 exercises
references
chapter 12 application case study: molecular visualization and
analysis
12.1 application background
12.2 a simple kernel implementation
12.3 thread granularity adjustment
12.4 memory coalescing
12.5 summary
12.6 exercises
references
chapter 13 parallel programming and computational thinking
13.1 goals of parallel computing
13.2 problem decomposition
13.3 algorithm selection
13.4 computational thinking
13.5 summary
13.6 exercises
references
chapter 14 an introduction to opencltm
14.1 background
14.2 data parallelism model
14.3 device architecture
14.4 kernel functions
14.5 device management and kernel launch
14.6 electrostatic potential map in opencl
14.7 summary
14.8 exercises
references
chapter 15 parallel programming with openacc
15.1 0penacc versus cuda c
15.2 execution model
15.3 memory model
15.4 basic openacc programs
parallel construct
loop constmct
kernels construct
data management
asynchronous computation and data transfer
15.5 future directions of openacc
15.6 exercises
chapter 16 thrust: a productivity-oriented library for cuda
16.1 background
16.2 motivation
16.3 basic thrust features
iterators and memory space
interoperability
16.4 generic programming
16.5 benefits of abstraction
16.6 programmer productivity
robustness
real world performance
16.7 best practices
fusion
structure of arrays
implicit ranges
16.8 exercises
references
chapter 17 cuda fortran
17.1 cuda fortran and cuda c differences
17.2 a first cuda fortran program
17.3 multidimensional array in cuda fortran.
17.4 overloading host/device routines with generic interfaces
17.5 calling cuda c via iso_c_binding
17.6 kernel loop directives and reduction operations
17.7 dynamic shared memory
17.8 asynchronous data transfers
17.9 compilation and profiling
17.10 calling thrust from cuda fortran
17.11 exercises
chapter 18 an introduction to c + + amp
18.1 core c + + amp features
18.2 details of the c + + amp execution model
explicit and implicit data copies
asynchronous operation
section summary
18.3 managing accelerators
18.4 tiled execution
18.5 c + + amp graphics features
18.6 summary
18.7 exercises
chapter 19 programming a heterogeneous computing cluster
19.1 background
19.2 a running example
19.3 mpi basics
19.4 mpi point-to-point communication types
19.5 overlapping computation and communication
19.6 mpi collective communication
19.7 summary
19.8 exercises
reference
chapter 20 cuda dynamic parallelism
20.1 background
20.2 dynamic parallelism overview
20.3 important details
launch enviromnent configuration
apierrors and launch failures
events
streams
synchronization scope
20.4 memory visibility
global memory
zero-copy memory
constant memory
texture memory
20.5 a simple example
20.6 runtime limitations
memory footprint
nesting depth
memory allocation and lifetime
ecc errors
streams
events
launch pool
20.7 a more complex example
linear bezier curves
quadratic bezier curves
bezier curve calculation (predynamic parallelism)
bezier curve calculation (with dynamic parallelism)
20.8 summary
reference
chapter 21 conclusion and future outlook
21.1 goals revisited
21.2 memory model evolution
21.3 kernel execution control evolution
21.4 core performance
21.5 programming environment
21.6 future outlook
references
appendix A: matrix multiplication host-only version source
code
appendix B: gpu compute capabilities
index