CUDA Fortran for scientists and engineers : best practices for efficient CUDA Fortran programming /
CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presu...
| Main Authors: | , |
|---|---|
| Corporate Author: | |
| Format: | eBook |
| Language: | English |
| Published: |
Cambridge, MA :
Morgan Kaufmann,
[2024]
|
| Edition: | Second edition. |
| Subjects: | |
| Online Access: | Connect to the full text of this electronic book |
Table of Contents:
- Front Cover
- CUDA Fortran for Scientists and Engineers
- Copyright
- Contents
- Preface to the Second Edition
- Preface to the First Edition
- Acknowledgments
- 1 CUDA Fortran programming
- 1 Introduction
- 1.1 A brief history of GPU computing
- 1.2 Parallel computation
- 1.3 Basic concepts
- 1.3.1 A first CUDA Fortran program
- 1.3.1.1 CUDA Fortran compilation
- 1.3.2 Extending to larger arrays
- 1.3.3 Multidimensional arrays
- 1.3.4 Interfaces for device code
- 1.3.5 Managed data
- 1.3.6 Kernel loop directives and CUF kernels
- 1.4 Determining CUDA hardware features and limits
- 1.4.1 Choosing a device to run on
- 1.4.2 Floating point precision
- 1.4.2.1 Accommodating variable precision
- 1.5 Error handling
- 1.6 Compiling CUDA Fortran code
- 1.7 CUDA Driver, Toolkit, and compatibility
- 2 Correctness, accuracy, and debugging
- 2.1 Assessing correctness of results
- 2.1.1 Non-associativity of floating point arithmetic
- 2.1.2 Fused-multiply add
- 2.1.3 Flags affecting floating-point accuracy
- 2.2 Debugging
- 2.2.1 Printing from device code
- 2.2.2 Debugging with cuda-gdb
- 2.2.2.1 System requirements
- 2.2.2.2 Compilation
- 2.2.2.3 Setting breakpoints
- 2.2.2.4 Focus
- software and hardware coordinates
- 2.2.2.5 CUDA activity status
- 2.2.2.6 Single-stepping in device code
- 2.2.2.7 Examining program state
- 2.2.3 compute-sanitizer
- 3 Performance measurement and metrics
- 3.1 Measuring execution time
- 3.1.1 Host-device synchronization and CPU timers
- 3.1.2 Timing via CUDA events
- 3.1.3 Nsight Systems command-line interface nsys
- 3.1.3.1 Nsight Systems graphical user interface nsys-ui
- 3.1.4 Customizing profiling with nvtx
- 3.1.4.1 Basic NVTX tooling interfaces
- 3.1.4.2 Advanced NVTX tooling interfaces
- 3.1.4.3 Automated NVTX instrumentation.
- 3.2 Instruction, bandwidth, and latency bound kernels
- 3.3 Memory bandwidth
- 3.3.1 Theoretical peak bandwidth
- 3.3.2 Effective bandwidth
- 3.3.3 Actual data throughput vs. effective bandwidth
- 4 Synchronization
- 4.1 Synchronization of kernel execution and data transfers
- 4.1.1 Pageable versus pinned host memory
- 4.1.2 Streams
- 4.1.2.1 Creating streams
- 4.1.3 Asynchronous transfers via cudaMemcpyAsync()
- 4.1.4 Synchronization barriers
- 4.1.4.1 cudaDeviceSynchronize()
- 4.1.4.2 cudaStreamSynchronize()
- 4.1.4.3 cudaEventSynchronize()
- 4.1.4.4 Querying streams and events
- 4.1.5 Advanced stream topics
- 4.1.5.1 The default stream
- 4.1.5.2 Non-blocking streams
- 4.1.5.3 Stream priorities
- 4.2 Synchronization of kernel threads on the device
- 4.2.1 Shared memory
- 4.2.2 Synchronizing threads within a block
- 4.2.3 Warps, warp synchronization, and warp-level primitives
- 4.2.3.1 SHFL functions
- 4.2.4 Atomics
- 4.2.5 Memory fences
- 4.2.6 Cooperative groups
- 4.2.6.1 Grid synchronization
- 4.2.6.2 Thread block clusters
- 5 Optimization
- 5.1 Transfers between host and device
- 5.1.1 Pinned memory
- 5.2 Device memory
- 5.2.1 ECC (Error Correcting Code)
- 5.2.2 Global memory
- 5.2.2.1 Declaring global array arguments in kernels
- 5.2.2.2 Coalesced global memory access
- 5.2.3 Local memory
- 5.2.4 Constant memory
- 5.2.5 L1 and L2 caches
- 5.2.6 Shared memory
- 5.2.6.1 Configuring shared memory
- 5.2.6.2 Global memory coalescing through shared memory
- 5.2.6.3 Shared memory bank conflicts
- 5.2.7 Registers
- 5.2.7.1 Exchanging register data between threads in a warp
- 5.3 Execution configuration
- 5.3.1 Thread-level parallelism
- 5.3.2 Instruction-level parallelism
- 5.3.2.1 Asynchronous data transfers between global and shared memory
- 5.3.2.2 Instruction-level parallelism in CUF kernels.
- 11 Applications of the fast Fourier transform
- 11.1 CUFFT
- 11.2 Spectral derivatives
- 11.3 Convolution
- 11.4 Poisson solver
- 11.4.1 Vortex dynamics
- 12 Ray tracing
- 12.1 Generating an image file
- PPM file format
- Implementation
- 12.2 Vectors in CUDA Fortran
- Implementation of the RGB module
- 12.3 Rays, a simple camera, and background
- First ray-tracing code
- 12.4 Adding a sphere
- Theory
- Implementation
- 12.5 Surface normals and multiple objects
- Surface normals
- Multiple objects
- 12.6 Antialiasing
- Implementation
- CUDA implementation
- 12.7 Material types
- 12.7.1 Diffuse materials
- Implementation
- Random number generation
- Modifications to the color() routine
- 12.7.2 Metal
- Theory
- Implementation
- Material derived types
- Initialization of spheres and material types
- The scatter() function and Cray pointers
- 12.7.3 Dielectrics
- Theory
- Implementation
- 12.8 Positionable camera
- 12.9 Defocus blur
- Implementation
- 12.10 Where next?
- 12.11 Triangles
- Theory
- Implementation
- 12.12 Lights
- Implementation
- 12.13 Textures
- Theory
- Implementation
- 3 Appendices
- A System and environment management
- A.1 Environment variables
- A.1.1 General
- A.1.2 Just-in-time compilation
- A.2 nvidia-smi
- System Management Interface
- A.2.1 Enabling and disabling ECC
- A.2.2 Compute mode
- A.2.3 Persistence mode
- A.2.4 Topology
- References
- Index
- Back Cover.