CUDA Fortran for scientists and engineers : best practices for efficient CUDA Fortran programming /

CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presu...

Full description

Bibliographic Details
Main Authors: Ruetsch, Gregory (Author), Fatica, Massimiliano (Author)
Corporate Author: ScienceDirect (Online service)
Format: eBook
Language:English
Published: Cambridge, MA : Morgan Kaufmann, [2024]
Edition:Second edition.
Subjects:
Online Access:Connect to the full text of this electronic book
Table of Contents:
  • Front Cover
  • CUDA Fortran for Scientists and Engineers
  • Copyright
  • Contents
  • Preface to the Second Edition
  • Preface to the First Edition
  • Acknowledgments
  • 1 CUDA Fortran programming
  • 1 Introduction
  • 1.1 A brief history of GPU computing
  • 1.2 Parallel computation
  • 1.3 Basic concepts
  • 1.3.1 A first CUDA Fortran program
  • 1.3.1.1 CUDA Fortran compilation
  • 1.3.2 Extending to larger arrays
  • 1.3.3 Multidimensional arrays
  • 1.3.4 Interfaces for device code
  • 1.3.5 Managed data
  • 1.3.6 Kernel loop directives and CUF kernels
  • 1.4 Determining CUDA hardware features and limits
  • 1.4.1 Choosing a device to run on
  • 1.4.2 Floating point precision
  • 1.4.2.1 Accommodating variable precision
  • 1.5 Error handling
  • 1.6 Compiling CUDA Fortran code
  • 1.7 CUDA Driver, Toolkit, and compatibility
  • 2 Correctness, accuracy, and debugging
  • 2.1 Assessing correctness of results
  • 2.1.1 Non-associativity of floating point arithmetic
  • 2.1.2 Fused-multiply add
  • 2.1.3 Flags affecting floating-point accuracy
  • 2.2 Debugging
  • 2.2.1 Printing from device code
  • 2.2.2 Debugging with cuda-gdb
  • 2.2.2.1 System requirements
  • 2.2.2.2 Compilation
  • 2.2.2.3 Setting breakpoints
  • 2.2.2.4 Focus
  • software and hardware coordinates
  • 2.2.2.5 CUDA activity status
  • 2.2.2.6 Single-stepping in device code
  • 2.2.2.7 Examining program state
  • 2.2.3 compute-sanitizer
  • 3 Performance measurement and metrics
  • 3.1 Measuring execution time
  • 3.1.1 Host-device synchronization and CPU timers
  • 3.1.2 Timing via CUDA events
  • 3.1.3 Nsight Systems command-line interface nsys
  • 3.1.3.1 Nsight Systems graphical user interface nsys-ui
  • 3.1.4 Customizing profiling with nvtx
  • 3.1.4.1 Basic NVTX tooling interfaces
  • 3.1.4.2 Advanced NVTX tooling interfaces
  • 3.1.4.3 Automated NVTX instrumentation.
  • 3.2 Instruction, bandwidth, and latency bound kernels
  • 3.3 Memory bandwidth
  • 3.3.1 Theoretical peak bandwidth
  • 3.3.2 Effective bandwidth
  • 3.3.3 Actual data throughput vs. effective bandwidth
  • 4 Synchronization
  • 4.1 Synchronization of kernel execution and data transfers
  • 4.1.1 Pageable versus pinned host memory
  • 4.1.2 Streams
  • 4.1.2.1 Creating streams
  • 4.1.3 Asynchronous transfers via cudaMemcpyAsync()
  • 4.1.4 Synchronization barriers
  • 4.1.4.1 cudaDeviceSynchronize()
  • 4.1.4.2 cudaStreamSynchronize()
  • 4.1.4.3 cudaEventSynchronize()
  • 4.1.4.4 Querying streams and events
  • 4.1.5 Advanced stream topics
  • 4.1.5.1 The default stream
  • 4.1.5.2 Non-blocking streams
  • 4.1.5.3 Stream priorities
  • 4.2 Synchronization of kernel threads on the device
  • 4.2.1 Shared memory
  • 4.2.2 Synchronizing threads within a block
  • 4.2.3 Warps, warp synchronization, and warp-level primitives
  • 4.2.3.1 SHFL functions
  • 4.2.4 Atomics
  • 4.2.5 Memory fences
  • 4.2.6 Cooperative groups
  • 4.2.6.1 Grid synchronization
  • 4.2.6.2 Thread block clusters
  • 5 Optimization
  • 5.1 Transfers between host and device
  • 5.1.1 Pinned memory
  • 5.2 Device memory
  • 5.2.1 ECC (Error Correcting Code)
  • 5.2.2 Global memory
  • 5.2.2.1 Declaring global array arguments in kernels
  • 5.2.2.2 Coalesced global memory access
  • 5.2.3 Local memory
  • 5.2.4 Constant memory
  • 5.2.5 L1 and L2 caches
  • 5.2.6 Shared memory
  • 5.2.6.1 Configuring shared memory
  • 5.2.6.2 Global memory coalescing through shared memory
  • 5.2.6.3 Shared memory bank conflicts
  • 5.2.7 Registers
  • 5.2.7.1 Exchanging register data between threads in a warp
  • 5.3 Execution configuration
  • 5.3.1 Thread-level parallelism
  • 5.3.2 Instruction-level parallelism
  • 5.3.2.1 Asynchronous data transfers between global and shared memory
  • 5.3.2.2 Instruction-level parallelism in CUF kernels.
  • 11 Applications of the fast Fourier transform
  • 11.1 CUFFT
  • 11.2 Spectral derivatives
  • 11.3 Convolution
  • 11.4 Poisson solver
  • 11.4.1 Vortex dynamics
  • 12 Ray tracing
  • 12.1 Generating an image file
  • PPM file format
  • Implementation
  • 12.2 Vectors in CUDA Fortran
  • Implementation of the RGB module
  • 12.3 Rays, a simple camera, and background
  • First ray-tracing code
  • 12.4 Adding a sphere
  • Theory
  • Implementation
  • 12.5 Surface normals and multiple objects
  • Surface normals
  • Multiple objects
  • 12.6 Antialiasing
  • Implementation
  • CUDA implementation
  • 12.7 Material types
  • 12.7.1 Diffuse materials
  • Implementation
  • Random number generation
  • Modifications to the color() routine
  • 12.7.2 Metal
  • Theory
  • Implementation
  • Material derived types
  • Initialization of spheres and material types
  • The scatter() function and Cray pointers
  • 12.7.3 Dielectrics
  • Theory
  • Implementation
  • 12.8 Positionable camera
  • 12.9 Defocus blur
  • Implementation
  • 12.10 Where next?
  • 12.11 Triangles
  • Theory
  • Implementation
  • 12.12 Lights
  • Implementation
  • 12.13 Textures
  • Theory
  • Implementation
  • 3 Appendices
  • A System and environment management
  • A.1 Environment variables
  • A.1.1 General
  • A.1.2 Just-in-time compilation
  • A.2 nvidia-smi
  • System Management Interface
  • A.2.1 Enabling and disabling ECC
  • A.2.2 Compute mode
  • A.2.3 Persistence mode
  • A.2.4 Topology
  • References
  • Index
  • Back Cover.