CUDA Fortran for scientists and engineers : best practices for efficient CUDA Fortran programming /

CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presu...

Full description

Bibliographic Details
Main Authors:	Ruetsch, Gregory (Author), Fatica, Massimiliano (Author)
Corporate Author:	ScienceDirect (Online service)
Format:	eBook
Language:	English
Published:	Cambridge, MA : Morgan Kaufmann, [2024]
Edition:	Second edition.
Subjects:	FORTRAN (Computer program language) Fortran (Langage de programmation) Electronic books.
Online Access:	Connect to the full text of this electronic book

Table of Contents:

Front Cover
CUDA Fortran for Scientists and Engineers
Copyright
Contents
Preface to the Second Edition
Preface to the First Edition
Acknowledgments
1 CUDA Fortran programming
1 Introduction
1.1 A brief history of GPU computing
1.2 Parallel computation
1.3 Basic concepts
1.3.1 A first CUDA Fortran program
1.3.1.1 CUDA Fortran compilation
1.3.2 Extending to larger arrays
1.3.3 Multidimensional arrays
1.3.4 Interfaces for device code
1.3.5 Managed data
1.3.6 Kernel loop directives and CUF kernels
1.4 Determining CUDA hardware features and limits
1.4.1 Choosing a device to run on
1.4.2 Floating point precision
1.4.2.1 Accommodating variable precision
1.5 Error handling
1.6 Compiling CUDA Fortran code
1.7 CUDA Driver, Toolkit, and compatibility
2 Correctness, accuracy, and debugging
2.1 Assessing correctness of results
2.1.1 Non-associativity of floating point arithmetic
2.1.2 Fused-multiply add
2.1.3 Flags affecting floating-point accuracy
2.2 Debugging
2.2.1 Printing from device code
2.2.2 Debugging with cuda-gdb
2.2.2.1 System requirements
2.2.2.2 Compilation
2.2.2.3 Setting breakpoints
2.2.2.4 Focus
software and hardware coordinates
2.2.2.5 CUDA activity status
2.2.2.6 Single-stepping in device code
2.2.2.7 Examining program state
2.2.3 compute-sanitizer
3 Performance measurement and metrics
3.1 Measuring execution time
3.1.1 Host-device synchronization and CPU timers
3.1.2 Timing via CUDA events
3.1.3 Nsight Systems command-line interface nsys
3.1.3.1 Nsight Systems graphical user interface nsys-ui
3.1.4 Customizing profiling with nvtx
3.1.4.1 Basic NVTX tooling interfaces
3.1.4.2 Advanced NVTX tooling interfaces
3.1.4.3 Automated NVTX instrumentation.
3.2 Instruction, bandwidth, and latency bound kernels
3.3 Memory bandwidth
3.3.1 Theoretical peak bandwidth
3.3.2 Effective bandwidth
3.3.3 Actual data throughput vs. effective bandwidth
4 Synchronization
4.1 Synchronization of kernel execution and data transfers
4.1.1 Pageable versus pinned host memory
4.1.2 Streams
4.1.2.1 Creating streams
4.1.3 Asynchronous transfers via cudaMemcpyAsync()
4.1.4 Synchronization barriers
4.1.4.1 cudaDeviceSynchronize()
4.1.4.2 cudaStreamSynchronize()
4.1.4.3 cudaEventSynchronize()
4.1.4.4 Querying streams and events
4.1.5 Advanced stream topics
4.1.5.1 The default stream
4.1.5.2 Non-blocking streams
4.1.5.3 Stream priorities
4.2 Synchronization of kernel threads on the device
4.2.1 Shared memory
4.2.2 Synchronizing threads within a block
4.2.3 Warps, warp synchronization, and warp-level primitives
4.2.3.1 SHFL functions
4.2.4 Atomics
4.2.5 Memory fences
4.2.6 Cooperative groups
4.2.6.1 Grid synchronization
4.2.6.2 Thread block clusters
5 Optimization
5.1 Transfers between host and device
5.1.1 Pinned memory
5.2 Device memory
5.2.1 ECC (Error Correcting Code)
5.2.2 Global memory
5.2.2.1 Declaring global array arguments in kernels
5.2.2.2 Coalesced global memory access
5.2.3 Local memory
5.2.4 Constant memory
5.2.5 L1 and L2 caches
5.2.6 Shared memory
5.2.6.1 Configuring shared memory
5.2.6.2 Global memory coalescing through shared memory
5.2.6.3 Shared memory bank conflicts
5.2.7 Registers
5.2.7.1 Exchanging register data between threads in a warp
5.3 Execution configuration
5.3.1 Thread-level parallelism
5.3.2 Instruction-level parallelism
5.3.2.1 Asynchronous data transfers between global and shared memory
5.3.2.2 Instruction-level parallelism in CUF kernels.
11 Applications of the fast Fourier transform
11.1 CUFFT
11.2 Spectral derivatives
11.3 Convolution
11.4 Poisson solver
11.4.1 Vortex dynamics
12 Ray tracing
12.1 Generating an image file
PPM file format
Implementation
12.2 Vectors in CUDA Fortran
Implementation of the RGB module
12.3 Rays, a simple camera, and background
First ray-tracing code
12.4 Adding a sphere
Theory
Implementation
12.5 Surface normals and multiple objects
Surface normals
Multiple objects
12.6 Antialiasing
Implementation
CUDA implementation
12.7 Material types
12.7.1 Diffuse materials
Implementation
Random number generation
Modifications to the color() routine
12.7.2 Metal
Theory
Implementation
Material derived types
Initialization of spheres and material types
The scatter() function and Cray pointers
12.7.3 Dielectrics
Theory
Implementation
12.8 Positionable camera
12.9 Defocus blur
Implementation
12.10 Where next?
12.11 Triangles
Theory
Implementation
12.12 Lights
Implementation
12.13 Textures
Theory
Implementation
3 Appendices
A System and environment management
A.1 Environment variables
A.1.1 General
A.1.2 Just-in-time compilation
A.2 nvidia-smi
System Management Interface
A.2.1 Enabling and disabling ECC
A.2.2 Compute mode
A.2.3 Persistence mode
A.2.4 Topology
References
Index
Back Cover.