General-purpose graphics processor architectures /

Bibliographic Details
Main Authors: Aamodt, Tor M. (Author), Fung, Wilson Wai Lun (Author), Rogers, Timothy G. (Author)
Corporate Author: Morgan & Claypool Publishers
Format: eBook
Language:English
Published: [San Rafael, California] : Morgan & Claypool, 2018.
Series:Synthesis digital library of engineering and computer science.
Synthesis lectures in computer architecture ; # 44.
Subjects:
Online Access:Connect to the full text of this electronic book (PDF)
Table of Contents:
  • 1. Introduction
  • 1.1 The landscape of computation accelerators
  • 1.2 GPU hardware basics
  • 1.3 A brief history of GPUs
  • 1.4 Book outline
  • 2. Programming model
  • 2.1 Execution model
  • 2.2 GPU instruction set architectures
  • 2.2.1 NVIDIA GPU instruction set architectures
  • 2.2.2 AMD graphics core next instruction set architecture
  • 3. The SIMT core: instruction and register data flow
  • 3.1 One-loop approximation
  • 3.1.1 SIMT execution masking
  • 3.1.2 SIMT deadlock and stackless SIMT architectures
  • 3.1.3 Warp scheduling
  • 3.2 Two-loop approximation
  • 3.3 Three-loop approximation
  • 3.3.1 Operand collector
  • 3.3.2 Instruction replay: handling structural hazards
  • 3.4 Research directions on branch divergence
  • 3.4.1 Warp compaction
  • 3.4.2 Intra-warp divergent path management
  • 3.4.3 Adding MIMD capability
  • 3.4.4 Complexity-effective divergence management
  • 3.5 Research directions on scalarization and affine execution
  • 3.5.1 Detection of uniform or affine variables
  • 3.5.2 Exploiting uniform or affine variables in GPU
  • 3.6 Research directions on register file architecture
  • 3.6.1 Hierarchical register file
  • 3.6.2 Drowsy state register file
  • 3.6.3 Register file virtualization
  • 3.6.4 Partitioned register file
  • 3.6.5 RegLess
  • 4. Memory system
  • 4.1 First-level memory structures
  • 4.1.1 Scratchpad memory and L1 data cache
  • 4.1.2 L1 texture cache
  • 4.1.3 Unified texture and data cache
  • 4.2 On-chip interconnection network
  • 4.3 Memory partition unit
  • 4.3.1 L2 cache
  • 4.3.2 Atomic operations
  • 4.3.3 Memory access scheduler
  • 4.4 Research directions for GPU memory systems
  • 4.4.1 Memory access scheduling and interconnection network design
  • 4.4.2 Caching effectiveness
  • 4.4.3 Memory request prioritization and cache bypassing
  • 4.4.4 Exploiting inter-warp heterogeneity
  • 4.4.5 Coordinated cache bypassing
  • 4.4.6 Adaptive cache management
  • 4.4.7 Cache prioritization
  • 4.4.8 Virtual memory page placement
  • 4.4.9 Data placement
  • 4.4.10 Multi-chip-module GPUs
  • 5. Crosscutting research on GPU computing architectures
  • 5.1 Thread scheduling
  • 5.1.1 Research on assignment of threadblocks to cores
  • 5.1.2 Research on cycle-by-cycle scheduling decisions
  • 5.1.3 Research on scheduling multiple kernels
  • 5.1.4 Fine-grain synchronization aware scheduling
  • 5.2 Alternative ways of expressing parallelism
  • 5.3 Support for transactional memory
  • 5.3.1 Kilo TM
  • 5.3.2 Warp TM and temporal conflict detection
  • 5.4 Heterogeneous systems
  • Bibliography
  • Authors' biographies.