GPU Memory Hierarchy & Optimization
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
Best viewed on desktop for optimal interactive experience
Interactive GPU Architecture Explorer
Explore modern GPU architecture with interactive 3D visualization, memory access patterns, and kernel configuration:
GPU Architecture Explorer
Interactive visualization of modern GPU architecture, memory patterns, and execution model
GPU Architecture - 3D View
SM 0 Details
GPU Architecture Key Concepts
Streaming Multiprocessor (SM)
The fundamental processing unit containing CUDA cores, Tensor cores, RT cores, schedulers, and memory.
Warp
Group of 32 threads that execute in SIMD fashion. The basic scheduling unit in GPU execution.
Memory Coalescing
Combining memory accesses from multiple threads into fewer transactions for optimal bandwidth.
Occupancy
Ratio of active warps to maximum warps per SM. Higher occupancy helps hide memory latency.
GPU Memory Hierarchy Overview
Modern GPUs feature a complex memory hierarchy designed to maximize throughput for parallel workloads. Understanding this hierarchy is crucial for achieving peak performance in GPU applications.
Memory Types and Characteristics
1. Registers (Fastest, Private)
Location: On-chip, per-thread Size: 256 KB per SM (65,536 × 32-bit) Latency: 0 cycles (immediate) Bandwidth: ~8 TB/s Scope: Private to each thread
Key Points:
- Fastest storage in GPU
- Compiler-managed allocation
- Spilling to local memory impacts performance
- Limited to 255 registers per thread
2. Shared Memory (Fast, Shared within Block)
Location: On-chip, per SM Size: 48-228 KB per SM (configurable) Latency: ~20-30 cycles Bandwidth: ~4 TB/s Scope: Shared within thread block
Optimization Strategies:
- Use for data reuse within blocks
- Avoid bank conflicts (32 banks, 4-byte width)
- Implement tiling algorithms
- Coordinate thread access patterns
3. L1 Cache (Fast, Automatic)
Location: On-chip, per SM Size: 128 KB (combined with shared memory) Latency: ~30-40 cycles Bandwidth: ~4 TB/s Scope: Per SM, caches global/local memory
Configuration Options:
// Configure L1/Shared split cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared); // More shared memory cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferL1); // More L1 cache cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferEqual); // Balanced
4. L2 Cache (Medium, GPU-wide)
Location: On-chip, shared across GPU Size: 40-60 MB (A100: 40 MB, H100: 50 MB) Latency: ~200 cycles Bandwidth: ~2-3 TB/s Scope: All SMs, coherent
Features:
- Caches all memory accesses
- Persistent across kernel launches
- Supports atomic operations
- Hardware-managed coherency
5. Global Memory (Slow, Large)
Location: Off-chip DRAM/HBM Size: 24-80 GB (HBM2e/HBM3) Latency: ~400-600 cycles Bandwidth: 1-2 TB/s Scope: Entire GPU and host
Memory Coalescing Patterns
Coalesced Access (Optimal)
// Perfect coalescing - consecutive threads access consecutive addresses __global__ void coalesced_access(float* data) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float value = data[tid]; // Thread 0→data[0], Thread 1→data[1], etc. }
Strided Access (Poor)
// Strided pattern - wastes bandwidth __global__ void strided_access(float* data, int stride) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float value = data[tid * stride]; // Non-consecutive access }
Random Access (Worst)
// Random pattern - serialized transactions __global__ void random_access(float* data, int* indices) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float value = data[indices[tid]]; // Unpredictable pattern }
Shared Memory Optimization
Bank Conflict Resolution
// Bank conflict example - all threads access same bank __shared__ float shared[32]; float value = shared[threadIdx.x % 4]; // 8-way bank conflict! // Conflict-free access __shared__ float shared[33]; // Padding prevents conflicts float value = shared[threadIdx.x]; // Each thread → different bank
Matrix Transpose with Shared Memory
#define TILE_DIM 32 #define BLOCK_ROWS 8 __global__ void transpose_coalesced(float *odata, float *idata, int width, int height) { __shared__ float tile[TILE_DIM][TILE_DIM + 1]; // +1 padding avoids bank conflicts int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; // Coalesced read from global memory for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) { if (x < width && (y + j) < height) { tile[threadIdx.y + j][threadIdx.x] = idata[(y + j) * width + x]; } } __syncthreads(); // Transpose indices for write x = blockIdx.y * TILE_DIM + threadIdx.x; y = blockIdx.x * TILE_DIM + threadIdx.y; // Coalesced write to global memory for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) { if (x < height && (y + j) < width) { odata[(y + j) * height + x] = tile[threadIdx.x][threadIdx.y + j]; } } }
Memory Access Optimization Techniques
1. Texture Memory (Spatial Locality)
// Texture memory for 2D spatial locality texture<float, cudaTextureType2D> texRef; __global__ void texture_kernel(float* output, int width, int height) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; // Hardware-optimized 2D cache output[y * width + x] = tex2D(texRef, x, y); }
2. Constant Memory (Broadcast)
__constant__ float const_data[1024]; // 64 KB constant memory __global__ void const_kernel(float* output) { int tid = threadIdx.x; // All threads read same value - broadcast optimization output[tid] = const_data[0] * tid; }
3. Unified Memory (Simplified Programming)
// Automatic migration between host and device float *data; cudaMallocManaged(&data, size); // Access from both CPU and GPU data[0] = 1.0f; // CPU write kernel<<<grid, block>>>(data); // GPU access cudaDeviceSynchronize(); float result = data[0]; // CPU read
Performance Analysis Tools
Memory Bandwidth Calculation
Effective Bandwidth = (Bytes Transferred / Time) × Occupancy Efficiency = (Effective Bandwidth / Theoretical Bandwidth) × 100%
Profiling Metrics
# NSight Compute memory metrics ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum, l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum, smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct, smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct ./application
Memory Optimization Checklist
Design Phase
- Identify memory access patterns
- Calculate memory bandwidth requirements
- Plan data layout for coalescing
- Determine shared memory usage
Implementation Phase
- Implement coalesced access patterns
- Use shared memory for data reuse
- Avoid bank conflicts with padding
- Minimize register pressure
- Use appropriate memory types (texture, constant)
Optimization Phase
- Profile memory transactions
- Analyze bandwidth utilization
- Identify and fix inefficient patterns
- Tune L1/shared memory configuration
- Consider memory prefetching
Advanced Memory Patterns
Warp-level Memory Operations
// Warp shuffle for register-level data exchange __global__ void warp_reduce(float* data) { float value = data[threadIdx.x]; // Warp-level reduction without shared memory for (int mask = 16; mask > 0; mask >>= 1) { value += __shfl_xor_sync(0xffffffff, value, mask); } if (threadIdx.x % 32 == 0) { data[threadIdx.x / 32] = value; } }
Memory-level Parallelism
// Increase memory-level parallelism with unrolling __global__ void mlp_kernel(float* __restrict__ out, const float* __restrict__ in, int n) { int tid = blockIdx.x * blockDim.x + threadIdx.x; int stride = gridDim.x * blockDim.x; // Process multiple elements per thread float sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0; for (int i = tid; i < n/4; i += stride) { // Issue multiple loads simultaneously float4 val = reinterpret_cast<const float4*>(in)[i]; sum0 += val.x; sum1 += val.y; sum2 += val.z; sum3 += val.w; } out[tid] = sum0 + sum1 + sum2 + sum3; }
Best Practices Summary
Do's ✅
- Coalesce global memory accesses
- Use shared memory for data reuse
- Pad shared memory arrays to avoid bank conflicts
- Batch small transfers into larger ones
- Use vector loads (float2, float4) when possible
- Profile and measure actual bandwidth
Don'ts ❌
- Random memory access patterns
- Excessive register usage causing spills
- Unaligned memory accesses
- Frequent host-device transfers
- Ignoring memory hierarchy
- Assuming CPU optimization strategies work on GPU
Performance Impact
Bandwidth Utilization by Pattern
Access Pattern | Efficiency | Transactions | Performance |
---|---|---|---|
Coalesced | 100% | 1 | Optimal |
Sequential Misaligned | 80-100% | 1-2 | Good |
Strided (stride=2) | 50% | 2 | Poor |
Strided (stride=32) | 3% | 32 | Very Poor |
Random | 3-10% | 32 | Worst |
Conclusion
GPU memory hierarchy optimization is essential for achieving peak performance. Focus on:
- Coalesced access patterns for global memory
- Shared memory for data reuse and communication
- Avoiding bank conflicts through padding
- Proper occupancy to hide memory latency
- Profiling and measurement to validate optimizations
Master these concepts to unlock the full potential of GPU computing and achieve order-of-magnitude performance improvements in your applications.