GPU Memory Hierarchy & Optimization

20 min

Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance

Best viewed on desktop for optimal interactive experience

Interactive GPU Architecture Explorer

Explore modern GPU architecture with interactive 3D visualization, memory access patterns, and kernel configuration:

GPU Architecture Explorer

Interactive visualization of modern GPU architecture, memory patterns, and execution model

GPU Architecture - 3D View

SM 0 Details

CUDA Cores:
64
Tensor Cores:
4
Max Threads:
2048
Shared Memory:
164 KB

GPU Architecture Key Concepts

Streaming Multiprocessor (SM)

The fundamental processing unit containing CUDA cores, Tensor cores, RT cores, schedulers, and memory.

Warp

Group of 32 threads that execute in SIMD fashion. The basic scheduling unit in GPU execution.

Memory Coalescing

Combining memory accesses from multiple threads into fewer transactions for optimal bandwidth.

Occupancy

Ratio of active warps to maximum warps per SM. Higher occupancy helps hide memory latency.

GPU Memory Hierarchy Overview

Modern GPUs feature a complex memory hierarchy designed to maximize throughput for parallel workloads. Understanding this hierarchy is crucial for achieving peak performance in GPU applications.

Memory Types and Characteristics

1. Registers (Fastest, Private)

Location: On-chip, per-thread Size: 256 KB per SM (65,536 × 32-bit) Latency: 0 cycles (immediate) Bandwidth: ~8 TB/s Scope: Private to each thread

Key Points:

  • Fastest storage in GPU
  • Compiler-managed allocation
  • Spilling to local memory impacts performance
  • Limited to 255 registers per thread

2. Shared Memory (Fast, Shared within Block)

Location: On-chip, per SM Size: 48-228 KB per SM (configurable) Latency: ~20-30 cycles Bandwidth: ~4 TB/s Scope: Shared within thread block

Optimization Strategies:

  • Use for data reuse within blocks
  • Avoid bank conflicts (32 banks, 4-byte width)
  • Implement tiling algorithms
  • Coordinate thread access patterns

3. L1 Cache (Fast, Automatic)

Location: On-chip, per SM Size: 128 KB (combined with shared memory) Latency: ~30-40 cycles Bandwidth: ~4 TB/s Scope: Per SM, caches global/local memory

Configuration Options:

// Configure L1/Shared split cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared); // More shared memory cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferL1); // More L1 cache cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferEqual); // Balanced

4. L2 Cache (Medium, GPU-wide)

Location: On-chip, shared across GPU Size: 40-60 MB (A100: 40 MB, H100: 50 MB) Latency: ~200 cycles Bandwidth: ~2-3 TB/s Scope: All SMs, coherent

Features:

  • Caches all memory accesses
  • Persistent across kernel launches
  • Supports atomic operations
  • Hardware-managed coherency

5. Global Memory (Slow, Large)

Location: Off-chip DRAM/HBM Size: 24-80 GB (HBM2e/HBM3) Latency: ~400-600 cycles Bandwidth: 1-2 TB/s Scope: Entire GPU and host

Memory Coalescing Patterns

Coalesced Access (Optimal)

// Perfect coalescing - consecutive threads access consecutive addresses __global__ void coalesced_access(float* data) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float value = data[tid]; // Thread 0→data[0], Thread 1→data[1], etc. }

Strided Access (Poor)

// Strided pattern - wastes bandwidth __global__ void strided_access(float* data, int stride) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float value = data[tid * stride]; // Non-consecutive access }

Random Access (Worst)

// Random pattern - serialized transactions __global__ void random_access(float* data, int* indices) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float value = data[indices[tid]]; // Unpredictable pattern }

Shared Memory Optimization

Bank Conflict Resolution

// Bank conflict example - all threads access same bank __shared__ float shared[32]; float value = shared[threadIdx.x % 4]; // 8-way bank conflict! // Conflict-free access __shared__ float shared[33]; // Padding prevents conflicts float value = shared[threadIdx.x]; // Each thread → different bank

Matrix Transpose with Shared Memory

#define TILE_DIM 32 #define BLOCK_ROWS 8 __global__ void transpose_coalesced(float *odata, float *idata, int width, int height) { __shared__ float tile[TILE_DIM][TILE_DIM + 1]; // +1 padding avoids bank conflicts int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; // Coalesced read from global memory for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) { if (x < width && (y + j) < height) { tile[threadIdx.y + j][threadIdx.x] = idata[(y + j) * width + x]; } } __syncthreads(); // Transpose indices for write x = blockIdx.y * TILE_DIM + threadIdx.x; y = blockIdx.x * TILE_DIM + threadIdx.y; // Coalesced write to global memory for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) { if (x < height && (y + j) < width) { odata[(y + j) * height + x] = tile[threadIdx.x][threadIdx.y + j]; } } }

Memory Access Optimization Techniques

1. Texture Memory (Spatial Locality)

// Texture memory for 2D spatial locality texture<float, cudaTextureType2D> texRef; __global__ void texture_kernel(float* output, int width, int height) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; // Hardware-optimized 2D cache output[y * width + x] = tex2D(texRef, x, y); }

2. Constant Memory (Broadcast)

__constant__ float const_data[1024]; // 64 KB constant memory __global__ void const_kernel(float* output) { int tid = threadIdx.x; // All threads read same value - broadcast optimization output[tid] = const_data[0] * tid; }

3. Unified Memory (Simplified Programming)

// Automatic migration between host and device float *data; cudaMallocManaged(&data, size); // Access from both CPU and GPU data[0] = 1.0f; // CPU write kernel<<<grid, block>>>(data); // GPU access cudaDeviceSynchronize(); float result = data[0]; // CPU read

Performance Analysis Tools

Memory Bandwidth Calculation

Effective Bandwidth = (Bytes Transferred / Time) × Occupancy Efficiency = (Effective Bandwidth / Theoretical Bandwidth) × 100%

Profiling Metrics

# NSight Compute memory metrics ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum, l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum, smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct, smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct ./application

Memory Optimization Checklist

Design Phase

  • Identify memory access patterns
  • Calculate memory bandwidth requirements
  • Plan data layout for coalescing
  • Determine shared memory usage

Implementation Phase

  • Implement coalesced access patterns
  • Use shared memory for data reuse
  • Avoid bank conflicts with padding
  • Minimize register pressure
  • Use appropriate memory types (texture, constant)

Optimization Phase

  • Profile memory transactions
  • Analyze bandwidth utilization
  • Identify and fix inefficient patterns
  • Tune L1/shared memory configuration
  • Consider memory prefetching

Advanced Memory Patterns

Warp-level Memory Operations

// Warp shuffle for register-level data exchange __global__ void warp_reduce(float* data) { float value = data[threadIdx.x]; // Warp-level reduction without shared memory for (int mask = 16; mask > 0; mask >>= 1) { value += __shfl_xor_sync(0xffffffff, value, mask); } if (threadIdx.x % 32 == 0) { data[threadIdx.x / 32] = value; } }

Memory-level Parallelism

// Increase memory-level parallelism with unrolling __global__ void mlp_kernel(float* __restrict__ out, const float* __restrict__ in, int n) { int tid = blockIdx.x * blockDim.x + threadIdx.x; int stride = gridDim.x * blockDim.x; // Process multiple elements per thread float sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0; for (int i = tid; i < n/4; i += stride) { // Issue multiple loads simultaneously float4 val = reinterpret_cast<const float4*>(in)[i]; sum0 += val.x; sum1 += val.y; sum2 += val.z; sum3 += val.w; } out[tid] = sum0 + sum1 + sum2 + sum3; }

Best Practices Summary

Do's ✅

  • Coalesce global memory accesses
  • Use shared memory for data reuse
  • Pad shared memory arrays to avoid bank conflicts
  • Batch small transfers into larger ones
  • Use vector loads (float2, float4) when possible
  • Profile and measure actual bandwidth

Don'ts ❌

  • Random memory access patterns
  • Excessive register usage causing spills
  • Unaligned memory accesses
  • Frequent host-device transfers
  • Ignoring memory hierarchy
  • Assuming CPU optimization strategies work on GPU

Performance Impact

Bandwidth Utilization by Pattern

Access PatternEfficiencyTransactionsPerformance
Coalesced100%1Optimal
Sequential Misaligned80-100%1-2Good
Strided (stride=2)50%2Poor
Strided (stride=32)3%32Very Poor
Random3-10%32Worst

Conclusion

GPU memory hierarchy optimization is essential for achieving peak performance. Focus on:

  1. Coalesced access patterns for global memory
  2. Shared memory for data reuse and communication
  3. Avoiding bank conflicts through padding
  4. Proper occupancy to hide memory latency
  5. Profiling and measurement to validate optimizations

Master these concepts to unlock the full potential of GPU computing and achieve order-of-magnitude performance improvements in your applications.

If you found this explanation helpful, consider sharing it with others.

Mastodon