GPU Memory Hierarchy & Optimization

Interactive GPU Architecture Explorer

Explore modern GPU architecture with interactive 3D visualization, memory access patterns, and kernel configuration:

GPU Architecture Explorer

Interactive visualization of modern GPU architecture, memory patterns, and execution model

GPU Architecture - 3D View

SM 0 Details

CUDA Cores:

Tensor Cores:

Max Threads:

2048

Shared Memory:

164 KB

GPU Architecture Key Concepts

Streaming Multiprocessor (SM)

The fundamental processing unit containing CUDA cores, Tensor cores, RT cores, schedulers, and memory.

Warp

Group of 32 threads that execute in SIMD fashion. The basic scheduling unit in GPU execution.

Memory Coalescing

Combining memory accesses from multiple threads into fewer transactions for optimal bandwidth.

Occupancy

Ratio of active warps to maximum warps per SM. Higher occupancy helps hide memory latency.

GPU Memory Hierarchy Overview

Modern GPUs feature a complex memory hierarchy designed to maximize throughput for parallel workloads. Understanding this hierarchy is crucial for achieving peak performance in GPU applications.

Memory Types and Characteristics

1. Registers (Fastest, Private)

Location: On-chip, per-thread
Size: 256 KB per SM (65,536 × 32-bit)
Latency: 0 cycles (immediate)
Bandwidth: ~8 TB/s
Scope: Private to each thread

Key Points:

Fastest storage in GPU
Compiler-managed allocation
Spilling to local memory impacts performance
Limited to 255 registers per thread

2. Shared Memory (Fast, Shared within Block)

Location: On-chip, per SM
Size: 48-228 KB per SM (configurable)
Latency: ~20-30 cycles
Bandwidth: ~4 TB/s
Scope: Shared within thread block

Optimization Strategies:

Use for data reuse within blocks
Avoid bank conflicts (32 banks, 4-byte width)
Implement tiling algorithms
Coordinate thread access patterns

3. L1 Cache (Fast, Automatic)

Location: On-chip, per SM
Size: 128 KB (combined with shared memory)
Latency: ~30-40 cycles
Bandwidth: ~4 TB/s
Scope: Per SM, caches global/local memory

Configuration Options:

// Configure L1/Shared split
cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared);  // More shared memory
cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferL1);      // More L1 cache
cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferEqual);   // Balanced

4. L2 Cache (Medium, GPU-wide)

Location: On-chip, shared across GPU
Size: 40-60 MB (A100: 40 MB, H100: 50 MB)
Latency: ~200 cycles
Bandwidth: ~2-3 TB/s
Scope: All SMs, coherent

Features:

Caches all memory accesses
Persistent across kernel launches
Supports atomic operations
Hardware-managed coherency

5. Global Memory (Slow, Large)

Location: Off-chip DRAM/HBM
Size: 24-80 GB (HBM2e/HBM3)
Latency: ~400-600 cycles
Bandwidth: 1-2 TB/s
Scope: Entire GPU and host

Memory Coalescing Patterns

Coalesced Access (Optimal)

// Perfect coalescing - consecutive threads access consecutive addresses
__global__ void coalesced_access(float* data) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float value = data[tid];  // Thread 0→data[0], Thread 1→data[1], etc.
}

Strided Access (Poor)

// Strided pattern - wastes bandwidth
__global__ void strided_access(float* data, int stride) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float value = data[tid * stride];  // Non-consecutive access
}

Random Access (Worst)

// Random pattern - serialized transactions
__global__ void random_access(float* data, int* indices) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float value = data[indices[tid]];  // Unpredictable pattern
}

Shared Memory Optimization

Bank Conflict Resolution

// Bank conflict example - all threads access same bank
__shared__ float shared[32];
float value = shared[threadIdx.x % 4];  // 8-way bank conflict!

// Conflict-free access
__shared__ float shared[33];  // Padding prevents conflicts
float value = shared[threadIdx.x];  // Each thread → different bank

Matrix Transpose with Shared Memory

#define TILE_DIM 32
#define BLOCK_ROWS 8

__global__ void transpose_coalesced(float *odata, float *idata, int width, int height) {
    __shared__ float tile[TILE_DIM][TILE_DIM + 1];  // +1 padding avoids bank conflicts
    
    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;
    
    // Coalesced read from global memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < width && (y + j) < height) {
            tile[threadIdx.y + j][threadIdx.x] = idata[(y + j) * width + x];
        }
    }
    
    __syncthreads();
    
    // Transpose indices for write
    x = blockIdx.y * TILE_DIM + threadIdx.x;
    y = blockIdx.x * TILE_DIM + threadIdx.y;
    
    // Coalesced write to global memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < height && (y + j) < width) {
            odata[(y + j) * height + x] = tile[threadIdx.x][threadIdx.y + j];
        }
    }
}

Memory Access Optimization Techniques

1. Texture Memory (Spatial Locality)

// Texture memory for 2D spatial locality
texture<float, cudaTextureType2D> texRef;

__global__ void texture_kernel(float* output, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    // Hardware-optimized 2D cache
    output[y * width + x] = tex2D(texRef, x, y);
}

2. Constant Memory (Broadcast)

__constant__ float const_data[1024];  // 64 KB constant memory

__global__ void const_kernel(float* output) {
    int tid = threadIdx.x;
    // All threads read same value - broadcast optimization
    output[tid] = const_data[0] * tid;
}

3. Unified Memory (Simplified Programming)

// Automatic migration between host and device
float *data;
cudaMallocManaged(&data, size);

// Access from both CPU and GPU
data[0] = 1.0f;  // CPU write
kernel<<<grid, block>>>(data);  // GPU access
cudaDeviceSynchronize();
float result = data[0];  // CPU read

Performance Analysis Tools

Memory Bandwidth Calculation

Effective Bandwidth = (Bytes Transferred / Time) × Occupancy
Efficiency = (Effective Bandwidth / Theoretical Bandwidth) × 100%

Profiling Metrics

# NSight Compute memory metrics
ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,
               l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum,
               smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct,
               smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct
               ./application

Memory Optimization Checklist

Design Phase

Identify memory access patterns
Calculate memory bandwidth requirements
Plan data layout for coalescing
Determine shared memory usage

Implementation Phase

Implement coalesced access patterns
Use shared memory for data reuse
Avoid bank conflicts with padding
Minimize register pressure
Use appropriate memory types (texture, constant)

Optimization Phase

Profile memory transactions
Analyze bandwidth utilization
Identify and fix inefficient patterns
Tune L1/shared memory configuration
Consider memory prefetching

Advanced Memory Patterns

Warp-level Memory Operations

// Warp shuffle for register-level data exchange
__global__ void warp_reduce(float* data) {
    float value = data[threadIdx.x];
    
    // Warp-level reduction without shared memory
    for (int mask = 16; mask > 0; mask >>= 1) {
        value += __shfl_xor_sync(0xffffffff, value, mask);
    }
    
    if (threadIdx.x % 32 == 0) {
        data[threadIdx.x / 32] = value;
    }
}

Memory-level Parallelism

// Increase memory-level parallelism with unrolling
__global__ void mlp_kernel(float* __restrict__ out, 
                           const float* __restrict__ in, 
                           int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    
    // Process multiple elements per thread
    float sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
    
    for (int i = tid; i < n/4; i += stride) {
        // Issue multiple loads simultaneously
        float4 val = reinterpret_cast<const float4*>(in)[i];
        sum0 += val.x;
        sum1 += val.y;
        sum2 += val.z;
        sum3 += val.w;
    }
    
    out[tid] = sum0 + sum1 + sum2 + sum3;
}