Abhik Sarkar | ML & CV Consultant

GPU Streaming Multiprocessor (SM)

The fundamental processing unit in modern GPUs - understanding SM architecture

SM Block Diagram

Click components for details

What is a Streaming Multiprocessor (SM)?

The Streaming Multiprocessor (SM) is the fundamental processing unit in modern GPUs. Each GPU contains multiple SMs, and each SM is a highly parallel processor capable of executing hundreds of threads concurrently. Understanding SM architecture is crucial for optimizing GPU applications.

Core Components

1. CUDA Cores (Streaming Processors)

CUDA cores are the basic arithmetic units that execute instructions:

Features:
• Single-precision floating-point (FP32) operations
• Integer arithmetic (INT32)
• Fully pipelined execution
• 128 cores per SM (Ampere architecture)
• Execute one operation per clock cycle

2. Tensor Cores

Specialized units for matrix multiplication and AI workloads:

Capabilities:
• 4×4 matrix multiply-accumulate in single operation
• Mixed precision support (FP16, BF16, TF32, INT8, INT4)
• D = A×B + C operation
• 8× throughput vs CUDA cores for matrix operations
• Essential for deep learning training and inference

3. RT Cores

Hardware-accelerated ray tracing units (Turing and later):

Functions:
• Bounding Volume Hierarchy (BVH) traversal
• Ray-triangle intersection testing
• Motion blur acceleration
• Instance transformation
• 10× faster than software ray tracing

4. Warp Schedulers

Control thread execution and instruction dispatch:

Characteristics:
• Manage up to 48 warps (1536 threads) per SM
• Dual-issue capability (2 instructions per cycle)
• Zero-overhead context switching
• Hide memory latency through warp switching
• 4 schedulers per SM in modern architectures

Memory Hierarchy

Register File

• Size: 256 KB per SM (65,536 × 32-bit registers)
• Access: 0 cycles (immediate)
• Scope: Private per thread
• Usage: Local variables, intermediate results
• Spills to local memory if exceeded

Shared Memory

• Size: 128 KB per SM (configurable with L1)
• Access: ~20 cycles
• Scope: Shared within thread block
• Usage: Inter-thread communication, data reuse
• Bank conflicts can reduce performance

L1 Cache

• Size: 128 KB (combined with shared memory)
• Access: ~30 cycles
• Scope: Per SM
• Usage: Caches global/local memory access
• Configurable split: 0/128, 8/120, 16/112, ... 128/0 KB

L2 Cache

• Size: 40-60 MB (entire GPU)
• Access: ~200 cycles
• Scope: All SMs
• Usage: Shared across GPU
• Coherent across all SMs

Global Memory

• Size: 24-80 GB (HBM2e/HBM3)
• Access: ~500 cycles
• Scope: Entire GPU
• Bandwidth: 1-2 TB/s
• Coalesced access patterns critical

Execution Model

Warp Execution

The SM executes threads in groups of 32 called warps:

Warp Properties:
• 32 threads execute in SIMD fashion
• Same instruction, different data
• Divergence causes serialization
• Context stored in register file
• Fast switching between warps

Thread Hierarchy

Thread ─► Warp (32 threads) ─► Block (≤1024 threads) ─► Grid
  │           │                      │                      │
  │           │                      │                      └─ Application
  │           │                      └─ Executes on one SM
  │           └─ SIMD execution unit
  └─ Individual execution context

Occupancy

SM occupancy is the ratio of active warps to maximum warps:

Factors affecting occupancy:
• Registers per thread (max 255)
• Shared memory per block
• Threads per block
• Blocks per SM

Formula: Occupancy = Active Warps / Max Warps (48)

Target: 50-75% occupancy for memory-bound kernels
        25-50% occupancy for compute-bound kernels

Performance Optimization

1. Maximize Parallelism

• Launch enough threads to saturate SMs
• Use appropriate block size (typically 128-512 threads)
• Balance grid and block dimensions

2. Memory Access Patterns

• Coalesce global memory accesses
• Use shared memory for data reuse
• Minimize bank conflicts in shared memory
• Leverage texture cache for spatial locality

3. Instruction Mix

• Balance compute and memory operations
• Use intrinsics for special functions
• Leverage tensor cores for matrix operations
• Minimize divergent branches

4. Resource Usage

• Minimize register pressure
• Use shared memory judiciously
• Consider dynamic shared memory allocation
• Profile with NSight Compute

Evolution Across Generations

Kepler (2012) - SM 3.x

• 192 CUDA cores per SM
• 64KB shared memory
• Dynamic parallelism introduction

Maxwell (2014) - SM 5.x

• 128 CUDA cores per SM
• Improved power efficiency
• Larger L2 cache

Pascal (2016) - SM 6.x

• 64 CUDA cores per SM
• HBM2 memory support
• Unified memory improvements

Volta (2017) - SM 7.0

• 64 CUDA cores per SM
• First Tensor Cores (8 per SM)
• Independent thread scheduling
• 120 TFLOPS for deep learning

Turing (2018) - SM 7.5

• 64 CUDA cores per SM
• RT Cores for ray tracing
• Concurrent FP32 + INT32 execution
• Mesh shading support

Ampere (2020) - SM 8.x

• 128 FP32 CUDA cores per SM
• 3rd gen Tensor Cores (sparse support)
• 2nd gen RT Cores
• 2× FP32 throughput

Ada Lovelace (2022) - SM 8.9

• 128 CUDA cores per SM
• 4th gen Tensor Cores (FP8 support)
• 3rd gen RT Cores (2.5× performance)
• Shader Execution Reordering

Hopper (2022) - SM 9.0

• 128 CUDA cores per SM
• 4th gen Tensor Cores with Transformer Engine
• 228 KB shared memory
• Thread Block Clusters
• Dynamic programming model

Programming Considerations

CUDA Code Example

__global__ void matrixMul(float* C, float* A, float* B, int N) {
    // Shared memory for tile-based computation
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];
    
    int bx = blockIdx.x, by = blockIdx.y;
    int tx = threadIdx.x, ty = threadIdx.y;
    
    // Warp-level primitives for efficiency
    if (tx < TILE_SIZE && ty < TILE_SIZE) {
        // Coalesced memory access
        As[ty][tx] = A[...];
        Bs[ty][tx] = B[...];
    }
    __syncthreads(); // Block-level synchronization
    
    // Compute using shared memory
    float sum = 0.0f;
    #pragma unroll
    for (int k = 0; k < TILE_SIZE; k++) {
        sum += As[ty][k] * Bs[k][tx];
    }
    
    // Write result (coalesced)
    C[...] = sum;
}

Tensor Core Usage

// Tensor Core matrix multiplication using WMMA API
#include <mma.h>
using namespace nvcuda;

__global__ void tensor_core_gemm(half* C, half* A, half* B) {
    // Declare fragments for matrix tiles
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
    
    // Load matrices into fragments
    wmma::load_matrix_sync(a_frag, A, 16);
    wmma::load_matrix_sync(b_frag, B, 16);
    wmma::fill_fragment(c_frag, 0.0f);
    
    // Perform matrix multiplication using Tensor Cores
    wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    
    // Store result
    wmma::store_matrix_sync(C, c_frag, 16, wmma::mem_row_major);
}

Best Practices

1. Kernel Design

Keep kernels simple and focused
Minimize register usage per thread
Use templates for compile-time optimization
Profile before optimizing

2. Memory Management

Prefer shared memory over global memory
Use constant memory for read-only data
Implement tiling for large data sets
Consider texture memory for 2D/3D data

3. Occupancy Tuning

Use CUDA Occupancy Calculator
Experiment with block sizes
Balance resource usage
Don't always maximize occupancy

4. Advanced Features

Leverage cooperative groups for flexible synchronization
Use shuffle instructions for warp-level communication
Implement dynamic parallelism when appropriate
Utilize unified memory for simplified programming

Debugging and Profiling

Tools

• NVIDIA Nsight Compute - Kernel profiling
• NVIDIA Nsight Systems - System-wide analysis
• cuda-memcheck - Memory error detection
• compute-sanitizer - Race condition detection

Key Metrics

• SM Efficiency - Utilization of SM resources
• Occupancy - Active warps ratio
• Memory Throughput - Bandwidth utilization
• Instruction Throughput - IPC (Instructions Per Cycle)
• Warp Stall Reasons - Bottleneck identification

Conclusion

The Streaming Multiprocessor is the heart of GPU computing, combining massive parallelism with specialized hardware units. Understanding its architecture, execution model, and memory hierarchy is essential for writing efficient GPU code. As GPUs evolve, SMs continue to gain new capabilities while maintaining the fundamental SIMT execution model that makes them powerful accelerators for parallel workloads.