NUMA Architecture: Non-Uniform Memory Access
Explore NUMA (Non-Uniform Memory Access) architecture, understanding how modern multi-socket systems manage memory locality and the performance implications of local vs remote memory access.
Best viewed on desktop for optimal interactive experience
What is NUMA Architecture?
NUMA (Non-Uniform Memory Access) is a computer memory design used in multiprocessing systems where memory access time depends on the memory location relative to the processor. In NUMA systems, processors can access their own local memory faster than non-local memory (memory local to another processor or memory shared between processors).
This architecture is crucial for modern multi-socket servers and high-performance computing systems where memory bandwidth and latency are critical performance factors.
Interactive NUMA Demonstration
Experience how memory access patterns affect performance in NUMA systems:
NUMA (Non-Uniform Memory Access) Architecture
NUMA Access Pattern Demonstration
Thread Memory Access Requests
NUMA Node 0
CPU Socket 0 (8 cores)
NUMA Node 1
CPU Socket 1 (8 cores)
Local Memory Access
CPU → Local Memory Controller → Local DRAM
Fast Path
Remote Memory Access
CPU → QPI/UPI → Remote MC → Remote DRAM
3.3x Slower
Local Accesses
Remote Accesses
Average Latency
NUMA Architecture: In NUMA systems, each CPU socket has its own local memory controller and DRAM. Accessing local memory is fast (~90ns), but accessing memory attached to another socket requires going through the inter-socket interconnect (QPI/UPI), resulting in higher latency (~300ns). NUMA-aware applications place data close to the threads that use it for optimal performance.
Understanding NUMA Components
1. NUMA Nodes
Each NUMA node typically consists of:
- CPU Socket: One or more processors with multiple cores
- Local Memory: RAM directly attached to the socket's memory controller
- Memory Controller: Manages access to local memory banks
- Local I/O: PCIe lanes and devices attached to this socket
2. Inter-Socket Interconnect
Modern systems use high-speed interconnects:
- Intel QPI (QuickPath Interconnect): Up to 25.6 GB/s
- Intel UPI (Ultra Path Interconnect): Up to 41.6 GB/s
- AMD Infinity Fabric: Up to 50 GB/s per link
These links enable processors to access remote memory but at higher latency.
3. Memory Access Types
Local Memory Access (~90ns)
CPU Core → L3 Cache → Memory Controller → Local DRAM
Remote Memory Access (~300ns)
CPU Core → L3 Cache → QPI/UPI → Remote Socket → Memory Controller → Remote DRAM
NUMA Performance Characteristics
Latency Comparison
Access Type | Typical Latency | Relative Performance |
---|---|---|
L1 Cache | 1 ns | Baseline |
L2 Cache | 3 ns | 3x slower than L1 |
L3 Cache | 10-15 ns | 10-15x slower than L1 |
Local DRAM | 60-100 ns | 60-100x slower than L1 |
Remote DRAM | 200-300 ns | 200-300x slower than L1 |
Bandwidth Implications
While local memory provides full bandwidth (e.g., 100+ GB/s with DDR4-3200), remote memory access is limited by:
- Interconnect bandwidth (25-50 GB/s)
- Contention from other cross-socket traffic
- Protocol overhead
NUMA-Aware Programming
Best Practices
- Thread Affinity: Pin threads to cores near their data
// Linux example using sched_setaffinity cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core_id, &cpuset); sched_setaffinity(0, sizeof(cpuset), &cpuset);
- Memory Allocation: Allocate memory on the correct NUMA node
// Using libnuma numa_alloc_onnode(size, node_id);
- First-Touch Policy: Memory is allocated on the node that first touches it
// Thread on node 0 initializes its data #pragma omp parallel { int thread_id = omp_get_thread_num(); initialize_local_data(data[thread_id]); }
Common NUMA Pitfalls
- False Sharing: Multiple threads on different nodes accessing the same cache line
- Remote Memory Allocation: Allocating all memory from one thread
- Thread Migration: OS moving threads between nodes
- Imbalanced Memory Usage: One node's memory fills while others are empty
NUMA in Different Systems
Server Configurations
- 2-Socket Servers: Most common, good balance of cost and performance
- 4-Socket Servers: High-end systems with complex NUMA topology
- 8+ Socket Systems: Specialized HPC systems with multi-hop NUMA
Cloud and Virtualization
- VM NUMA Topology: VMs should respect physical NUMA boundaries
- Container Affinity: Kubernetes and Docker NUMA-aware scheduling
- NUMA Balancing: Automatic kernel optimization (can be disabled)
Tools for NUMA Analysis
System Information
# Show NUMA topology numactl --hardware # Display NUMA statistics numastat # Show per-node memory usage numastat -m
Performance Monitoring
# Monitor NUMA performance counters perf stat -e node-loads,node-load-misses ./application # Intel PCM for detailed NUMA metrics pcm-numa
Relationship to Memory Interleaving
While memory interleaving distributes addresses across banks within a node for bandwidth, NUMA distributes memory across nodes for scalability. Modern systems use both:
- Intra-node: Memory interleaving across channels
- Inter-node: NUMA for multi-socket scalability
- Hybrid: Some systems support interleaving across NUMA nodes (with latency trade-offs)
Best Use Cases
NUMA architecture excels in:
- Database Servers: Partitioning data by NUMA node
- Scientific Computing: Domain decomposition aligned with NUMA
- Big Data Analytics: NUMA-aware data partitioning
- Virtualization Hosts: VM placement based on NUMA topology
Conclusion
NUMA architecture is essential for scaling beyond single-socket systems. Understanding NUMA characteristics—local vs remote memory access, interconnect bandwidth, and proper thread/memory affinity—is crucial for achieving optimal performance in modern multi-socket systems. The 3-4x latency difference between local and remote memory access can significantly impact application performance, making NUMA-aware design a critical consideration for high-performance applications.
Related Concepts
Deepen your understanding with these interconnected concepts