NUMA Architecture: Non-Uniform Memory Access

Explore NUMA (Non-Uniform Memory Access) architecture, understanding how modern multi-socket systems manage memory locality and the performance implications of local vs remote memory access.

Best viewed on desktop for optimal interactive experience

What is NUMA Architecture?

NUMA (Non-Uniform Memory Access) is a computer memory design used in multiprocessing systems where memory access time depends on the memory location relative to the processor. In NUMA systems, processors can access their own local memory faster than non-local memory (memory local to another processor or memory shared between processors).

This architecture is crucial for modern multi-socket servers and high-performance computing systems where memory bandwidth and latency are critical performance factors.

Interactive NUMA Demonstration

Experience how memory access patterns affect performance in NUMA systems:

NUMA (Non-Uniform Memory Access) Architecture

NUMA Access Pattern Demonstration

Thread Memory Access Requests

T0→M0
T1→M0
T2→M1
T3→M1
T4→M0
T5→M0
T6→M1
T7→M1
QPI/UPI Link
25.6 GB/s

NUMA Node 0

CPU Socket 0 (8 cores)

Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Memory Controller 0
DDR4-3200
16GB
0x0000-0x3FFF
DDR4-3200
16GB
0x4000-0x7FFF

NUMA Node 1

CPU Socket 1 (8 cores)

Core 8
Core 9
Core 10
Core 11
Core 12
Core 13
Core 14
Core 15
Memory Controller 1
DDR4-3200
16GB
0x8000-0xBFFF
DDR4-3200
16GB
0xC000-0xFFFF

Local Memory Access

CPU → Local Memory Controller → Local DRAM

~90 ns

Fast Path

Remote Memory Access

CPU → QPI/UPI → Remote MC → Remote DRAM

~300 ns

3.3x Slower

Local Accesses

0
requests

Remote Accesses

0
requests

Average Latency

0
nanoseconds
Local Memory Access
Remote Memory Access
QPI/UPI Active

NUMA Architecture: In NUMA systems, each CPU socket has its own local memory controller and DRAM. Accessing local memory is fast (~90ns), but accessing memory attached to another socket requires going through the inter-socket interconnect (QPI/UPI), resulting in higher latency (~300ns). NUMA-aware applications place data close to the threads that use it for optimal performance.

Understanding NUMA Components

1. NUMA Nodes

Each NUMA node typically consists of:

  • CPU Socket: One or more processors with multiple cores
  • Local Memory: RAM directly attached to the socket's memory controller
  • Memory Controller: Manages access to local memory banks
  • Local I/O: PCIe lanes and devices attached to this socket

2. Inter-Socket Interconnect

Modern systems use high-speed interconnects:

  • Intel QPI (QuickPath Interconnect): Up to 25.6 GB/s
  • Intel UPI (Ultra Path Interconnect): Up to 41.6 GB/s
  • AMD Infinity Fabric: Up to 50 GB/s per link

These links enable processors to access remote memory but at higher latency.

3. Memory Access Types

Local Memory Access (~90ns)

CPU Core → L3 Cache → Memory Controller → Local DRAM

Remote Memory Access (~300ns)

CPU Core → L3 Cache → QPI/UPI → Remote Socket → Memory Controller → Remote DRAM

NUMA Performance Characteristics

Latency Comparison

Access TypeTypical LatencyRelative Performance
L1 Cache1 nsBaseline
L2 Cache3 ns3x slower than L1
L3 Cache10-15 ns10-15x slower than L1
Local DRAM60-100 ns60-100x slower than L1
Remote DRAM200-300 ns200-300x slower than L1

Bandwidth Implications

While local memory provides full bandwidth (e.g., 100+ GB/s with DDR4-3200), remote memory access is limited by:

  1. Interconnect bandwidth (25-50 GB/s)
  2. Contention from other cross-socket traffic
  3. Protocol overhead

NUMA-Aware Programming

Best Practices

  1. Thread Affinity: Pin threads to cores near their data
// Linux example using sched_setaffinity cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core_id, &cpuset); sched_setaffinity(0, sizeof(cpuset), &cpuset);
  1. Memory Allocation: Allocate memory on the correct NUMA node
// Using libnuma numa_alloc_onnode(size, node_id);
  1. First-Touch Policy: Memory is allocated on the node that first touches it
// Thread on node 0 initializes its data #pragma omp parallel { int thread_id = omp_get_thread_num(); initialize_local_data(data[thread_id]); }

Common NUMA Pitfalls

  1. False Sharing: Multiple threads on different nodes accessing the same cache line
  2. Remote Memory Allocation: Allocating all memory from one thread
  3. Thread Migration: OS moving threads between nodes
  4. Imbalanced Memory Usage: One node's memory fills while others are empty

NUMA in Different Systems

Server Configurations

  • 2-Socket Servers: Most common, good balance of cost and performance
  • 4-Socket Servers: High-end systems with complex NUMA topology
  • 8+ Socket Systems: Specialized HPC systems with multi-hop NUMA

Cloud and Virtualization

  • VM NUMA Topology: VMs should respect physical NUMA boundaries
  • Container Affinity: Kubernetes and Docker NUMA-aware scheduling
  • NUMA Balancing: Automatic kernel optimization (can be disabled)

Tools for NUMA Analysis

System Information

# Show NUMA topology numactl --hardware # Display NUMA statistics numastat # Show per-node memory usage numastat -m

Performance Monitoring

# Monitor NUMA performance counters perf stat -e node-loads,node-load-misses ./application # Intel PCM for detailed NUMA metrics pcm-numa

Relationship to Memory Interleaving

While memory interleaving distributes addresses across banks within a node for bandwidth, NUMA distributes memory across nodes for scalability. Modern systems use both:

  1. Intra-node: Memory interleaving across channels
  2. Inter-node: NUMA for multi-socket scalability
  3. Hybrid: Some systems support interleaving across NUMA nodes (with latency trade-offs)

Best Use Cases

NUMA architecture excels in:

  • Database Servers: Partitioning data by NUMA node
  • Scientific Computing: Domain decomposition aligned with NUMA
  • Big Data Analytics: NUMA-aware data partitioning
  • Virtualization Hosts: VM placement based on NUMA topology

Conclusion

NUMA architecture is essential for scaling beyond single-socket systems. Understanding NUMA characteristics—local vs remote memory access, interconnect bandwidth, and proper thread/memory affinity—is crucial for achieving optimal performance in modern multi-socket systems. The 3-4x latency difference between local and remote memory access can significantly impact application performance, making NUMA-aware design a critical consideration for high-performance applications.

If you found this explanation helpful, consider sharing it with others.

Mastodon