NUMA Architecture: Non-Uniform Memory Access

What is NUMA Architecture?

NUMA (Non-Uniform Memory Access) is a computer memory design used in multiprocessing systems where memory access time depends on the memory location relative to the processor. In NUMA systems, processors can access their own local memory faster than non-local memory (memory local to another processor or memory shared between processors).

This architecture is crucial for modern multi-socket servers and high-performance computing systems where memory bandwidth and latency are critical performance factors.

Interactive NUMA Demonstration

Experience how memory access patterns affect performance in NUMA systems:

NUMA (Non-Uniform Memory Access) Architecture

NUMA Access Pattern Demonstration

Thread Memory Access Requests

T0→M0

T1→M0

T2→M1

T3→M1

T4→M0

T5→M0

T6→M1

T7→M1

QPI/UPI Link

25.6 GB/s

NUMA Node 0

CPU Socket 0 (8 cores)

Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Core 6

Core 7

Memory Controller 0

DDR4-3200

16GB

0x0000-0x3FFF

DDR4-3200

16GB

0x4000-0x7FFF

NUMA Node 1

CPU Socket 1 (8 cores)

Core 8

Core 9

Core 10

Core 11

Core 12

Core 13

Core 14

Core 15

Memory Controller 1

DDR4-3200

16GB

0x8000-0xBFFF

DDR4-3200

16GB

0xC000-0xFFFF

Local Memory Access

CPU → Local Memory Controller → Local DRAM

~90 ns

Fast Path

Remote Memory Access

CPU → QPI/UPI → Remote MC → Remote DRAM

~300 ns

3.3x Slower

Local Accesses

requests

Remote Accesses

requests

Average Latency

nanoseconds

Local Memory Access

Remote Memory Access

QPI/UPI Active

NUMA Architecture: In NUMA systems, each CPU socket has its own local memory controller and DRAM. Accessing local memory is fast (~90ns), but accessing memory attached to another socket requires going through the inter-socket interconnect (QPI/UPI), resulting in higher latency (~300ns). NUMA-aware applications place data close to the threads that use it for optimal performance.

Understanding NUMA Components

1. NUMA Nodes

Each NUMA node typically consists of:

CPU Socket: One or more processors with multiple cores
Local Memory: RAM directly attached to the socket's memory controller
Memory Controller: Manages access to local memory banks
Local I/O: PCIe lanes and devices attached to this socket

2. Inter-Socket Interconnect

Modern systems use high-speed interconnects:

Intel QPI (QuickPath Interconnect): Up to 25.6 GB/s
Intel UPI (Ultra Path Interconnect): Up to 41.6 GB/s
AMD Infinity Fabric: Up to 50 GB/s per link

These links enable processors to access remote memory but at higher latency.

3. Memory Access Types

Local Memory Access (~90ns)

CPU Core → L3 Cache → Memory Controller → Local DRAM

Remote Memory Access (~300ns)

CPU Core → L3 Cache → QPI/UPI → Remote Socket → Memory Controller → Remote DRAM

NUMA Performance Characteristics

Latency Comparison

Access Type	Typical Latency	Relative Performance
L1 Cache	1 ns	Baseline
L2 Cache	3 ns	3x slower than L1
L3 Cache	10-15 ns	10-15x slower than L1
Local DRAM	60-100 ns	60-100x slower than L1
Remote DRAM	200-300 ns	200-300x slower than L1

Bandwidth Implications

While local memory provides full bandwidth (e.g., 100+ GB/s with DDR4-3200), remote memory access is limited by:

Interconnect bandwidth (25-50 GB/s)
Contention from other cross-socket traffic
Protocol overhead

NUMA-Aware Programming

Best Practices

Thread Affinity: Pin threads to cores near their data

// Linux example using sched_setaffinity
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
sched_setaffinity(0, sizeof(cpuset), &cpuset);

Memory Allocation: Allocate memory on the correct NUMA node

// Using libnuma
numa_alloc_onnode(size, node_id);

First-Touch Policy: Memory is allocated on the node that first touches it

// Thread on node 0 initializes its data
#pragma omp parallel
{
    int thread_id = omp_get_thread_num();
    initialize_local_data(data[thread_id]);
}

Common NUMA Pitfalls

False Sharing: Multiple threads on different nodes accessing the same cache line
Remote Memory Allocation: Allocating all memory from one thread
Thread Migration: OS moving threads between nodes
Imbalanced Memory Usage: One node's memory fills while others are empty

NUMA in Different Systems

Server Configurations

2-Socket Servers: Most common, good balance of cost and performance
4-Socket Servers: High-end systems with complex NUMA topology
8+ Socket Systems: Specialized HPC systems with multi-hop NUMA

Cloud and Virtualization

VM NUMA Topology: VMs should respect physical NUMA boundaries
Container Affinity: Kubernetes and Docker NUMA-aware scheduling
NUMA Balancing: Automatic kernel optimization (can be disabled)

Tools for NUMA Analysis

System Information

# Show NUMA topology
numactl --hardware

# Display NUMA statistics  
numastat

# Show per-node memory usage
numastat -m

Performance Monitoring

# Monitor NUMA performance counters
perf stat -e node-loads,node-load-misses ./application

# Intel PCM for detailed NUMA metrics
pcm-numa

Relationship to Memory Interleaving

While memory interleaving distributes addresses across banks within a node for bandwidth, NUMA distributes memory across nodes for scalability. Modern systems use both:

Intra-node: Memory interleaving across channels
Inter-node: NUMA for multi-socket scalability
Hybrid: Some systems support interleaving across NUMA nodes (with latency trade-offs)

Best Use Cases

NUMA architecture excels in:

Database Servers: Partitioning data by NUMA node
Scientific Computing: Domain decomposition aligned with NUMA
Big Data Analytics: NUMA-aware data partitioning
Virtualization Hosts: VM placement based on NUMA topology

Conclusion

NUMA architecture is essential for scaling beyond single-socket systems. Understanding NUMA characteristics—local vs remote memory access, interconnect bandwidth, and proper thread/memory affinity—is crucial for achieving optimal performance in modern multi-socket systems. The 3-4x latency difference between local and remote memory access can significantly impact application performance, making NUMA-aware design a critical consideration for high-performance applications.

NUMA Architecture: Non-Uniform Memory Access

Table of Contents

What is NUMA Architecture?

Interactive NUMA Demonstration

NUMA (Non-Uniform Memory Access) Architecture

NUMA Access Pattern Demonstration

Thread Memory Access Requests

NUMA Node 0

CPU Socket 0 (8 cores)

NUMA Node 1

CPU Socket 1 (8 cores)

Local Memory Access

Remote Memory Access

Local Accesses

Remote Accesses

Average Latency

Understanding NUMA Components

1. NUMA Nodes

2. Inter-Socket Interconnect

3. Memory Access Types

NUMA Performance Characteristics

Latency Comparison

Bandwidth Implications

NUMA-Aware Programming

Best Practices

Common NUMA Pitfalls

NUMA in Different Systems

Server Configurations

Cloud and Virtualization

Tools for NUMA Analysis

System Information

Performance Monitoring

Relationship to Memory Interleaving

Best Use Cases

Conclusion

Related Concepts

Memory Interleaving

Memory Access Patterns

CPU Cache Lines