Transparent Huge Pages (THP)

Transparent Huge Pages (THP) is a Linux kernel feature that automatically promotes standard 4KB memory pages to larger 2MB huge pages when beneficial, without application changes. By reducing page table depth and increasing TLB coverage, THP can significantly improve performance for memory-intensive workloads—but it also introduces complexity around memory fragmentation, latency spikes, and memory bloat that require careful consideration.

The Page Table Overhead Problem

Modern x86-64 systems use 4-level page tables to translate virtual addresses to physical addresses. Every memory access potentially requires walking through all 4 levels, which creates significant overhead.

The core problem: With 4KB pages, a 10GB process working set requires 2.5 million page table entries and generates massive TLB pressure. Even with a 1536-entry TLB covering 6MB, 99.94% of memory requires expensive page table walks.

The TLB: Hardware's Caching Solution

The Translation Lookaside Buffer (TLB) is a specialized cache that stores recent virtual-to-physical address translations. Modern CPUs have:

L1 DTLB: 64 entries for 4KB pages (256KB coverage)
L2 STLB: 1536 entries for 4KB pages (6MB coverage)

TLB miss cost: 10-20 cycles for page walk vs 1-2 cycles for hit — a 10× performance penalty.

Measuring the Problem

For a typical database workload with 10GB working set:

TLB coverage: 6MB (0.06% of working set)
TLB miss rate: ~60-80%
Performance impact: 30-50% slowdown from TLB misses alone

How Transparent Huge Pages Work

THP addresses this by using 2MB huge pages instead of 4KB pages, providing:

512× more coverage per TLB entry (2MB vs 4KB)
One fewer page table level (3 levels vs 4)
Reduced page table memory (512 PTEs collapsed into 1 PMD entry)

Huge Page Basics

Standard 4KB page:

Virtual address: 48 bits (PGD → PUD → PMD → PTE → Offset)
TLB coverage: 6MB with 1536 entries
Page table walk: 4 memory accesses

2MB huge page:

Virtual address: 48 bits (PGD → PUD → PMD → Offset)
TLB coverage: 3GB with same 1536 entries
Page table walk: 3 memory accesses (25% faster)
PSE bit set in PMD entry to skip PTE level

THP Allocation Mechanisms

Linux provides two mechanisms for creating huge pages transparently:

1. Synchronous Allocation (Page Fault Path)

When a process faults on a virtual address:

Check alignment: Virtual address must be 2MB-aligned
Check memory: Look for 2MB contiguous physical region
Allocate directly: If available, allocate 2MB huge page immediately
Fallback: If not available, allocate 4KB page and retry later

Fast path: Direct huge page allocation (0.1-0.5ms) Slow path: Defragmentation required (10-100ms+)

2. Asynchronous Promotion (khugepaged)

The khugepaged kernel daemon scans memory periodically looking for collapse opportunities:

khugepaged Page Collapse Operation

Before: 512 × 4KB pages

PMD points to Page Table with 512 PTEs, each pointing to scattered 4KB physical pages

khugepaged scans and collapses →

After: 1 × 2MB huge page

PMD points directly to 2MB contiguous physical page (PSE=1)

Benefits:

✓ Page table memory freed: 4KB saved (512 PTEs × 8 bytes)
✓ TLB coverage: 512× increase (1 TLB entry instead of 512)
✓ Page walk cost: 25% reduction (3 levels instead of 4)
⚠ Migration harder: must move 2MB contiguous block

khugepaged scans for:

512 contiguous 4KB pages within same VMA
All pages present in memory
All pages belong to same process
Memory region is 2MB-aligned

Collapse operation:

Allocate 2MB huge page
Copy 512 pages to new location
Update PMD to point to huge page (set PSE bit)
Free original page table and 512 pages
Flush TLB entries

khugepaged Configuration

# Scan interval (milliseconds)
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs

# Pages to scan per interval
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

# Maximum pages to collapse per scan
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none

# Minimum allocation order to enable scanning
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

Typical settings:

Scan interval: 10000ms (10 seconds)
Pages per scan: 4096 pages
Collapse limit: 511 pages max

Defragmentation

Physical memory fragmentation is THP's biggest challenge. When memory becomes fragmented, allocating 2MB contiguous regions fails:

Memory Fragmentation and THP Allocation

Fragmented Memory (THP fails)

Process A

Process B

✗ No 2MB contiguous region available

After Defragmentation

✓ 2MB contiguous region created

THP allocation succeeds for Process A

Defragmentation Process

kcompactd/kswapd migrate pages to create contiguous regions. May require reclaiming page cache or moving active pages.

Defragmentation Modes

cat /sys/kernel/mm/transparent_hugepage/defrag

Available modes:

always: Synchronously defragment on every THP allocation
- Pro: Highest THP success rate (80-95%)
- Con: Severe latency spikes (10-100ms stalls)
- Use: Never for production workloads
defer: Defer to kcompactd daemon for background defrag
- Pro: No direct application stalls
- Con: Lower THP coverage initially
- Use: General-purpose servers
defer+madvise: Defer + synchronous for madvise() regions
- Pro: App control over critical regions
- Con: Requires application changes
- Use: Databases with madvise(MADV_HUGEPAGE)
madvise: Only for madvise() regions, no background defrag
- Pro: Explicit control, no surprises
- Con: Requires application opt-in
- Use: Recommended for production
never: No defragmentation
- Pro: Zero latency from defrag
- Con: THP only works with already-free huge pages
- Use: Real-time or latency-sensitive systems

Memory Fragmentation Visualization

After system runs for hours/days:

Page cache allocations scatter throughout memory
Process allocations interleaved
2MB contiguous regions become rare
THP success rate drops from 80% to <20%

Defragmentation mechanisms:

kcompactd: Background compaction daemon
kswapd: Evicts/moves pages during reclaim
Direct compaction: Synchronous during allocation (slow!)

Benefits and Use Cases

Performance Impact: Real Measurements

THP Performance Impact by Workload Type

Database

+30%

ML Train

+12%

Analytics

+25%

Video Enc

+18%

Memcpy

+50%

Web Fork

-28%

Sparse

-22%

Speedup

Slowdown

Maximum benefit workloads:

1. Databases (PostgreSQL, MongoDB, Redis):

Large buffer pools (10GB+)
Sequential scans over huge tables
Measured improvement: 15-35% throughput increase
TLB miss reduction: 60-70%

Example - Redis benchmark:

# Without THP
redis-benchmark -t set,get -n 1000000 -q
SET: 89285.71 requests per second
GET: 91743.12 requests per second

# With THP (madvise mode)
SET: 117096.02 requests per second (+31%)
GET: 120192.31 requests per second (+31%)

2. ML Training (PyTorch, TensorFlow):

Large model weights in memory
Contiguous tensor allocations
Measured improvement: 5-15% training speedup
Memory bandwidth: Better cache line utilization

Example - PyTorch ResNet50:

# Training throughput (images/sec)
Without THP: 147 img/s
With THP:    160 img/s (+9%)

3. In-Memory Analytics (Spark, ClickHouse):

Large columnar data structures
Aggregations over billions of rows
Measured improvement: 20-40% query speedup
Page walk reduction: 25%

4. Video Processing Pipelines:

Large frame buffers
Sequential processing
Measured improvement: 15-25% throughput
Reduced stalls: Fewer TLB-related pipeline flushes

When THP Provides Maximum Benefit

THP works best with:

✅ Large, contiguous allocations (multi-MB buffers) ✅ Dense memory access patterns (sequential reads/writes) ✅ Long-lived processes (time for khugepaged to optimize) ✅ Memory-bound workloads (not CPU/IO bound) ✅ Predictable allocation patterns (not rapid alloc/free)

Performance benchmark table:

Workload Type	THP Improvement	TLB Miss Reduction	Best Mode
PostgreSQL	+30%	70%	madvise
MongoDB	+25%	65%	madvise
Redis	+31%	68%	madvise
ML Training	+9%	45%	defer
Spark	+35%	72%	madvise
Memcached	+20%	58%	madvise
Video Encode	+18%	50%	defer

Recommended Configuration: madvise Mode

Best practice for production:

# Enable THP in madvise-only mode
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
echo madvise > /sys/kernel/mm/transparent_hugepage/defrag

# Configure khugepaged conservatively
echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
echo 4096 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

Why madvise mode?

Applications explicitly opt-in for critical regions
No unexpected memory bloat
No latency spikes from defragmentation
Predictable performance characteristics

Configuration and Management

THP Modes

cat /sys/kernel/mm/transparent_hugepage/enabled

Modes:

always: Kernel tries to use huge pages whenever possible
- Pro: Maximum THP coverage
- Con: Memory bloat, latency spikes
- Use: Benchmarking only
madvise: Only for regions marked with madvise(MADV_HUGEPAGE)
- Pro: Explicit control, no surprises
- Con: Requires application changes
- Use: Recommended for production
never: Disable THP completely
- Pro: Predictable memory usage
- Con: No THP benefits
- Use: Incompatible workloads

Application-Level THP Control

Applications can control THP per-region using madvise():

#include <sys/mman.h>

// Allocate large buffer
void* buffer = mmap(NULL, 100 * 1024 * 1024,  // 100MB
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// Request THP for this region
madvise(buffer, 100 * 1024 * 1024, MADV_HUGEPAGE);

// Or disable THP for this region
madvise(buffer, 100 * 1024 * 1024, MADV_NOHUGEPAGE);

Use cases:

Database buffer pools: MADV_HUGEPAGE
Small object allocators: MADV_NOHUGEPAGE
JVM heap: MADV_HUGEPAGE
Redis dataset: MADV_HUGEPAGE

Monitoring THP Usage

Check THP statistics:

cat /proc/meminfo | grep -i huge
AnonHugePages:   4096000 kB    # Anonymous huge pages
ShmemHugePages:        0 kB    # Shared memory huge pages
FileHugePages:         0 kB    # File-backed huge pages
HugePages_Total:       0       # Reserved huge pages (not THP)

Per-process THP usage:

cat /proc/<PID>/smaps | grep -i huge
AnonHugePages:     10240 kB

# Or detailed per-region
grep -A 15 "private_anonymous" /proc/<PID>/smaps | grep AnonHugePages

THP allocation success rate:

grep thp /proc/vmstat
thp_fault_alloc 45821          # Successful allocations
thp_fault_fallback 20739       # Failed, fell back to 4KB
thp_collapse_alloc 12048       # khugepaged collapses
thp_split_page 3421            # Huge pages split back to 4KB

Calculation:

Success rate = thp_fault_alloc / (thp_fault_alloc + thp_fault_fallback)
             = 45821 / (45821 + 20739) = 68.8%

Tuning khugepaged

Aggressive settings (high THP coverage):

echo 1000 > scan_sleep_millisecs    # Scan every second
echo 8192 > pages_to_scan           # Scan 32MB per interval

Conservative settings (low overhead):

echo 30000 > scan_sleep_millisecs   # Scan every 30 seconds
echo 2048 > pages_to_scan           # Scan 8MB per interval

Disable khugepaged entirely:

echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

Systemwide THP Configuration Script

#!/bin/bash
# Production THP configuration

# Enable madvise-only mode
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
echo madvise > /sys/kernel/mm/transparent_hugepage/defrag

# Conservative khugepaged settings
echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
echo 4096 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
echo 511 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none

# Disable for swap
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap

# Make persistent (add to /etc/rc.local or systemd)

Trade-offs and Issues

The Dark Side of THP: Memory Bloat

THP's biggest hidden cost is internal fragmentation from small allocations:

Internal Fragmentation: Memory Bloat with THP

Without THP (4KB pages)

Process allocates 100 × 5KB objects

5KB

... × 100

Total: 500KB used

Each 5KB uses 2 × 4KB pages = 8KB

100 × 8KB = 800KB total RSS

Waste: 300KB (37.5%)

With THP enabled (2MB pages)

Same 100 × 5KB objects

2MB Huge Page

500KB used

1.5MB wasted!

Total: 2MB allocated

Kernel allocates full 2MB huge page

Only 500KB used = 2048KB total RSS

Waste: 1548KB (75.6%!) - 2.5× memory bloat!

Memory Bloat Risk

Workloads with many small allocations can see 2-4× memory usage increase with THP!

Real-world examples:

Redis with small keys:

# Without THP
used_memory: 1048576000 bytes (1000 MB)

# With THP (always mode)
used_memory_rss: 2684354560 bytes (2560 MB)
# 2.5× memory bloat!

Cause: Redis allocates many small objects (50-100 bytes). With THP, kernel allocates 2MB huge pages for each, wasting 1.95MB+ per allocation.

Node.js applications:

Many small JavaScript objects
THP can cause 2-4× memory usage increase
OOM kills on memory-constrained systems

Java with G1GC:

G1 uses 1MB regions
THP allocates 2MB per region (50% waste)
Recommendation: MADV_NOHUGEPAGE for Java heaps

Latency Spikes from Defragmentation

Synchronous defragmentation (defrag=always) causes severe latency spikes:

# Normal page fault
Page fault latency: 0.1-0.5 microseconds

# THP fault with defragmentation
Page fault latency: 10-100 milliseconds (100,000× slower!)

Root cause: Direct compaction:

Kernel scans for 2MB contiguous region
Not found → trigger compaction
Migrate pages to create contiguous space
May need to evict page cache or swap
Update hundreds of page table entries
Flush TLBs across all CPUs

Real-world impact:

Redis P99 latency: 1ms → 150ms
PostgreSQL query stalls: 50-200ms
Web server request timeouts

Solution: Use defrag=madvise or defrag=defer

When THP Hurts Performance

THP can degrade performance in these scenarios:

1. Fork-heavy workloads (web servers with prefork):

Each fork() must copy 2MB huge pages
Copy-on-write splits huge pages back to 4KB
Slowdown: 2-3× slower forks

Example:

// Apache prefork model
Without THP: fork() = 0.5ms
With THP:    fork() = 1.4ms (2.8× slower)

2. Sparse memory access:

Application touches 4KB per 2MB huge page
2048KB wasted per allocation
Page faults bring in full 2MB (instead of 4KB)

3. Frequent allocation/deallocation:

Small object allocators (malloc/free loops)
Memory churn prevents huge page formation
khugepaged overhead with no benefit

4. Memory-constrained systems:

Internal fragmentation causes OOM
Kernel unable to reclaim partially-used huge pages
Swapping 2MB chunks instead of 4KB

Monitoring THP Problems

Symptoms of THP issues:

# Check for excessive huge page splits
grep thp_split /proc/vmstat
thp_split_page 892847    # If high, THP may be counterproductive

# Check for allocation failures
grep thp_fault_fallback /proc/vmstat
thp_fault_fallback 45218  # If >50% of faults, fragmentation problem

# Monitor defrag stalls
grep compact_stall /proc/vmstat
compact_stall 12847       # Direct compaction stalls (latency spikes!)

Per-process memory bloat detection:

# Compare RSS vs actual memory usage
cat /proc/<PID>/status | grep -E "VmRSS|VmData"
VmRSS:     2560000 kB     # Resident memory
VmData:    1048576 kB     # Actual data
# If RSS >> VmData, likely THP bloat

Best Practices Summary

✅ DO:

Use madvise mode in production
Let applications opt-in with madvise(MADV_HUGEPAGE)
Monitor THP statistics regularly
Use defer or madvise defrag mode
Tune khugepaged conservatively

❌ DON'T:

Use always mode in production
Use defrag=always (latency spikes!)
Enable THP for all workloads blindly
Ignore memory bloat on small-object workloads
Forget to benchmark before/after THP enablement

Recommended per-workload settings:

Workload	enabled	defrag	khugepaged
Database	madvise	madvise	aggressive
Web server (fork)	never	never	disabled
ML training	defer	defer	moderate
Redis	madvise	madvise	moderate
Java (large heap)	madvise	defer	conservative
Real-time systems	never	never	disabled

Conclusion

Transparent Huge Pages is a powerful optimization that can deliver 10-50% performance improvements for memory-intensive workloads by reducing TLB pressure and page table overhead. However, it's not a silver bullet:

When THP shines:

Large, long-lived allocations (databases, caches, ML models)
Dense, sequential access patterns
Memory-bound workloads with high TLB miss rates

When THP hurts:

Small object allocations (memory bloat)
Fork-heavy workloads (copy-on-write overhead)
Memory-constrained systems (OOM risk)
Latency-sensitive applications (defrag stalls)

Production recommendation:

Default: enabled=madvise, defrag=madvise
Let applications decide which regions benefit from THP
Monitor continuously for memory bloat and latency spikes
Benchmark rigorously before enabling in production

Understanding THP's mechanics and trade-offs allows you to make informed decisions about when to leverage this powerful feature and when to avoid it.

Transparent Huge Pages (THP): Reducing TLB Pressure

Table of Contents