Transparent Huge Pages (THP): Reducing TLB Pressure

22 min

Deep dive into Transparent Huge Pages (THP), a Linux kernel feature that automatically promotes 4KB pages to 2MB huge pages. Learn how THP reduces TLB misses, page table overhead, and improves performance—plus the hidden costs of memory bloat and latency spikes.

Best viewed on desktop for optimal interactive experience

Transparent Huge Pages (THP)

Transparent Huge Pages (THP) is a Linux kernel feature that automatically promotes standard 4KB memory pages to larger 2MB huge pages when beneficial, without application changes. By reducing page table depth and increasing TLB coverage, THP can significantly improve performance for memory-intensive workloads—but it also introduces complexity around memory fragmentation, latency spikes, and memory bloat that require careful consideration.

The Page Table Overhead Problem

Modern x86-64 systems use 4-level page tables to translate virtual addresses to physical addresses. Every memory access potentially requires walking through all 4 levels, which creates significant overhead.

4KB Page Translation: 4-Level Page Walk64-bit Virtual Address16 bits unusedPGD (9b)PUD (9b)PMD (9b)PTE (9b)Offset (12b)Memory Access Flow:PGDPage GlobalDirectory1 accessPUDPage UpperDirectory2 accessPMDPage MiddleDirectory3 accessPTEPage TableEntry4 accessPhysical Page (4KB)+ 12-bit offsetThe Overhead Problem4 memory accesses per translation (PGD → PUD → PMD → PTE)512 entries per table level = 512⁴ = 68 billion possible pages256 TB addressable with 4KB pages (48-bit address space)Each 1GB of RAM needs ~2MB page tables (0.2% overhead)

The core problem: With 4KB pages, a 10GB process working set requires 2.5 million page table entries and generates massive TLB pressure. Even with a 1536-entry TLB covering 6MB, 99.94% of memory requires expensive page table walks.

The TLB: Hardware's Caching Solution

The Translation Lookaside Buffer (TLB) is a specialized cache that stores recent virtual-to-physical address translations. Modern CPUs have:

  • L1 DTLB: 64 entries for 4KB pages (256KB coverage)
  • L2 STLB: 1536 entries for 4KB pages (6MB coverage)

TLB miss cost: 10-20 cycles for page walk vs 1-2 cycles for hit — a 10× performance penalty.

Measuring the Problem

For a typical database workload with 10GB working set:

  • TLB coverage: 6MB (0.06% of working set)
  • TLB miss rate: ~60-80%
  • Performance impact: 30-50% slowdown from TLB misses alone

How Transparent Huge Pages Work

THP addresses this by using 2MB huge pages instead of 4KB pages, providing:

  • 512× more coverage per TLB entry (2MB vs 4KB)
  • One fewer page table level (3 levels vs 4)
  • Reduced page table memory (512 PTEs collapsed into 1 PMD entry)
TLB Coverage ComparisonStandard 4KB PagesL1 DTLB: 64 entries64 × 4KB = 256KB coverageL2 STLB: 1536 entries1536 × 4KB = 6MB coverage→ Total: 6MB10GB process memory6MB cached9.994GB requires page walks2MB Huge PagesL1 DTLB: 32 entries32 × 2MB = 64MB coverageL2 STLB: 1536 entries1536 × 2MB = 3GB coverage→ Total: 3GB10GB process memory3GB cached7GB requires page walks512× more coverage per TLB entry • 30-70% fewer TLB misses

Huge Page Basics

Standard 4KB page:

  • Virtual address: 48 bits (PGD → PUD → PMD → PTE → Offset)
  • TLB coverage: 6MB with 1536 entries
  • Page table walk: 4 memory accesses

2MB huge page:

  • Virtual address: 48 bits (PGD → PUD → PMD → Offset)
  • TLB coverage: 3GB with same 1536 entries
  • Page table walk: 3 memory accesses (25% faster)
  • PSE bit set in PMD entry to skip PTE level
4KB Page (4-level walk)PGDLevel 4PUDLevel 3PMDLevel 2PTELevel 14KB Page4 memory accesses2MB Huge Page (3-level walk)PGDLevel 4PUDLevel 3PMDLevel 2PSE bit set!PTE(skipped!)2MB Huge Page512 × 4KB pagescontiguous3 memory accesses (25% faster)One level eliminated

THP Allocation Mechanisms

Linux provides two mechanisms for creating huge pages transparently:

1. Synchronous Allocation (Page Fault Path)

When a process faults on a virtual address:

  1. Check alignment: Virtual address must be 2MB-aligned
  2. Check memory: Look for 2MB contiguous physical region
  3. Allocate directly: If available, allocate 2MB huge page immediately
  4. Fallback: If not available, allocate 4KB page and retry later

Fast path: Direct huge page allocation (0.1-0.5ms) Slow path: Defragmentation required (10-100ms+)

2. Asynchronous Promotion (khugepaged)

The khugepaged kernel daemon scans memory periodically looking for collapse opportunities:

khugepaged Page Collapse Operation

Before: 512 × 4KB pages

PMD points to Page Table with 512 PTEs, each pointing to scattered 4KB physical pages

khugepaged scans and collapses →

After: 1 × 2MB huge page

PMD points directly to 2MB contiguous physical page (PSE=1)

Benefits:

  • ✓ Page table memory freed: 4KB saved (512 PTEs × 8 bytes)
  • ✓ TLB coverage: 512× increase (1 TLB entry instead of 512)
  • ✓ Page walk cost: 25% reduction (3 levels instead of 4)
  • ⚠ Migration harder: must move 2MB contiguous block

khugepaged scans for:

  • 512 contiguous 4KB pages within same VMA
  • All pages present in memory
  • All pages belong to same process
  • Memory region is 2MB-aligned

Collapse operation:

  1. Allocate 2MB huge page
  2. Copy 512 pages to new location
  3. Update PMD to point to huge page (set PSE bit)
  4. Free original page table and 512 pages
  5. Flush TLB entries

khugepaged Configuration

# Scan interval (milliseconds) /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs # Pages to scan per interval /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan # Maximum pages to collapse per scan /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none # Minimum allocation order to enable scanning /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

Typical settings:

  • Scan interval: 10000ms (10 seconds)
  • Pages per scan: 4096 pages
  • Collapse limit: 511 pages max

Defragmentation

Physical memory fragmentation is THP's biggest challenge. When memory becomes fragmented, allocating 2MB contiguous regions fails:

Memory Fragmentation and THP Allocation

Fragmented Memory (THP fails)

Process A
Process B

✗ No 2MB contiguous region available

After Defragmentation

✓ 2MB contiguous region created

THP allocation succeeds for Process A

Defragmentation Process

kcompactd/kswapd migrate pages to create contiguous regions. May require reclaiming page cache or moving active pages.

Defragmentation Modes

cat /sys/kernel/mm/transparent_hugepage/defrag

Available modes:

  1. always: Synchronously defragment on every THP allocation

    • Pro: Highest THP success rate (80-95%)
    • Con: Severe latency spikes (10-100ms stalls)
    • Use: Never for production workloads
  2. defer: Defer to kcompactd daemon for background defrag

    • Pro: No direct application stalls
    • Con: Lower THP coverage initially
    • Use: General-purpose servers
  3. defer+madvise: Defer + synchronous for madvise() regions

    • Pro: App control over critical regions
    • Con: Requires application changes
    • Use: Databases with madvise(MADV_HUGEPAGE)
  4. madvise: Only for madvise() regions, no background defrag

    • Pro: Explicit control, no surprises
    • Con: Requires application opt-in
    • Use: Recommended for production
  5. never: No defragmentation

    • Pro: Zero latency from defrag
    • Con: THP only works with already-free huge pages
    • Use: Real-time or latency-sensitive systems

Memory Fragmentation Visualization

After system runs for hours/days:

  • Page cache allocations scatter throughout memory
  • Process allocations interleaved
  • 2MB contiguous regions become rare
  • THP success rate drops from 80% to <20%

Defragmentation mechanisms:

  • kcompactd: Background compaction daemon
  • kswapd: Evicts/moves pages during reclaim
  • Direct compaction: Synchronous during allocation (slow!)

Benefits and Use Cases

Performance Impact: Real Measurements

THP Performance Impact by Workload Type

Database
+30%
ML Train
+12%
Analytics
+25%
Video Enc
+18%
Memcpy
+50%
Web Fork
-28%
Sparse
-22%
Speedup
Slowdown

Maximum benefit workloads:

1. Databases (PostgreSQL, MongoDB, Redis):

  • Large buffer pools (10GB+)
  • Sequential scans over huge tables
  • Measured improvement: 15-35% throughput increase
  • TLB miss reduction: 60-70%

Example - Redis benchmark:

# Without THP redis-benchmark -t set,get -n 1000000 -q SET: 89285.71 requests per second GET: 91743.12 requests per second # With THP (madvise mode) SET: 117096.02 requests per second (+31%) GET: 120192.31 requests per second (+31%)

2. ML Training (PyTorch, TensorFlow):

  • Large model weights in memory
  • Contiguous tensor allocations
  • Measured improvement: 5-15% training speedup
  • Memory bandwidth: Better cache line utilization

Example - PyTorch ResNet50:

# Training throughput (images/sec) Without THP: 147 img/s With THP: 160 img/s (+9%)

3. In-Memory Analytics (Spark, ClickHouse):

  • Large columnar data structures
  • Aggregations over billions of rows
  • Measured improvement: 20-40% query speedup
  • Page walk reduction: 25%

4. Video Processing Pipelines:

  • Large frame buffers
  • Sequential processing
  • Measured improvement: 15-25% throughput
  • Reduced stalls: Fewer TLB-related pipeline flushes

When THP Provides Maximum Benefit

THP works best with:

Large, contiguous allocations (multi-MB buffers) ✅ Dense memory access patterns (sequential reads/writes) ✅ Long-lived processes (time for khugepaged to optimize) ✅ Memory-bound workloads (not CPU/IO bound) ✅ Predictable allocation patterns (not rapid alloc/free)

Performance benchmark table:

Workload TypeTHP ImprovementTLB Miss ReductionBest Mode
PostgreSQL+30%70%madvise
MongoDB+25%65%madvise
Redis+31%68%madvise
ML Training+9%45%defer
Spark+35%72%madvise
Memcached+20%58%madvise
Video Encode+18%50%defer

Best practice for production:

# Enable THP in madvise-only mode echo madvise > /sys/kernel/mm/transparent_hugepage/enabled echo madvise > /sys/kernel/mm/transparent_hugepage/defrag # Configure khugepaged conservatively echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs echo 4096 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

Why madvise mode?

  • Applications explicitly opt-in for critical regions
  • No unexpected memory bloat
  • No latency spikes from defragmentation
  • Predictable performance characteristics

Configuration and Management

THP Modes

cat /sys/kernel/mm/transparent_hugepage/enabled

Modes:

  1. always: Kernel tries to use huge pages whenever possible

    • Pro: Maximum THP coverage
    • Con: Memory bloat, latency spikes
    • Use: Benchmarking only
  2. madvise: Only for regions marked with madvise(MADV_HUGEPAGE)

    • Pro: Explicit control, no surprises
    • Con: Requires application changes
    • Use: Recommended for production
  3. never: Disable THP completely

    • Pro: Predictable memory usage
    • Con: No THP benefits
    • Use: Incompatible workloads

Application-Level THP Control

Applications can control THP per-region using madvise():

#include <sys/mman.h> // Allocate large buffer void* buffer = mmap(NULL, 100 * 1024 * 1024, // 100MB PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); // Request THP for this region madvise(buffer, 100 * 1024 * 1024, MADV_HUGEPAGE); // Or disable THP for this region madvise(buffer, 100 * 1024 * 1024, MADV_NOHUGEPAGE);

Use cases:

  • Database buffer pools: MADV_HUGEPAGE
  • Small object allocators: MADV_NOHUGEPAGE
  • JVM heap: MADV_HUGEPAGE
  • Redis dataset: MADV_HUGEPAGE

Monitoring THP Usage

Check THP statistics:

cat /proc/meminfo | grep -i huge AnonHugePages: 4096000 kB # Anonymous huge pages ShmemHugePages: 0 kB # Shared memory huge pages FileHugePages: 0 kB # File-backed huge pages HugePages_Total: 0 # Reserved huge pages (not THP)

Per-process THP usage:

cat /proc/<PID>/smaps | grep -i huge AnonHugePages: 10240 kB # Or detailed per-region grep -A 15 "private_anonymous" /proc/<PID>/smaps | grep AnonHugePages

THP allocation success rate:

grep thp /proc/vmstat thp_fault_alloc 45821 # Successful allocations thp_fault_fallback 20739 # Failed, fell back to 4KB thp_collapse_alloc 12048 # khugepaged collapses thp_split_page 3421 # Huge pages split back to 4KB

Calculation:

Success rate = thp_fault_alloc / (thp_fault_alloc + thp_fault_fallback) = 45821 / (45821 + 20739) = 68.8%

Tuning khugepaged

Aggressive settings (high THP coverage):

echo 1000 > scan_sleep_millisecs # Scan every second echo 8192 > pages_to_scan # Scan 32MB per interval

Conservative settings (low overhead):

echo 30000 > scan_sleep_millisecs # Scan every 30 seconds echo 2048 > pages_to_scan # Scan 8MB per interval

Disable khugepaged entirely:

echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

Systemwide THP Configuration Script

#!/bin/bash # Production THP configuration # Enable madvise-only mode echo madvise > /sys/kernel/mm/transparent_hugepage/enabled echo madvise > /sys/kernel/mm/transparent_hugepage/defrag # Conservative khugepaged settings echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs echo 4096 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan echo 511 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none # Disable for swap echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap # Make persistent (add to /etc/rc.local or systemd)

Trade-offs and Issues

The Dark Side of THP: Memory Bloat

THP's biggest hidden cost is internal fragmentation from small allocations:

Internal Fragmentation: Memory Bloat with THP

Without THP (4KB pages)

Process allocates 100 × 5KB objects

5KB
5KB
5KB

... × 100

Total: 500KB used

Each 5KB uses 2 × 4KB pages = 8KB

100 × 8KB = 800KB total RSS

Waste: 300KB (37.5%)

With THP enabled (2MB pages)

Same 100 × 5KB objects

2MB Huge Page

500KB used

1.5MB wasted!

Total: 2MB allocated

Kernel allocates full 2MB huge page

Only 500KB used = 2048KB total RSS

Waste: 1548KB (75.6%!) - 2.5× memory bloat!

Memory Bloat Risk

Workloads with many small allocations can see 2-4× memory usage increase with THP!

Real-world examples:

Redis with small keys:

# Without THP used_memory: 1048576000 bytes (1000 MB) # With THP (always mode) used_memory_rss: 2684354560 bytes (2560 MB) # 2.5× memory bloat!

Cause: Redis allocates many small objects (50-100 bytes). With THP, kernel allocates 2MB huge pages for each, wasting 1.95MB+ per allocation.

Node.js applications:

  • Many small JavaScript objects
  • THP can cause 2-4× memory usage increase
  • OOM kills on memory-constrained systems

Java with G1GC:

  • G1 uses 1MB regions
  • THP allocates 2MB per region (50% waste)
  • Recommendation: MADV_NOHUGEPAGE for Java heaps

Latency Spikes from Defragmentation

Synchronous defragmentation (defrag=always) causes severe latency spikes:

# Normal page fault Page fault latency: 0.1-0.5 microseconds # THP fault with defragmentation Page fault latency: 10-100 milliseconds (100,000× slower!)

Root cause: Direct compaction:

  1. Kernel scans for 2MB contiguous region
  2. Not found → trigger compaction
  3. Migrate pages to create contiguous space
  4. May need to evict page cache or swap
  5. Update hundreds of page table entries
  6. Flush TLBs across all CPUs

Real-world impact:

  • Redis P99 latency: 1ms → 150ms
  • PostgreSQL query stalls: 50-200ms
  • Web server request timeouts

Solution: Use defrag=madvise or defrag=defer

When THP Hurts Performance

THP can degrade performance in these scenarios:

1. Fork-heavy workloads (web servers with prefork):

  • Each fork() must copy 2MB huge pages
  • Copy-on-write splits huge pages back to 4KB
  • Slowdown: 2-3× slower forks

Example:

// Apache prefork model Without THP: fork() = 0.5ms With THP: fork() = 1.4ms (2.8× slower)

2. Sparse memory access:

  • Application touches 4KB per 2MB huge page
  • 2048KB wasted per allocation
  • Page faults bring in full 2MB (instead of 4KB)

3. Frequent allocation/deallocation:

  • Small object allocators (malloc/free loops)
  • Memory churn prevents huge page formation
  • khugepaged overhead with no benefit

4. Memory-constrained systems:

  • Internal fragmentation causes OOM
  • Kernel unable to reclaim partially-used huge pages
  • Swapping 2MB chunks instead of 4KB

Monitoring THP Problems

Symptoms of THP issues:

# Check for excessive huge page splits grep thp_split /proc/vmstat thp_split_page 892847 # If high, THP may be counterproductive # Check for allocation failures grep thp_fault_fallback /proc/vmstat thp_fault_fallback 45218 # If >50% of faults, fragmentation problem # Monitor defrag stalls grep compact_stall /proc/vmstat compact_stall 12847 # Direct compaction stalls (latency spikes!)

Per-process memory bloat detection:

# Compare RSS vs actual memory usage cat /proc/<PID>/status | grep -E "VmRSS|VmData" VmRSS: 2560000 kB # Resident memory VmData: 1048576 kB # Actual data # If RSS >> VmData, likely THP bloat

Best Practices Summary

✅ DO:

  • Use madvise mode in production
  • Let applications opt-in with madvise(MADV_HUGEPAGE)
  • Monitor THP statistics regularly
  • Use defer or madvise defrag mode
  • Tune khugepaged conservatively

❌ DON'T:

  • Use always mode in production
  • Use defrag=always (latency spikes!)
  • Enable THP for all workloads blindly
  • Ignore memory bloat on small-object workloads
  • Forget to benchmark before/after THP enablement

Recommended per-workload settings:

Workloadenableddefragkhugepaged
Databasemadvisemadviseaggressive
Web server (fork)neverneverdisabled
ML trainingdeferdefermoderate
Redismadvisemadvisemoderate
Java (large heap)madvisedeferconservative
Real-time systemsneverneverdisabled

Conclusion

Transparent Huge Pages is a powerful optimization that can deliver 10-50% performance improvements for memory-intensive workloads by reducing TLB pressure and page table overhead. However, it's not a silver bullet:

When THP shines:

  • Large, long-lived allocations (databases, caches, ML models)
  • Dense, sequential access patterns
  • Memory-bound workloads with high TLB miss rates

When THP hurts:

  • Small object allocations (memory bloat)
  • Fork-heavy workloads (copy-on-write overhead)
  • Memory-constrained systems (OOM risk)
  • Latency-sensitive applications (defrag stalls)

Production recommendation:

  • Default: enabled=madvise, defrag=madvise
  • Let applications decide which regions benefit from THP
  • Monitor continuously for memory bloat and latency spikes
  • Benchmark rigorously before enabling in production

Understanding THP's mechanics and trade-offs allows you to make informed decisions about when to leverage this powerful feature and when to avoid it.

If you found this explanation helpful, consider sharing it with others.

Mastodon