Attention Sinks: Stable Streaming LLMs

Understand attention sinks, the phenomenon where LLMs concentrate attention on initial tokens, and how preserving them enables infinite-length streaming inference.

Best viewed on desktop for optimal interactive experience

Attention Sinks: The Key to Streaming LLMs

Attention sinks are a fascinating phenomenon where language models naturally concentrate significant attention on initial tokens (like BOS), regardless of their semantic importance. This discovery enables efficient streaming inference with stable performance.

Interactive Attention Sink Visualization

Explore how attention sinks stabilize streaming inference through this step-by-step demonstration:

The Discovery

LLMs naturally concentrate attention on initial tokens, even when they're semantically meaningless (like the BOS token). These become "attention sinks" - repositories for excess attention mass.

Natural Attention Distribution
0
1
2
3
4
5
6
7
8
9
10
11

Where Attention Goes

0
6%
1
6%
2
6%
3
6%
5
1%
6
2%
7
4%
8
5%
9
6%
10
7%
11
8%
25%
of attention goes to first 4 tokens
(Even though they're just BOS + "The quick brown")
💡
Why This Happens

Softmax requires all attention weights to sum to 1. When tokens don't need specific information, they still must allocate attention somewhere. Initial tokens become the natural "dump" because they're always visible and positionally distinct.

The Problem: Naive Sliding Window

The obvious approach: keep a sliding window of recent 8 tokens, evict the oldest. But this causes catastrophic failure when sink tokens are evicted.

Position 7: Everything's Fine
0
1
2
3
4
5
6
7
10.5
PPL
Position 12: Sinks Evicted!
5
6
7
8
9
10
11
12
23.3
PPL
Quality Collapse: 10.5 → 23
1.Initial tokens (sinks) get evicted from cache
2.Model still wants to allocate 25% attention to "sinks"
3.But there's nowhere to put it!
4.Attention redistributes randomly → catastrophic failure

The Solution: StreamingLLM

Keep just 4 initial tokens (sinks) permanently, alongside the sliding window. The model stays happy because it still has somewhere to dump excess attention.

Cache Structure at Position 12
0
1
2
3
Sinks
(Always Kept)
···
9
10
11
12
Recent Window
(Last 4)

Attention Distribution (Stable!)

0
6%
1
6%
2
6%
3
6%
5
1%
6
2%
7
4%
8
5%
9
6%
10
7%
11
8%
12
9%
10.4
PPL
✓ Quality Stays Stable
The Key Insight

By preserving just 4 tokens, we maintain the model's learned attention pattern. The sinks absorb the 25% excess attention, preventing forced redistribution. The model doesn't care that middle tokens are missing—it only needs sinks for stability and recent window for content.

The Impact

Same memory usage. Dramatically different results.

❌ Without Sinks
5
6
7
8
9
10
11
12
23.3
PPL
Cache Size:8 tokens
Quality:Catastrophic ⚠️
Perplexity:23.3
✓ With Sinks
0
1
2
3
9
10
11
12
10.4
PPL
Cache Size:8 tokens
Quality:Stable ✓
Perplexity:10.4
The Magic of Attention Sinks
4
Sink Tokens
Just preserve these
Sequence Length
No limit anymore
99%
Memory Saved
vs full caching
Stream Position:12 / 19

The Discovery

Researchers observed that LLMs consistently allocate high attention scores to initial tokens, even when these tokens carry no semantic meaning. These "sink" tokens serve as repositories for excess attention mass.

Why Attention Sinks Form

Softmax Constraint

The softmax operation requires attention weights to sum to 1:

Σj=1n \text{softmax}(si)j = 1

When a token doesn't strongly attend to any specific position, the model needs somewhere to "dump" the remaining attention mass.

Initial Token Bias

Initial tokens become natural sinks because:

  1. They're always visible (no causal masking)
  2. They're positionally distinct
  3. Models learn this pattern during training

The Streaming Problem

Without Attention Sinks: The Catastrophic Failure

The Naive Approach:

Imagine processing a long document (100K tokens) with a 1K token cache. The obvious solution seems simple:

  • Keep the most recent 1,000 tokens in memory
  • Evict the oldest token when a new one arrives
  • Slide this window forward as generation continues

What Happens:

Initial State (tokens 0-1000):

  • Model allocates ~20-30% of total attention to initial tokens (BOS, first words)
  • These tokens act as attention sinks
  • Remaining 70-80% distributed across content tokens
  • Perplexity: 10.5 (normal)

After Eviction (tokens 1001-2000):

  • Initial sink tokens are gone from cache
  • Model still wants to allocate 20-30% attention somewhere
  • But there's nowhere natural to put it!
  • Attention gets forcibly redistributed across random positions

Result:

  • Perplexity spike to 60+ (6× worse!)
  • Generation becomes incoherent
  • Model "loses its grounding"
  • Performance never recovers

Why This Fails:

The softmax attention mechanism requires all attention weights to sum to 1. When tokens don't need specific information, they still must allocate attention somewhere. The model learned during training to use initial tokens as "attention dumps." Removing these sinks violates the model's learned attention distribution, causing catastrophic failure.

With Attention Sinks: The StreamingLLM Solution

The Key Insight:

Instead of evicting everything old, preserve just the first 4 tokens permanently.

Cache Structure:

[Sink Tokens: 0-3] + [Sliding Window: most recent N-4 tokens] + [New Token]

Example with 1024-token cache:

  • Positions 0-3: Always preserved (BOS + first 3 tokens)
  • Positions 4-1023: Sliding window (most recent 1020 tokens)
  • Total cache size: Still 1024 tokens

What Happens:

Initial State (tokens 0-1000):

  • Same as before
  • Perplexity: 10.3

After Many Evictions (tokens 1001-8000):

  • Sinks remain at positions 0-3
  • Window contains tokens 7005-8000 (most recent 1020)
  • Model still has somewhere to dump excess attention (the sinks!)
  • Attention distribution remains stable

Result:

  • Perplexity stays ~10.9 (only 6% degradation)
  • Generation remains coherent
  • Can process infinite-length streams
  • Memory stays constant at 1024 tokens

Why This Works:

By preserving just 4 tokens that serve as attention sinks, we maintain the model's learned attention patterns. The sinks absorb excess attention mass, preventing forced redistribution. The model doesn't care that middle tokens (1000-7000) are missing—it only needs sinks for stability and the recent window for content.

Architecture: How StreamingLLM Works

Core Components

StreamingLLM extends standard transformer inference with three key modifications:

1. KV Cache Management

Standard Approach:

  • Cache grows unbounded with sequence length
  • Memory usage: O(n) where n = total tokens processed
  • For 100K tokens: prohibitively expensive

StreamingLLM Approach:

  • Fixed-size cache: sink_size + window_size (e.g., 4 + 1020 = 1024)
  • Memory usage: O(window_size) - constant!
  • For any length: fixed 1024 tokens in memory

Cache Structure:

Layer 1: [Sink: 0-3] [Gap: evicted tokens] [Window: recent 1020] [New] Layer 2: [Sink: 0-3] [Gap: evicted tokens] [Window: recent 1020] [New] ... Layer N: [Sink: 0-3] [Gap: evicted tokens] [Window: recent 1020] [New]

Each transformer layer maintains its own cache with identical structure.

2. Cache Update Strategy

Phase 1: Building Up (tokens 0 to window_size)

Simply append each new token to cache:

  • Token 0: Cache = [0]
  • Token 1: Cache = [0, 1]
  • ...
  • Token 1023: Cache = [0, 1, ..., 1023]

No eviction yet—cache hasn't filled up.

Phase 2: Steady State (tokens > window_size)

For each new token:

  1. Identify sink region: First 4 tokens (0-3)
  2. Identify window region: Last 1020 tokens
  3. Concatenate: [sinks] + [window] + [new_token]
  4. Evict middle: Oldest non-sink token disappears

Example at token 5000:

  • Sinks: tokens 0, 1, 2, 3
  • Window: tokens 3981-5000 (most recent 1020)
  • New: token 5001
  • Evicted: tokens 4-3980 (not in cache)

3. Position Encoding Management

The Challenge:

Position encodings (RoPE, ALiBi, etc.) depend on token position. When we evict middle tokens, we create gaps in positions:

Actual positions in cache: [0, 1, 2, 3, 3981, 3982, ..., 5000]

But the model needs to know these are the correct positions, not sequential [0, 1, 2, 3, 4, 5, ...].

The Solution:

Maintain position IDs that reflect true positions:

For Sinks:

  • Always positions 0, 1, 2, 3 (never change)

For Window:

  • Use actual positions (3981, 3982, ...)
  • Not relative positions (4, 5, ...)

For RoPE: Apply rotary embeddings using true positions. RoPE naturally handles gaps—it only cares about relative distances between visible tokens.

For ALiBi: Compute biases using true position differences. A gap doesn't matter because ALiBi only biases based on distance within the cache.

Attention Computation with Sinks

Step 1: Compute Scores

For the current token (query), compute attention scores against all cached tokens (keys):

  • Score to sink 0, sink 1, sink 2, sink 3
  • Score to window token 1, window token 2, ..., window token 1020

Standard scaled dot-product: score = (Q · K^T) / √d_k

Step 2: Create Attention Mask

Build a boolean mask indicating which positions are valid:

[True, True, True, True, False, False, ..., False, True, True, ..., True] └─ sinks (0-3) ─┘ └────── evicted (4-3980) ──────┘ └─ window ─┘

Step 3: Mask Invalid Positions

Set scores for evicted positions to -∞ (they'll get 0 attention after softmax):

Scores before mask: [2.3, 1.8, 1.5, 1.2, ?, ?, ..., ?, 3.1, 2.8, ...] Scores after mask: [2.3, 1.8, 1.5, 1.2, -∞, -∞, ..., -∞, 3.1, 2.8, ...]

Step 4: Softmax

Apply softmax only over valid (non-masked) positions. The -∞ values become 0 probability:

Attention weights: [0.15, 0.08, 0.05, 0.04, 0, 0, ..., 0, 0.25, 0.18, ...]

Notice: Sinks get ~15% + 8% + 5% + 4% = 32% total attention (typical).

Step 5: Weighted Sum

Multiply attention weights by values and sum:

Output = 0.15·V₀ + 0.08·V₁ + 0.05·V₂ + 0.04·V₃ + 0.25·V₃₉₈₁ + ...

The sinks contribute significantly even though they're semantically irrelevant!

Generation Loop

Streaming Generation Process:

Step 1: Initialize

  • Load model
  • Create empty cache for all layers
  • Set sink_size=4, window_size=1024

Step 2: Process Prompt

  • Tokenize prompt (e.g., "Write a story about" → [245, 2834, 891, 546])
  • Forward pass through model with empty cache
  • Cache stores keys/values for all prompt tokens
  • Current position: 4

Step 3: Generate First Token

  • Current query: last prompt token's representation
  • Attend to: all 4 prompt tokens (sinks) + themselves (window)
  • Sample from output distribution
  • Generate token (e.g., "a" → 320)
  • Update cache: append new key/value
  • Current position: 5

Step 4-1023: Continue Building

  • Same process
  • Cache grows but hasn't hit limit
  • No eviction yet

Step 1024+: Steady State

  • Cache is full
  • For each new token:
    1. Extract sink keys/values (positions 0-3)
    2. Extract window keys/values (most recent 1020)
    3. Concatenate: sinks + window
    4. Generate next token
    5. Append new key/value to cache
    6. Oldest non-sink gets evicted automatically

Result:

  • Can generate 100K+ tokens
  • Memory stays constant
  • Quality remains high (perplexity ~11 vs ~10 without streaming)

Performance Analysis

Perplexity Comparison

MethodPPL@2KPPL@8KPPL@16KMemory
Full Cache10.210.310.4O(n)
Window (no sinks)10.563.5450+O(w)
Window + 4 Sinks10.310.911.2O(w)

Memory Savings

For a 100K token stream with 1K window:

  • Full cache: 100K × d_model memory
  • StreamingLLM: 1K × d_model memory
  • Savings: 99%

Advanced Techniques

Dynamic Sink Detection

Rather than hardcoding the number of attention sinks (typically 4), adaptive systems can automatically identify which tokens naturally act as sinks by analyzing attention patterns during inference.

Detection Methodology:

The system monitors attention weights across multiple generation steps, computing the average attention received by each token position. Tokens that consistently receive high attention (above a threshold like 10-15%) are identified as natural sinks.

Key Characteristics of Detected Sinks:

  1. Contiguous Positioning: Sinks typically form a contiguous block at the sequence start (positions 0-3 or 0-7)
  2. Consistent Attention: They receive stable attention across different contexts and prompts
  3. Position Independence: Their sink behavior doesn't depend on semantic content

Adaptive Sink Management:

The system maintains a rolling history of attention patterns (typically 100-200 tokens). Once sufficient data is collected, it analyzes the attention distribution to identify natural sink positions. This enables:

  • Task-specific optimization: Different tasks may require different numbers of sinks
  • Model-specific tuning: Some models use more/fewer sinks based on training
  • Dynamic adjustment: Sink count can adapt during long generation sessions

Benefits:

  • Eliminates manual tuning of sink size
  • Adapts to model-specific attention patterns
  • Can discover non-obvious sink configurations

Hierarchical Sinks

For extremely long contexts (100K+ tokens), a single-level sink system may be insufficient. Hierarchical sinks use multi-scale preservation to maintain references at different temporal distances.

Multi-Level Preservation Strategy:

Instead of preserving only the first 4 tokens, hierarchical systems preserve tokens at multiple scales:

  • Level 1 (Immediate): First 4 tokens (always preserved)
  • Level 2 (Short-term): Every 16th token from recent 1,000 tokens
  • Level 3 (Long-term): Every 64th token from positions 1,000-10,000

Cache Composition Example:

For a 50,000 token sequence with 1,024 token cache:

  • 4 level-1 sinks (positions 0-3)
  • ~60 level-2 sinks (every 16th position from recent window)
  • ~150 level-3 sinks (every 64th position from middle range)
  • ~810 most recent tokens (sliding window)

Attention Distribution:

Hierarchical sinks enable the model to:

  • Access immediate context (level 1) for stability
  • Reference recent patterns (level 2) for coherence
  • Recall distant events (level 3) for long-range dependencies

Use Cases:

  • Document summarization: Maintain references to different sections
  • Long-form generation: Track narrative threads across chapters
  • Multi-turn dialogue: Preserve conversation history at multiple scales

Sink Recycling

In scenarios where the context changes dramatically (e.g., switching topics, processing multiple documents), the original sinks may become less relevant. Sink recycling enables gradual transition to new attention anchors.

Transition Strategy:

When attention patterns indicate a context shift, the system gradually phases out old sinks while introducing new ones:

  1. Detection: Identify when attention to current sinks drops below threshold
  2. Decay: Gradually reduce the strength of old sink representations (e.g., multiply by 0.9 each step)
  3. Replacement: Introduce new candidate sinks from recent high-attention tokens
  4. Stabilization: Monitor new sinks for consistency before full commitment

Practical Application - Multi-Document Processing:

When processing a sequence of articles:

  • Document 1 (tokens 0-5000): Sinks at positions 0-3
  • Document 2 (tokens 5001-10000): Begin decaying original sinks, introduce sinks at 5001-5004
  • Document 3 (tokens 10001-15000): Original sinks fully retired, new sinks established

Benefits:

  • Prevents stale sinks from consuming attention
  • Adapts to changing content domains
  • Maintains stability during transitions

Challenges:

  • Determining optimal decay rate (too fast = instability, too slow = wasted attention)
  • Identifying genuine context shifts vs. temporary topic changes
  • Managing multiple active sink sets during transition periods

Production Deployment

GPU Optimization Strategies

Efficient streaming attention requires careful GPU kernel design to handle the discontinuous memory access pattern created by attention sinks. The performance challenge lies in computing attention over non-contiguous cache regions (sinks + gap + window).

Memory Layout Optimization:

Standard attention kernels assume contiguous key/value sequences. StreamingLLM requires modified kernels that efficiently handle:

  • Sink region: Positions 0-3 (always accessed)
  • Gap region: Positions 4-3980 (masked out, never accessed)
  • Window region: Positions 3981-5000 (actively accessed)

Kernel Design Principles:

1. Shared Memory Utilization:

  • Load query vectors into shared memory (reused across all key positions)
  • Stream keys from global memory for dot product computation
  • Store attention scores in shared memory for softmax reduction

2. Efficient Masking:

  • Precompute boolean masks: is_sink = (position < sink_size), is_window = (position >= cache_size - window_size)
  • Apply mask during score computation (set invalid positions to -∞)
  • Avoid branching by using arithmetic operations: score = valid ? score : -INFINITY

3. Coalesced Memory Access:

  • Organize cache layout to enable coalesced reads: [batch, head, position, dim]
  • Each thread loads consecutive dimensions of the same position
  • Minimize warp divergence during conditional masking

4. Parallel Reduction for Softmax:

  • Compute attention scores in parallel (one thread per cache position)
  • Use warp-level reduction for softmax denominator: sum(exp(score_i))
  • Normalize scores in parallel to get attention weights
  • Compute weighted sum of values in parallel

Performance Characteristics:

OptimizationSpeedupMemory Bandwidth
Naive Implementation1.0×100%
Shared Memory1.8×65%
Fused Masking2.3×60%
Optimized Layout3.2×45%

Production Considerations:

  • Batch Processing: Process multiple queries simultaneously to amortize kernel launch overhead
  • Multi-Head Parallelism: Each head processes independently, ideal for GPU parallelization
  • Flash Attention Integration: Combine sink-aware masking with Flash Attention's memory-efficient algorithm
  • Mixed Precision: Use FP16 for storage, FP32 for accumulation to balance speed and accuracy

Configuration Guidelines

Selecting optimal sink and window sizes depends on your specific use case, model characteristics, and computational constraints. Here's a comprehensive decision framework:

Task-Based Configuration:

Conversational AI (Chat, Assistants):

  • Sink size: 4 tokens
  • Window size: 1024-2048 tokens
  • Rationale: Conversations have natural turn boundaries. Context resets frequently, so smaller windows suffice. Minimal sinks provide stability without overhead.
  • Typical sequence length: 2K-8K tokens per conversation

Document Summarization:

  • Sink size: 8 tokens
  • Window size: 2048-4096 tokens
  • Rationale: Documents require broader context to capture main themes. More sinks help maintain document structure signals. Larger windows preserve more content.
  • Typical sequence length: 10K-50K tokens per document

Code Generation:

  • Sink size: 16 tokens
  • Window size: 4096-8192 tokens
  • Rationale: Code has hierarchical structure (functions, classes, modules). More sinks preserve structural context. Large windows maintain scope visibility.
  • Typical sequence length: 20K-100K tokens per codebase

Real-Time Transcription:

  • Sink size: 4 tokens
  • Window size: 512-1024 tokens
  • Rationale: Latency-critical application. Minimal cache reduces memory and computation. Recent context is most relevant for transcription accuracy.
  • Typical sequence length: Unbounded (streaming audio)

Model-Size Considerations:

Model SizeRecommended Sink SizeRecommended Window Size
Small (< 1B params)4512-1024
Medium (1-7B params)4-81024-2048
Large (7-70B params)8-162048-4096
Extra Large (> 70B)16-324096-8192

Memory Constraint Calculation:

Total memory = (sink_size + window_size) × num_layers × 2 × hidden_dim × precision_bytes

Example for LLaMA-7B (32 layers, 4096 hidden dim, FP16):

  • Config (4 sinks + 1024 window): 1028 × 32 × 2 × 4096 × 2 = 1.1GB KV cache
  • Config (8 sinks + 2048 window): 2056 × 32 × 2 × 4096 × 2 = 2.2GB KV cache

Best Practices

1. Sink Size Selection

Empirical Guidelines:

The optimal number of attention sinks depends on three factors that should be balanced:

Minimum Stability Threshold:

  • At least 4 sinks are required for stable generation across most models
  • Fewer than 4 causes attention distribution instability
  • This baseline provides sufficient "dump space" for excess attention

Window Size Scaling:

  • Larger windows benefit from more sinks: suggested_sinks ≈ log₂(window_size)
  • Window 512 → 4 sinks
  • Window 2048 → 8 sinks
  • Window 8192 → 16 sinks
  • Rationale: More positions need more attention stabilization points

Natural Token Boundaries:

  • Examine first N tokens of typical prompts (BOS + first few words)
  • If prompts consistently start with structured tokens (e.g., system prompts), preserve that entire prefix
  • Example: If system prompts are 12 tokens, use 12-16 sinks

Decision Process:

  1. Start with baseline: 4 sinks
  2. If window > 2048: increase to log₂(window_size)
  3. If using structured prompts: align to prompt length
  4. Test and validate with perplexity measurements

2. Position Encoding

Maintaining Positional Accuracy:

Attention sinks create non-contiguous position sequences. Position encodings must reflect true token positions to maintain model accuracy:

Absolute Position Embeddings:

  • Each token keeps its original absolute position
  • Sink at position 0 always has position embedding PE(0)
  • Window token at position 5000 has position embedding PE(5000)
  • Never renumber positions to be sequential [0,1,2,3,4,5...]

RoPE (Rotary Position Embeddings):

  • Apply rotation based on true position indices
  • Sinks use positions [0,1,2,3]
  • Window uses positions [3981,3982,...,5000]
  • RoPE's relative nature handles gaps naturally—only distances between visible tokens matter
  • The gap (positions 4-3980) is simply absent; no special handling needed

ALiBi (Attention with Linear Biases):

  • Compute biases using true position differences
  • Bias between position 0 (sink) and position 5000 (new token) = -5000 × slope
  • Bias between position 3981 (window start) and 5000 (new token) = -1019 × slope
  • Gaps don't affect bias calculation—only distance between cached positions matters

Key Principle: Never "fix" positions to be contiguous. The model expects position gaps and handles them correctly through its learned position encoding mechanism.

3. Monitoring and Debugging

Attention Sink Health Metrics:

To ensure attention sinks are functioning correctly, monitor these key indicators during generation:

Sink Attention Percentage:

  • Metric: Sum of attention weights to sink positions divided by number of sinks
  • Healthy range: 20-40% total attention to sinks
  • Warning signs:
    • < 10%: Sinks not being used, may indicate misconfiguration
    • 60%: Over-reliance on sinks, model may be struggling

Attention Entropy:

  • Metric: Shannon entropy of attention distribution: -Σ(p × log(p))
  • Healthy range: 3.5-5.5 (for typical sequence lengths)
  • Warning signs:
    • < 2.0: Attention too focused (possible degeneration)
    • 6.0: Attention too diffuse (instability)

Maximum Attention Weight:

  • Metric: Highest attention weight assigned to any single position
  • Healthy range: 0.15-0.35
  • Warning signs:
    • 0.5: Single position dominance (attention collapse)

    • < 0.05: No clear focus (may indicate confusion)

Perplexity Tracking:

  • Monitor perplexity every 1000 tokens during long generation
  • Gradual increase (10.3 → 10.9 over 10K tokens): Expected behavior
  • Sudden spike (10.5 → 30+): Critical failure, check sink preservation
  • Sustained > 25: Investigate cache corruption or position encoding issues

Debugging Common Issues:

SymptomLikely CauseSolution
Perplexity spike after N tokensWindow size = N, sinks evictedVerify sink positions preserved
Low sink attention (< 10%)Position encoding mismatchCheck position IDs match true positions
Attention collapse to single tokenSoftmax numerical instabilityIncrease numerical precision, check for NaN
Gradual coherence lossWindow too small for taskIncrease window size or use hierarchical sinks

If you found this explanation helpful, consider sharing it with others.

Mastodon