Attention Sinks: Stable Streaming LLMs
Understand attention sinks, the phenomenon where LLMs concentrate attention on initial tokens, and how preserving them enables infinite-length streaming inference.
Best viewed on desktop for optimal interactive experience
Attention Sinks: The Key to Streaming LLMs
Attention sinks are a fascinating phenomenon where language models naturally concentrate significant attention on initial tokens (like BOS), regardless of their semantic importance. This discovery enables efficient streaming inference with stable performance.
Interactive Attention Sink Visualization
Explore how attention sinks stabilize streaming inference through this step-by-step demonstration:
The Discovery
LLMs naturally concentrate attention on initial tokens, even when they're semantically meaningless (like the BOS token). These become "attention sinks" - repositories for excess attention mass.
Where Attention Goes
Softmax requires all attention weights to sum to 1. When tokens don't need specific information, they still must allocate attention somewhere. Initial tokens become the natural "dump" because they're always visible and positionally distinct.
The Problem: Naive Sliding Window
The obvious approach: keep a sliding window of recent 8 tokens, evict the oldest. But this causes catastrophic failure when sink tokens are evicted.
The Solution: StreamingLLM
Keep just 4 initial tokens (sinks) permanently, alongside the sliding window. The model stays happy because it still has somewhere to dump excess attention.
(Always Kept)
(Last 4)
Attention Distribution (Stable!)
By preserving just 4 tokens, we maintain the model's learned attention pattern. The sinks absorb the 25% excess attention, preventing forced redistribution. The model doesn't care that middle tokens are missing—it only needs sinks for stability and recent window for content.
The Impact
Same memory usage. Dramatically different results.
The Discovery
Researchers observed that LLMs consistently allocate high attention scores to initial tokens, even when these tokens carry no semantic meaning. These "sink" tokens serve as repositories for excess attention mass.
Why Attention Sinks Form
Softmax Constraint
The softmax operation requires attention weights to sum to 1:
When a token doesn't strongly attend to any specific position, the model needs somewhere to "dump" the remaining attention mass.
Initial Token Bias
Initial tokens become natural sinks because:
- They're always visible (no causal masking)
- They're positionally distinct
- Models learn this pattern during training
The Streaming Problem
Without Attention Sinks: The Catastrophic Failure
The Naive Approach:
Imagine processing a long document (100K tokens) with a 1K token cache. The obvious solution seems simple:
- Keep the most recent 1,000 tokens in memory
- Evict the oldest token when a new one arrives
- Slide this window forward as generation continues
What Happens:
Initial State (tokens 0-1000):
- Model allocates ~20-30% of total attention to initial tokens (BOS, first words)
- These tokens act as attention sinks
- Remaining 70-80% distributed across content tokens
- Perplexity: 10.5 (normal)
After Eviction (tokens 1001-2000):
- Initial sink tokens are gone from cache
- Model still wants to allocate 20-30% attention somewhere
- But there's nowhere natural to put it!
- Attention gets forcibly redistributed across random positions
Result:
- Perplexity spike to 60+ (6× worse!)
- Generation becomes incoherent
- Model "loses its grounding"
- Performance never recovers
Why This Fails:
The softmax attention mechanism requires all attention weights to sum to 1. When tokens don't need specific information, they still must allocate attention somewhere. The model learned during training to use initial tokens as "attention dumps." Removing these sinks violates the model's learned attention distribution, causing catastrophic failure.
With Attention Sinks: The StreamingLLM Solution
The Key Insight:
Instead of evicting everything old, preserve just the first 4 tokens permanently.
Cache Structure:
[Sink Tokens: 0-3] + [Sliding Window: most recent N-4 tokens] + [New Token]
Example with 1024-token cache:
- Positions 0-3: Always preserved (BOS + first 3 tokens)
- Positions 4-1023: Sliding window (most recent 1020 tokens)
- Total cache size: Still 1024 tokens
What Happens:
Initial State (tokens 0-1000):
- Same as before
- Perplexity: 10.3
After Many Evictions (tokens 1001-8000):
- Sinks remain at positions 0-3
- Window contains tokens 7005-8000 (most recent 1020)
- Model still has somewhere to dump excess attention (the sinks!)
- Attention distribution remains stable
Result:
- Perplexity stays ~10.9 (only 6% degradation)
- Generation remains coherent
- Can process infinite-length streams
- Memory stays constant at 1024 tokens
Why This Works:
By preserving just 4 tokens that serve as attention sinks, we maintain the model's learned attention patterns. The sinks absorb excess attention mass, preventing forced redistribution. The model doesn't care that middle tokens (1000-7000) are missing—it only needs sinks for stability and the recent window for content.
Architecture: How StreamingLLM Works
Core Components
StreamingLLM extends standard transformer inference with three key modifications:
1. KV Cache Management
Standard Approach:
- Cache grows unbounded with sequence length
- Memory usage: O(n) where n = total tokens processed
- For 100K tokens: prohibitively expensive
StreamingLLM Approach:
- Fixed-size cache: sink_size + window_size (e.g., 4 + 1020 = 1024)
- Memory usage: O(window_size) - constant!
- For any length: fixed 1024 tokens in memory
Cache Structure:
Layer 1: [Sink: 0-3] [Gap: evicted tokens] [Window: recent 1020] [New] Layer 2: [Sink: 0-3] [Gap: evicted tokens] [Window: recent 1020] [New] ... Layer N: [Sink: 0-3] [Gap: evicted tokens] [Window: recent 1020] [New]
Each transformer layer maintains its own cache with identical structure.
2. Cache Update Strategy
Phase 1: Building Up (tokens 0 to window_size)
Simply append each new token to cache:
- Token 0: Cache = [0]
- Token 1: Cache = [0, 1]
- ...
- Token 1023: Cache = [0, 1, ..., 1023]
No eviction yet—cache hasn't filled up.
Phase 2: Steady State (tokens > window_size)
For each new token:
- Identify sink region: First 4 tokens (0-3)
- Identify window region: Last 1020 tokens
- Concatenate: [sinks] + [window] + [new_token]
- Evict middle: Oldest non-sink token disappears
Example at token 5000:
- Sinks: tokens 0, 1, 2, 3
- Window: tokens 3981-5000 (most recent 1020)
- New: token 5001
- Evicted: tokens 4-3980 (not in cache)
3. Position Encoding Management
The Challenge:
Position encodings (RoPE, ALiBi, etc.) depend on token position. When we evict middle tokens, we create gaps in positions:
Actual positions in cache: [0, 1, 2, 3, 3981, 3982, ..., 5000]
But the model needs to know these are the correct positions, not sequential [0, 1, 2, 3, 4, 5, ...].
The Solution:
Maintain position IDs that reflect true positions:
For Sinks:
- Always positions 0, 1, 2, 3 (never change)
For Window:
- Use actual positions (3981, 3982, ...)
- Not relative positions (4, 5, ...)
For RoPE: Apply rotary embeddings using true positions. RoPE naturally handles gaps—it only cares about relative distances between visible tokens.
For ALiBi: Compute biases using true position differences. A gap doesn't matter because ALiBi only biases based on distance within the cache.
Attention Computation with Sinks
Step 1: Compute Scores
For the current token (query), compute attention scores against all cached tokens (keys):
- Score to sink 0, sink 1, sink 2, sink 3
- Score to window token 1, window token 2, ..., window token 1020
Standard scaled dot-product: score = (Q · K^T) / √d_k
Step 2: Create Attention Mask
Build a boolean mask indicating which positions are valid:
[True, True, True, True, False, False, ..., False, True, True, ..., True] └─ sinks (0-3) ─┘ └────── evicted (4-3980) ──────┘ └─ window ─┘
Step 3: Mask Invalid Positions
Set scores for evicted positions to -∞ (they'll get 0 attention after softmax):
Scores before mask: [2.3, 1.8, 1.5, 1.2, ?, ?, ..., ?, 3.1, 2.8, ...] Scores after mask: [2.3, 1.8, 1.5, 1.2, -∞, -∞, ..., -∞, 3.1, 2.8, ...]
Step 4: Softmax
Apply softmax only over valid (non-masked) positions. The -∞ values become 0 probability:
Attention weights: [0.15, 0.08, 0.05, 0.04, 0, 0, ..., 0, 0.25, 0.18, ...]
Notice: Sinks get ~15% + 8% + 5% + 4% = 32% total attention (typical).
Step 5: Weighted Sum
Multiply attention weights by values and sum:
Output = 0.15·V₀ + 0.08·V₁ + 0.05·V₂ + 0.04·V₃ + 0.25·V₃₉₈₁ + ...
The sinks contribute significantly even though they're semantically irrelevant!
Generation Loop
Streaming Generation Process:
Step 1: Initialize
- Load model
- Create empty cache for all layers
- Set sink_size=4, window_size=1024
Step 2: Process Prompt
- Tokenize prompt (e.g., "Write a story about" → [245, 2834, 891, 546])
- Forward pass through model with empty cache
- Cache stores keys/values for all prompt tokens
- Current position: 4
Step 3: Generate First Token
- Current query: last prompt token's representation
- Attend to: all 4 prompt tokens (sinks) + themselves (window)
- Sample from output distribution
- Generate token (e.g., "a" → 320)
- Update cache: append new key/value
- Current position: 5
Step 4-1023: Continue Building
- Same process
- Cache grows but hasn't hit limit
- No eviction yet
Step 1024+: Steady State
- Cache is full
- For each new token:
- Extract sink keys/values (positions 0-3)
- Extract window keys/values (most recent 1020)
- Concatenate: sinks + window
- Generate next token
- Append new key/value to cache
- Oldest non-sink gets evicted automatically
Result:
- Can generate 100K+ tokens
- Memory stays constant
- Quality remains high (perplexity ~11 vs ~10 without streaming)
Performance Analysis
Perplexity Comparison
Method | PPL@2K | PPL@8K | PPL@16K | Memory |
---|---|---|---|---|
Full Cache | 10.2 | 10.3 | 10.4 | O(n) |
Window (no sinks) | 10.5 | 63.5 | 450+ | O(w) |
Window + 4 Sinks | 10.3 | 10.9 | 11.2 | O(w) |
Memory Savings
For a 100K token stream with 1K window:
- Full cache: 100K × d_model memory
- StreamingLLM: 1K × d_model memory
- Savings: 99%
Advanced Techniques
Dynamic Sink Detection
Rather than hardcoding the number of attention sinks (typically 4), adaptive systems can automatically identify which tokens naturally act as sinks by analyzing attention patterns during inference.
Detection Methodology:
The system monitors attention weights across multiple generation steps, computing the average attention received by each token position. Tokens that consistently receive high attention (above a threshold like 10-15%) are identified as natural sinks.
Key Characteristics of Detected Sinks:
- Contiguous Positioning: Sinks typically form a contiguous block at the sequence start (positions 0-3 or 0-7)
- Consistent Attention: They receive stable attention across different contexts and prompts
- Position Independence: Their sink behavior doesn't depend on semantic content
Adaptive Sink Management:
The system maintains a rolling history of attention patterns (typically 100-200 tokens). Once sufficient data is collected, it analyzes the attention distribution to identify natural sink positions. This enables:
- Task-specific optimization: Different tasks may require different numbers of sinks
- Model-specific tuning: Some models use more/fewer sinks based on training
- Dynamic adjustment: Sink count can adapt during long generation sessions
Benefits:
- Eliminates manual tuning of sink size
- Adapts to model-specific attention patterns
- Can discover non-obvious sink configurations
Hierarchical Sinks
For extremely long contexts (100K+ tokens), a single-level sink system may be insufficient. Hierarchical sinks use multi-scale preservation to maintain references at different temporal distances.
Multi-Level Preservation Strategy:
Instead of preserving only the first 4 tokens, hierarchical systems preserve tokens at multiple scales:
- Level 1 (Immediate): First 4 tokens (always preserved)
- Level 2 (Short-term): Every 16th token from recent 1,000 tokens
- Level 3 (Long-term): Every 64th token from positions 1,000-10,000
Cache Composition Example:
For a 50,000 token sequence with 1,024 token cache:
- 4 level-1 sinks (positions 0-3)
- ~60 level-2 sinks (every 16th position from recent window)
- ~150 level-3 sinks (every 64th position from middle range)
- ~810 most recent tokens (sliding window)
Attention Distribution:
Hierarchical sinks enable the model to:
- Access immediate context (level 1) for stability
- Reference recent patterns (level 2) for coherence
- Recall distant events (level 3) for long-range dependencies
Use Cases:
- Document summarization: Maintain references to different sections
- Long-form generation: Track narrative threads across chapters
- Multi-turn dialogue: Preserve conversation history at multiple scales
Sink Recycling
In scenarios where the context changes dramatically (e.g., switching topics, processing multiple documents), the original sinks may become less relevant. Sink recycling enables gradual transition to new attention anchors.
Transition Strategy:
When attention patterns indicate a context shift, the system gradually phases out old sinks while introducing new ones:
- Detection: Identify when attention to current sinks drops below threshold
- Decay: Gradually reduce the strength of old sink representations (e.g., multiply by 0.9 each step)
- Replacement: Introduce new candidate sinks from recent high-attention tokens
- Stabilization: Monitor new sinks for consistency before full commitment
Practical Application - Multi-Document Processing:
When processing a sequence of articles:
- Document 1 (tokens 0-5000): Sinks at positions 0-3
- Document 2 (tokens 5001-10000): Begin decaying original sinks, introduce sinks at 5001-5004
- Document 3 (tokens 10001-15000): Original sinks fully retired, new sinks established
Benefits:
- Prevents stale sinks from consuming attention
- Adapts to changing content domains
- Maintains stability during transitions
Challenges:
- Determining optimal decay rate (too fast = instability, too slow = wasted attention)
- Identifying genuine context shifts vs. temporary topic changes
- Managing multiple active sink sets during transition periods
Production Deployment
GPU Optimization Strategies
Efficient streaming attention requires careful GPU kernel design to handle the discontinuous memory access pattern created by attention sinks. The performance challenge lies in computing attention over non-contiguous cache regions (sinks + gap + window).
Memory Layout Optimization:
Standard attention kernels assume contiguous key/value sequences. StreamingLLM requires modified kernels that efficiently handle:
- Sink region: Positions 0-3 (always accessed)
- Gap region: Positions 4-3980 (masked out, never accessed)
- Window region: Positions 3981-5000 (actively accessed)
Kernel Design Principles:
1. Shared Memory Utilization:
- Load query vectors into shared memory (reused across all key positions)
- Stream keys from global memory for dot product computation
- Store attention scores in shared memory for softmax reduction
2. Efficient Masking:
- Precompute boolean masks: is_sink = (position < sink_size), is_window = (position >= cache_size - window_size)
- Apply mask during score computation (set invalid positions to -∞)
- Avoid branching by using arithmetic operations:
score = valid ? score : -INFINITY
3. Coalesced Memory Access:
- Organize cache layout to enable coalesced reads: [batch, head, position, dim]
- Each thread loads consecutive dimensions of the same position
- Minimize warp divergence during conditional masking
4. Parallel Reduction for Softmax:
- Compute attention scores in parallel (one thread per cache position)
- Use warp-level reduction for softmax denominator:
sum(exp(score_i))
- Normalize scores in parallel to get attention weights
- Compute weighted sum of values in parallel
Performance Characteristics:
Optimization | Speedup | Memory Bandwidth |
---|---|---|
Naive Implementation | 1.0× | 100% |
Shared Memory | 1.8× | 65% |
Fused Masking | 2.3× | 60% |
Optimized Layout | 3.2× | 45% |
Production Considerations:
- Batch Processing: Process multiple queries simultaneously to amortize kernel launch overhead
- Multi-Head Parallelism: Each head processes independently, ideal for GPU parallelization
- Flash Attention Integration: Combine sink-aware masking with Flash Attention's memory-efficient algorithm
- Mixed Precision: Use FP16 for storage, FP32 for accumulation to balance speed and accuracy
Configuration Guidelines
Selecting optimal sink and window sizes depends on your specific use case, model characteristics, and computational constraints. Here's a comprehensive decision framework:
Task-Based Configuration:
Conversational AI (Chat, Assistants):
- Sink size: 4 tokens
- Window size: 1024-2048 tokens
- Rationale: Conversations have natural turn boundaries. Context resets frequently, so smaller windows suffice. Minimal sinks provide stability without overhead.
- Typical sequence length: 2K-8K tokens per conversation
Document Summarization:
- Sink size: 8 tokens
- Window size: 2048-4096 tokens
- Rationale: Documents require broader context to capture main themes. More sinks help maintain document structure signals. Larger windows preserve more content.
- Typical sequence length: 10K-50K tokens per document
Code Generation:
- Sink size: 16 tokens
- Window size: 4096-8192 tokens
- Rationale: Code has hierarchical structure (functions, classes, modules). More sinks preserve structural context. Large windows maintain scope visibility.
- Typical sequence length: 20K-100K tokens per codebase
Real-Time Transcription:
- Sink size: 4 tokens
- Window size: 512-1024 tokens
- Rationale: Latency-critical application. Minimal cache reduces memory and computation. Recent context is most relevant for transcription accuracy.
- Typical sequence length: Unbounded (streaming audio)
Model-Size Considerations:
Model Size | Recommended Sink Size | Recommended Window Size |
---|---|---|
Small (< 1B params) | 4 | 512-1024 |
Medium (1-7B params) | 4-8 | 1024-2048 |
Large (7-70B params) | 8-16 | 2048-4096 |
Extra Large (> 70B) | 16-32 | 4096-8192 |
Memory Constraint Calculation:
Total memory = (sink_size + window_size) × num_layers × 2 × hidden_dim × precision_bytes
Example for LLaMA-7B (32 layers, 4096 hidden dim, FP16):
- Config (4 sinks + 1024 window): 1028 × 32 × 2 × 4096 × 2 = 1.1GB KV cache
- Config (8 sinks + 2048 window): 2056 × 32 × 2 × 4096 × 2 = 2.2GB KV cache
Best Practices
1. Sink Size Selection
Empirical Guidelines:
The optimal number of attention sinks depends on three factors that should be balanced:
Minimum Stability Threshold:
- At least 4 sinks are required for stable generation across most models
- Fewer than 4 causes attention distribution instability
- This baseline provides sufficient "dump space" for excess attention
Window Size Scaling:
- Larger windows benefit from more sinks:
suggested_sinks ≈ log₂(window_size)
- Window 512 → 4 sinks
- Window 2048 → 8 sinks
- Window 8192 → 16 sinks
- Rationale: More positions need more attention stabilization points
Natural Token Boundaries:
- Examine first N tokens of typical prompts (BOS + first few words)
- If prompts consistently start with structured tokens (e.g., system prompts), preserve that entire prefix
- Example: If system prompts are 12 tokens, use 12-16 sinks
Decision Process:
- Start with baseline: 4 sinks
- If window > 2048: increase to log₂(window_size)
- If using structured prompts: align to prompt length
- Test and validate with perplexity measurements
2. Position Encoding
Maintaining Positional Accuracy:
Attention sinks create non-contiguous position sequences. Position encodings must reflect true token positions to maintain model accuracy:
Absolute Position Embeddings:
- Each token keeps its original absolute position
- Sink at position 0 always has position embedding PE(0)
- Window token at position 5000 has position embedding PE(5000)
- Never renumber positions to be sequential [0,1,2,3,4,5...]
RoPE (Rotary Position Embeddings):
- Apply rotation based on true position indices
- Sinks use positions [0,1,2,3]
- Window uses positions [3981,3982,...,5000]
- RoPE's relative nature handles gaps naturally—only distances between visible tokens matter
- The gap (positions 4-3980) is simply absent; no special handling needed
ALiBi (Attention with Linear Biases):
- Compute biases using true position differences
- Bias between position 0 (sink) and position 5000 (new token) = -5000 × slope
- Bias between position 3981 (window start) and 5000 (new token) = -1019 × slope
- Gaps don't affect bias calculation—only distance between cached positions matters
Key Principle: Never "fix" positions to be contiguous. The model expects position gaps and handles them correctly through its learned position encoding mechanism.
3. Monitoring and Debugging
Attention Sink Health Metrics:
To ensure attention sinks are functioning correctly, monitor these key indicators during generation:
Sink Attention Percentage:
- Metric: Sum of attention weights to sink positions divided by number of sinks
- Healthy range: 20-40% total attention to sinks
- Warning signs:
- < 10%: Sinks not being used, may indicate misconfiguration
-
60%: Over-reliance on sinks, model may be struggling
Attention Entropy:
- Metric: Shannon entropy of attention distribution: -Σ(p × log(p))
- Healthy range: 3.5-5.5 (for typical sequence lengths)
- Warning signs:
- < 2.0: Attention too focused (possible degeneration)
-
6.0: Attention too diffuse (instability)
Maximum Attention Weight:
- Metric: Highest attention weight assigned to any single position
- Healthy range: 0.15-0.35
- Warning signs:
-
0.5: Single position dominance (attention collapse)
- < 0.05: No clear focus (may indicate confusion)
-
Perplexity Tracking:
- Monitor perplexity every 1000 tokens during long generation
- Gradual increase (10.3 → 10.9 over 10K tokens): Expected behavior
- Sudden spike (10.5 → 30+): Critical failure, check sink preservation
- Sustained > 25: Investigate cache corruption or position encoding issues
Debugging Common Issues:
Symptom | Likely Cause | Solution |
---|---|---|
Perplexity spike after N tokens | Window size = N, sinks evicted | Verify sink positions preserved |
Low sink attention (< 10%) | Position encoding mismatch | Check position IDs match true positions |
Attention collapse to single token | Softmax numerical instability | Increase numerical precision, check for NaN |
Gradual coherence loss | Window too small for task | Increase window size or use hierarchical sinks |