Attention Sinks: The Key to Streaming LLMs

Attention sinks are a fascinating phenomenon where language models naturally concentrate significant attention on initial tokens (like BOS), regardless of their semantic importance. This discovery enables efficient streaming inference with stable performance.

Interactive Attention Sink Visualization

Explore how attention sinks stabilize streaming inference through this step-by-step demonstration:

The Discovery

LLMs naturally concentrate attention on initial tokens, even when they're semantically meaningless (like the BOS token). These become "attention sinks" - repositories for excess attention mass.

Natural Attention Distribution

Where Attention Goes

25%

of attention goes to first 4 tokens

(Even though they're just BOS + "The quick brown")

💡

Why This Happens

Softmax requires all attention weights to sum to 1. When tokens don't need specific information, they still must allocate attention somewhere. Initial tokens become the natural "dump" because they're always visible and positionally distinct.

The Problem: Naive Sliding Window

The obvious approach: keep a sliding window of recent 8 tokens, evict the oldest. But this causes catastrophic failure when sink tokens are evicted.

Position 7: Everything's Fine

10.5

PPL

Position 12: Sinks Evicted!

23.3

PPL

Quality Collapse: 10.5 → 23

1.Initial tokens (sinks) get evicted from cache

2.Model still wants to allocate 25% attention to "sinks"

3.But there's nowhere to put it!

4.Attention redistributes randomly → catastrophic failure

The Solution: StreamingLLM

Keep just 4 initial tokens (sinks) permanently, alongside the sliding window. The model stays happy because it still has somewhere to dump excess attention.

Cache Structure at Position 12

Sinks
(Always Kept)

···

Recent Window
(Last 4)

Attention Distribution (Stable!)

10.4

PPL

✓ Quality Stays Stable

✨

The Key Insight

By preserving just 4 tokens, we maintain the model's learned attention pattern. The sinks absorb the 25% excess attention, preventing forced redistribution. The model doesn't care that middle tokens are missing—it only needs sinks for stability and recent window for content.

The Impact

Same memory usage. Dramatically different results.

❌ Without Sinks

23.3

PPL

Cache Size:8 tokens

Quality:Catastrophic ⚠️

Perplexity:23.3

✓ With Sinks

10.4

PPL

Cache Size:8 tokens

Quality:Stable ✓

Perplexity:10.4

The Magic of Attention Sinks

Sink Tokens

Just preserve these

∞

Sequence Length

No limit anymore

99%

Memory Saved

vs full caching

Stream Position:12 / 19

The Discovery

Researchers observed that LLMs consistently allocate high attention scores to initial tokens, even when these tokens carry no semantic meaning. These "sink" tokens serve as repositories for excess attention mass.

Why Attention Sinks Form

Softmax Constraint

The softmax operation requires attention weights to sum to 1:

Σ_j=1ⁿ \text{softmax}(s_i)_j = 1

When a token doesn't strongly attend to any specific position, the model needs somewhere to "dump" the remaining attention mass.

Initial Token Bias

Initial tokens become natural sinks because:

They're always visible (no causal masking)
They're positionally distinct
Models learn this pattern during training

The Streaming Problem

Without Attention Sinks: The Catastrophic Failure

The Naive Approach:

Imagine processing a long document (100K tokens) with a 1K token cache. The obvious solution seems simple:

Keep the most recent 1,000 tokens in memory
Evict the oldest token when a new one arrives
Slide this window forward as generation continues

What Happens:

Initial State (tokens 0-1000):

Model allocates ~20-30% of total attention to initial tokens (BOS, first words)
These tokens act as attention sinks
Remaining 70-80% distributed across content tokens
Perplexity: 10.5 (normal)

After Eviction (tokens 1001-2000):

Initial sink tokens are gone from cache
Model still wants to allocate 20-30% attention somewhere
But there's nowhere natural to put it!
Attention gets forcibly redistributed across random positions

Result:

Perplexity spike to 60+ (6× worse!)
Generation becomes incoherent
Model "loses its grounding"
Performance never recovers

Why This Fails:

The softmax attention mechanism requires all attention weights to sum to 1. When tokens don't need specific information, they still must allocate attention somewhere. The model learned during training to use initial tokens as "attention dumps." Removing these sinks violates the model's learned attention distribution, causing catastrophic failure.

With Attention Sinks: The StreamingLLM Solution

The Key Insight:

Instead of evicting everything old, preserve just the first 4 tokens permanently.

Cache Structure:

[Sink Tokens: 0-3] + [Sliding Window: most recent N-4 tokens] + [New Token]

Example with 1024-token cache:

Positions 0-3: Always preserved (BOS + first 3 tokens)
Positions 4-1023: Sliding window (most recent 1020 tokens)
Total cache size: Still 1024 tokens

What Happens:

Initial State (tokens 0-1000):

Same as before
Perplexity: 10.3

After Many Evictions (tokens 1001-8000):

Sinks remain at positions 0-3
Window contains tokens 7005-8000 (most recent 1020)
Model still has somewhere to dump excess attention (the sinks!)
Attention distribution remains stable

Result:

Perplexity stays ~10.9 (only 6% degradation)
Generation remains coherent
Can process infinite-length streams
Memory stays constant at 1024 tokens

Why This Works:

By preserving just 4 tokens that serve as attention sinks, we maintain the model's learned attention patterns. The sinks absorb excess attention mass, preventing forced redistribution. The model doesn't care that middle tokens (1000-7000) are missing—it only needs sinks for stability and the recent window for content.

Architecture: How StreamingLLM Works

Core Components

StreamingLLM extends standard transformer inference with three key modifications:

1. KV Cache Management

Standard Approach:

Cache grows unbounded with sequence length
Memory usage: O(n) where n = total tokens processed
For 100K tokens: prohibitively expensive

StreamingLLM Approach:

Fixed-size cache: sink_size + window_size (e.g., 4 + 1020 = 1024)
Memory usage: O(window_size) - constant!
For any length: fixed 1024 tokens in memory

Cache Structure:

Layer 1:  [Sink: 0-3] [Gap: evicted tokens] [Window: recent 1020] [New]
Layer 2:  [Sink: 0-3] [Gap: evicted tokens] [Window: recent 1020] [New]
...
Layer N:  [Sink: 0-3] [Gap: evicted tokens] [Window: recent 1020] [New]

Each transformer layer maintains its own cache with identical structure.

2. Cache Update Strategy

Phase 1: Building Up (tokens 0 to window_size)

Simply append each new token to cache:

Token 0: Cache = [0]
Token 1: Cache = [0, 1]
...
Token 1023: Cache = [0, 1, ..., 1023]

No eviction yet—cache hasn't filled up.

Phase 2: Steady State (tokens > window_size)

For each new token:

Identify sink region: First 4 tokens (0-3)
Identify window region: Last 1020 tokens
Concatenate: [sinks] + [window] + [new_token]
Evict middle: Oldest non-sink token disappears

Example at token 5000:

Sinks: tokens 0, 1, 2, 3
Window: tokens 3981-5000 (most recent 1020)
New: token 5001
Evicted: tokens 4-3980 (not in cache)

3. Position Encoding Management

The Challenge:

Position encodings (RoPE, ALiBi, etc.) depend on token position. When we evict middle tokens, we create gaps in positions:

Actual positions in cache: [0, 1, 2, 3, 3981, 3982, ..., 5000]

But the model needs to know these are the correct positions, not sequential [0, 1, 2, 3, 4, 5, ...].

The Solution:

Maintain position IDs that reflect true positions:

For Sinks:

Always positions 0, 1, 2, 3 (never change)

For Window:

Use actual positions (3981, 3982, ...)
Not relative positions (4, 5, ...)

For RoPE: Apply rotary embeddings using true positions. RoPE naturally handles gaps—it only cares about relative distances between visible tokens.

For ALiBi: Compute biases using true position differences. A gap doesn't matter because ALiBi only biases based on distance within the cache.

Attention Computation with Sinks

Step 1: Compute Scores

For the current token (query), compute attention scores against all cached tokens (keys):

Score to sink 0, sink 1, sink 2, sink 3
Score to window token 1, window token 2, ..., window token 1020

Standard scaled dot-product: score = (Q · K^T) / √d_k

Step 2: Create Attention Mask

Build a boolean mask indicating which positions are valid:

[True, True, True, True, False, False, ..., False, True, True, ..., True]
 └─ sinks (0-3) ─┘  └────── evicted (4-3980) ──────┘  └─ window ─┘

Step 3: Mask Invalid Positions

Set scores for evicted positions to -∞ (they'll get 0 attention after softmax):

Scores before mask: [2.3, 1.8, 1.5, 1.2, ?, ?, ..., ?, 3.1, 2.8, ...]
Scores after mask:  [2.3, 1.8, 1.5, 1.2, -∞, -∞, ..., -∞, 3.1, 2.8, ...]

Step 4: Softmax

Apply softmax only over valid (non-masked) positions. The -∞ values become 0 probability:

Attention weights: [0.15, 0.08, 0.05, 0.04, 0, 0, ..., 0, 0.25, 0.18, ...]

Notice: Sinks get ~15% + 8% + 5% + 4% = 32% total attention (typical).

Step 5: Weighted Sum

Multiply attention weights by values and sum:

Output = 0.15·V₀ + 0.08·V₁ + 0.05·V₂ + 0.04·V₃ + 0.25·V₃₉₈₁ + ...

The sinks contribute significantly even though they're semantically irrelevant!

Generation Loop

Streaming Generation Process:

Step 1: Initialize

Load model
Create empty cache for all layers
Set sink_size=4, window_size=1024

Step 2: Process Prompt

Tokenize prompt (e.g., "Write a story about" → [245, 2834, 891, 546])
Forward pass through model with empty cache
Cache stores keys/values for all prompt tokens
Current position: 4

Step 3: Generate First Token

Current query: last prompt token's representation
Attend to: all 4 prompt tokens (sinks) + themselves (window)
Sample from output distribution
Generate token (e.g., "a" → 320)
Update cache: append new key/value
Current position: 5

Step 4-1023: Continue Building

Same process
Cache grows but hasn't hit limit
No eviction yet

Step 1024+: Steady State

Cache is full
For each new token:
1. Extract sink keys/values (positions 0-3)
2. Extract window keys/values (most recent 1020)
3. Concatenate: sinks + window
4. Generate next token
5. Append new key/value to cache
6. Oldest non-sink gets evicted automatically

Result:

Can generate 100K+ tokens
Memory stays constant
Quality remains high (perplexity ~11 vs ~10 without streaming)

Performance Analysis

Perplexity Comparison

Method	PPL@2K	PPL@8K	PPL@16K	Memory
Full Cache	10.2	10.3	10.4	O(n)
Window (no sinks)	10.5	63.5	450+	O(w)
Window + 4 Sinks	10.3	10.9	11.2	O(w)

Memory Savings

For a 100K token stream with 1K window:

Full cache: 100K × d_model memory
StreamingLLM: 1K × d_model memory
Savings: 99%

Advanced Techniques

Dynamic Sink Detection

Rather than hardcoding the number of attention sinks (typically 4), adaptive systems can automatically identify which tokens naturally act as sinks by analyzing attention patterns during inference.

Detection Methodology:

The system monitors attention weights across multiple generation steps, computing the average attention received by each token position. Tokens that consistently receive high attention (above a threshold like 10-15%) are identified as natural sinks.

Key Characteristics of Detected Sinks:

Contiguous Positioning: Sinks typically form a contiguous block at the sequence start (positions 0-3 or 0-7)
Consistent Attention: They receive stable attention across different contexts and prompts
Position Independence: Their sink behavior doesn't depend on semantic content

Adaptive Sink Management:

The system maintains a rolling history of attention patterns (typically 100-200 tokens). Once sufficient data is collected, it analyzes the attention distribution to identify natural sink positions. This enables:

Task-specific optimization: Different tasks may require different numbers of sinks
Model-specific tuning: Some models use more/fewer sinks based on training
Dynamic adjustment: Sink count can adapt during long generation sessions

Benefits:

Eliminates manual tuning of sink size
Adapts to model-specific attention patterns
Can discover non-obvious sink configurations

Hierarchical Sinks

For extremely long contexts (100K+ tokens), a single-level sink system may be insufficient. Hierarchical sinks use multi-scale preservation to maintain references at different temporal distances.

Multi-Level Preservation Strategy:

Instead of preserving only the first 4 tokens, hierarchical systems preserve tokens at multiple scales:

Level 1 (Immediate): First 4 tokens (always preserved)
Level 2 (Short-term): Every 16th token from recent 1,000 tokens
Level 3 (Long-term): Every 64th token from positions 1,000-10,000

Cache Composition Example:

For a 50,000 token sequence with 1,024 token cache:

4 level-1 sinks (positions 0-3)
~60 level-2 sinks (every 16th position from recent window)
~150 level-3 sinks (every 64th position from middle range)
~810 most recent tokens (sliding window)

Attention Distribution:

Hierarchical sinks enable the model to:

Access immediate context (level 1) for stability
Reference recent patterns (level 2) for coherence
Recall distant events (level 3) for long-range dependencies

Use Cases:

Document summarization: Maintain references to different sections
Long-form generation: Track narrative threads across chapters
Multi-turn dialogue: Preserve conversation history at multiple scales

Sink Recycling

In scenarios where the context changes dramatically (e.g., switching topics, processing multiple documents), the original sinks may become less relevant. Sink recycling enables gradual transition to new attention anchors.

Transition Strategy:

When attention patterns indicate a context shift, the system gradually phases out old sinks while introducing new ones:

Detection: Identify when attention to current sinks drops below threshold
Decay: Gradually reduce the strength of old sink representations (e.g., multiply by 0.9 each step)
Replacement: Introduce new candidate sinks from recent high-attention tokens
Stabilization: Monitor new sinks for consistency before full commitment

Practical Application - Multi-Document Processing:

When processing a sequence of articles:

Document 1 (tokens 0-5000): Sinks at positions 0-3
Document 2 (tokens 5001-10000): Begin decaying original sinks, introduce sinks at 5001-5004
Document 3 (tokens 10001-15000): Original sinks fully retired, new sinks established

Benefits:

Prevents stale sinks from consuming attention
Adapts to changing content domains
Maintains stability during transitions

Challenges:

Determining optimal decay rate (too fast = instability, too slow = wasted attention)
Identifying genuine context shifts vs. temporary topic changes
Managing multiple active sink sets during transition periods

Production Deployment

GPU Optimization Strategies

Efficient streaming attention requires careful GPU kernel design to handle the discontinuous memory access pattern created by attention sinks. The performance challenge lies in computing attention over non-contiguous cache regions (sinks + gap + window).

Memory Layout Optimization:

Standard attention kernels assume contiguous key/value sequences. StreamingLLM requires modified kernels that efficiently handle:

Sink region: Positions 0-3 (always accessed)
Gap region: Positions 4-3980 (masked out, never accessed)
Window region: Positions 3981-5000 (actively accessed)

Kernel Design Principles:

1. Shared Memory Utilization:

Load query vectors into shared memory (reused across all key positions)
Stream keys from global memory for dot product computation
Store attention scores in shared memory for softmax reduction

2. Efficient Masking:

Precompute boolean masks: is_sink = (position < sink_size), is_window = (position >= cache_size - window_size)
Apply mask during score computation (set invalid positions to -∞)
Avoid branching by using arithmetic operations: score = valid ? score : -INFINITY

3. Coalesced Memory Access:

Organize cache layout to enable coalesced reads: [batch, head, position, dim]
Each thread loads consecutive dimensions of the same position
Minimize warp divergence during conditional masking

4. Parallel Reduction for Softmax:

Compute attention scores in parallel (one thread per cache position)
Use warp-level reduction for softmax denominator: sum(exp(score_i))
Normalize scores in parallel to get attention weights
Compute weighted sum of values in parallel

Performance Characteristics:

Optimization	Speedup	Memory Bandwidth
Naive Implementation	1.0×	100%
Shared Memory	1.8×	65%
Fused Masking	2.3×	60%
Optimized Layout	3.2×	45%

Production Considerations:

Batch Processing: Process multiple queries simultaneously to amortize kernel launch overhead
Multi-Head Parallelism: Each head processes independently, ideal for GPU parallelization
Flash Attention Integration: Combine sink-aware masking with Flash Attention's memory-efficient algorithm
Mixed Precision: Use FP16 for storage, FP32 for accumulation to balance speed and accuracy

Configuration Guidelines

Selecting optimal sink and window sizes depends on your specific use case, model characteristics, and computational constraints. Here's a comprehensive decision framework:

Task-Based Configuration:

Conversational AI (Chat, Assistants):

Sink size: 4 tokens
Window size: 1024-2048 tokens
Rationale: Conversations have natural turn boundaries. Context resets frequently, so smaller windows suffice. Minimal sinks provide stability without overhead.
Typical sequence length: 2K-8K tokens per conversation

Document Summarization:

Sink size: 8 tokens
Window size: 2048-4096 tokens
Rationale: Documents require broader context to capture main themes. More sinks help maintain document structure signals. Larger windows preserve more content.
Typical sequence length: 10K-50K tokens per document

Code Generation:

Sink size: 16 tokens
Window size: 4096-8192 tokens
Rationale: Code has hierarchical structure (functions, classes, modules). More sinks preserve structural context. Large windows maintain scope visibility.
Typical sequence length: 20K-100K tokens per codebase

Real-Time Transcription:

Sink size: 4 tokens
Window size: 512-1024 tokens
Rationale: Latency-critical application. Minimal cache reduces memory and computation. Recent context is most relevant for transcription accuracy.
Typical sequence length: Unbounded (streaming audio)

Model-Size Considerations:

Model Size	Recommended Sink Size	Recommended Window Size
Small (< 1B params)	4	512-1024
Medium (1-7B params)	4-8	1024-2048
Large (7-70B params)	8-16	2048-4096
Extra Large (> 70B)	16-32	4096-8192

Memory Constraint Calculation:

Total memory = (sink_size + window_size) × num_layers × 2 × hidden_dim × precision_bytes

Example for LLaMA-7B (32 layers, 4096 hidden dim, FP16):

Config (4 sinks + 1024 window): 1028 × 32 × 2 × 4096 × 2 = 1.1GB KV cache
Config (8 sinks + 2048 window): 2056 × 32 × 2 × 4096 × 2 = 2.2GB KV cache

Best Practices

1. Sink Size Selection

Empirical Guidelines:

The optimal number of attention sinks depends on three factors that should be balanced:

Minimum Stability Threshold:

At least 4 sinks are required for stable generation across most models
Fewer than 4 causes attention distribution instability
This baseline provides sufficient "dump space" for excess attention

Window Size Scaling:

Larger windows benefit from more sinks: suggested_sinks ≈ log₂(window_size)
Window 512 → 4 sinks
Window 2048 → 8 sinks
Window 8192 → 16 sinks
Rationale: More positions need more attention stabilization points

Natural Token Boundaries:

Examine first N tokens of typical prompts (BOS + first few words)
If prompts consistently start with structured tokens (e.g., system prompts), preserve that entire prefix
Example: If system prompts are 12 tokens, use 12-16 sinks

Decision Process:

Start with baseline: 4 sinks
If window > 2048: increase to log₂(window_size)
If using structured prompts: align to prompt length
Test and validate with perplexity measurements

2. Position Encoding

Maintaining Positional Accuracy:

Attention sinks create non-contiguous position sequences. Position encodings must reflect true token positions to maintain model accuracy:

Absolute Position Embeddings:

Each token keeps its original absolute position
Sink at position 0 always has position embedding PE(0)
Window token at position 5000 has position embedding PE(5000)
Never renumber positions to be sequential [0,1,2,3,4,5...]

RoPE (Rotary Position Embeddings):

Apply rotation based on true position indices
Sinks use positions [0,1,2,3]
Window uses positions [3981,3982,...,5000]
RoPE's relative nature handles gaps naturally—only distances between visible tokens matter
The gap (positions 4-3980) is simply absent; no special handling needed

ALiBi (Attention with Linear Biases):

Compute biases using true position differences
Bias between position 0 (sink) and position 5000 (new token) = -5000 × slope
Bias between position 3981 (window start) and 5000 (new token) = -1019 × slope
Gaps don't affect bias calculation—only distance between cached positions matters

Key Principle: Never "fix" positions to be contiguous. The model expects position gaps and handles them correctly through its learned position encoding mechanism.

3. Monitoring and Debugging

Attention Sink Health Metrics:

To ensure attention sinks are functioning correctly, monitor these key indicators during generation:

Sink Attention Percentage:

Metric: Sum of attention weights to sink positions divided by number of sinks
Healthy range: 20-40% total attention to sinks
Warning signs:
- < 10%: Sinks not being used, may indicate misconfiguration
- 60%: Over-reliance on sinks, model may be struggling

Attention Entropy:

Metric: Shannon entropy of attention distribution: -Σ(p × log(p))
Healthy range: 3.5-5.5 (for typical sequence lengths)
Warning signs:
- < 2.0: Attention too focused (possible degeneration)
- 6.0: Attention too diffuse (instability)

Maximum Attention Weight:

Metric: Highest attention weight assigned to any single position
Healthy range: 0.15-0.35
Warning signs:
- 0.5: Single position dominance (attention collapse)
- < 0.05: No clear focus (may indicate confusion)

Perplexity Tracking:

Monitor perplexity every 1000 tokens during long generation
Gradual increase (10.3 → 10.9 over 10K tokens): Expected behavior
Sudden spike (10.5 → 30+): Critical failure, check sink preservation
Sustained > 25: Investigate cache corruption or position encoding issues

Debugging Common Issues:

Symptom	Likely Cause	Solution
Perplexity spike after N tokens	Window size = N, sinks evicted	Verify sink positions preserved
Low sink attention (< 10%)	Position encoding mismatch	Check position IDs match true positions
Attention collapse to single token	Softmax numerical instability	Increase numerical precision, check for NaN
Gradual coherence loss	Window too small for task	Increase window size or use hierarchical sinks

Attention Sinks: Stable Streaming LLMs

Table of Contents

Attention Sinks: The Key to Streaming LLMs

Interactive Attention Sink Visualization

The Discovery

Where Attention Goes

The Problem: Naive Sliding Window

The Solution: StreamingLLM

Attention Distribution (Stable!)

The Impact

The Discovery

Why Attention Sinks Form

Softmax Constraint

Initial Token Bias

The Streaming Problem

Without Attention Sinks: The Catastrophic Failure

With Attention Sinks: The StreamingLLM Solution

Architecture: How StreamingLLM Works

Core Components

1. KV Cache Management

2. Cache Update Strategy

3. Position Encoding Management

Attention Computation with Sinks

Generation Loop

Performance Analysis

Perplexity Comparison

Memory Savings

Advanced Techniques

Dynamic Sink Detection

Hierarchical Sinks

Sink Recycling

Production Deployment

GPU Optimization Strategies

Configuration Guidelines

Best Practices

1. Sink Size Selection

2. Position Encoding

3. Monitoring and Debugging