Sliding Window Attention

Learn how Sliding Window Attention enables efficient processing of long sequences by limiting attention to local context windows, used in Mistral and Longformer.

Best viewed on desktop for optimal interactive experience

Sliding Window Attention: Efficient Local Context Processing

Sliding Window Attention restricts each token to attend only to a fixed-size window of surrounding tokens, dramatically reducing computational complexity while maintaining strong performance through clever architectural design.

Interactive Sliding Window Visualization

Explore how sliding windows create efficient attention patterns through this step-by-step walkthrough:

Step 1 of 10
Introduction
Understanding sliding window attention
Select Example:

What is Sliding Window Attention?

Sliding Window Attention is an efficient attention mechanism where each token attends only to a fixed-size window of surrounding tokens, reducing complexity from O(n²) to O(n×w).

Key Benefits:
  • • Linear complexity with sequence length
  • • Constant memory per position
  • • Receptive field grows with layers
  • • Used in Mistral, Longformer
Current Example:
Window Size: 2
Sequence Length: 12
Complexity: O(n × 2)

The Core Concept

Instead of full O(n2) attention, each token attends to:

  • w tokens to the left
  • w tokens to the right
  • Total window size: 2w + 1

This reduces complexity to O(n × w) where w is much less than n.

How Sliding Window Works

Basic Mechanism

For token at position i with window size w:

Attention(i) = softmax(Qi K[i-w:i+w]T√(d)) V[i-w:i+w]

Only positions within [i-w, i+w] are attended to.

Attention Pattern

Window size w=3: Token 0: [0, 1, 2, 3] → Can see 4 tokens Token 5: [2, 3, 4, 5, 6, 7, 8] → Can see 7 tokens (full window) Token n: [n-3, n-2, n-1, n] → Can see 4 tokens

Implementation Architecture

Core Components

1. Standard Multi-Head Attention Foundation

Sliding window attention builds on top of standard multi-head attention with the same fundamental components:

  • Query Projections: Transform input embeddings into query vectors for each attention head
  • Key/Value Projections: Create key and value representations for attention computation
  • Head Dimension Scaling: Divide model dimension across multiple heads (head\dim = dmodel / nheads)
  • Attention Scaling Factor: Use 1/√(dhead) to normalize attention scores
  • Output Projection: Combine multi-head outputs back into model dimension

2. Window Boundary Computation

For each token at position i, the attention window is dynamically calculated:

  • Start Position: max(0, i - w) to handle sequence beginning
  • End Position: min(n, i + w + 1) to handle sequence end
  • Window Content: All tokens from start to end positions
  • Edge Handling: Tokens near boundaries naturally get smaller windows

3. Position-by-Position Attention

The sliding window mechanism processes each query token individually:

  • Query Isolation: Extract single query vector for current position
  • Key/Value Window: Extract only the keys and values within the defined window
  • Local Attention Scores: Compute Qi · KwindowT only for window positions
  • Normalization: Apply softmax over window positions only
  • Value Aggregation: Weighted sum of values within the window

4. Efficiency Through Masking

An alternative implementation uses attention masks for parallelization:

  • Full Score Matrix: Compute all n × n attention scores
  • Sliding Window Mask: Binary matrix where 1 indicates valid attention within window
  • Mask Application: Set out-of-window scores to -∞ before softmax
  • Parallel Processing: All positions processed simultaneously with masking
  • Memory Trade-off: Faster computation but requires full attention matrix storage

Multi-Layer Sliding Window

Effective Receptive Field

With L layers and window size w:

  • Layer 1: Can see w tokens
  • Layer 2: Can see up to 2w tokens
  • Layer L: Can see up to L×w tokens
Receptive\Field = L × w

Example: 32 layers × window 4096 = 131,072 token receptive field!

Information Flow

Layer 1: [----window----] Layer 2: [--------window×2--------] Layer 3: [------------window×3------------] ... Layer L: [----------------window×L----------------]

Mistral's Implementation Strategy

Model Configuration

Mistral 7B Architecture:

  • Hidden Size: 4096-dimensional embeddings
  • Attention Heads: 32 heads (128 dimensions per head)
  • Sliding Window: 4096 tokens per window
  • Maximum Positions: 32,768 tokens with RoPE position encoding
  • Number of Layers: 32 transformer layers
  • Effective Receptive Field: 32 layers × 4096 window = 131,072 tokens

This configuration enables processing sequences up to 32K tokens while each layer only attends to 4K local tokens, dramatically reducing memory and computation requirements.

Rolling Buffer Cache System

Cache Architecture Design:

Mistral uses a sophisticated rolling buffer for KV caching that only stores values within the sliding window:

1. Fixed-Size Buffer

  • Allocates memory for exactly window_size positions (4096 tokens)
  • Stores key and value tensors for each layer and head
  • Memory size: L × 2 × w × h × d where L=layers, w=window, h=heads, d=head_dim
  • Constant memory regardless of total sequence length

2. Buffer Update Mechanism

  • Maintains current position pointer within the window
  • When new tokens arrive, writes them to the buffer at current position
  • Advances position pointer by number of new tokens
  • When buffer fills up, shifts old values out (circular buffer behavior)

3. Rolling Strategy

  • If position + new_tokens > window_size, triggers roll operation
  • Calculates shift amount: how many positions to discard from beginning
  • Shifts entire buffer content left by shift amount
  • Resets position to accommodate new tokens
  • Writes new key/value pairs at updated position

4. Memory Efficiency

  • Full attention cache grows with sequence: O(n) memory per layer
  • Rolling buffer stays constant: O(w) memory per layer
  • For 32K sequence with 4K window: 87.5% memory reduction
  • Enables inference on much longer sequences with same hardware

Comparison with Other Approaches

FeatureFull AttentionSliding WindowHierarchicalSparse
ComplexityO(n²)O(n×w)O(n×√n)O(n×k)
Receptive FieldGlobalL×wGlobalVaries
MemoryO(n²)O(n×w)O(n×√n)O(n×k)
QualityBestGoodGoodGood
ImplementationSimpleMediumComplexComplex

Hybrid Approaches

Sliding Window + Global Attention

Longformer-Style Global Tokens:

Combining sliding window attention with a few global tokens creates a powerful hybrid pattern that balances efficiency with global context access.

Architecture Design:

Global Token Strategy:

  • Reserve first N tokens (typically 1-8) as "global" tokens
  • Global tokens use full attention: can attend to and be attended by all positions
  • Remaining tokens use sliding window attention among themselves
  • All tokens can attend to global tokens (asymmetric pattern)

Attention Mask Structure:

For a sequence with G global tokens and sliding window w:

  • Global tokens (0 to G-1): Full attention to all positions (rows of all 1s in mask)
  • Content tokens (G to n): Sliding window + attention to global tokens
  • Pattern: Global region (full) + Band diagonal (sliding window)

Use Cases and Benefits:

Use CaseGlobal Token UsageBenefits
Document QADocument title, metadataQuery can access summary info + local context
ClassificationCLS token (like BERT)Global aggregator + efficient local processing
SummarizationSection headersStructural anchors + detailed local content
Code AnalysisFunction signaturesHigh-level structure + detailed implementation

Complexity:

  • Memory: O(G×n + n×w) ≈ O(n×w) for small G
  • Computation: Similar to memory, dominated by sliding window when G ≪ n

Dilated Sliding Window

Multi-Scale Attention Pattern:

Dilated windows capture both local and distant context by attending to positions at regular intervals (similar to dilated convolutions).

Dilation Mechanism:

Standard Sliding Window:

  • Attends to consecutive positions: [i-w, i-w+1, ..., i, ..., i+w-1, i+w]
  • Dense local coverage

Dilated Sliding Window with dilation d:

  • Attends to positions at intervals: [i-w×d, i-(w-1)×d, ..., i, ..., i+(w-1)×d, i+w×d]
  • Sparse coverage over larger range
  • Same number of attended positions but wider spatial coverage

Hybrid Dilated Pattern:

  • Local window (dilation=1): Immediate neighbors for fine-grained context
  • Dilated window (dilation=2,4,8): Distant positions for long-range patterns
  • Combined: Dense local + sparse distant = effective multi-scale attention

Example Configuration:

For token at position 100 with window size 4:

  • Standard: attends to [96, 97, 98, 99, 100, 101, 102, 103, 104] (9 positions)
  • Dilated (d=2): attends to [92, 94, 96, 98, 100, 102, 104, 106, 108] (9 positions)
  • Hybrid: Local [98, 99, 100, 101, 102] + Dilated [92, 96, 104, 108] (9 positions)

Benefits:

  • Wider receptive field without increasing computation
  • Multi-scale features: Local details + global structure
  • Hierarchical patterns: Similar to multi-layer perceptron with dilation

Trade-offs:

  • Misses some intermediate positions (gaps in coverage)
  • Works best when patterns appear at regular intervals (e.g., code indentation, document structure)
  • May need multiple dilation rates for full coverage

Memory and Speed Analysis

Memory Savings

For sequence length n = 32K, window w = 4K:

MethodAttention MatrixRelative Size
Full32K × 32K = 1B entries100%
Sliding32K × 4K = 128M entries12.5%
Savings872M entries87.5% reduction

Speed Comparison

Performance Characteristics:

Sliding window attention provides dramatic speedups, especially for long sequences:

Sequence Length 8K with Window 512:

  • Full attention: Computes 8K × 8K = 64M attention scores
  • Sliding window: Computes 8K × 512 = 4M attention scores
  • Speedup: ~16× faster, 93.75% fewer operations

Sequence Length 32K with Window 4K:

  • Full attention: 32K × 32K = 1B attention scores
  • Sliding window: 32K × 4K = 128M attention scores
  • Speedup: ~8× faster, 87.5% fewer operations

Key Factors Affecting Speed:

  • Window-to-sequence ratio: Smaller windows provide greater speedups
  • Hardware parallelization: GPUs can process windowed attention very efficiently
  • Memory bandwidth: Reduced matrix size means less data movement
  • Cache utilization: Smaller working set fits better in GPU cache

Best Practices

1. Window Size Selection

Decision Framework:

Choosing the right window size involves balancing context coverage, memory constraints, and computational efficiency:

Short Sequences (≤ 2048 tokens):

  • Recommendation: Use full attention (window = sequence length)
  • Rationale: Performance overhead is minimal, full context improves quality
  • Memory: Still manageable at O(n²) for small n

Medium Sequences (2K-8K tokens):

  • Recommendation: Window size 512-1024
  • Rationale: Balances local context with computational efficiency
  • Receptive field: With 16-32 layers, achieves 8K-32K effective range
  • Use case: Document processing, code analysis

Long Sequences (8K-32K+ tokens):

  • Recommendation: Window size 1024-4096
  • Rationale: Maintains manageable computation while ensuring coverage
  • Receptive field calculation: window × layers ≥ sequence\length
  • Example: Mistral uses 4096 window × 32 layers = 131K receptive field for 32K sequences
  • Use case: Long documents, books, transcription

Target Receptive Field Method:

  • Calculate required receptive field for your task
  • Divide by number of layers: window = target\RF / n\layers
  • Round up to nearest power of 2 for efficiency
  • Validate that window × layers ≥ sequence length

2. Layer-Specific Windows

Variable Window Strategy:

Using different window sizes across layers can optimize the trade-off between local and global context:

Progressive Window Expansion:

  • Early layers (1-8): Smaller windows (512-1024) focus on local patterns
  • Middle layers (9-16): Medium windows (1024-2048) capture phrase-level context
  • Deep layers (17-32): Larger windows (2048-4096) enable long-range dependencies

Implementation Patterns:

Grouped Window Scaling:

  • Divide layers into groups (e.g., groups of 4)
  • Each group uses progressively larger windows
  • Example: Layers 1-4: 512, Layers 5-8: 1024, Layers 9-12: 2048, etc.

Exponential Growth:

  • window(layer) = base\window × 2\lfloor layer/4 \rfloor
  • Starts small for efficiency, grows for coverage
  • Balances computation across layers

Task-Adaptive Windows:

  • Code tasks: Larger windows in middle layers (function-level context)
  • Chat tasks: Uniform smaller windows (recent conversation focus)
  • Long documents: Increasing windows toward later layers (narrative flow)

3. Attention Sinks Integration

Combining Sliding Windows with Attention Sinks:

For very long sequences, combining sliding windows with attention sinks provides both efficiency and stability:

Hybrid Attention Pattern:

  • Sink tokens (positions 0-3): Always attended to by all tokens
  • Sliding window: Local context window around current position
  • Effective pattern: Global access to sinks + local sliding window

Attention Mask Structure:

For token at position i with window w and 4 sink tokens:

  1. Positions 0-3: Always accessible (attention sinks)
  2. Positions [i-w, i+w]: Local window (if not overlapping with sinks)
  3. All other positions: Masked out

Benefits:

  • Stability: Sinks prevent attention distribution collapse during streaming
  • Efficiency: Window limits computational cost to O(n × (w + sinks))
  • Quality: Maintains both global anchors and local context

Use Cases:

  • Streaming inference: Process 100K+ token sequences with constant memory
  • Document QA: Sinks hold document metadata, window processes content
  • Long-form generation: Sinks maintain narrative anchors, window handles recent context

Memory Calculation:

  • Standard sliding window: O(n × w)
  • With sinks: O(n × (w + s)) where s = number of sinks (typically 4)
  • Overhead: ~0.1-1% for typical configurations

Limitations and Solutions

Limitation 1: Information Bottleneck

Problem: Information must propagate through layers Solution: Add skip connections or global tokens

Limitation 2: Fixed Window Size

Problem: Some tasks need variable context Solution: Dynamic window sizing based on content

Limitation 3: Edge Effects

Problem: Tokens at sequence boundaries see less context Solution: Padding or cyclic attention

Production Optimizations

Flash Attention with Sliding Window

Integration Strategy:

Flash Attention's memory-efficient algorithm can be combined with sliding window patterns for maximum efficiency:

Key Integration Points:

1. Attention Masking:

  • Create sliding window mask before attention computation
  • Pass mask to Flash Attention's scaled dot-product function
  • Flash Attention applies mask during tiling, avoiding unnecessary computation

2. Memory Efficiency:

  • Flash Attention already reduces memory from O(n²) to O(n) for full attention
  • Sliding window further reduces computation from O(n²) to O(n×w)
  • Combined: Minimal memory (Flash) + minimal computation (sliding window)

3. Performance Benefits:

  • Speed: 2-3× faster than naive sliding window implementation
  • Memory: Constant memory regardless of sequence length
  • Scalability: Enables 100K+ token sequences on consumer GPUs

4. Implementation Considerations:

  • Use PyTorch's scaled_dot_product_attention with custom mask
  • Set is_causal=False unless using causal sliding window
  • Ensure mask is properly shaped: [batch, heads, seq_len, seq_len] or broadcastable
  • Disable dropout during inference for maximum speed

GPU Kernel Optimization

Efficient CUDA Implementation:

Optimized sliding window attention kernels leverage GPU parallelism and memory hierarchy:

Parallelization Strategy:

Thread Organization:

  • One thread per query position (or multiple threads for large head dimensions)
  • Block size: Typically 128-256 threads for optimal occupancy
  • Grid size: Ceil(sequence_length / block_size) blocks

Memory Access Pattern:

1. Coalesced Global Memory Reads:

  • Load query vectors in a coalesced manner (consecutive threads access consecutive memory)
  • Stream keys/values within window bounds from global memory
  • Write outputs coalesced back to global memory

2. Shared Memory Utilization:

  • Load query for current position into shared memory (reused across window)
  • Optionally cache window keys/values in shared memory for reuse
  • Store partial attention scores in shared memory before reduction

3. Register Optimization:

  • Keep frequently used values (scaling factor, position indices) in registers
  • Accumulate attention scores in registers before final write
  • Minimize register spills through careful variable management

Kernel Execution Flow:

  1. Load Query: Thread loads query vector for its assigned position
  2. Compute Window Bounds: Calculate start = max(0, pos - w), end = min(n, pos + w + 1)
  3. Score Computation Loop: Iterate through window positions, compute dot products
  4. Softmax Reduction: Parallel reduction to find max, then compute exp and normalize
  5. Value Aggregation: Weighted sum of values using attention weights
  6. Write Output: Store result to global memory

Performance Optimizations:

OptimizationImpactNotes
Shared memory for queries2-3× speedupReduces global memory reads
Warp-level primitives1.5-2× speedupUse shuffle operations for reductions
Loop unrolling1.2-1.5× speedupFor fixed small window sizes
FP16 computation2× speedup on Ampere+Use tensor cores when available

If you found this explanation helpful, consider sharing it with others.

Mastodon