Sliding Window Attention: Efficient Local Context Processing

Sliding Window Attention restricts each token to attend only to a fixed-size window of surrounding tokens, dramatically reducing computational complexity while maintaining strong performance through clever architectural design.

Interactive Sliding Window Visualization

Explore how sliding windows create efficient attention patterns through this step-by-step walkthrough:

Step 1 of 10

Introduction

Understanding sliding window attention

Select Example:

What is Sliding Window Attention?

Sliding Window Attention is an efficient attention mechanism where each token attends only to a fixed-size window of surrounding tokens, reducing complexity from O(n²) to O(n×w).

Key Benefits:

• Linear complexity with sequence length
• Constant memory per position
• Receptive field grows with layers
• Used in Mistral, Longformer

Current Example:

Window Size: 2

Sequence Length: 12

Complexity: O(n × 2)

The Core Concept

Instead of full O(n²) attention, each token attends to:

w tokens to the left
w tokens to the right
Total window size: 2w + 1

This reduces complexity to O(n × w) where w is much less than n.

How Sliding Window Works

Basic Mechanism

For token at position i with window size w:

Attention(i) = softmax(Q_i K_[i-w:i+w]^T√(d)) V_[i-w:i+w]

Only positions within [i-w, i+w] are attended to.

Attention Pattern

Window size w=3:
Token 0: [0, 1, 2, 3]       → Can see 4 tokens
Token 5: [2, 3, 4, 5, 6, 7, 8] → Can see 7 tokens (full window)
Token n: [n-3, n-2, n-1, n]  → Can see 4 tokens

Implementation Architecture

Core Components

1. Standard Multi-Head Attention Foundation

Sliding window attention builds on top of standard multi-head attention with the same fundamental components:

Query Projections: Transform input embeddings into query vectors for each attention head
Key/Value Projections: Create key and value representations for attention computation
Head Dimension Scaling: Divide model dimension across multiple heads (head\_dim = d_model / n_heads)
Attention Scaling Factor: Use 1/√(d_head) to normalize attention scores
Output Projection: Combine multi-head outputs back into model dimension

2. Window Boundary Computation

For each token at position i, the attention window is dynamically calculated:

Start Position: max(0, i - w) to handle sequence beginning
End Position: min(n, i + w + 1) to handle sequence end
Window Content: All tokens from start to end positions
Edge Handling: Tokens near boundaries naturally get smaller windows

3. Position-by-Position Attention

The sliding window mechanism processes each query token individually:

Query Isolation: Extract single query vector for current position
Key/Value Window: Extract only the keys and values within the defined window
Local Attention Scores: Compute Q_i · K_window^T only for window positions
Normalization: Apply softmax over window positions only
Value Aggregation: Weighted sum of values within the window

4. Efficiency Through Masking

An alternative implementation uses attention masks for parallelization:

Full Score Matrix: Compute all n × n attention scores
Sliding Window Mask: Binary matrix where 1 indicates valid attention within window
Mask Application: Set out-of-window scores to -∞ before softmax
Parallel Processing: All positions processed simultaneously with masking
Memory Trade-off: Faster computation but requires full attention matrix storage

Multi-Layer Sliding Window

Effective Receptive Field

With L layers and window size w:

Layer 1: Can see w tokens
Layer 2: Can see up to 2w tokens
Layer L: Can see up to L×w tokens

Receptive\_Field = L × w

Example: 32 layers × window 4096 = 131,072 token receptive field!

Information Flow

Layer 1: [----window----]
Layer 2: [--------window×2--------]
Layer 3: [------------window×3------------]
...
Layer L: [----------------window×L----------------]

Mistral's Implementation Strategy

Model Configuration

Mistral 7B Architecture:

Hidden Size: 4096-dimensional embeddings
Attention Heads: 32 heads (128 dimensions per head)
Sliding Window: 4096 tokens per window
Maximum Positions: 32,768 tokens with RoPE position encoding
Number of Layers: 32 transformer layers
Effective Receptive Field: 32 layers × 4096 window = 131,072 tokens

This configuration enables processing sequences up to 32K tokens while each layer only attends to 4K local tokens, dramatically reducing memory and computation requirements.

Rolling Buffer Cache System

Cache Architecture Design:

Mistral uses a sophisticated rolling buffer for KV caching that only stores values within the sliding window:

1. Fixed-Size Buffer

Allocates memory for exactly window_size positions (4096 tokens)
Stores key and value tensors for each layer and head
Memory size: L × 2 × w × h × d where L=layers, w=window, h=heads, d=head_dim
Constant memory regardless of total sequence length

2. Buffer Update Mechanism

Maintains current position pointer within the window
When new tokens arrive, writes them to the buffer at current position
Advances position pointer by number of new tokens
When buffer fills up, shifts old values out (circular buffer behavior)

3. Rolling Strategy

If position + new_tokens > window_size, triggers roll operation
Calculates shift amount: how many positions to discard from beginning
Shifts entire buffer content left by shift amount
Resets position to accommodate new tokens
Writes new key/value pairs at updated position

4. Memory Efficiency

Full attention cache grows with sequence: O(n) memory per layer
Rolling buffer stays constant: O(w) memory per layer
For 32K sequence with 4K window: 87.5% memory reduction
Enables inference on much longer sequences with same hardware

Comparison with Other Approaches

Feature	Full Attention	Sliding Window	Hierarchical	Sparse
Complexity	O(n²)	O(n×w)	O(n×√n)	O(n×k)
Receptive Field	Global	L×w	Global	Varies
Memory	O(n²)	O(n×w)	O(n×√n)	O(n×k)
Quality	Best	Good	Good	Good
Implementation	Simple	Medium	Complex	Complex

Hybrid Approaches

Sliding Window + Global Attention

Longformer-Style Global Tokens:

Combining sliding window attention with a few global tokens creates a powerful hybrid pattern that balances efficiency with global context access.

Architecture Design:

Global Token Strategy:

Reserve first N tokens (typically 1-8) as "global" tokens
Global tokens use full attention: can attend to and be attended by all positions
Remaining tokens use sliding window attention among themselves
All tokens can attend to global tokens (asymmetric pattern)

Attention Mask Structure:

For a sequence with G global tokens and sliding window w:

Global tokens (0 to G-1): Full attention to all positions (rows of all 1s in mask)
Content tokens (G to n): Sliding window + attention to global tokens
Pattern: Global region (full) + Band diagonal (sliding window)

Use Cases and Benefits:

Use Case	Global Token Usage	Benefits
Document QA	Document title, metadata	Query can access summary info + local context
Classification	CLS token (like BERT)	Global aggregator + efficient local processing
Summarization	Section headers	Structural anchors + detailed local content
Code Analysis	Function signatures	High-level structure + detailed implementation

Complexity:

Memory: O(G×n + n×w) ≈ O(n×w) for small G
Computation: Similar to memory, dominated by sliding window when G ≪ n

Dilated Sliding Window

Multi-Scale Attention Pattern:

Dilated windows capture both local and distant context by attending to positions at regular intervals (similar to dilated convolutions).

Dilation Mechanism:

Standard Sliding Window:

Attends to consecutive positions: [i-w, i-w+1, ..., i, ..., i+w-1, i+w]
Dense local coverage

Dilated Sliding Window with dilation d:

Attends to positions at intervals: [i-w×d, i-(w-1)×d, ..., i, ..., i+(w-1)×d, i+w×d]
Sparse coverage over larger range
Same number of attended positions but wider spatial coverage

Hybrid Dilated Pattern:

Local window (dilation=1): Immediate neighbors for fine-grained context
Dilated window (dilation=2,4,8): Distant positions for long-range patterns
Combined: Dense local + sparse distant = effective multi-scale attention

Example Configuration:

For token at position 100 with window size 4:

Standard: attends to [96, 97, 98, 99, 100, 101, 102, 103, 104] (9 positions)
Dilated (d=2): attends to [92, 94, 96, 98, 100, 102, 104, 106, 108] (9 positions)
Hybrid: Local [98, 99, 100, 101, 102] + Dilated [92, 96, 104, 108] (9 positions)

Benefits:

Wider receptive field without increasing computation
Multi-scale features: Local details + global structure
Hierarchical patterns: Similar to multi-layer perceptron with dilation

Trade-offs:

Misses some intermediate positions (gaps in coverage)
Works best when patterns appear at regular intervals (e.g., code indentation, document structure)
May need multiple dilation rates for full coverage

Memory and Speed Analysis

Memory Savings

For sequence length n = 32K, window w = 4K:

Method	Attention Matrix	Relative Size
Full	32K × 32K = 1B entries	100%
Sliding	32K × 4K = 128M entries	12.5%
Savings	872M entries	87.5% reduction

Speed Comparison

Performance Characteristics:

Sliding window attention provides dramatic speedups, especially for long sequences:

Sequence Length 8K with Window 512:

Full attention: Computes 8K × 8K = 64M attention scores
Sliding window: Computes 8K × 512 = 4M attention scores
Speedup: ~16× faster, 93.75% fewer operations

Sequence Length 32K with Window 4K:

Full attention: 32K × 32K = 1B attention scores
Sliding window: 32K × 4K = 128M attention scores
Speedup: ~8× faster, 87.5% fewer operations

Key Factors Affecting Speed:

Window-to-sequence ratio: Smaller windows provide greater speedups
Hardware parallelization: GPUs can process windowed attention very efficiently
Memory bandwidth: Reduced matrix size means less data movement
Cache utilization: Smaller working set fits better in GPU cache

Best Practices

1. Window Size Selection

Decision Framework:

Choosing the right window size involves balancing context coverage, memory constraints, and computational efficiency:

Short Sequences (≤ 2048 tokens):

Recommendation: Use full attention (window = sequence length)
Rationale: Performance overhead is minimal, full context improves quality
Memory: Still manageable at O(n²) for small n

Medium Sequences (2K-8K tokens):

Recommendation: Window size 512-1024
Rationale: Balances local context with computational efficiency
Receptive field: With 16-32 layers, achieves 8K-32K effective range
Use case: Document processing, code analysis

Long Sequences (8K-32K+ tokens):

Recommendation: Window size 1024-4096
Rationale: Maintains manageable computation while ensuring coverage
Receptive field calculation: window × layers ≥ sequence\_length
Example: Mistral uses 4096 window × 32 layers = 131K receptive field for 32K sequences
Use case: Long documents, books, transcription

Target Receptive Field Method:

Calculate required receptive field for your task
Divide by number of layers: window = target\_RF / n\_layers
Round up to nearest power of 2 for efficiency
Validate that window × layers ≥ sequence length

2. Layer-Specific Windows

Variable Window Strategy:

Using different window sizes across layers can optimize the trade-off between local and global context:

Progressive Window Expansion:

Early layers (1-8): Smaller windows (512-1024) focus on local patterns
Middle layers (9-16): Medium windows (1024-2048) capture phrase-level context
Deep layers (17-32): Larger windows (2048-4096) enable long-range dependencies

Implementation Patterns:

Grouped Window Scaling:

Divide layers into groups (e.g., groups of 4)
Each group uses progressively larger windows
Example: Layers 1-4: 512, Layers 5-8: 1024, Layers 9-12: 2048, etc.

Exponential Growth:

window(layer) = base\_window × 2^{\lfloor layer/4 \rfloor}
Starts small for efficiency, grows for coverage
Balances computation across layers

Task-Adaptive Windows:

Code tasks: Larger windows in middle layers (function-level context)
Chat tasks: Uniform smaller windows (recent conversation focus)
Long documents: Increasing windows toward later layers (narrative flow)

3. Attention Sinks Integration

Combining Sliding Windows with Attention Sinks:

For very long sequences, combining sliding windows with attention sinks provides both efficiency and stability:

Hybrid Attention Pattern:

Sink tokens (positions 0-3): Always attended to by all tokens
Sliding window: Local context window around current position
Effective pattern: Global access to sinks + local sliding window

Attention Mask Structure:

For token at position i with window w and 4 sink tokens:

Positions 0-3: Always accessible (attention sinks)
Positions [i-w, i+w]: Local window (if not overlapping with sinks)
All other positions: Masked out

Benefits:

Stability: Sinks prevent attention distribution collapse during streaming
Efficiency: Window limits computational cost to O(n × (w + sinks))
Quality: Maintains both global anchors and local context

Use Cases:

Streaming inference: Process 100K+ token sequences with constant memory
Document QA: Sinks hold document metadata, window processes content
Long-form generation: Sinks maintain narrative anchors, window handles recent context

Memory Calculation:

Standard sliding window: O(n × w)
With sinks: O(n × (w + s)) where s = number of sinks (typically 4)
Overhead: ~0.1-1% for typical configurations

Key Integration Points:

1. Attention Masking:

Create sliding window mask before attention computation
Pass mask to Flash Attention's scaled dot-product function
Flash Attention applies mask during tiling, avoiding unnecessary computation

2. Memory Efficiency:

Flash Attention already reduces memory from O(n²) to O(n) for full attention
Sliding window further reduces computation from O(n²) to O(n×w)
Combined: Minimal memory (Flash) + minimal computation (sliding window)

3. Performance Benefits:

Speed: 2-3× faster than naive sliding window implementation
Memory: Constant memory regardless of sequence length
Scalability: Enables 100K+ token sequences on consumer GPUs

4. Implementation Considerations:

Use PyTorch's scaled_dot_product_attention with custom mask
Set is_causal=False unless using causal sliding window
Ensure mask is properly shaped: [batch, heads, seq_len, seq_len] or broadcastable
Disable dropout during inference for maximum speed

GPU Kernel Optimization

Efficient CUDA Implementation:

Optimized sliding window attention kernels leverage GPU parallelism and memory hierarchy:

Parallelization Strategy:

Thread Organization:

One thread per query position (or multiple threads for large head dimensions)
Block size: Typically 128-256 threads for optimal occupancy
Grid size: Ceil(sequence_length / block_size) blocks

Memory Access Pattern:

1. Coalesced Global Memory Reads:

Load query vectors in a coalesced manner (consecutive threads access consecutive memory)
Stream keys/values within window bounds from global memory
Write outputs coalesced back to global memory

2. Shared Memory Utilization:

Load query for current position into shared memory (reused across window)
Optionally cache window keys/values in shared memory for reuse
Store partial attention scores in shared memory before reduction

3. Register Optimization:

Keep frequently used values (scaling factor, position indices) in registers
Accumulate attention scores in registers before final write
Minimize register spills through careful variable management

Kernel Execution Flow:

Load Query: Thread loads query vector for its assigned position
Compute Window Bounds: Calculate start = max(0, pos - w), end = min(n, pos + w + 1)
Score Computation Loop: Iterate through window positions, compute dot products
Softmax Reduction: Parallel reduction to find max, then compute exp and normalize
Value Aggregation: Weighted sum of values using attention weights
Write Output: Store result to global memory

Performance Optimizations:

Optimization	Impact	Notes
Shared memory for queries	2-3× speedup	Reduces global memory reads
Warp-level primitives	1.5-2× speedup	Use shuffle operations for reductions
Loop unrolling	1.2-1.5× speedup	For fixed small window sizes
FP16 computation	2× speedup on Ampere+	Use tensor cores when available

Sliding Window Attention

Table of Contents