Sparse Attention Patterns: Efficient Long-Range Modeling

Sparse attention patterns reduce the quadratic complexity of self-attention by limiting which positions can attend to each other, enabling efficient processing of sequences with thousands or millions of tokens.

Interactive Sparse Pattern Explorer

Visualize different sparse attention patterns and their trade-offs:

The Problem: Quadratic Complexity

Full attention requires computing scores between all token pairs. For a sequence of 32 tokens, that's 32² = 1024 operations. This explodes for long sequences!

Full Attention Matrix

O(n²)

Time Complexity

Sequence length:32

Operations:1024

Memory:1024 floats

Why This Is Bad

•1K tokens → 1M operations

•10K tokens → 100M operations

•100K tokens → 10B operations

•Memory grows quadratically

⚠️

The Bottleneck

At 32 tokens, every query looks at 32 keys. Most of these connections aren't even important! We're wasting computation on irrelevant pairs.

The Insight: Most Attention Is Wasted

Studies show tokens mostly attend to nearby positions or a few global ones. What if we only compute attention where it matters?

Sparse vs Full Comparison

Full Attention

1024

1024 connections

Every token attends to every other token

Sparse Attention

327

327 connections

68% sparsity - only 10.2 connections per token

3.1× Faster

With 68% sparsity

Sparse Attention Patterns

Different tasks need different patterns. Click to explore each one!

Fixed Pattern Pattern

Click on the matrix to explore different query positions

Pattern Stats

Total connections:327

Sparsity:68.1%

Avg per token:10.2

Position 12 Connections

Attends to 10 tokens

...

Real-World Impact

Sparse attention enables processing sequences that would be impossible with full attention.

📄

Longformer

• 4K → 16K tokens

• Global + Local pattern

• Document processing

🐦

BigBird

• Random + Window + Global

• 4K token context

• Question answering

🎬

Axial Transformer

• O(n√n) complexity

• Images & video

• 2D factorization

Complexity Comparison

Pattern	Complexity	Use Case	Models
Full Attention	O(n²)	Short sequences	BERT, GPT
Fixed Sparse	O(n×k)	Structured data	Sparse Transformer
Block-Local	O(n×b)	Hierarchical	BlockBERT
Global+Local	O(n×w)	Documents	Longformer, BigBird
Axial	O(n×√n)	Images/Video	Axial Transformer
Random	O(n×r)	General	BigBird

Where n = sequence length, k = fixed positions, b = block size,w = window size, r = random connections

Sequence Length: 32

Stride: 4

Window: 4

The Sparsity Principle

Instead of computing attention between all pairs:

Attention_full(Q, K, V) = softmax(QK^T√(d))V O(n²)

Sparse attention uses patterns:

Attention_sparse(Q, K, V) = softmax(QK^T ⊙ M√(d))V O(n × k)

Where M is a sparse mask and k is much less than n.

Major Sparse Attention Patterns

1. Fixed Pattern (Sparse Transformer)

Concept: Attend to fixed positions at regular intervals (e.g., every k-th token).

How it works:

Each token attends to positions at regular stride intervals
Combines with local attention window for nearby tokens
Reduces complexity from O(n²) to O(n×k) where k is the number of attended positions

Use cases: Structured data with periodic patterns, regular time series

2. Strided/Dilated Pattern

Concept: Skip connections with regular stride, creating diagonal patterns in the attention matrix.

How it works:

Different dilation rates per attention head
Each head captures different scale dependencies
Combines dilated attention with immediate neighbors

Use cases: Multi-scale feature detection, structured sequences

3. Block-Local Pattern

Concept: Divide sequence into blocks with local attention within each block.

How it works:

Sequence split into fixed-size blocks
Full attention within blocks
No cross-block attention (can be combined with other patterns)
Complexity: O(n×b) where b is block size

Use cases: Hierarchical data, document sections, paragraph-level processing

4. Global + Local (Longformer)

Concept: Combine global tokens that attend to everything with local sliding windows.

How it works:

Special global tokens (e.g., CLS) attend to all positions
All positions attend to global tokens
Regular tokens have local sliding window attention
Complexity: O(n×(g+w)) where g = global tokens, w = window size

Use cases: Document classification, long-form QA, information retrieval

5. Axial Attention

Concept: Factorize 2D attention into separate row and column attention.

How it works:

For images/video: reshape sequence to 2D grid
Apply attention along rows, then along columns
Complexity: O(n×√n) for square grids

Use cases: Image processing, video understanding, 2D structured data

6. Random Sparse Pattern (BigBird)

Concept: Combine random connections, sliding window, and global attention.

How it works:

Global blocks attend to everything
Sliding window for local context
Random connections for long-range dependencies
Complexity: O(n×(g+w+r)) where r = random connections

Use cases: General-purpose long sequences, question answering, summarization

Complexity Analysis

Pattern	Time Complexity	Space Complexity	Effective Range
Full Attention	O(n²)	O(n²)	Global
Fixed Sparse	O(n × k)	O(n × k)	k positions
Strided	O(n × n/s)	O(n × n/s)	Every s-th position
Block-Local	O(n × b)	O(n × b)	Block size b
Global+Local	O(n × (g+w))	O(n × (g+w))	Global + window w
Axial	O(n × √n)	O(n × √n)	Row + column
Random	O(n × r)	O(n × r)	r random positions

Efficient Implementation

Sparse Matrix Operations

Key optimization strategies:

Sparse Tensor Representation
- Store only non-zero attention weights
- Use coordinate (COO) format for sparse matrices
- Reduces memory from O(n²) to O(k×n) where k << n
Block-Sparse Operations
- Group tokens into blocks
- Compute attention only for specified block pairs
- Enables GPU optimization with larger tile sizes
- Memory-efficient for hierarchical patterns
Implementation Considerations
- Use specialized sparse matrix kernels (cuSPARSE, PyTorch sparse)
- Batch sparse operations when possible
- Balance sparsity ratio vs kernel overhead (too sparse can be slower)

Production Models

Longformer (Global + Local)

Architecture:

768 hidden size, 12 attention heads
Window size: 512 tokens per layer
CLS token has global attention
Max sequence: 4,096 tokens

Key features: Efficient document processing, QA tasks

BigBird (Random + Window + Global)

Architecture:

768 hidden size, 12 attention heads
Block size: 64 tokens
3 random blocks + 2 global blocks + 3 sliding window blocks
Max sequence: 4,096 tokens

Key features: Theoretically proven to approximate full attention

Sparse Transformer (Fixed Pattern)

Architecture:

1,024 hidden size, 16 attention heads
Stride: 128 (attend every 128th position)
Local context: 128 tokens
Max sequence: 8,192 tokens

Key features: Early sparse attention work, generative modeling

Combining Patterns

Hybrid Sparse Attention

Concept: Use different sparse patterns in different attention heads for better coverage.

Strategy:

Divide attention heads across multiple pattern types
Each head group uses a different sparsity pattern
Fixed pattern heads: capture regular intervals
Strided heads: multi-scale dependencies
Block-local heads: local context
Global-local heads: long-range connections

Benefits:

Better coverage of sequence dependencies
Redundancy for important connections
Pattern specialization per head

Best Practices

1. Pattern Selection Guidelines

Choose based on task type:

Task Type	Recommended Pattern	Reason
Classification	Global + Local	Global tokens (CLS) need full context
Generation	Sliding Window	Local context most important
Image/Video	Axial	Natural 2D structure
Long Documents	BigBird / Longformer	Balanced coverage
Structured Data	Fixed / Strided	Exploit known patterns
Very Long (>4K)	BigBird	Proven theoretical guarantees