Sparse Attention Patterns

Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.

Best viewed on desktop for optimal interactive experience

Sparse Attention Patterns: Efficient Long-Range Modeling

Sparse attention patterns reduce the quadratic complexity of self-attention by limiting which positions can attend to each other, enabling efficient processing of sequences with thousands or millions of tokens.

Interactive Sparse Pattern Explorer

Visualize different sparse attention patterns and their trade-offs:

The Problem: Quadratic Complexity

Full attention requires computing scores between all token pairs. For a sequence of 32 tokens, that's 32² = 1024 operations. This explodes for long sequences!

Full Attention Matrix
O(n²)
Time Complexity
Sequence length:32
Operations:1024
Memory:1024 floats
Why This Is Bad
1K tokens → 1M operations
10K tokens → 100M operations
100K tokens → 10B operations
Memory grows quadratically
⚠️
The Bottleneck

At 32 tokens, every query looks at 32 keys. Most of these connections aren't even important! We're wasting computation on irrelevant pairs.

The Insight: Most Attention Is Wasted

Studies show tokens mostly attend to nearby positions or a few global ones. What if we only compute attention where it matters?

Sparse vs Full Comparison
Full Attention
1024
1024 connections
Every token attends to every other token
Sparse Attention
327
327 connections
68% sparsity - only 10.2 connections per token
3.1× Faster
With 68% sparsity

Sparse Attention Patterns

Different tasks need different patterns. Click to explore each one!

Fixed Pattern Pattern
Click on the matrix to explore different query positions
Pattern Stats
Total connections:327
Sparsity:68.1%
Avg per token:10.2
Position 12 Connections
Attends to 10 tokens
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
...

Real-World Impact

Sparse attention enables processing sequences that would be impossible with full attention.

📄
Longformer
• 4K → 16K tokens
• Global + Local pattern
• Document processing
🐦
BigBird
• Random + Window + Global
• 4K token context
• Question answering
🎬
Axial Transformer
• O(n√n) complexity
• Images & video
• 2D factorization
Complexity Comparison
PatternComplexityUse CaseModels
Full AttentionO(n²)Short sequencesBERT, GPT
Fixed SparseO(n×k)Structured dataSparse Transformer
Block-LocalO(n×b)HierarchicalBlockBERT
Global+LocalO(n×w)DocumentsLongformer, BigBird
AxialO(n×√n)Images/VideoAxial Transformer
RandomO(n×r)GeneralBigBird
Where n = sequence length, k = fixed positions, b = block size,w = window size, r = random connections
Sequence Length: 32
Stride: 4
Window: 4

The Sparsity Principle

Instead of computing attention between all pairs:

Attentionfull(Q, K, V) = softmax(QKT√(d))V O(n2)

Sparse attention uses patterns:

Attentionsparse(Q, K, V) = softmax(QKT ⊙ M√(d))V O(n × k)

Where M is a sparse mask and k is much less than n.

Major Sparse Attention Patterns

1. Fixed Pattern (Sparse Transformer)

Concept: Attend to fixed positions at regular intervals (e.g., every k-th token).

How it works:

  • Each token attends to positions at regular stride intervals
  • Combines with local attention window for nearby tokens
  • Reduces complexity from O(n²) to O(n×k) where k is the number of attended positions

Use cases: Structured data with periodic patterns, regular time series

2. Strided/Dilated Pattern

Concept: Skip connections with regular stride, creating diagonal patterns in the attention matrix.

How it works:

  • Different dilation rates per attention head
  • Each head captures different scale dependencies
  • Combines dilated attention with immediate neighbors

Use cases: Multi-scale feature detection, structured sequences

3. Block-Local Pattern

Concept: Divide sequence into blocks with local attention within each block.

How it works:

  • Sequence split into fixed-size blocks
  • Full attention within blocks
  • No cross-block attention (can be combined with other patterns)
  • Complexity: O(n×b) where b is block size

Use cases: Hierarchical data, document sections, paragraph-level processing

4. Global + Local (Longformer)

Concept: Combine global tokens that attend to everything with local sliding windows.

How it works:

  • Special global tokens (e.g., CLS) attend to all positions
  • All positions attend to global tokens
  • Regular tokens have local sliding window attention
  • Complexity: O(n×(g+w)) where g = global tokens, w = window size

Use cases: Document classification, long-form QA, information retrieval

5. Axial Attention

Concept: Factorize 2D attention into separate row and column attention.

How it works:

  • For images/video: reshape sequence to 2D grid
  • Apply attention along rows, then along columns
  • Complexity: O(n×√n) for square grids

Use cases: Image processing, video understanding, 2D structured data

6. Random Sparse Pattern (BigBird)

Concept: Combine random connections, sliding window, and global attention.

How it works:

  • Global blocks attend to everything
  • Sliding window for local context
  • Random connections for long-range dependencies
  • Complexity: O(n×(g+w+r)) where r = random connections

Use cases: General-purpose long sequences, question answering, summarization

Complexity Analysis

PatternTime ComplexitySpace ComplexityEffective Range
Full AttentionO(n²)O(n²)Global
Fixed SparseO(n × k)O(n × k)k positions
StridedO(n × n/s)O(n × n/s)Every s-th position
Block-LocalO(n × b)O(n × b)Block size b
Global+LocalO(n × (g+w))O(n × (g+w))Global + window w
AxialO(n × √n)O(n × √n)Row + column
RandomO(n × r)O(n × r)r random positions

Efficient Implementation

Sparse Matrix Operations

Key optimization strategies:

  1. Sparse Tensor Representation

    • Store only non-zero attention weights
    • Use coordinate (COO) format for sparse matrices
    • Reduces memory from O(n²) to O(k×n) where k << n
  2. Block-Sparse Operations

    • Group tokens into blocks
    • Compute attention only for specified block pairs
    • Enables GPU optimization with larger tile sizes
    • Memory-efficient for hierarchical patterns
  3. Implementation Considerations

    • Use specialized sparse matrix kernels (cuSPARSE, PyTorch sparse)
    • Batch sparse operations when possible
    • Balance sparsity ratio vs kernel overhead (too sparse can be slower)

Production Models

Longformer (Global + Local)

Architecture:

  • 768 hidden size, 12 attention heads
  • Window size: 512 tokens per layer
  • CLS token has global attention
  • Max sequence: 4,096 tokens

Key features: Efficient document processing, QA tasks

BigBird (Random + Window + Global)

Architecture:

  • 768 hidden size, 12 attention heads
  • Block size: 64 tokens
  • 3 random blocks + 2 global blocks + 3 sliding window blocks
  • Max sequence: 4,096 tokens

Key features: Theoretically proven to approximate full attention

Sparse Transformer (Fixed Pattern)

Architecture:

  • 1,024 hidden size, 16 attention heads
  • Stride: 128 (attend every 128th position)
  • Local context: 128 tokens
  • Max sequence: 8,192 tokens

Key features: Early sparse attention work, generative modeling

Combining Patterns

Hybrid Sparse Attention

Concept: Use different sparse patterns in different attention heads for better coverage.

Strategy:

  • Divide attention heads across multiple pattern types
  • Each head group uses a different sparsity pattern
  • Fixed pattern heads: capture regular intervals
  • Strided heads: multi-scale dependencies
  • Block-local heads: local context
  • Global-local heads: long-range connections

Benefits:

  • Better coverage of sequence dependencies
  • Redundancy for important connections
  • Pattern specialization per head

Best Practices

1. Pattern Selection Guidelines

Choose based on task type:

Task TypeRecommended PatternReason
ClassificationGlobal + LocalGlobal tokens (CLS) need full context
GenerationSliding WindowLocal context most important
Image/VideoAxialNatural 2D structure
Long DocumentsBigBird / LongformerBalanced coverage
Structured DataFixed / StridedExploit known patterns
Very Long (>4K)BigBirdProven theoretical guarantees

2. Dynamic Sparsity

Concept: Adapt sparsity pattern based on content importance rather than fixed structure.

How it works:

  • Score each position's importance
  • Select top-k positions to attend to
  • Sparsity ratio adjusts based on content
  • More compute for important sequences

Trade-offs:

  • More flexible than fixed patterns
  • Adds scoring overhead
  • May be less predictable for optimization

Performance Tips

Memory-Efficient Training

Gradient Checkpointing:

  • Store only sparse pattern indices during forward pass
  • Recompute attention values during backward pass
  • Reduces memory from O(n²) to O(k×n)
  • Essential for very long sequences

Mixed Precision:

  • Use FP16 for sparse matrix operations
  • Keep FP32 for accumulation
  • 2× memory reduction with minimal quality loss

Hardware Optimization

Custom CUDA Kernels:

  • Specialized kernels for block-sparse operations
  • Shared memory for block tiles
  • Coalesced memory access patterns
  • Significant speedup over generic implementations

Implementation Libraries:

  • DeepSpeed: Sparse attention primitives
  • FairScale: Memory-efficient sparse transformers
  • xFormers: Facebook's sparse attention toolkit
  • Triton: Easy custom sparse kernels

If you found this explanation helpful, consider sharing it with others.

Mastodon