ALiBi: Attention with Linear Biases

Understand ALiBi, the position encoding method that adds linear biases to attention scores, enabling exceptional length extrapolation without position embeddings.

Best viewed on desktop for optimal interactive experience

ALiBi: Attention with Linear Biases

Attention with Linear Biases (ALiBi) revolutionizes position encoding in transformers by directly modifying attention scores based on token distance. Unlike traditional methods that add position embeddings to inputs, ALiBi applies a simple linear penalty: the farther apart two tokens are, the less they attend to each other.

This page provides a comprehensive, step-by-step exploration of ALiBi. Use the interactive visualization below to understand how linear biases create position-aware attention patterns without any learned parameters.

The Position Encoding Challenge

  • Problem: Transformers need position information, but traditional methods limit extrapolation to longer sequences
  • Traditional Solutions: Learned embeddings (fail on longer sequences), sinusoidal encodings (mediocre extrapolation), RoPE (good but complex)
  • ALiBi's Innovation: Add distance-based bias directly to attention scores—zero parameters, excellent extrapolation

How to Use This Visualization

The interactive component below walks you through 8 key steps. Use the navigation controls to explore:

  1. Regular Attention - See the baseline without position info
  2. Distance Matrix - Understand how ALiBi computes token distances
  3. Slope Selection - Learn why different heads use different slopes
  4. Applying Bias - Watch distance × slope create penalties
  5. Modified Scores - See how bias affects attention scores
  6. Attention Patterns - Compare regular vs ALiBi attention distributions
  7. Multi-Head View - Observe diversity across attention heads
  8. Extrapolation - Discover why ALiBi works on unseen sequence lengths

ALiBi: Attention with Linear Biases

Position encoding that adds distance-based penalties to attention scores

1

Regular Attention Scores

Standard attention without position information

Regular Attention (No Position Info)

Standard scaled dot-product attention without any position encoding. Scores based purely on content similarity.
Attention Scores (Q·K^T / √d)
Min: 0.45Max: 1.00
Attention Distribution (Position 5)
T0
7.5%
T1
7.9%
T2
8.3%
T3
8.7%
T4
9.2%
T5
9.6%
T6
9.2%
T7
8.7%
T8
8.3%
T9
7.9%
T10
7.5%
T11
7.1%
Problem:Without position information, tokens at different positions are treated identically. The model can't distinguish "cat sat on mat" from "mat sat on cat"!

ALiBi Technical Details

Advantages

  • ✅ Zero learned parameters
  • ✅ Excellent length extrapolation
  • ✅ Built-in locality bias
  • ✅ Simple implementation
  • ✅ Used in BLOOM (176B), MPT

Limitations

  • ❌ Fixed linear decay pattern
  • ❌ Less flexible than learned PE
  • ❌ Slope tuning may be needed
  • ❌ May not suit all tasks

Formula

Attention_ALiBi(Q, K, V) = softmax((Q·K^T / √d) - m·|i-j|) V

where m is the head-specific slope and |i-j| is the distance between positions

The Core Innovation: Bias in Attention Scores

ALiBi's key insight is deceptively simple: add a distance-based penalty directly to attention scores before softmax.

AttentionALiBi(Q, K, V) = softmax(QKT√(d) - m · |i - j|)V

Breaking Down the Formula:

ComponentMeaningEffect
QKT / √(d)Standard scaled dot-productContent-based similarity
mHead-specific slopeControls penalty strength
\|i - j\|Absolute position distanceHow far apart tokens are
- m · \|i - j\|Linear penaltyFarther = more negative = less attention

Key Advantages:

  • Zero parameters - No position embeddings to learn
  • Natural locality - Built-in bias toward nearby tokens
  • Excellent extrapolation - Works on sequences longer than training

Step-by-Step: How ALiBi Modifies Attention

The visualization above demonstrates this process interactively. Here's the mathematical flow:

Step 1: Compute Standard Attention Scores

Start with regular scaled dot-product attention:

Sij = qi · kj√(dk)

At this stage, there's no position information—only content-based similarity.

Step 2: Calculate Position Distances

For every pair of positions (i, j), compute absolute distance:

Dij = |i - j|

This creates a symmetric distance matrix where diagonal elements (i=i) are zero.

Step 3: Apply Head-Specific Slopes

Each attention head h gets a unique slope mh:

mh = 2-\frac{8{n}} × 2-\frac{h{n}} \text{for head } h ∈ [0, n-1]

Why geometric progression? This gives heads exponentially different ranges:

  • Head 0 (steep slope): Strong local focus
  • Head n-1 (gentle slope): More global attention

Step 4: Compute Bias Matrix

Bhij = -mh · Dij

Negative values penalize distant positions. The farther apart, the more negative.

Step 5: Add Bias to Scores

S'hij = Sij + Bhij

This directly modifies attention scores before softmax.

Step 6: Apply Softmax

Ahij = exp(S'hij)Σk exp(S'hik)

Because distant positions have more negative scores, they receive exponentially less attention after softmax!

Implementation Architecture Overview

ALiBi modifies standard transformer attention with minimal changes. Understanding the architecture helps clarify why it's so effective.

Core Components:

1. Standard Attention Foundation

ALiBi builds on top of regular multi-head attention. The Q, K, V projections remain unchanged:

  • Query, Key, Value are still learned linear transformations
  • Multi-head splitting still happens (e.g., 8 heads for a 512-dim model means 64-dim per head)
  • Output projection combines heads as usual

The only difference: Add a bias matrix before softmax.

2. Slope Generation Strategy

For models with varying head counts, ALiBi uses a smart approach:

Power-of-2 Heads (2, 4, 8, 16, 32...):

  • Use the clean geometric formula directly
  • Each head's slope is exactly half the previous one
  • Mathematically elegant and efficient

Non-Power-of-2 Heads (3, 6, 12, 24...):

  • Compute slopes for the nearest power of 2
  • Interpolate additional slopes from the next power of 2
  • Example: 12 heads uses slopes from 8-head and 16-head configurations
  • Ensures smooth coverage across attention ranges

3. Bias Matrix Construction

The bias matrix has specific properties that enable efficiency:

Precomputation Strategy:

  • Create a maximum-size bias matrix once (e.g., for 2048 tokens)
  • Store it as a model buffer (not a trainable parameter)
  • For shorter sequences, slice the needed portion
  • For longer sequences during inference, compute on-the-fly

Memory Efficiency:

  • Option 1: Store full matrix (fast but memory-heavy)
  • Option 2: Compute on-the-fly (slower but memory-light)
  • Option 3: Cache common lengths (balanced approach)

Symmetry Property: The bias matrix is symmetric because distance is symmetric: |i - j| = |j - i|. This means:

  • Token 2 → Token 5 has the same bias as Token 5 → Token 2
  • Only need to compute upper/lower triangle
  • Can optimize storage with this property

4. Forward Pass Integration

ALiBi integrates into attention with three steps:

Step A: Compute Standard Scores

  • Multiply Query and Key matrices (Q·K^T)
  • Scale by square root of dimension (√d_k)
  • This gives content-based similarity scores

Step B: Add Bias

  • Select or compute bias matrix for current sequence length
  • Add bias element-wise to scores
  • This injects position information

Step C: Standard Attention Completion

  • Apply causal mask (for decoder models)
  • Compute softmax across key dimension
  • Multiply by Values
  • Combine heads and project

5. Handling Variable Sequence Lengths

Production models see varying lengths. ALiBi handles this elegantly:

Training Time:

  • Fixed maximum length (e.g., 2048 tokens)
  • Precomputed bias matrix covers all positions
  • Efficient batching with padding

Inference Time:

  • Can exceed training length!
  • Two approaches:
    • Extend bias matrix using the same formula
    • Compute bias on-demand for new positions
  • No retraining needed—formula generalizes

6. Memory Considerations

For a model with:

  • 8 attention heads
  • Maximum sequence length 2048
  • Requires: 8 × 2048 × 2048 = 33M entries

Memory optimization strategies:

Lazy Computation:

  • Don't store full matrix
  • Compute only needed entries during forward pass
  • Trade: CPU time for memory savings

Caching Common Lengths:

  • Precompute for common sizes (512, 1024, 2048)
  • On-demand computation for others
  • Best balance for production

Sharing Across Layers:

  • Same bias matrix for all transformer layers
  • One-time computation, reuse everywhere
  • Massive memory savings in deep models

Key Properties

1. No Position Embeddings

  • Zero parameters for position encoding
  • No memory overhead
  • Simpler architecture

2. Linear Decay with Distance

  • Nearby tokens: Strong attention
  • Distant tokens: Exponentially suppressed
  • Natural locality bias

3. Superior Extrapolation

ALiBi's extrapolation capability is its standout feature:

Training Phase:

  • Train model on sequences up to 1,024 tokens
  • Model learns attention patterns at this scale
  • Bias formula is applied consistently

Inference Phase:

  • Deploy on sequences of 10,000+ tokens
  • Same bias formula extends naturally: -m · |i-j|
  • No retraining, no fine-tuning needed
  • Performance degrades gracefully (not catastrophically)

Why This Works:

  • No position-specific parameters to overfit
  • Relative distance relationship is scale-invariant
  • Linear penalty maintains proportional behavior
  • Model sees longer distances but same pattern

Practical Impact:

  • Train small (cheaper, faster)
  • Deploy large (better context, better results)
  • Critical for production LLMs with varying input lengths

4. Head-Specific Slopes

Different heads learn different ranges:

  • Some heads: Very local (steep slope)
  • Other heads: More global (gentle slope)

Why Geometric Progression for Slopes?

ALiBi uses a carefully designed geometric progression for head slopes. This isn't arbitrary—it creates multi-scale attention naturally.

The Formula:

mh = 2-\frac{8{n}} × 2-\frac{h{n}} \text{for head } h ∈ [0, n-1]

Why this specific formula?

  1. Base factor 2-8/n: Chosen empirically to work well across model sizes
  2. Geometric decay: Each head's slope is half the previous one (2-h/n)
  3. Exponential diversity: Creates heads spanning from very local to relatively global

Example Slopes for Common Configurations:

4 Heads (Typical in Small Models):

  • Head 0: 0.5 → Strong locality (range ~2 tokens)
  • Head 1: 0.25 → Medium-local (range ~4 tokens)
  • Head 2: 0.125 → Medium-range (range ~8 tokens)
  • Head 3: 0.0625 → Moderate range (range ~16 tokens)

8 Heads (Medium Models):

  • Head 0: 0.5 → Very local
  • Head 1: 0.25 → Local
  • Head 2: 0.125 → Medium-local
  • Head 3: 0.0625 → Medium
  • Head 4: 0.03125 → Medium-global
  • Head 5: 0.015625 → Global
  • Head 6: 0.0078125 → Very global
  • Head 7: 0.00390625 → Most global

This creates a spectrum from syntax-focused (local) to semantics-focused (global) heads.

16 Heads (Large Models like GPT-3):

  • Head 0: 0.707 → Extremely local
  • Head 4: 0.25 → Medium
  • Head 8: 0.088 → Relatively global
  • Head 12: 0.031 → Very global
  • Head 15: 0.011 → Most global attention range

Doubling heads provides finer-grained coverage of attention ranges.

Attention Range Interpretation:

For a slope m, tokens beyond distance ~1/m receive minimal attention:

  • Slope 0.5 → Effective range ~2 tokens
  • Slope 0.125 → Effective range ~8 tokens
  • Slope 0.0156 → Effective range ~64 tokens

Comparison with Other Methods

FeatureALiBiRoPELearned PESinusoidal
Parameters00O(L×D)0
ExtrapolationExcellentGoodPoorGood
SimplicityVery HighMediumHighHigh
Local BiasBuilt-inNoNoNo
Used InBLOOM, MPTLLaMABERTOriginal

Why ALiBi Excels at Extrapolation

The Fundamental Insight:

Traditional position encodings (learned or sinusoidal) add position information to input embeddings. The model learns to associate specific embedding patterns with specific positions. When you go beyond training length, the model sees position patterns it's never encountered → performance collapses.

ALiBi's Advantage:

ALiBi doesn't encode position in embeddings. Instead, it encodes relative distance in attention scores. The formula -m · |i-j| works the same whether |i-j| = 5 or |i-j| = 500!

Three Keys to Extrapolation:

  1. Parameter-Free: No learned position parameters that overfit to training lengths
  2. Linear Penalty: Distance 100 and 1000 differ by the same multiplicative factor (10×)
  3. Consistent Relative Behavior: The attention pattern shape stays similar at any scale

Experimental Results (Perplexity)

Train LengthTest LengthExtrapolationALiBiRoPELearned PE
102410241× (baseline)10.210.310.2
1024204810.411.115.3 ❌
1024409610.613.2 ❌25.7 ❌
1024819210.918.5 ❌Diverges ❌
10241638416×11.335+ ❌N/A

Lower perplexity is better. ALiBi maintains near-baseline performance even at 16× training length!

Key Takeaway: Train on 1K tokens, deploy on 10K+ tokens with minimal degradation. This is transformative for production LLMs where context length matters.

Production Models Using ALiBi

ALiBi has proven itself in real-world, large-scale language models. Here's how major models configure and use it:

BLOOM (176B Parameters)

Configuration:

  • Model Size: 14,336 hidden dimensions
  • Attention Heads: 112 heads
  • Position Encoding: ALiBi only (no learned embeddings)
  • Training Length: 2,048 tokens
  • Inference Capability: Can handle significantly longer sequences

Why BLOOM Chose ALiBi:

  • Zero position embedding parameters saves memory in a 176B model
  • Multilingual training benefits from not overfitting to specific sequence lengths
  • Inference flexibility for various downstream tasks
  • Proven extrapolation allows deployment on longer contexts

Architecture Impact:

  • 112 heads means very fine-grained attention range coverage
  • Slopes range from extremely local to relatively global
  • Each head specializes in a specific distance scale
  • No position embedding layer → simpler architecture

MPT (7B and 30B Parameters)

Configuration:

  • Model Size: 4,096 hidden dimensions (7B variant)
  • Attention Heads: 32 heads
  • Position Encoding: ALiBi exclusively
  • Training Length: 2,048 tokens
  • Extrapolation Factor: 4× (can handle 8K at inference)
  • Special Feature: alibi_bias_max parameter for clamping

MPT's Refinements:

  • Bias Clamping: Limits maximum penalty to prevent extreme suppression
  • Optimized for Code: ALiBi's locality bias helps with programming syntax
  • Training Efficiency: Smaller position encoding footprint speeds training
  • Commercial Deployment: Reliable extrapolation critical for production

Performance Characteristics:

  • Trained on 2K, deploys on 8K with <5% degradation
  • Faster inference than models with position embeddings
  • Better few-shot learning due to flexible context

Combining ALiBi with Other Techniques

ALiBi's simplicity makes it compatible with advanced attention optimizations:

ALiBi + Flash Attention

How They Combine: Flash Attention optimizes the attention computation itself (reducing memory and increasing speed), while ALiBi provides position information. They're orthogonal improvements.

Integration Approach:

  • Flash Attention accepts custom bias matrices
  • Pass ALiBi bias to Flash Attention's bias parameter
  • Get both speed (Flash) and extrapolation (ALiBi)

Benefits:

  • 2-4× faster attention with Flash
  • Zero position parameters with ALiBi
  • Excellent length extrapolation maintained
  • Critical for long-context models (GPT-4 scale)

Used In:

  • MPT models use this combination
  • Many recent open-source LLMs
  • Production systems requiring both speed and long context

ALiBi + Sliding Window Attention

Motivation: For extremely long sequences (100K+ tokens), even ALiBi can struggle. Sliding window restricts attention to nearby tokens.

Combined Strategy:

  1. Sliding Window: Hard mask limiting attention to ±window_size
  2. ALiBi Bias: Soft preference for nearby tokens within window
  3. Together: Window provides hard cutoff, ALiBi provides gradient within window

Architecture:

  • Each token attends to only W nearby tokens (e.g., W=512)
  • Within that window, ALiBi bias still applies
  • Computational complexity: O(n × W) instead of O(n²)

Use Cases:

  • Document processing (100K+ tokens)
  • Long-form code analysis
  • Multi-document question answering
  • Extreme context scenarios

Trade-offs:

  • Lose true global attention
  • Gain linear complexity
  • ALiBi helps make window feel less abrupt
  • Good for tasks where distant context less critical

ALiBi + Grouped Query Attention (GQA)

Synergy: GQA reduces memory by sharing K/V across multiple query heads. ALiBi adds no memory overhead.

Combined Benefits:

  • GQA: Fewer K/V projections → less memory
  • ALiBi: No position embeddings → less memory
  • Together: Extremely memory-efficient architecture

Production Impact:

  • Enables larger models on same hardware
  • Faster inference due to reduced memory movement
  • Maintains quality with both techniques

When to Use ALiBi vs Alternatives

Use ALiBi When:

Long context is critical - Documents, code, conversations ✅ Inference length >> training length - Deploy on longer sequences ✅ Memory is constrained - No position embedding parameters ✅ Locality matters - Tasks benefit from strong local bias ✅ Simplicity is valued - Easy to implement and debug

Example Applications:

  • Long-form text generation (stories, articles)
  • Code completion with large context
  • Conversational AI with long chat histories
  • Document Q&A with full context

Consider Alternatives When:

RoPE (Rotary Position Embeddings):

  • Used in: LLaMA, PaLM, GPT-NeoX
  • Best for: Balance of extrapolation and flexibility
  • Trade-off: More complex implementation

Learned Position Embeddings:

  • Used in: BERT, original GPT
  • Best for: Fixed-length tasks, when you won't need extrapolation
  • Trade-off: Terrible extrapolation, extra parameters

No Position Encoding (rare):

  • Used in: Some vision transformers after initial patches
  • Best for: When position truly doesn't matter
  • Trade-off: Loses sequence order information

Comparison Table:

FeatureALiBiRoPELearned PESinusoidal
Parameters00O(L×D)0
Extrapolation⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Locality BiasBuilt-in ✅No ❌No ❌No ❌
ImplementationVery SimpleMediumSimpleSimple
MemoryMinimalMinimalHighMinimal
Production UseBLOOM, MPTLLaMA, PaLMBERT, GPT-2Transformer (2017)

Implementation Strategies for Production

Strategy 1: Precomputation for Speed

Approach: Cache bias matrices for common sequence lengths during model initialization.

When to Use:

  • Inference-heavy workloads
  • Fixed or predictable sequence lengths
  • Memory is available

How It Works:

  • During initialization, compute bias matrices for lengths like 512, 1024, 2048, 4096, 8192
  • Store these as non-trainable buffers
  • At inference, retrieve the appropriate cached matrix
  • If needed length not cached, compute on-the-fly

Memory Trade-off:

  • 8 heads × 2048 × 2048 = 33M floats = 134MB per cached length
  • Typically cache 5 lengths = ~670MB
  • Acceptable for most production scenarios

Speed Benefit:

  • Avoids recomputation every forward pass
  • Particularly beneficial for batch processing
  • Can be >10× faster than computing each time

Strategy 2: On-the-Fly Computation

Approach: Compute bias matrix during each forward pass.

When to Use:

  • Memory-constrained environments
  • Highly variable sequence lengths
  • Training (where speed less critical)

How It Works:

  • Create position indices for current sequence length
  • Compute distances: |i - j| for all pairs
  • Multiply by slopes
  • Add to attention scores

Memory Benefit:

  • No storage overhead
  • Scales to any sequence length
  • Ideal for mobile/edge deployment

Speed Trade-off:

  • Small overhead per forward pass
  • Negligible for large models (attention dominates)
  • Modern hardware makes this fast enough

Strategy 3: Hybrid Caching

Approach: Cache common lengths, compute rare lengths on-demand.

Recommended Setup:

  • Cache: 512, 1024, 2048 (covers 95% of use cases)
  • Compute: Everything else
  • Best of both worlds

Production Best Practice: This is the recommended approach for most deployments—balances speed and memory efficiently.

Handling Variable-Length Batches

Challenge: In practice, batches contain sequences of different lengths (with padding).

Solution Approaches:

Approach A: Masking

  • Use maximum length bias matrix
  • Apply attention mask for padding
  • Padded positions don't affect results (masked to -∞)

Approach B: Per-Sequence Bias

  • Compute exact-length bias for each sequence
  • More memory but no wasted computation
  • Better for extreme length variation

Approach C: Bucketed Lengths

  • Group sequences into length buckets (e.g., 0-512, 513-1024)
  • Use bias matrix for bucket maximum
  • Reduces variety while maintaining efficiency

Task-Specific Slope Tuning

While the standard formula works well generally, some tasks benefit from adjusted slopes:

Local-Heavy Tasks (syntax-focused):

  • Steeper slopes: Use base factor 2^(-4/n) instead of 2^(-8/n)
  • Effect: Stronger locality bias
  • Use cases: Code completion, grammar correction
  • Example: Head 0 gets slope 0.7 instead of 0.5

Global-Aware Tasks (semantics-focused):

  • Gentler slopes: Use base factor 2^(-16/n)
  • Effect: Weaker locality bias, more global attention
  • Use cases: Long-form summarization, topic modeling
  • Example: Head 0 gets slope 0.3 instead of 0.5

Balanced Tasks (standard):

  • Default slopes: 2^(-8/n) works well
  • Effect: Proven multi-scale coverage
  • Use cases: General language modeling, chat
  • Recommendation: Start here, tune only if needed

Visualizing ALiBi Patterns

To understand ALiBi's effect, visualizing attention patterns helps:

What to Visualize:

  1. Distance Matrix: Shows the symmetric |i-j| structure
  2. Bias Matrix: Distance × slope for each head
  3. Attention Heatmap: After softmax, shows final attention weights

Typical Patterns:

Diagonal Concentration:

  • Strongest attention on diagonal (self-attention)
  • Decays as you move away from diagonal
  • Rate of decay = slope

Head Diversity:

  • Head 0: Tight diagonal band (local)
  • Head N-1: Broader, more diffuse (global)
  • Middle heads: Gradual transition

Length Scaling:

  • Pattern shape stays similar at any sequence length
  • Longer sequences show same relative structure
  • This constancy enables extrapolation

Conclusion: The Power of Simplicity

ALiBi demonstrates that sometimes the simplest solutions are the most effective. By adding a straightforward linear bias to attention scores, ALiBi achieves:

Key Achievements:

  1. Zero-parameter position encoding that matches or exceeds learned approaches
  2. Exceptional length extrapolation - train short, deploy long
  3. Natural inductive bias toward local attention patterns
  4. Production success in models like BLOOM (176B) and MPT (7B, 30B)

The Broader Lesson:

ALiBi's success teaches us to question complexity. Instead of adding position embeddings (parameters, memory, limited generalization), ALiBi asks: "What if we just penalized distant tokens proportionally to their distance?"

This principle—find the simplest modification that encodes the right inductive bias—is powerful across machine learning. ALiBi proves that with the right insight, you can often eliminate parameters while improving performance.

Practical Recommendation:

For new transformer models targeting long-context applications, ALiBi should be your default choice for position encoding. Its combination of simplicity, zero parameters, and excellent extrapolation makes it ideal for production LLMs.

Interactive Learning:

Use the visualization above to build intuition. Watch how:

  • Distance matrices create symmetric patterns
  • Slopes control attention range
  • Linear biases create natural locality
  • Different heads complement each other
  • Extrapolation "just works"

If you found this explanation helpful, consider sharing it with others.

Mastodon