ALiBi: Attention with Linear Biases

Attention with Linear Biases (ALiBi) revolutionizes position encoding in transformers by directly modifying attention scores based on token distance. Unlike traditional methods that add position embeddings to inputs, ALiBi applies a simple linear penalty: the farther apart two tokens are, the less they attend to each other.

This page provides a comprehensive, step-by-step exploration of ALiBi. Use the interactive visualization below to understand how linear biases create position-aware attention patterns without any learned parameters.

The Position Encoding Challenge

Problem: Transformers need position information, but traditional methods limit extrapolation to longer sequences
Traditional Solutions: Learned embeddings (fail on longer sequences), sinusoidal encodings (mediocre extrapolation), RoPE (good but complex)
ALiBi's Innovation: Add distance-based bias directly to attention scores—zero parameters, excellent extrapolation

How to Use This Visualization

The interactive component below walks you through 8 key steps. Use the navigation controls to explore:

Regular Attention - See the baseline without position info
Distance Matrix - Understand how ALiBi computes token distances
Slope Selection - Learn why different heads use different slopes
Applying Bias - Watch distance × slope create penalties
Modified Scores - See how bias affects attention scores
Attention Patterns - Compare regular vs ALiBi attention distributions
Multi-Head View - Observe diversity across attention heads
Extrapolation - Discover why ALiBi works on unseen sequence lengths

ALiBi: Attention with Linear Biases

Position encoding that adds distance-based penalties to attention scores

Regular Attention Scores

Standard attention without position information

Regular Attention (No Position Info)

Standard scaled dot-product attention without any position encoding. Scores based purely on content similarity.

Attention Scores (Q·K^T / √d)

Min: 0.45Max: 1.00

Attention Distribution (Position 5)

7.5%

7.9%

8.3%

8.7%

9.2%

9.6%

9.2%

8.7%

8.3%

7.9%

T10

7.5%

T11

7.1%

Problem:Without position information, tokens at different positions are treated identically. The model can't distinguish "cat sat on mat" from "mat sat on cat"!

ALiBi Technical Details

Advantages

✅ Zero learned parameters
✅ Excellent length extrapolation
✅ Built-in locality bias
✅ Simple implementation
✅ Used in BLOOM (176B), MPT

Limitations

❌ Fixed linear decay pattern
❌ Less flexible than learned PE
❌ Slope tuning may be needed
❌ May not suit all tasks

Formula

Attention_ALiBi(Q, K, V) = softmax((Q·K^T / √d) - m·|i-j|) V

where m is the head-specific slope and |i-j| is the distance between positions

The Core Innovation: Bias in Attention Scores

ALiBi's key insight is deceptively simple: add a distance-based penalty directly to attention scores before softmax.

Attention_ALiBi(Q, K, V) = softmax(QK^T√(d) - m · |i - j|)V

Breaking Down the Formula:

Component	Meaning	Effect
QK^T / √(d)	Standard scaled dot-product	Content-based similarity
m	Head-specific slope	Controls penalty strength
\\|i - j\\|	Absolute position distance	How far apart tokens are
- m · \\|i - j\\|	Linear penalty	Farther = more negative = less attention

Key Advantages:

✅ Zero parameters - No position embeddings to learn
✅ Natural locality - Built-in bias toward nearby tokens
✅ Excellent extrapolation - Works on sequences longer than training

Step-by-Step: How ALiBi Modifies Attention

The visualization above demonstrates this process interactively. Here's the mathematical flow:

Step 1: Compute Standard Attention Scores

Start with regular scaled dot-product attention:

S_ij = q_i · k_j√(d_k)

At this stage, there's no position information—only content-based similarity.

Step 2: Calculate Position Distances

For every pair of positions (i, j), compute absolute distance:

D_ij = |i - j|

This creates a symmetric distance matrix where diagonal elements (i=i) are zero.

Step 3: Apply Head-Specific Slopes

Each attention head h gets a unique slope m_h:

m_h = 2^-\frac{8{n}} × 2^-\frac{h{n}} \text{for head } h ∈ [0, n-1]

Why geometric progression? This gives heads exponentially different ranges:

Head 0 (steep slope): Strong local focus
Head n-1 (gentle slope): More global attention

Step 4: Compute Bias Matrix

B^h_ij = -m_h · D_ij

Negative values penalize distant positions. The farther apart, the more negative.

Step 5: Add Bias to Scores

S'^h_ij = S_ij + B^h_ij

This directly modifies attention scores before softmax.

Step 6: Apply Softmax

A^h_ij = exp(S'^h_ij)Σ_k exp(S'^h_ik)

Because distant positions have more negative scores, they receive exponentially less attention after softmax!

Implementation Architecture Overview

ALiBi modifies standard transformer attention with minimal changes. Understanding the architecture helps clarify why it's so effective.

Core Components:

1. Standard Attention Foundation

ALiBi builds on top of regular multi-head attention. The Q, K, V projections remain unchanged:

Query, Key, Value are still learned linear transformations
Multi-head splitting still happens (e.g., 8 heads for a 512-dim model means 64-dim per head)
Output projection combines heads as usual

The only difference: Add a bias matrix before softmax.

2. Slope Generation Strategy

For models with varying head counts, ALiBi uses a smart approach:

Power-of-2 Heads (2, 4, 8, 16, 32...):

Use the clean geometric formula directly
Each head's slope is exactly half the previous one
Mathematically elegant and efficient

Non-Power-of-2 Heads (3, 6, 12, 24...):

Compute slopes for the nearest power of 2
Interpolate additional slopes from the next power of 2
Example: 12 heads uses slopes from 8-head and 16-head configurations
Ensures smooth coverage across attention ranges

3. Bias Matrix Construction

The bias matrix has specific properties that enable efficiency:

Precomputation Strategy:

Create a maximum-size bias matrix once (e.g., for 2048 tokens)
Store it as a model buffer (not a trainable parameter)
For shorter sequences, slice the needed portion
For longer sequences during inference, compute on-the-fly

Memory Efficiency:

Option 1: Store full matrix (fast but memory-heavy)
Option 2: Compute on-the-fly (slower but memory-light)
Option 3: Cache common lengths (balanced approach)

Symmetry Property: The bias matrix is symmetric because distance is symmetric: |i - j| = |j - i|. This means:

Token 2 → Token 5 has the same bias as Token 5 → Token 2
Only need to compute upper/lower triangle
Can optimize storage with this property

4. Forward Pass Integration

ALiBi integrates into attention with three steps:

Step A: Compute Standard Scores

Multiply Query and Key matrices (Q·K^T)
Scale by square root of dimension (√d_k)
This gives content-based similarity scores

Step B: Add Bias

Select or compute bias matrix for current sequence length
Add bias element-wise to scores
This injects position information

Step C: Standard Attention Completion

Apply causal mask (for decoder models)
Compute softmax across key dimension
Multiply by Values
Combine heads and project

5. Handling Variable Sequence Lengths

Production models see varying lengths. ALiBi handles this elegantly:

Training Time:

Fixed maximum length (e.g., 2048 tokens)
Precomputed bias matrix covers all positions
Efficient batching with padding

Inference Time:

Can exceed training length!
Two approaches:
- Extend bias matrix using the same formula
- Compute bias on-demand for new positions
No retraining needed—formula generalizes

6. Memory Considerations

For a model with:

8 attention heads
Maximum sequence length 2048
Requires: 8 × 2048 × 2048 = 33M entries

Memory optimization strategies:

Lazy Computation:

Don't store full matrix
Compute only needed entries during forward pass
Trade: CPU time for memory savings

Caching Common Lengths:

Precompute for common sizes (512, 1024, 2048)
On-demand computation for others
Best balance for production

Sharing Across Layers:

Same bias matrix for all transformer layers
One-time computation, reuse everywhere
Massive memory savings in deep models

Key Properties

1. No Position Embeddings

Zero parameters for position encoding
No memory overhead
Simpler architecture

2. Linear Decay with Distance

Nearby tokens: Strong attention
Distant tokens: Exponentially suppressed
Natural locality bias

3. Superior Extrapolation

ALiBi's extrapolation capability is its standout feature:

Training Phase:

Train model on sequences up to 1,024 tokens
Model learns attention patterns at this scale
Bias formula is applied consistently

Inference Phase:

Deploy on sequences of 10,000+ tokens
Same bias formula extends naturally: -m · |i-j|
No retraining, no fine-tuning needed
Performance degrades gracefully (not catastrophically)

Why This Works:

No position-specific parameters to overfit
Relative distance relationship is scale-invariant
Linear penalty maintains proportional behavior
Model sees longer distances but same pattern

Practical Impact:

Train small (cheaper, faster)
Deploy large (better context, better results)
Critical for production LLMs with varying input lengths

4. Head-Specific Slopes

Different heads learn different ranges:

Some heads: Very local (steep slope)
Other heads: More global (gentle slope)

Why Geometric Progression for Slopes?

ALiBi uses a carefully designed geometric progression for head slopes. This isn't arbitrary—it creates multi-scale attention naturally.

The Formula:

m_h = 2^-\frac{8{n}} × 2^-\frac{h{n}} \text{for head } h ∈ [0, n-1]

Why this specific formula?

Base factor 2^-8/n: Chosen empirically to work well across model sizes
Geometric decay: Each head's slope is half the previous one (2^-h/n)
Exponential diversity: Creates heads spanning from very local to relatively global

Example Slopes for Common Configurations:

4 Heads (Typical in Small Models):

Head 0: 0.5 → Strong locality (range ~2 tokens)
Head 1: 0.25 → Medium-local (range ~4 tokens)
Head 2: 0.125 → Medium-range (range ~8 tokens)
Head 3: 0.0625 → Moderate range (range ~16 tokens)

8 Heads (Medium Models):

Head 0: 0.5 → Very local
Head 1: 0.25 → Local
Head 2: 0.125 → Medium-local
Head 3: 0.0625 → Medium
Head 4: 0.03125 → Medium-global
Head 5: 0.015625 → Global
Head 6: 0.0078125 → Very global
Head 7: 0.00390625 → Most global

This creates a spectrum from syntax-focused (local) to semantics-focused (global) heads.

16 Heads (Large Models like GPT-3):

Head 0: 0.707 → Extremely local
Head 4: 0.25 → Medium
Head 8: 0.088 → Relatively global
Head 12: 0.031 → Very global
Head 15: 0.011 → Most global attention range

Doubling heads provides finer-grained coverage of attention ranges.

Attention Range Interpretation:

For a slope m, tokens beyond distance ~1/m receive minimal attention:

Slope 0.5 → Effective range ~2 tokens
Slope 0.125 → Effective range ~8 tokens
Slope 0.0156 → Effective range ~64 tokens

Comparison with Other Methods

Feature	ALiBi	RoPE	Learned PE	Sinusoidal
Parameters	0	0	O(L×D)	0
Extrapolation	Excellent	Good	Poor	Good
Simplicity	Very High	Medium	High	High
Local Bias	Built-in	No	No	No
Used In	BLOOM, MPT	LLaMA	BERT	Original

Why ALiBi Excels at Extrapolation

The Fundamental Insight:

Traditional position encodings (learned or sinusoidal) add position information to input embeddings. The model learns to associate specific embedding patterns with specific positions. When you go beyond training length, the model sees position patterns it's never encountered → performance collapses.

ALiBi's Advantage:

ALiBi doesn't encode position in embeddings. Instead, it encodes relative distance in attention scores. The formula -m · |i-j| works the same whether |i-j| = 5 or |i-j| = 500!

Three Keys to Extrapolation:

Parameter-Free: No learned position parameters that overfit to training lengths
Linear Penalty: Distance 100 and 1000 differ by the same multiplicative factor (10×)
Consistent Relative Behavior: The attention pattern shape stays similar at any scale

Experimental Results (Perplexity)

Train Length	Test Length	Extrapolation	ALiBi	RoPE	Learned PE
1024	1024	1× (baseline)	10.2	10.3	10.2
1024	2048	2×	10.4	11.1	15.3 ❌
1024	4096	4×	10.6	13.2 ❌	25.7 ❌
1024	8192	8×	10.9	18.5 ❌	Diverges ❌
1024	16384	16×	11.3	35+ ❌	N/A

Lower perplexity is better. ALiBi maintains near-baseline performance even at 16× training length!

Key Takeaway: Train on 1K tokens, deploy on 10K+ tokens with minimal degradation. This is transformative for production LLMs where context length matters.

Production Models Using ALiBi

ALiBi has proven itself in real-world, large-scale language models. Here's how major models configure and use it:

BLOOM (176B Parameters)

Configuration:

Model Size: 14,336 hidden dimensions
Attention Heads: 112 heads
Position Encoding: ALiBi only (no learned embeddings)
Training Length: 2,048 tokens
Inference Capability: Can handle significantly longer sequences

Why BLOOM Chose ALiBi:

Zero position embedding parameters saves memory in a 176B model
Multilingual training benefits from not overfitting to specific sequence lengths
Inference flexibility for various downstream tasks
Proven extrapolation allows deployment on longer contexts

Architecture Impact:

112 heads means very fine-grained attention range coverage
Slopes range from extremely local to relatively global
Each head specializes in a specific distance scale
No position embedding layer → simpler architecture

MPT (7B and 30B Parameters)

Configuration:

Model Size: 4,096 hidden dimensions (7B variant)
Attention Heads: 32 heads
Position Encoding: ALiBi exclusively
Training Length: 2,048 tokens
Extrapolation Factor: 4× (can handle 8K at inference)
Special Feature: alibi_bias_max parameter for clamping

MPT's Refinements:

Bias Clamping: Limits maximum penalty to prevent extreme suppression
Optimized for Code: ALiBi's locality bias helps with programming syntax
Training Efficiency: Smaller position encoding footprint speeds training
Commercial Deployment: Reliable extrapolation critical for production

Performance Characteristics:

Trained on 2K, deploys on 8K with <5% degradation
Faster inference than models with position embeddings
Better few-shot learning due to flexible context

Combining ALiBi with Other Techniques

ALiBi's simplicity makes it compatible with advanced attention optimizations:

ALiBi + Flash Attention

How They Combine: Flash Attention optimizes the attention computation itself (reducing memory and increasing speed), while ALiBi provides position information. They're orthogonal improvements.

Integration Approach:

Flash Attention accepts custom bias matrices
Pass ALiBi bias to Flash Attention's bias parameter
Get both speed (Flash) and extrapolation (ALiBi)

Benefits:

2-4× faster attention with Flash
Zero position parameters with ALiBi
Excellent length extrapolation maintained
Critical for long-context models (GPT-4 scale)

Used In:

MPT models use this combination
Many recent open-source LLMs
Production systems requiring both speed and long context

ALiBi + Sliding Window Attention

Motivation: For extremely long sequences (100K+ tokens), even ALiBi can struggle. Sliding window restricts attention to nearby tokens.

Combined Strategy:

Sliding Window: Hard mask limiting attention to ±window_size
ALiBi Bias: Soft preference for nearby tokens within window
Together: Window provides hard cutoff, ALiBi provides gradient within window

Architecture:

Each token attends to only W nearby tokens (e.g., W=512)
Within that window, ALiBi bias still applies
Computational complexity: O(n × W) instead of O(n²)

Use Cases:

Document processing (100K+ tokens)
Long-form code analysis
Multi-document question answering
Extreme context scenarios

Trade-offs:

Lose true global attention
Gain linear complexity
ALiBi helps make window feel less abrupt
Good for tasks where distant context less critical

ALiBi + Grouped Query Attention (GQA)

Synergy: GQA reduces memory by sharing K/V across multiple query heads. ALiBi adds no memory overhead.

Combined Benefits:

GQA: Fewer K/V projections → less memory
ALiBi: No position embeddings → less memory
Together: Extremely memory-efficient architecture

Production Impact:

Enables larger models on same hardware
Faster inference due to reduced memory movement
Maintains quality with both techniques

When to Use ALiBi vs Alternatives

Use ALiBi When:

✅ Long context is critical - Documents, code, conversations ✅ Inference length >> training length - Deploy on longer sequences ✅ Memory is constrained - No position embedding parameters ✅ Locality matters - Tasks benefit from strong local bias ✅ Simplicity is valued - Easy to implement and debug

Example Applications:

Long-form text generation (stories, articles)
Code completion with large context
Conversational AI with long chat histories
Document Q&A with full context

Consider Alternatives When:

RoPE (Rotary Position Embeddings):

Used in: LLaMA, PaLM, GPT-NeoX
Best for: Balance of extrapolation and flexibility
Trade-off: More complex implementation

Learned Position Embeddings:

Used in: BERT, original GPT
Best for: Fixed-length tasks, when you won't need extrapolation
Trade-off: Terrible extrapolation, extra parameters

No Position Encoding (rare):

Used in: Some vision transformers after initial patches
Best for: When position truly doesn't matter
Trade-off: Loses sequence order information

Comparison Table:

Feature	ALiBi	RoPE	Learned PE	Sinusoidal
Parameters	0	0	O(L×D)	0
Extrapolation	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐	⭐⭐⭐
Locality Bias	Built-in ✅	No ❌	No ❌	No ❌
Implementation	Very Simple	Medium	Simple	Simple
Memory	Minimal	Minimal	High	Minimal
Production Use	BLOOM, MPT	LLaMA, PaLM	BERT, GPT-2	Transformer (2017)

Implementation Strategies for Production

Strategy 1: Precomputation for Speed

Approach: Cache bias matrices for common sequence lengths during model initialization.

When to Use:

Inference-heavy workloads
Fixed or predictable sequence lengths
Memory is available

How It Works:

During initialization, compute bias matrices for lengths like 512, 1024, 2048, 4096, 8192
Store these as non-trainable buffers
At inference, retrieve the appropriate cached matrix
If needed length not cached, compute on-the-fly

Memory Trade-off:

8 heads × 2048 × 2048 = 33M floats = 134MB per cached length
Typically cache 5 lengths = ~670MB
Acceptable for most production scenarios

Speed Benefit:

Avoids recomputation every forward pass
Particularly beneficial for batch processing
Can be >10× faster than computing each time

Strategy 2: On-the-Fly Computation

Approach: Compute bias matrix during each forward pass.

When to Use:

Memory-constrained environments
Highly variable sequence lengths
Training (where speed less critical)

How It Works:

Create position indices for current sequence length
Compute distances: |i - j| for all pairs
Multiply by slopes
Add to attention scores

Memory Benefit:

No storage overhead
Scales to any sequence length
Ideal for mobile/edge deployment

Speed Trade-off:

Small overhead per forward pass
Negligible for large models (attention dominates)
Modern hardware makes this fast enough

Strategy 3: Hybrid Caching

Approach: Cache common lengths, compute rare lengths on-demand.

Recommended Setup:

Cache: 512, 1024, 2048 (covers 95% of use cases)
Compute: Everything else
Best of both worlds

Production Best Practice: This is the recommended approach for most deployments—balances speed and memory efficiently.

Handling Variable-Length Batches

Challenge: In practice, batches contain sequences of different lengths (with padding).

Solution Approaches:

Approach A: Masking

Use maximum length bias matrix
Apply attention mask for padding
Padded positions don't affect results (masked to -∞)

Approach B: Per-Sequence Bias

Compute exact-length bias for each sequence
More memory but no wasted computation
Better for extreme length variation

Approach C: Bucketed Lengths

Group sequences into length buckets (e.g., 0-512, 513-1024)
Use bias matrix for bucket maximum
Reduces variety while maintaining efficiency

Task-Specific Slope Tuning

While the standard formula works well generally, some tasks benefit from adjusted slopes:

Local-Heavy Tasks (syntax-focused):

Steeper slopes: Use base factor 2^(-4/n) instead of 2^(-8/n)
Effect: Stronger locality bias
Use cases: Code completion, grammar correction
Example: Head 0 gets slope 0.7 instead of 0.5

Global-Aware Tasks (semantics-focused):

Gentler slopes: Use base factor 2^(-16/n)
Effect: Weaker locality bias, more global attention
Use cases: Long-form summarization, topic modeling
Example: Head 0 gets slope 0.3 instead of 0.5

Balanced Tasks (standard):

Default slopes: 2^(-8/n) works well
Effect: Proven multi-scale coverage
Use cases: General language modeling, chat
Recommendation: Start here, tune only if needed

Visualizing ALiBi Patterns

To understand ALiBi's effect, visualizing attention patterns helps:

What to Visualize:

Distance Matrix: Shows the symmetric |i-j| structure
Bias Matrix: Distance × slope for each head
Attention Heatmap: After softmax, shows final attention weights

Typical Patterns:

Diagonal Concentration:

Strongest attention on diagonal (self-attention)
Decays as you move away from diagonal
Rate of decay = slope

Head Diversity:

Head 0: Tight diagonal band (local)
Head N-1: Broader, more diffuse (global)
Middle heads: Gradual transition

Length Scaling:

Pattern shape stays similar at any sequence length
Longer sequences show same relative structure
This constancy enables extrapolation

Conclusion: The Power of Simplicity

ALiBi demonstrates that sometimes the simplest solutions are the most effective. By adding a straightforward linear bias to attention scores, ALiBi achieves:

Key Achievements:

Zero-parameter position encoding that matches or exceeds learned approaches
Exceptional length extrapolation - train short, deploy long
Natural inductive bias toward local attention patterns
Production success in models like BLOOM (176B) and MPT (7B, 30B)

The Broader Lesson:

ALiBi's success teaches us to question complexity. Instead of adding position embeddings (parameters, memory, limited generalization), ALiBi asks: "What if we just penalized distant tokens proportionally to their distance?"

This principle—find the simplest modification that encodes the right inductive bias—is powerful across machine learning. ALiBi proves that with the right insight, you can often eliminate parameters while improving performance.

Practical Recommendation:

For new transformer models targeting long-context applications, ALiBi should be your default choice for position encoding. Its combination of simplicity, zero parameters, and excellent extrapolation makes it ideal for production LLMs.

Interactive Learning:

Use the visualization above to build intuition. Watch how:

Distance matrices create symmetric patterns
Slopes control attention range
Linear biases create natural locality
Different heads complement each other
Extrapolation "just works"

RoPE (Rotary Position Embeddings) - Main alternative to ALiBi
Positional Embeddings in ViT - Traditional approach
Multi-Head Attention - Attention mechanism ALiBi modifies
Context Windows - Why extrapolation matters
Flash Attention - Combines well with ALiBi

Table of Contents

ALiBi: Attention with Linear Biases

Regular Attention (No Position Info)

ALiBi Technical Details

Advantages

Limitations

Formula