Context Windows: The Memory Limits of LLMs

Interactive visualization of context window mechanisms in LLMs - sliding windows, expanding contexts, and attention patterns that define what models can "remember".

Best viewed on desktop for optimal interactive experience

Context Windows in Large Language Models

Context windows define the maximum amount of text an LLM can process at once - the model's "working memory." This fundamental constraint shapes how models understand and generate text, from simple queries to complex documents.

Interactive Context Window Explorer

Explore different windowing strategies and see how they affect token processing:

Window Type

Context Window Controls

Position: 0 / 24

Token Stream Visualization

The
0
quick
1
brown
2
fox
3
jumps
4
over
5
the
6
lazy
7
dog
8
while
9
the
10
cat
11
watches
12
from
13
the
14
tree
15
and
16
birds
17
fly
18
overhead
19
in
20
the
21
blue
22
sky
23
during
24
a
25
sunny
26
afternoon
27
with
28
clouds
29
passing
30
by
31
Active
8 tokens
Processed
0 tokens
Remaining
24 tokens

Context Sizes Across Models

GPT-3
2,048
GPT-4
8,192
GPT-4-32k
32,768
Claude 2
100,000
Claude 3
200,000
* Context sizes in tokens (approximate)

Sliding Window Details

Fixed-size window that slides over the sequence

Advantages:
  • Memory efficient
  • O(n) complexity
  • Easy to implement
Limitations:
  • Limited long-range dependencies
  • Information loss

Memory Complexity

Attention memory scales as O(n²) where n is the context length. A 100K context requires ~40GB of memory just for attention scores!

The Context Length Challenge

Memory Complexity

The quadratic complexity of self-attention is the primary bottleneck:

O(n2 · d)

Where:

  • n = sequence length (context size)
  • d = model dimension

For a 100K token context:

  • Attention matrix: 100,000² = 10 billion elements
  • Memory required: ~40GB (float32)
  • Computation: 10 billion dot products per layer

The Information Bottleneck

I(X; Y) ≤ min(H(X), H(Y))

Context windows create an information bottleneck - the model must compress all relevant information into a fixed-size representation.

Windowing Strategies

1. Sliding Window Attention

Used in models like Longformer and BigBird:

def sliding_window_attention(seq_len, window_size): mask = torch.zeros(seq_len, seq_len) for i in range(seq_len): start = max(0, i - window_size // 2) end = min(seq_len, i + window_size // 2 + 1) mask[i, start:end] = 1 return mask

Complexity: O(n · w) where w is window size

2. Dilated/Strided Attention

Increases receptive field without increasing computation:

reffective = w · d

Where:

  • w = window size
  • d = dilation rate

3. Block-Local Attention

Divides sequence into non-overlapping blocks:

def block_attention(seq_len, block_size): n_blocks = seq_len // block_size mask = torch.zeros(seq_len, seq_len) for i in range(n_blocks): start = i * block_size end = (i + 1) * block_size mask[start:end, start:end] = 1 return mask

4. Global + Local Attention

Combines global tokens with local windows:

  • Global tokens attend to everything
  • Local tokens use sliding windows
  • Enables long-range dependencies

Position Encodings for Long Contexts

Rotary Position Embeddings (RoPE)

f(x, m) = x · eimθ

RoPE enables better extrapolation to longer contexts:

  • Relative position information
  • No absolute position limits
  • Used in LLaMA, GPT-NeoX

ALiBi (Attention with Linear Biases)

aij = qi · kj - m · |i - j|

Adds position-dependent bias to attention scores:

  • No position embeddings needed
  • Better length generalization
  • Linear complexity addition

Context Extension Techniques

1. Position Interpolation

Extend context by interpolating position encodings:

def interpolate_positions(max_len, original_max=2048): scale = original_max / max_len positions = torch.arange(max_len) * scale return positions

2. Attention Sinks

Keep initial tokens as "attention sinks":

  • First few tokens receive disproportionate attention
  • Preserving them stabilizes long contexts
  • Enables streaming with fixed memory

3. Memory-Efficient Attention

Flash Attention

Reduces memory from O(n²) to O(n):

  • Tiling and recomputation
  • Kernel fusion
  • IO-aware algorithm

Ring Attention

Distributes attention across devices:

  • Each device handles part of sequence
  • Ring communication pattern
  • Enables million-token contexts

Practical Implications

Effective Context Utilization

Not all of the context is equally used:

ueffective ≈ √(ncontext)

Studies show models primarily use:

  • Beginning (primacy effect)
  • End (recency effect)
  • Semantically relevant sections

The Lost Middle Problem

Performance degradation in middle of long contexts:

  1. Start: High attention (prompts, instructions)
  2. Middle: Low attention (often ignored)
  3. End: High attention (recent context)

Context Length vs Quality Trade-off

Context SizeBenefitsDrawbacks
2KFast, cheapLimited applications
8KGood for most tasksMay truncate documents
32KFull documentsSlower, more expensive
100K+Books, codebasesVery slow, costly

Memory Management Strategies

1. Chunking and Summarization

Break long documents into chunks:

def chunk_document(text, chunk_size=1024, overlap=128): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunks.append(text[start:end]) start = end - overlap return chunks

2. Hierarchical Processing

Process at multiple granularities:

  1. Sentence-level encoding
  2. Paragraph-level aggregation
  3. Document-level summary

3. Retrieval-Augmented Generation (RAG)

Extend effective context via retrieval:

  • Store documents in vector database
  • Retrieve relevant chunks
  • Augment prompt with retrieved context

Optimizing Context Usage

Token Budgeting

Allocate tokens strategically:

def budget_tokens(system=500, examples=1000, query=500, response=1000): total = system + examples + query + response context_remaining = MAX_CONTEXT - total return { 'document': context_remaining * 0.7, 'history': context_remaining * 0.3 }

Context Compression

Reduce token count while preserving information:

  1. Remove redundancy: Deduplicate similar content
  2. Summarize: Condense verbose sections
  3. Filter: Remove irrelevant information
  4. Compress: Use more efficient encoding

Future Directions

Infinite Context Models

Emerging approaches for unlimited context:

  • Memorizing Transformers: External memory
  • Recurrent Memory Transformers: Compressed memory
  • Infini-Attention: Compressive memory

Adaptive Context

Dynamic context allocation:

  • Expand for complex tasks
  • Contract for simple queries
  • Content-aware sizing

Performance Metrics

Perplexity vs Context Length

PPL(n) = PPL0 · (1 + α log n)

Perplexity typically improves logarithmically with context.

Needle in a Haystack Test

Measures retrieval accuracy across context positions:

  • Place fact at various positions
  • Query for the fact
  • Measure retrieval accuracy

Conclusion

Context windows represent a fundamental trade-off in LLM design between capability and efficiency. Understanding these limitations and workarounds is crucial for effectively using and developing LLM applications.

If you found this explanation helpful, consider sharing it with others.

Mastodon