Best viewed on desktop for optimal interactive experience

Tokenization in Large Language Models

Tokenization is the foundational step in LLM processing - converting raw text into numerical tokens that models can understand. The choice of tokenization method significantly impacts model performance, vocabulary size, and handling of rare words.

Interactive Tokenization Explorer

Experiment with different tokenization methods and see how they break down text:

Tokenization Method

Input Text

BPE Process Visualization

Tokenization Result

0 tokens

Word

Subword

Character

Vocabulary Sample

Showing 0 entries

About Byte Pair Encoding (BPE)

Pros:

Efficient compression
Handles OOV words
Language agnostic

Cons:

Multiple tokenizations possible
Not deterministic

Why Tokenization Matters

The Vocabulary Size Dilemma

V_char = 256 ≪ V_subword ≈ 50k ≪ V_word ≈ 170k

Different granularities offer different trade-offs:

Character-level: Small vocabulary, long sequences
Subword-level: Balanced vocabulary and sequence length
Word-level: Large vocabulary, short sequences

The OOV Problem

Out-of-vocabulary (OOV) words are a critical challenge:

Word-level tokenization fails on new/rare words
Subword methods can handle any word by decomposition
Character-level always works but loses semantic meaning

Tokenization Methods

Byte Pair Encoding (BPE)

BPE builds vocabulary through iterative merging:

Initialize with character-level tokens
Count all adjacent token pairs
Merge the most frequent pair
Add new token to vocabulary
Repeat until vocabulary size reached

P(w) = Π_i=1ⁿ P(t_i | t₎

SentencePiece

Key differences from BPE:

Treats text as raw byte stream
Includes whitespace in tokens (▁ prefix)
Language-agnostic (no pre-tokenization)
Fully reversible

WordPiece

Used by BERT family models:

Maximizes likelihood of training data
Uses ## prefix for subwords
Requires pre-tokenization
More deterministic than BPE

Mathematical Framework

Compression Rate

C = |text_chars||text_tokens|

Typical compression rates:

BPE: 3-4 characters per token
SentencePiece: 3.5-4.5 characters per token
WordPiece: 2.5-3.5 characters per token

Vocabulary Entropy

H(V) = -Σ_{t ∈ V} P(t) log P(t)

Lower entropy indicates more efficient encoding.

Tokenization in Practice

GPT Models (BPE)

# GPT tokenization example
tokens = tokenizer.encode("Hello world!")
# Result: [15496, 995, 0]

BERT Models (WordPiece)

# BERT tokenization example
tokens = tokenizer.encode("[CLS] Hello world! [SEP]")
# Result: [101, 7592, 2088, 999, 102]

T5 Models (SentencePiece)

# T5 tokenization example
tokens = tokenizer.encode("Hello world!")
# Result: [▁Hello, ▁world, !]

Performance Impact

Sequence Length

Tokenization directly affects computational cost:

O(n² · d)

Where:

n = sequence length in tokens
d = model dimension

Memory Requirements

Memory = V × d_embed + n × d_model

Vocabulary embeddings: V × d_embed
Sequence processing: n × d_model

Special Tokens

Common Special Tokens

[PAD] - Padding for batch processing
[CLS] - Classification/start token
[SEP] - Separator between segments
[MASK] - Masked token for MLM
[UNK] - Unknown token fallback
<s> / </s> - Start/end of sequence

Position in Vocabulary

Special tokens typically occupy IDs 0-100 for easy identification and special handling.

Multilingual Considerations

Character Coverage

ASCII: 128 characters
Latin Extended: ~1,000 characters
CJK: ~20,000 characters
Full Unicode: ~140,000 characters

Tokenization Efficiency by Language

Language	Avg Tokens/Word	Compression
English	1.3	High
German	1.5	Medium-High
Chinese	2.1	Medium
Japanese	2.5	Medium-Low
Korean	3.2	Low

Advanced Techniques

Byte-level BPE

Used in GPT-2 and later:

Operates on bytes, not characters
Can encode any string
Vocabulary size: 50,257 tokens

Unigram Language Model

Alternative to BPE:

Starts with large vocabulary
Removes tokens to minimize loss
More principled than BPE

Common Issues and Solutions

1. Tokenization Artifacts

Problem: Weird token boundaries Solution: Use consistent preprocessing

2. Case Sensitivity

Problem: "Hello" ≠ "hello" Solution: Lowercase normalization or cased models

3. Number Handling

Problem: Each digit as separate token Solution: Special number tokenization

4. Whitespace Handling

Problem: Inconsistent space tokenization Solution: SentencePiece with whitespace tokens

Performance Optimization

Caching Strategies

# Cache frequently used tokenizations
token_cache = {}
def cached_tokenize(text):
    if text not in token_cache:
        token_cache[text] = tokenizer.encode(text)
    return token_cache[text]

Batch Processing

# Tokenize multiple texts efficiently
texts = ["Hello", "World", "AI"]
tokens = tokenizer.batch_encode(texts)

Context Windows - How tokens fit in context
KV Cache - Caching token representations
Flash Attention - Efficient token processing
Embeddings - Token to vector conversion

Conclusion

Tokenization is more than just splitting text - it's a crucial design decision that affects model performance, multilingual capabilities, and computational efficiency. Understanding tokenization helps explain many LLM behaviors and limitations.

Table of Contents

Tokenization Method

Input Text

BPE Process Visualization

Tokenization Result

Vocabulary Sample