Tokenization: Converting Text to Numbers
Interactive exploration of tokenization methods in LLMs - BPE, SentencePiece, and WordPiece. Understand how text becomes tokens that models can process.
Best viewed on desktop for optimal interactive experience
Tokenization in Large Language Models
Tokenization is the foundational step in LLM processing - converting raw text into numerical tokens that models can understand. The choice of tokenization method significantly impacts model performance, vocabulary size, and handling of rare words.
Interactive Tokenization Explorer
Experiment with different tokenization methods and see how they break down text:
Tokenization Method
Input Text
BPE Process Visualization
Tokenization Result
0 tokensVocabulary Sample
Showing 0 entriesAbout Byte Pair Encoding (BPE)
- Efficient compression
- Handles OOV words
- Language agnostic
- Multiple tokenizations possible
- Not deterministic
Why Tokenization Matters
The Vocabulary Size Dilemma
Different granularities offer different trade-offs:
- Character-level: Small vocabulary, long sequences
- Subword-level: Balanced vocabulary and sequence length
- Word-level: Large vocabulary, short sequences
The OOV Problem
Out-of-vocabulary (OOV) words are a critical challenge:
- Word-level tokenization fails on new/rare words
- Subword methods can handle any word by decomposition
- Character-level always works but loses semantic meaning
Tokenization Methods
Byte Pair Encoding (BPE)
BPE builds vocabulary through iterative merging:
- Initialize with character-level tokens
- Count all adjacent token pairs
- Merge the most frequent pair
- Add new token to vocabulary
- Repeat until vocabulary size reached
SentencePiece
Key differences from BPE:
- Treats text as raw byte stream
- Includes whitespace in tokens (▁ prefix)
- Language-agnostic (no pre-tokenization)
- Fully reversible
WordPiece
Used by BERT family models:
- Maximizes likelihood of training data
- Uses ## prefix for subwords
- Requires pre-tokenization
- More deterministic than BPE
Mathematical Framework
Compression Rate
Typical compression rates:
- BPE: 3-4 characters per token
- SentencePiece: 3.5-4.5 characters per token
- WordPiece: 2.5-3.5 characters per token
Vocabulary Entropy
Lower entropy indicates more efficient encoding.
Tokenization in Practice
GPT Models (BPE)
# GPT tokenization example tokens = tokenizer.encode("Hello world!") # Result: [15496, 995, 0]
BERT Models (WordPiece)
# BERT tokenization example tokens = tokenizer.encode("[CLS] Hello world! [SEP]") # Result: [101, 7592, 2088, 999, 102]
T5 Models (SentencePiece)
# T5 tokenization example tokens = tokenizer.encode("Hello world!") # Result: [▁Hello, ▁world, !]
Performance Impact
Sequence Length
Tokenization directly affects computational cost:
Where:
- n = sequence length in tokens
- d = model dimension
Memory Requirements
- Vocabulary embeddings: V × dembed
- Sequence processing: n × dmodel
Special Tokens
Common Special Tokens
[PAD]
- Padding for batch processing[CLS]
- Classification/start token[SEP]
- Separator between segments[MASK]
- Masked token for MLM[UNK]
- Unknown token fallback<s>
/</s>
- Start/end of sequence
Position in Vocabulary
Special tokens typically occupy IDs 0-100 for easy identification and special handling.
Multilingual Considerations
Character Coverage
- ASCII: 128 characters
- Latin Extended: ~1,000 characters
- CJK: ~20,000 characters
- Full Unicode: ~140,000 characters
Tokenization Efficiency by Language
Language | Avg Tokens/Word | Compression |
---|---|---|
English | 1.3 | High |
German | 1.5 | Medium-High |
Chinese | 2.1 | Medium |
Japanese | 2.5 | Medium-Low |
Korean | 3.2 | Low |
Advanced Techniques
Byte-level BPE
Used in GPT-2 and later:
- Operates on bytes, not characters
- Can encode any string
- Vocabulary size: 50,257 tokens
Unigram Language Model
Alternative to BPE:
- Starts with large vocabulary
- Removes tokens to minimize loss
- More principled than BPE
Common Issues and Solutions
1. Tokenization Artifacts
Problem: Weird token boundaries Solution: Use consistent preprocessing
2. Case Sensitivity
Problem: "Hello" ≠ "hello" Solution: Lowercase normalization or cased models
3. Number Handling
Problem: Each digit as separate token Solution: Special number tokenization
4. Whitespace Handling
Problem: Inconsistent space tokenization Solution: SentencePiece with whitespace tokens
Performance Optimization
Caching Strategies
# Cache frequently used tokenizations token_cache = {} def cached_tokenize(text): if text not in token_cache: token_cache[text] = tokenizer.encode(text) return token_cache[text]
Batch Processing
# Tokenize multiple texts efficiently texts = ["Hello", "World", "AI"] tokens = tokenizer.batch_encode(texts)
Related Concepts
- Context Windows - How tokens fit in context
- KV Cache - Caching token representations
- Flash Attention - Efficient token processing
- Embeddings - Token to vector conversion
Conclusion
Tokenization is more than just splitting text - it's a crucial design decision that affects model performance, multilingual capabilities, and computational efficiency. Understanding tokenization helps explain many LLM behaviors and limitations.