Tokenization: Converting Text to Numbers

Interactive exploration of tokenization methods in LLMs - BPE, SentencePiece, and WordPiece. Understand how text becomes tokens that models can process.

Best viewed on desktop for optimal interactive experience

Tokenization in Large Language Models

Tokenization is the foundational step in LLM processing - converting raw text into numerical tokens that models can understand. The choice of tokenization method significantly impacts model performance, vocabulary size, and handling of rare words.

Interactive Tokenization Explorer

Experiment with different tokenization methods and see how they break down text:

Tokenization Method

Input Text

BPE Process Visualization

Tokenization Result

0 tokens
Word
Subword
Character

Vocabulary Sample

Showing 0 entries

About Byte Pair Encoding (BPE)

Pros:
  • Efficient compression
  • Handles OOV words
  • Language agnostic
Cons:
  • Multiple tokenizations possible
  • Not deterministic

Why Tokenization Matters

The Vocabulary Size Dilemma

Vchar = 256 ≪ Vsubword ≈ 50k ≪ Vword ≈ 170k

Different granularities offer different trade-offs:

  • Character-level: Small vocabulary, long sequences
  • Subword-level: Balanced vocabulary and sequence length
  • Word-level: Large vocabulary, short sequences

The OOV Problem

Out-of-vocabulary (OOV) words are a critical challenge:

  • Word-level tokenization fails on new/rare words
  • Subword methods can handle any word by decomposition
  • Character-level always works but loses semantic meaning

Tokenization Methods

Byte Pair Encoding (BPE)

BPE builds vocabulary through iterative merging:

  1. Initialize with character-level tokens
  2. Count all adjacent token pairs
  3. Merge the most frequent pair
  4. Add new token to vocabulary
  5. Repeat until vocabulary size reached
P(w) = Πi=1n P(ti | t)

SentencePiece

Key differences from BPE:

  • Treats text as raw byte stream
  • Includes whitespace in tokens (▁ prefix)
  • Language-agnostic (no pre-tokenization)
  • Fully reversible

WordPiece

Used by BERT family models:

  • Maximizes likelihood of training data
  • Uses ## prefix for subwords
  • Requires pre-tokenization
  • More deterministic than BPE

Mathematical Framework

Compression Rate

C = |textchars||texttokens|

Typical compression rates:

  • BPE: 3-4 characters per token
  • SentencePiece: 3.5-4.5 characters per token
  • WordPiece: 2.5-3.5 characters per token

Vocabulary Entropy

H(V) = -Σt ∈ V P(t) log P(t)

Lower entropy indicates more efficient encoding.

Tokenization in Practice

GPT Models (BPE)

# GPT tokenization example tokens = tokenizer.encode("Hello world!") # Result: [15496, 995, 0]

BERT Models (WordPiece)

# BERT tokenization example tokens = tokenizer.encode("[CLS] Hello world! [SEP]") # Result: [101, 7592, 2088, 999, 102]

T5 Models (SentencePiece)

# T5 tokenization example tokens = tokenizer.encode("Hello world!") # Result: [▁Hello, ▁world, !]

Performance Impact

Sequence Length

Tokenization directly affects computational cost:

O(n2 · d)

Where:

  • n = sequence length in tokens
  • d = model dimension

Memory Requirements

Memory = V × dembed + n × dmodel
  • Vocabulary embeddings: V × dembed
  • Sequence processing: n × dmodel

Special Tokens

Common Special Tokens

  • [PAD] - Padding for batch processing
  • [CLS] - Classification/start token
  • [SEP] - Separator between segments
  • [MASK] - Masked token for MLM
  • [UNK] - Unknown token fallback
  • <s> / </s> - Start/end of sequence

Position in Vocabulary

Special tokens typically occupy IDs 0-100 for easy identification and special handling.

Multilingual Considerations

Character Coverage

  • ASCII: 128 characters
  • Latin Extended: ~1,000 characters
  • CJK: ~20,000 characters
  • Full Unicode: ~140,000 characters

Tokenization Efficiency by Language

LanguageAvg Tokens/WordCompression
English1.3High
German1.5Medium-High
Chinese2.1Medium
Japanese2.5Medium-Low
Korean3.2Low

Advanced Techniques

Byte-level BPE

Used in GPT-2 and later:

  • Operates on bytes, not characters
  • Can encode any string
  • Vocabulary size: 50,257 tokens

Unigram Language Model

Alternative to BPE:

  • Starts with large vocabulary
  • Removes tokens to minimize loss
  • More principled than BPE

Common Issues and Solutions

1. Tokenization Artifacts

Problem: Weird token boundaries Solution: Use consistent preprocessing

2. Case Sensitivity

Problem: "Hello" ≠ "hello" Solution: Lowercase normalization or cased models

3. Number Handling

Problem: Each digit as separate token Solution: Special number tokenization

4. Whitespace Handling

Problem: Inconsistent space tokenization Solution: SentencePiece with whitespace tokens

Performance Optimization

Caching Strategies

# Cache frequently used tokenizations token_cache = {} def cached_tokenize(text): if text not in token_cache: token_cache[text] = tokenizer.encode(text) return token_cache[text]

Batch Processing

# Tokenize multiple texts efficiently texts = ["Hello", "World", "AI"] tokens = tokenizer.batch_encode(texts)

Conclusion

Tokenization is more than just splitting text - it's a crucial design decision that affects model performance, multilingual capabilities, and computational efficiency. Understanding tokenization helps explain many LLM behaviors and limitations.

If you found this explanation helpful, consider sharing it with others.

Mastodon