ViT Input: Patching & Positional Embeddings
Visualizing how Vision Transformers process images into sequences with spatial context.
Step 0: Input Image
Selected Patch: 5
5
Row: 1
Col: 1
Processing Steps
1
Input
2
Patch
3
Project
4
Flatten
5
Add Pos
Image
The process starts with the raw input image (e.g., 224x224 pixels).
Technical Implementation Details
Patch Embedding (Linear Projection)
z0 = [xclass; E·xp1; E·xp2; ... ; E·xpN] + Epos
Input sequence `z_0` to Transformer: `x_class` is the CLS token embedding. `x_p`i is the flattened `i`-th patch. `E` is the learnable linear projection matrix. `E_pos` is the positional embedding matrix. Addition is element-wise.
Typical Dimensions (ViT-Base)
- Input Image: 224 × 224 × 3
- Patch Size (P): 16 × 16
- Number of Patches (N): (224/16)² = 14² = 196
- Sequence Length: 1 (CLS) + 196 = 197
- Embedding Dimension (D): 768
- Projection Matrix (E): (P² * 3) × D = (16² * 3) × 768 = 768 × 768
- Pos. Embedding (E_pos): 197 × 768
Variations & Considerations
- Larger models (ViT-L, ViT-H) use larger D (1024, 1280).
- Higher resolution images (e.g., 384x384) result in more patches (N=576) and longer sequences. Positional embeddings often need interpolation.
- Smaller patch sizes (e.g., 8x8) increase N significantly, demanding more compute but potentially capturing finer details.
- Other architectures (Swin, CaiT) use different patching/embedding strategies (shifted windows, layer-scale).
Interactive ViT Input Visualization. Patch size fixed at 4x4 for demonstration. Embedding dimension simplified to 8.