ViT Input: Patching & Positional Embeddings

Visualizing how Vision Transformers process images into sequences with spatial context.

Step 0: Input Image
Selected Patch: 5
5
Row: 1
Col: 1

Processing Steps

1
Input
2
Patch
3
Project
4
Flatten
5
Add Pos
Image

The process starts with the raw input image (e.g., 224x224 pixels).

Technical Implementation Details

Patch Embedding (Linear Projection)

z0 = [xclass; E·xp1; E·xp2; ... ; E·xpN] + Epos

Input sequence `z_0` to Transformer: `x_class` is the CLS token embedding. `x_p`i is the flattened `i`-th patch. `E` is the learnable linear projection matrix. `E_pos` is the positional embedding matrix. Addition is element-wise.

Typical Dimensions (ViT-Base)

  • Input Image: 224 × 224 × 3
  • Patch Size (P): 16 × 16
  • Number of Patches (N): (224/16)² = 14² = 196
  • Sequence Length: 1 (CLS) + 196 = 197
  • Embedding Dimension (D): 768
  • Projection Matrix (E): (P² * 3) × D = (16² * 3) × 768 = 768 × 768
  • Pos. Embedding (E_pos): 197 × 768

Variations & Considerations

  • Larger models (ViT-L, ViT-H) use larger D (1024, 1280).
  • Higher resolution images (e.g., 384x384) result in more patches (N=576) and longer sequences. Positional embeddings often need interpolation.
  • Smaller patch sizes (e.g., 8x8) increase N significantly, demanding more compute but potentially capturing finer details.
  • Other architectures (Swin, CaiT) use different patching/embedding strategies (shifted windows, layer-scale).
Interactive ViT Input Visualization. Patch size fixed at 4x4 for demonstration. Embedding dimension simplified to 8.
Mastodon