ViT Input: Patching & Positional Embeddings

Visualizing how Vision Transformers process images into sequences with spatial context.

Demo Image:

Step 0: Input Image

Selected Patch: 5

Row: 1

Col: 1

Processing Steps

Input

Patch

Project

Flatten

Add Pos

Image

The process starts with the raw input image (e.g., 224x224 pixels).

Technical Implementation Details

Patch Embedding (Linear Projection)

z₀ = [x_class; E·x_p¹; E·x_p²; ... ; E·x_p^N] + E_pos

Input sequence `z_0` to Transformer: `x_class` is the CLS token embedding. `x_p`ⁱ is the flattened `i`-th patch. `E` is the learnable linear projection matrix. `E_pos` is the positional embedding matrix. Addition is element-wise.

Typical Dimensions (ViT-Base)

Input Image: 224 × 224 × 3
Patch Size (P): 16 × 16
Number of Patches (N): (224/16)² = 14² = 196
Sequence Length: 1 (CLS) + 196 = 197
Embedding Dimension (D): 768
Projection Matrix (E): (P² * 3) × D = (16² * 3) × 768 = 768 × 768
Pos. Embedding (E_pos): 197 × 768

Variations & Considerations

Larger models (ViT-L, ViT-H) use larger D (1024, 1280).
Higher resolution images (e.g., 384x384) result in more patches (N=576) and longer sequences. Positional embeddings often need interpolation.
Smaller patch sizes (e.g., 8x8) increase N significantly, demanding more compute but potentially capturing finer details.
Other architectures (Swin, CaiT) use different patching/embedding strategies (shifted windows, layer-scale).

Interactive ViT Input Visualization. Patch size fixed at 4x4 for demonstration. Embedding dimension simplified to 8.