Multi-Head Attention Visualization

An interactive exploration of how attention mechanisms work in transformer models

Select an Example
...
1

Input Embeddings

Input sequence of 4 token embeddings (dim=8). Focus on "I" as the Query.

I
(Query)
like
machine
learning
X = [x₁, x₂, ..., xₙ]

Each token is converted to a vector representation (embedding) of dimension 8. Token 0 ("I") is our focus as the query token.

Multi-Head Attention Concepts

Scaled Dot-Product Attention

The core mechanism computes attention scores between a Query (Q) and all Keys (K). Scores are calculated as `Attention(Q, K, V) = softmax( (Q * K^T) / sqrt(d_k) ) * V`. The scaling factor prevents vanishing gradients with large key dimensions.

Multiple Heads

Instead of one large attention calculation, multi-head attention projects Q, K, and V multiple times with different learned matrices. This allows the model to jointly attend to information from different representation subspaces.

Interactive Multi-Head Attention Visualization
Mastodon