Multi-Head Attention Visualization
An interactive exploration of how attention mechanisms work in transformer models
Input Embeddings
Input sequence of 4 token embeddings (dim=8). Focus on "I" as the Query.
Each token is converted to a vector representation (embedding) of dimension 8. Token 0 ("I") is our focus as the query token.
Multi-Head Attention Concepts
Scaled Dot-Product Attention
The core mechanism computes attention scores between a Query (Q) and all Keys (K). Scores are calculated as `Attention(Q, K, V) = softmax( (Q * K^T) / sqrt(d_k) ) * V`. The scaling factor prevents vanishing gradients with large key dimensions.
Multiple Heads
Instead of one large attention calculation, multi-head attention projects Q, K, and V multiple times with different learned matrices. This allows the model to jointly attend to information from different representation subspaces.