Attention Sinks: Stable Streaming LLMs
Understand attention sinks, the phenomenon where LLMs concentrate attention on initial tokens, and how preserving them enables infinite-length streaming inference.
17 min readConcept
Explore machine learning concepts related to inference. Clear explanations and practical insights.
Understand attention sinks, the phenomenon where LLMs concentrate attention on initial tokens, and how preserving them enables infinite-length streaming inference.
Interactive visualization of key-value caching in LLMs - how caching transformer attention states enables efficient text generation without quadratic recomputation.