Flash Attention: IO-Aware Exact Attention
Interactive visualization of Flash Attention - the breakthrough algorithm that makes attention memory-efficient through tiling, recomputation, and kernel fusion.
Clear explanations of core machine learning concepts, from foundational ideas to advanced techniques. Understand attention mechanisms, transformers, skip connections, and more.
Interactive visualization of Flash Attention - the breakthrough algorithm that makes attention memory-efficient through tiling, recomputation, and kernel fusion.
Interactive visualization of key-value caching in LLMs - how caching transformer attention states enables efficient text generation without quadratic recomputation.
Essential linear algebra concepts for machine learning with interactive visualizations
Understand how different memory access patterns impact cache performance, prefetcher efficiency, and overall application speed through interactive visualizations.
Understand how memory interleaving distributes addresses across multiple banks to enable parallel access, dramatically improving memory bandwidth in modern systems from DDR5 to GPU memory.
Explore NUMA (Non-Uniform Memory Access) architecture, understanding how modern multi-socket systems manage memory locality and the performance implications of local vs remote memory access.