Tagged with

transformers

Explore machine learning concepts related to transformers. Clear explanations and practical insights.

Concepts Found

Concepts Related to transformers

January 31, 2025

ALiBi: Attention with Linear Biases

Understand ALiBi, the position encoding method that adds linear biases to attention scores, enabling exceptional length extrapolation without position embeddings.

deep-learning attention transformers position-encoding

19 min readConcept

January 31, 2025

MHA vs GQA vs MQA: Choosing the Right Attention

Compare Multi-Head, Grouped-Query, and Multi-Query Attention mechanisms to understand their trade-offs and choose the optimal approach for your use case.

deep-learning attention transformers optimization

9 min readConcept

January 31, 2025

Attention Sinks: Stable Streaming LLMs

Understand attention sinks, the phenomenon where LLMs concentrate attention on initial tokens, and how preserving them enables infinite-length streaming inference.

deep-learning attention transformers streaming inference

17 min readConcept

January 31, 2025

Cross-Attention: Bridging Different Modalities

Understand cross-attention, the mechanism that enables transformers to align and fuse information from different sources, sequences, or modalities.

deep-learning attention transformers multimodal

15 min readConcept

January 31, 2025

Grouped-Query Attention (GQA)

Learn how Grouped-Query Attention balances the quality of Multi-Head Attention with the efficiency of Multi-Query Attention, enabling faster inference in large language models.

deep-learning attention transformers optimization

7 min readConcept

January 31, 2025

Linear Attention Approximations

Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.

deep-learning attention transformers linear-attention optimization

6 min readConcept

January 31, 2025

Masked and Causal Attention

Learn how masked attention enables autoregressive generation and prevents information leakage in transformers, essential for language models and sequential generation.

deep-learning attention transformers language-models

7 min readConcept

January 31, 2025

Multi-Query Attention (MQA)

Understand Multi-Query Attention, the radical efficiency optimization that shares keys and values across all attention heads, enabling massive memory savings for inference.

deep-learning attention transformers optimization

7 min readConcept

January 31, 2025

Rotary Position Embeddings (RoPE)

Understand Rotary Position Embeddings, the elegant position encoding method that encodes relative positions through rotation matrices, used in LLaMA, GPT-NeoX, and most modern LLMs.

deep-learning attention transformers position-encoding

8 min readConcept

January 31, 2025

Scaled Dot-Product Attention

Master the fundamental building block of transformers - scaled dot-product attention. Learn why scaling is crucial and how the mechanism enables parallel computation.

deep-learning attention transformers fundamentals

6 min readConcept

January 31, 2025

Sliding Window Attention

Learn how Sliding Window Attention enables efficient processing of long sequences by limiting attention to local context windows, used in Mistral and Longformer.

deep-learning attention transformers optimization

14 min readConcept

January 31, 2025

Sparse Attention Patterns

Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.

deep-learning attention transformers optimization sparse-models

7 min readConcept

January 21, 2025

Prompt Influence Flow: How Instructions Propagate Through Model Layers

Deep dive into how different prompt components influence model behavior across transformer layers, from surface patterns to abstract reasoning.

deep-learning llms prompting attention transformers

6 min readConcept

January 21, 2025

Cross-Encoder vs Bi-Encoder

Understand the fundamental differences between independent and joint encoding architectures for neural retrieval systems.

cross-encoder bi-encoder retrieval reranking neural-search transformers

7 min readConcept

January 21, 2025

Context Windows: The Memory Limits of LLMs

Interactive visualization of context window mechanisms in LLMs - sliding windows, expanding contexts, and attention patterns that define what models can "remember".

llms attention memory transformers

6 min readConcept

January 21, 2025

KV Cache: The Secret to Fast LLM Inference

Interactive visualization of key-value caching in LLMs - how caching transformer attention states enables efficient text generation without quadratic recomputation.

llms optimization inference transformers

7 min readConcept

January 21, 2025

Tokenization: Converting Text to Numbers

Interactive exploration of tokenization methods in LLMs - BPE, SentencePiece, and WordPiece. Understand how text becomes tokens that models can process.

llms tokenization nlp transformers

5 min readConcept

January 16, 2024

Mixture of Experts (MoE)

Understanding sparse mixture of experts models - architecture, routing mechanisms, load balancing, and efficient scaling strategies for large language models

MoE sparse-models expert-networks routing transformers scaling switch-transformer mixtral

6 min readConcept