Dense Embeddings Space Explorer
Interactive visualization of high-dimensional vector spaces, word relationships, and semantic arithmetic operations.
Best viewed on desktop for optimal interactive experience
Dense Embeddings Space Explorer
Dense embeddings revolutionized NLP by representing words and sentences as continuous vectors in high-dimensional space, where semantic similarity corresponds to geometric proximity.
Interactive 3D Embedding Space
Dense Embeddings Space Explorer
Explore semantic relationships in high-dimensional vector spaces
Embedding Configuration
3D Embedding Space
Nearest Neighbors
Understanding Dense Embeddings
Key Properties
- • Continuous vector representations
- • Capture semantic similarity
- • Enable arithmetic operations
- • Typically 50-1000 dimensions
Common Models
- • Word2Vec (CBOW, Skip-gram)
- • GloVe (Global Vectors)
- • FastText (Subword)
- • BERT (Contextual)
Applications
- • Semantic search
- • Document clustering
- • Recommendation systems
- • Machine translation
What Are Dense Embeddings?
Dense embeddings are continuous vector representations where:
- Every dimension has a value (unlike sparse representations)
- Semantic similarity = geometric proximity
- Vector arithmetic captures relationships
- Typically 50-1000 dimensions
Key Concepts
1. Word Embeddings Evolution
The progression of embedding techniques:
Model | Year | Key Innovation | Dimensions |
---|---|---|---|
Word2Vec | 2013 | Skip-gram/CBOW | 50-300 |
GloVe | 2014 | Global matrix factorization | 50-300 |
FastText | 2016 | Subword information | 100-300 |
BERT | 2018 | Contextual embeddings | 768 |
GPT-3 | 2020 | Scale + few-shot | 12,288 |
2. Training Objectives
Different models use different objectives:
Word2Vec Skip-gram:
GloVe:
3. Cosine Similarity
The standard metric for comparing embeddings:
Vector Arithmetic
The Famous Analogy
The most celebrated property of word embeddings:
king - man + woman ≈ queen
This works because embeddings encode relationships:
king - man
= royalty vector- Adding
woman
applies royalty to female - Result closest to
queen
More Examples
# Relationships captured by arithmetic paris - france + italy ≈ rome bigger - big + small ≈ smaller walking - walk + swim ≈ swimming
Implementation Details
Creating Word Embeddings
import numpy as np from gensim.models import Word2Vec # Train Word2Vec sentences = [["cat", "sat", "mat"], ["dog", "stood", "rug"]] model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1) # Skip-gram # Get embeddings cat_vector = model.wv['cat'] dog_vector = model.wv['dog'] # Compute similarity similarity = model.wv.similarity('cat', 'dog')
Finding Nearest Neighbors
def find_nearest(embedding, embeddings, k=5): """Find k nearest neighbors using cosine similarity""" similarities = [] for word, vec in embeddings.items(): sim = cosine_similarity(embedding, vec) similarities.append((word, sim)) # Sort by similarity similarities.sort(key=lambda x: x[1], reverse=True) return similarities[:k]
Sentence Embeddings
Moving from words to sentences:
Average Pooling
Simple but effective:
sentence_emb = np.mean([word_emb for word in sentence], axis=0)
Weighted Average
Using TF-IDF or importance weights:
weights = compute_tfidf(sentence) sentence_emb = np.average(word_embs, weights=weights, axis=0)
Sentence-BERT
Specialized models for sentence embeddings:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(sentences)
Applications
1. Semantic Search
# Index documents doc_embeddings = model.encode(documents) # Search query_embedding = model.encode(query) similarities = cosine_similarity(query_embedding, doc_embeddings) top_k = np.argsort(similarities)[-k:]
2. Clustering
from sklearn.cluster import KMeans # Cluster embeddings kmeans = KMeans(n_clusters=10) clusters = kmeans.fit_predict(embeddings)
3. Classification
# Use embeddings as features X = np.array([get_embedding(text) for text in texts]) classifier = LogisticRegression() classifier.fit(X, labels)
Visualization Techniques
t-SNE Projection
Reduce dimensions for visualization:
from sklearn.manifold import TSNE tsne = TSNE(n_components=2, perplexity=30) embeddings_2d = tsne.fit_transform(embeddings)
UMAP
Faster alternative to t-SNE:
import umap reducer = umap.UMAP(n_components=2) embeddings_2d = reducer.fit_transform(embeddings)
Common Pitfalls
1. Bias in Embeddings
Word embeddings can encode societal biases:
# Problematic associations doctor - man + woman ≈ nurse # Gender bias programmer - man + woman ≈ homemaker # Occupation bias
2. Out-of-Vocabulary Words
Handling unknown words:
- Use subword tokenization (FastText)
- Fall back to character embeddings
- Use contextual models (BERT)
3. Polysemy
Single vector per word loses context:
- "bank" (financial) vs "bank" (river)
- Solution: Contextual embeddings (BERT, GPT)
Performance Considerations
Memory Usage
- Word2Vec: ~1GB for 1M words × 300 dims
- BERT: ~400MB model + dynamic computation
- Storage: Use float16 or quantization
Speed Optimization
# Batch operations similarities = np.dot(query_embs, doc_embs.T) # Approximate nearest neighbor from annoy import AnnoyIndex index = AnnoyIndex(embedding_dim, 'angular') for i, vec in enumerate(embeddings): index.add_item(i, vec) index.build(10) # 10 trees
Modern Developments
1. Contextual Embeddings
BERT and GPT models provide context-dependent embeddings:
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModel.from_pretrained('bert-base-uncased') # Different embeddings for same word in different contexts inputs1 = tokenizer("The bank is closed", return_tensors="pt") inputs2 = tokenizer("The river bank is muddy", return_tensors="pt")
2. Multilingual Embeddings
Cross-lingual understanding:
- mBERT: 104 languages
- XLM-R: 100 languages
- LaBSE: Language-agnostic sentence embeddings
3. Multimodal Embeddings
Combining text and vision:
- CLIP: Text-image alignment
- ALIGN: Noisy data training
- Flamingo: Few-shot multimodal
Best Practices
-
Choose the right model:
- Static embeddings for speed
- Contextual for accuracy
- Domain-specific when available
-
Normalize embeddings:
normalized = embedding / np.linalg.norm(embedding)
-
Use appropriate similarity metrics:
- Cosine for normalized vectors
- Euclidean for positional relationships
- Dot product for efficiency
-
Consider fine-tuning:
- Domain adaptation improves performance
- Contrastive learning for specific tasks
Related Concepts
- Quantization Effects - Reducing embedding precision
- Matryoshka Embeddings - Multi-scale representations
- Sparse vs Dense - Comparing embedding types
References
- Mikolov et al. "Efficient Estimation of Word Representations in Vector Space"
- Pennington et al. "GloVe: Global Vectors for Word Representation"
- Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers"
- Reimers & Gurevych "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"