Vector Index Structures

14 min

Explore the fundamental data structures powering vector databases: trees, graphs, hash tables, and hybrid approaches for efficient similarity search.

Best viewed on desktop for optimal interactive experience

Vector Index Structures

Understanding the fundamental data structures behind vector search is crucial for building efficient similarity search systems. Each structure offers unique tradeoffs between build time, query time, memory usage, and accuracy.

Interactive Index Structure Visualization

Explore how different index structures organize and search through vector data:

Vector Index Structures

Compare different indexing strategies for vector search

Flat Index Visualization

Instructions: Click on the canvas to search. Flat index searches all vectors (brute force).

Search Complexity
O(N × d)
Accuracy
100%
Memory Usage
N × d × 4 bytes

Index Structure Categories

Taxonomy of Vector Indices

Vector Indices ├── Tree-Based │ ├── Space Partitioning │ │ ├── KD-Tree │ │ ├── Octree │ │ └── R-Tree │ └── Metric Trees │ ├── Ball Tree │ ├── Cover Tree │ └── VP-Tree ├── Graph-Based │ ├── Delaunay Graph │ ├── K-NN Graph │ ├── NSW (Small World) │ └── HNSW (Hierarchical) ├── Hash-Based │ ├── LSH (Locality Sensitive) │ ├── Multi-Index Hashing │ └── Learning to Hash └── Learned Indices ├── Neural Networks ├── Learned Trees └── Differentiable Indices

Tree-Based Structures

1. KD-Tree (K-Dimensional Tree)

Binary tree that partitions space using axis-aligned hyperplanes:

class KDNode: def __init__(self, point, axis, left=None, right=None): self.point = point self.axis = axis self.left = left self.right = right class KDTree: def __init__(self, points, leaf_size=10): self.leaf_size = leaf_size self.root = self._build(points, 0) def _build(self, points, depth): if len(points) <= self.leaf_size: return KDNode(points, -1) # Leaf node # Choose axis with maximum variance axis = depth % len(points[0]) # Sort and split at median points.sort(key=lambda p: p[axis]) median = len(points) // 2 return KDNode( point=points[median], axis=axis, left=self._build(points[:median], depth + 1), right=self._build(points[median + 1:], depth + 1) ) def search(self, query, k=1): """K-nearest neighbor search""" heap = [] # Max-heap of (distance, point) def traverse(node, depth): if node.axis == -1: # Leaf for point in node.point: dist = euclidean_distance(query, point) if len(heap) < k: heapq.heappush(heap, (-dist, point)) elif dist < -heap[0][0]: heapq.heapreplace(heap, (-dist, point)) return # Determine which side to explore first axis = node.axis if query[axis] < node.point[axis]: first, second = node.left, node.right else: first, second = node.right, node.left # Explore closer side if first: traverse(first, depth + 1) # Check if we need to explore other side if len(heap) < k or \ abs(query[axis] - node.point[axis]) < -heap[0][0]: if second: traverse(second, depth + 1) traverse(self.root, 0) return [point for _, point in sorted(heap, reverse=True)]

Complexity:

  • Build: O(N log N)
  • Search: O(log N) average, O(N) worst
  • Space: O(N)

Limitations:

  • Curse of dimensionality (ineffective for d > 20)
  • Unbalanced with skewed data

2. Ball Tree

Hierarchical structure using hyperspheres:

class BallNode: def __init__(self, center, radius, points, left=None, right=None): self.center = center self.radius = radius self.points = points self.left = left self.right = right class BallTree: def __init__(self, points, leaf_size=40): self.leaf_size = leaf_size self.root = self._build(points) def _build(self, points): if len(points) <= self.leaf_size: center = np.mean(points, axis=0) radius = max(euclidean_distance(center, p) for p in points) return BallNode(center, radius, points) # Find dimension with maximum spread spreads = np.max(points, axis=0) - np.min(points, axis=0) split_dim = np.argmax(spreads) # Partition using median indices = np.argsort(points[:, split_dim]) median = len(points) // 2 left_points = points[indices[:median]] right_points = points[indices[median:]] # Build child nodes left_child = self._build(left_points) right_child = self._build(right_points) # Compute bounding ball center = np.mean(points, axis=0) radius = max(euclidean_distance(center, p) for p in points) return BallNode(center, radius, None, left_child, right_child)

Advantages over KD-Tree:

  • Better for high dimensions
  • Adapts to data distribution
  • Efficient pruning using triangle inequality

3. VP-Tree (Vantage Point Tree)

Partitions based on distances from vantage points:

class VPNode: def __init__(self, vantage, threshold, inside=None, outside=None): self.vantage = vantage self.threshold = threshold self.inside = inside self.outside = outside class VPTree: def __init__(self, points, leaf_size=10): self.leaf_size = leaf_size self.root = self._build(points) def _build(self, points): if len(points) <= self.leaf_size: return points # Leaf # Choose vantage point (random or furthest) vantage = random.choice(points) # Compute distances to vantage distances = [(p, distance(vantage, p)) for p in points if p != vantage] distances.sort(key=lambda x: x[1]) # Split at median distance median = len(distances) // 2 threshold = distances[median][1] inside = [p for p, d in distances[:median]] outside = [p for p, d in distances[median:]] return VPNode( vantage=vantage, threshold=threshold, inside=self._build(inside) if inside else None, outside=self._build(outside) if outside else None )

Graph-Based Structures

1. K-NN Graph

Connect each point to its k nearest neighbors:

class KNNGraph: def __init__(self, points, k=10): self.points = points self.k = k self.graph = self._build_graph() def _build_graph(self): """Build exact k-NN graph""" n = len(self.points) graph = {i: [] for i in range(n)} for i in range(n): # Find k nearest neighbors distances = [] for j in range(n): if i != j: dist = euclidean_distance(self.points[i], self.points[j]) distances.append((j, dist)) # Sort and take k nearest distances.sort(key=lambda x: x[1]) graph[i] = [j for j, _ in distances[:self.k]] return graph def search(self, query, k=10, max_steps=100): """Greedy search on graph""" # Start from random node current = random.randint(0, len(self.points) - 1) visited = set() candidates = [] for _ in range(max_steps): if current in visited: break visited.add(current) # Add current to candidates dist = euclidean_distance(query, self.points[current]) candidates.append((current, dist)) # Move to unvisited neighbor closest to query best_neighbor = None best_dist = float('inf') for neighbor in self.graph[current]: if neighbor not in visited: dist = euclidean_distance(query, self.points[neighbor]) if dist < best_dist: best_dist = dist best_neighbor = neighbor if best_neighbor is None: break current = best_neighbor # Return k best candidates candidates.sort(key=lambda x: x[1]) return [idx for idx, _ in candidates[:k]]

2. Navigable Small World (NSW)

Add long-range connections for faster navigation:

class NSW: def __init__(self, d, m=16, ef_construction=200): self.d = d self.m = m # Number of connections self.ef = ef_construction self.nodes = [] self.graph = {} def insert(self, point): """Insert with small world connections""" idx = len(self.nodes) self.nodes.append(point) if idx == 0: self.graph[0] = [] return # Find m nearest neighbors neighbors = self.search_layer(point, self.m + self.ef, list(range(idx))) # Add bidirectional edges self.graph[idx] = [] for neighbor_idx in neighbors[:self.m]: self.graph[idx].append(neighbor_idx) self.graph[neighbor_idx].append(idx) # Prune connections if needed if len(self.graph[neighbor_idx]) > self.m: self.prune_connections(neighbor_idx)

3. Delaunay Graph

Natural neighbor structure in geometric space:

from scipy.spatial import Delaunay class DelaunayGraph: def __init__(self, points): self.points = np.array(points) self.tri = Delaunay(self.points) self.graph = self._build_graph() def _build_graph(self): """Extract graph from Delaunay triangulation""" graph = {i: set() for i in range(len(self.points))} for simplex in self.tri.simplices: # Add edges between all pairs in simplex for i in range(len(simplex)): for j in range(i + 1, len(simplex)): graph[simplex[i]].add(simplex[j]) graph[simplex[j]].add(simplex[i]) return {k: list(v) for k, v in graph.items()}

Inverted Index Structures

1. Inverted File (IVF)

Cluster-based inverted index:

class InvertedFile: def __init__(self, n_clusters=1024): self.n_clusters = n_clusters self.centroids = None self.inverted_lists = {} def build(self, vectors): """Build inverted file index""" # Cluster vectors kmeans = KMeans(n_clusters=self.n_clusters) cluster_ids = kmeans.fit_predict(vectors) self.centroids = kmeans.cluster_centers_ # Build inverted lists for i, cluster_id in enumerate(cluster_ids): if cluster_id not in self.inverted_lists: self.inverted_lists[cluster_id] = [] self.inverted_lists[cluster_id].append(i) def search(self, query, k=10, nprobe=10): """Search with inverted file""" # Find nearest clusters distances = euclidean_distances(query.reshape(1, -1), self.centroids)[0] nearest_clusters = np.argsort(distances)[:nprobe] # Search within clusters candidates = [] for cluster_id in nearest_clusters: if cluster_id in self.inverted_lists: candidates.extend(self.inverted_lists[cluster_id]) # Rank candidates if not candidates: return [] candidate_vectors = vectors[candidates] distances = euclidean_distances(query.reshape(1, -1), candidate_vectors)[0] top_k = np.argsort(distances)[:k] return [candidates[i] for i in top_k]

2. Multi-Index Structure

Multiple complementary indices:

class MultiIndex: def __init__(self, index_types=['ivf', 'lsh', 'graph']): self.indices = {} for index_type in index_types: self.indices[index_type] = create_index(index_type) def search(self, query, k=10): """Combine results from multiple indices""" all_candidates = set() for index in self.indices.values(): candidates = index.search(query, k * 2) all_candidates.update(candidates) # Rerank combined candidates return rerank(query, all_candidates, k)

Hybrid Structures

1. Tree + Graph

Combine tree partitioning with graph navigation:

class HybridTreeGraph: def __init__(self): self.tree = None # For coarse search self.graphs = {} # Per-leaf graphs def build(self, vectors): # Build tree structure self.tree = KDTree(vectors) # Build graph at each leaf for leaf in self.tree.get_leaves(): leaf_vectors = leaf.points graph = build_knn_graph(leaf_vectors) self.graphs[leaf.id] = graph def search(self, query, k=10): # Find relevant leaves using tree leaves = self.tree.find_leaves(query, radius=None) # Search within leaves using graphs results = [] for leaf in leaves: graph = self.graphs[leaf.id] candidates = graph_search(query, graph, k) results.extend(candidates) return merge_results(results, k)

2. Hierarchical Structure

Multi-resolution index:

class HierarchicalIndex: def __init__(self, levels=3): self.levels = levels self.indices = [] def build(self, vectors): current_vectors = vectors for level in range(self.levels): # Build index for current level if level == 0: # Fine-grained index index = build_hnsw(current_vectors) else: # Coarser index index = build_ivf(current_vectors, n_clusters=len(current_vectors)//10) self.indices.append(index) # Create representatives for next level if level < self.levels - 1: current_vectors = compute_representatives(current_vectors)

Memory Layouts

1. Cache-Optimized Layout

struct CacheOptimizedNode { // Hot data (frequently accessed) float center[8]; // First 8 dimensions float radius; uint32_t num_points; // Padding to cache line char padding[CACHE_LINE_SIZE - sizeof(above)]; // Cold data (rarely accessed) float* full_center; // All dimensions uint32_t* point_ids; Node* children[2]; };

2. Compressed Storage

class CompressedIndex: def __init__(self): self.metadata = {} # Fixed-size metadata self.compressed_vectors = [] # Variable-size compressed data self.offsets = [] # Offsets into compressed data def add(self, vector, metadata): # Compress vector compressed = compress_vector(vector) # Store with offset offset = len(self.compressed_vectors) self.offsets.append(offset) self.compressed_vectors.extend(compressed) # Store metadata separately self.metadata[len(self.offsets) - 1] = metadata

Performance Comparison

Build and Query Complexity

StructureBuild TimeQuery TimeSpaceEffective Dimensions
Linear ScanO(1)O(N · d)O(N · d)Any
KD-TreeO(N log N)O(2d + k)O(N)< 20
Ball TreeO(N log N)O(d log N)O(N)< 50
VP-TreeO(N log N)O(log N)O(N)Metric space
LSHO(N · L)O(L)O(N · L)High-d
HNSWO(N log N)O(log N)O(N · M)Any
IVFO(N · K)O(√(N))O(N + K · d)Any

Choosing the Right Structure

Decision Tree

def choose_index_structure(n_vectors, dimensions, memory_budget, update_frequency): if n_vectors < 1000: return "BruteForce" # Just scan everything if dimensions > 100: if update_frequency == "high": return "LSH" # Handles high-d and updates elif memory_budget == "low": return "IVF-PQ" # Compression else: return "HNSW" # Best recall if dimensions < 20: if update_frequency == "never": return "KDTree" # Classic choice else: return "BallTree" # More robust if memory_budget == "unlimited": return "HNSW" # Best performance elif memory_budget == "moderate": return "NSW" or "IVF" else: return "IVF-PQ" # Extreme compression

Implementation Tips

1. Benchmarking

def benchmark_index(index_class, vectors, queries, k=10): """Benchmark index performance""" import time # Build time start = time.time() index = index_class() index.build(vectors) build_time = time.time() - start # Query time start = time.time() results = [] for query in queries: results.append(index.search(query, k)) query_time = (time.time() - start) / len(queries) # Memory usage import sys memory = sys.getsizeof(index) # Recall (needs ground truth) recall = compute_recall(results, ground_truth) return { 'build_time': build_time, 'query_time_ms': query_time * 1000, 'memory_mb': memory / 1e6, 'recall@k': recall }

2. Dynamic Index Selection

class AdaptiveIndex: """Automatically choose best index based on data""" def __init__(self): self.index = None self.index_type = None def build(self, vectors): # Analyze data characteristics n, d = vectors.shape intrinsic_dim = estimate_intrinsic_dimension(vectors) distribution = analyze_distribution(vectors) # Choose index if intrinsic_dim < 10: self.index = KDTree(vectors) self.index_type = "KDTree" elif distribution == "clustered": self.index = IVF(vectors) self.index_type = "IVF" else: self.index = HNSW(vectors) self.index_type = "HNSW" print(f"Selected {self.index_type} based on data analysis")

References

  • Friedman et al. "An Algorithm for Finding Best Matches in Logarithmic Expected Time"
  • Omohundro. "Five Balltree Construction Algorithms"
  • Yianilos. "Data Structures and Algorithms for Nearest Neighbor Search"
  • Malkov & Yashunin. "Efficient and Robust Approximate Nearest Neighbor Search Using HNSW"

If you found this explanation helpful, consider sharing it with others.

Mastodon