Vector Index Structures

Understanding the fundamental data structures behind vector search is crucial for building efficient similarity search systems. Each structure offers unique tradeoffs between build time, query time, memory usage, and accuracy.

Interactive Index Structure Visualization

Explore how different index structures organize and search through vector data:

Vector Index Structures

Compare different indexing strategies for vector search

Flat Index Visualization

Instructions: Click on the canvas to search. Flat index searches all vectors (brute force).

Search Complexity

O(N × d)

Accuracy

100%

Memory Usage

N × d × 4 bytes

Index Structure Categories

Taxonomy of Vector Indices

Vector Indices
├── Tree-Based
│   ├── Space Partitioning
│   │   ├── KD-Tree
│   │   ├── Octree
│   │   └── R-Tree
│   └── Metric Trees
│       ├── Ball Tree
│       ├── Cover Tree
│       └── VP-Tree
├── Graph-Based
│   ├── Delaunay Graph
│   ├── K-NN Graph
│   ├── NSW (Small World)
│   └── HNSW (Hierarchical)
├── Hash-Based
│   ├── LSH (Locality Sensitive)
│   ├── Multi-Index Hashing
│   └── Learning to Hash
└── Learned Indices
    ├── Neural Networks
    ├── Learned Trees
    └── Differentiable Indices

Tree-Based Structures

1. KD-Tree (K-Dimensional Tree)

Binary tree that partitions space using axis-aligned hyperplanes:

class KDNode:
    def __init__(self, point, axis, left=None, right=None):
        self.point = point
        self.axis = axis
        self.left = left
        self.right = right

class KDTree:
    def __init__(self, points, leaf_size=10):
        self.leaf_size = leaf_size
        self.root = self._build(points, 0)
    
    def _build(self, points, depth):
        if len(points) <= self.leaf_size:
            return KDNode(points, -1)  # Leaf node
        
        # Choose axis with maximum variance
        axis = depth % len(points[0])
        
        # Sort and split at median
        points.sort(key=lambda p: p[axis])
        median = len(points) // 2
        
        return KDNode(
            point=points[median],
            axis=axis,
            left=self._build(points[:median], depth + 1),
            right=self._build(points[median + 1:], depth + 1)
        )
    
    def search(self, query, k=1):
        """K-nearest neighbor search"""
        heap = []  # Max-heap of (distance, point)
        
        def traverse(node, depth):
            if node.axis == -1:  # Leaf
                for point in node.point:
                    dist = euclidean_distance(query, point)
                    if len(heap) < k:
                        heapq.heappush(heap, (-dist, point))
                    elif dist < -heap[0][0]:
                        heapq.heapreplace(heap, (-dist, point))
                return
            
            # Determine which side to explore first
            axis = node.axis
            if query[axis] < node.point[axis]:
                first, second = node.left, node.right
            else:
                first, second = node.right, node.left
            
            # Explore closer side
            if first:
                traverse(first, depth + 1)
            
            # Check if we need to explore other side
            if len(heap) < k or \
               abs(query[axis] - node.point[axis]) < -heap[0][0]:
                if second:
                    traverse(second, depth + 1)
        
        traverse(self.root, 0)
        return [point for _, point in sorted(heap, reverse=True)]

Complexity:

Build: O(N log N)
Search: O(log N) average, O(N) worst
Space: O(N)

Limitations:

Curse of dimensionality (ineffective for d > 20)
Unbalanced with skewed data

2. Ball Tree

Hierarchical structure using hyperspheres:

class BallNode:
    def __init__(self, center, radius, points, left=None, right=None):
        self.center = center
        self.radius = radius
        self.points = points
        self.left = left
        self.right = right

class BallTree:
    def __init__(self, points, leaf_size=40):
        self.leaf_size = leaf_size
        self.root = self._build(points)
    
    def _build(self, points):
        if len(points) <= self.leaf_size:
            center = np.mean(points, axis=0)
            radius = max(euclidean_distance(center, p) for p in points)
            return BallNode(center, radius, points)
        
        # Find dimension with maximum spread
        spreads = np.max(points, axis=0) - np.min(points, axis=0)
        split_dim = np.argmax(spreads)
        
        # Partition using median
        indices = np.argsort(points[:, split_dim])
        median = len(points) // 2
        
        left_points = points[indices[:median]]
        right_points = points[indices[median:]]
        
        # Build child nodes
        left_child = self._build(left_points)
        right_child = self._build(right_points)
        
        # Compute bounding ball
        center = np.mean(points, axis=0)
        radius = max(euclidean_distance(center, p) for p in points)
        
        return BallNode(center, radius, None, left_child, right_child)

Advantages over KD-Tree:

Better for high dimensions
Adapts to data distribution
Efficient pruning using triangle inequality

3. VP-Tree (Vantage Point Tree)

Partitions based on distances from vantage points:

class VPNode:
    def __init__(self, vantage, threshold, inside=None, outside=None):
        self.vantage = vantage
        self.threshold = threshold
        self.inside = inside
        self.outside = outside

class VPTree:
    def __init__(self, points, leaf_size=10):
        self.leaf_size = leaf_size
        self.root = self._build(points)
    
    def _build(self, points):
        if len(points) <= self.leaf_size:
            return points  # Leaf
        
        # Choose vantage point (random or furthest)
        vantage = random.choice(points)
        
        # Compute distances to vantage
        distances = [(p, distance(vantage, p)) for p in points if p != vantage]
        distances.sort(key=lambda x: x[1])
        
        # Split at median distance
        median = len(distances) // 2
        threshold = distances[median][1]
        
        inside = [p for p, d in distances[:median]]
        outside = [p for p, d in distances[median:]]
        
        return VPNode(
            vantage=vantage,
            threshold=threshold,
            inside=self._build(inside) if inside else None,
            outside=self._build(outside) if outside else None
        )

Graph-Based Structures

1. K-NN Graph

Connect each point to its k nearest neighbors:

class KNNGraph:
    def __init__(self, points, k=10):
        self.points = points
        self.k = k
        self.graph = self._build_graph()
    
    def _build_graph(self):
        """Build exact k-NN graph"""
        n = len(self.points)
        graph = {i: [] for i in range(n)}
        
        for i in range(n):
            # Find k nearest neighbors
            distances = []
            for j in range(n):
                if i != j:
                    dist = euclidean_distance(self.points[i], self.points[j])
                    distances.append((j, dist))
            
            # Sort and take k nearest
            distances.sort(key=lambda x: x[1])
            graph[i] = [j for j, _ in distances[:self.k]]
        
        return graph
    
    def search(self, query, k=10, max_steps=100):
        """Greedy search on graph"""
        # Start from random node
        current = random.randint(0, len(self.points) - 1)
        visited = set()
        candidates = []
        
        for _ in range(max_steps):
            if current in visited:
                break
            visited.add(current)
            
            # Add current to candidates
            dist = euclidean_distance(query, self.points[current])
            candidates.append((current, dist))
            
            # Move to unvisited neighbor closest to query
            best_neighbor = None
            best_dist = float('inf')
            
            for neighbor in self.graph[current]:
                if neighbor not in visited:
                    dist = euclidean_distance(query, self.points[neighbor])
                    if dist < best_dist:
                        best_dist = dist
                        best_neighbor = neighbor
            
            if best_neighbor is None:
                break
            current = best_neighbor
        
        # Return k best candidates
        candidates.sort(key=lambda x: x[1])
        return [idx for idx, _ in candidates[:k]]

2. Navigable Small World (NSW)

Add long-range connections for faster navigation:

class NSW:
    def __init__(self, d, m=16, ef_construction=200):
        self.d = d
        self.m = m  # Number of connections
        self.ef = ef_construction
        self.nodes = []
        self.graph = {}
    
    def insert(self, point):
        """Insert with small world connections"""
        idx = len(self.nodes)
        self.nodes.append(point)
        
        if idx == 0:
            self.graph[0] = []
            return
        
        # Find m nearest neighbors
        neighbors = self.search_layer(point, self.m + self.ef, list(range(idx)))
        
        # Add bidirectional edges
        self.graph[idx] = []
        for neighbor_idx in neighbors[:self.m]:
            self.graph[idx].append(neighbor_idx)
            self.graph[neighbor_idx].append(idx)
            
            # Prune connections if needed
            if len(self.graph[neighbor_idx]) > self.m:
                self.prune_connections(neighbor_idx)

3. Delaunay Graph

Natural neighbor structure in geometric space:

from scipy.spatial import Delaunay

class DelaunayGraph:
    def __init__(self, points):
        self.points = np.array(points)
        self.tri = Delaunay(self.points)
        self.graph = self._build_graph()
    
    def _build_graph(self):
        """Extract graph from Delaunay triangulation"""
        graph = {i: set() for i in range(len(self.points))}
        
        for simplex in self.tri.simplices:
            # Add edges between all pairs in simplex
            for i in range(len(simplex)):
                for j in range(i + 1, len(simplex)):
                    graph[simplex[i]].add(simplex[j])
                    graph[simplex[j]].add(simplex[i])
        
        return {k: list(v) for k, v in graph.items()}

Inverted Index Structures

1. Inverted File (IVF)

Cluster-based inverted index:

class InvertedFile:
    def __init__(self, n_clusters=1024):
        self.n_clusters = n_clusters
        self.centroids = None
        self.inverted_lists = {}
    
    def build(self, vectors):
        """Build inverted file index"""
        # Cluster vectors
        kmeans = KMeans(n_clusters=self.n_clusters)
        cluster_ids = kmeans.fit_predict(vectors)
        self.centroids = kmeans.cluster_centers_
        
        # Build inverted lists
        for i, cluster_id in enumerate(cluster_ids):
            if cluster_id not in self.inverted_lists:
                self.inverted_lists[cluster_id] = []
            self.inverted_lists[cluster_id].append(i)
    
    def search(self, query, k=10, nprobe=10):
        """Search with inverted file"""
        # Find nearest clusters
        distances = euclidean_distances(query.reshape(1, -1), self.centroids)[0]
        nearest_clusters = np.argsort(distances)[:nprobe]
        
        # Search within clusters
        candidates = []
        for cluster_id in nearest_clusters:
            if cluster_id in self.inverted_lists:
                candidates.extend(self.inverted_lists[cluster_id])
        
        # Rank candidates
        if not candidates:
            return []
        
        candidate_vectors = vectors[candidates]
        distances = euclidean_distances(query.reshape(1, -1), candidate_vectors)[0]
        top_k = np.argsort(distances)[:k]
        
        return [candidates[i] for i in top_k]

2. Multi-Index Structure

Multiple complementary indices:

class MultiIndex:
    def __init__(self, index_types=['ivf', 'lsh', 'graph']):
        self.indices = {}
        for index_type in index_types:
            self.indices[index_type] = create_index(index_type)
    
    def search(self, query, k=10):
        """Combine results from multiple indices"""
        all_candidates = set()
        
        for index in self.indices.values():
            candidates = index.search(query, k * 2)
            all_candidates.update(candidates)
        
        # Rerank combined candidates
        return rerank(query, all_candidates, k)

Hybrid Structures

1. Tree + Graph

Combine tree partitioning with graph navigation:

class HybridTreeGraph:
    def __init__(self):
        self.tree = None  # For coarse search
        self.graphs = {}  # Per-leaf graphs
    
    def build(self, vectors):
        # Build tree structure
        self.tree = KDTree(vectors)
        
        # Build graph at each leaf
        for leaf in self.tree.get_leaves():
            leaf_vectors = leaf.points
            graph = build_knn_graph(leaf_vectors)
            self.graphs[leaf.id] = graph
    
    def search(self, query, k=10):
        # Find relevant leaves using tree
        leaves = self.tree.find_leaves(query, radius=None)
        
        # Search within leaves using graphs
        results = []
        for leaf in leaves:
            graph = self.graphs[leaf.id]
            candidates = graph_search(query, graph, k)
            results.extend(candidates)
        
        return merge_results(results, k)

2. Hierarchical Structure

Multi-resolution index:

class HierarchicalIndex:
    def __init__(self, levels=3):
        self.levels = levels
        self.indices = []
    
    def build(self, vectors):
        current_vectors = vectors
        
        for level in range(self.levels):
            # Build index for current level
            if level == 0:
                # Fine-grained index
                index = build_hnsw(current_vectors)
            else:
                # Coarser index
                index = build_ivf(current_vectors, n_clusters=len(current_vectors)//10)
            
            self.indices.append(index)
            
            # Create representatives for next level
            if level < self.levels - 1:
                current_vectors = compute_representatives(current_vectors)

Memory Layouts

1. Cache-Optimized Layout

struct CacheOptimizedNode {
    // Hot data (frequently accessed)
    float center[8];  // First 8 dimensions
    float radius;
    uint32_t num_points;
    
    // Padding to cache line
    char padding[CACHE_LINE_SIZE - sizeof(above)];
    
    // Cold data (rarely accessed)
    float* full_center;  // All dimensions
    uint32_t* point_ids;
    Node* children[2];
};

2. Compressed Storage

class CompressedIndex:
    def __init__(self):
        self.metadata = {}  # Fixed-size metadata
        self.compressed_vectors = []  # Variable-size compressed data
        self.offsets = []  # Offsets into compressed data
    
    def add(self, vector, metadata):
        # Compress vector
        compressed = compress_vector(vector)
        
        # Store with offset
        offset = len(self.compressed_vectors)
        self.offsets.append(offset)
        self.compressed_vectors.extend(compressed)
        
        # Store metadata separately
        self.metadata[len(self.offsets) - 1] = metadata

Performance Comparison

Build and Query Complexity

Structure	Build Time	Query Time	Space	Effective Dimensions
Linear Scan	O(1)	O(N · d)	O(N · d)	Any
KD-Tree	O(N log N)	O(2^d + k)	O(N)	< 20
Ball Tree	O(N log N)	O(d log N)	O(N)	< 50
VP-Tree	O(N log N)	O(log N)	O(N)	Metric space
LSH	O(N · L)	O(L)	O(N · L)	High-d
HNSW	O(N log N)	O(log N)	O(N · M)	Any
IVF	O(N · K)	O(√(N))	O(N + K · d)	Any

Choosing the Right Structure

Decision Tree

def choose_index_structure(n_vectors, dimensions, memory_budget, update_frequency):
    if n_vectors < 1000:
        return "BruteForce"  # Just scan everything
    
    if dimensions > 100:
        if update_frequency == "high":
            return "LSH"  # Handles high-d and updates
        elif memory_budget == "low":
            return "IVF-PQ"  # Compression
        else:
            return "HNSW"  # Best recall
    
    if dimensions < 20:
        if update_frequency == "never":
            return "KDTree"  # Classic choice
        else:
            return "BallTree"  # More robust
    
    if memory_budget == "unlimited":
        return "HNSW"  # Best performance
    elif memory_budget == "moderate":
        return "NSW" or "IVF"
    else:
        return "IVF-PQ"  # Extreme compression

Implementation Tips

1. Benchmarking

def benchmark_index(index_class, vectors, queries, k=10):
    """Benchmark index performance"""
    import time
    
    # Build time
    start = time.time()
    index = index_class()
    index.build(vectors)
    build_time = time.time() - start
    
    # Query time
    start = time.time()
    results = []
    for query in queries:
        results.append(index.search(query, k))
    query_time = (time.time() - start) / len(queries)
    
    # Memory usage
    import sys
    memory = sys.getsizeof(index)
    
    # Recall (needs ground truth)
    recall = compute_recall(results, ground_truth)
    
    return {
        'build_time': build_time,
        'query_time_ms': query_time * 1000,
        'memory_mb': memory / 1e6,
        'recall@k': recall
    }

2. Dynamic Index Selection

class AdaptiveIndex:
    """Automatically choose best index based on data"""
    
    def __init__(self):
        self.index = None
        self.index_type = None
    
    def build(self, vectors):
        # Analyze data characteristics
        n, d = vectors.shape
        intrinsic_dim = estimate_intrinsic_dimension(vectors)
        distribution = analyze_distribution(vectors)
        
        # Choose index
        if intrinsic_dim < 10:
            self.index = KDTree(vectors)
            self.index_type = "KDTree"
        elif distribution == "clustered":
            self.index = IVF(vectors)
            self.index_type = "IVF"
        else:
            self.index = HNSW(vectors)
            self.index_type = "HNSW"
        
        print(f"Selected {self.index_type} based on data analysis")

HNSW Search - Graph-based champion
IVF-PQ - Inverted file with compression
LSH Search - Hash-based approach
ANN Comparison - Algorithm comparison
Vector Quantization - Compression techniques

References

Friedman et al. "An Algorithm for Finding Best Matches in Logarithmic Expected Time"
Omohundro. "Five Balltree Construction Algorithms"
Yianilos. "Data Structures and Algorithms for Nearest Neighbor Search"
Malkov & Yashunin. "Efficient and Robust Approximate Nearest Neighbor Search Using HNSW"

Table of Contents

Vector Index Structures

Flat Index Visualization