Hash Tables: Fast Lookups with Collision Resolution

Understanding Hash Tables

Hash tables are one of the most important data structures in computer science, providing average O(1) time complexity for insertions, deletions, and lookups. They achieve this efficiency by using a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

The key challenge in hash table design is handling collisions - when multiple keys hash to the same index. Various collision resolution strategies exist, each with different trade-offs in terms of performance, memory usage, and implementation complexity.

Interactive Hash Table Explorer

Visualize hash table operations, collision resolution strategies, and performance characteristics:

Hash Function

Collision Resolution

Table Size

Key

Value

Search Key

Hash Table Visualization

[0]

Empty

[1]

Empty

[2]

Empty

[3]

Empty

[4]

Empty

[5]

Empty

[6]

Empty

[7]

Empty

[8]

Empty

[9]

Empty

Total Items

Collisions

Load Factor

Avg Chain

Max Chain

Current Configuration

• Hash Function: division
• Collision Strategy: chaining
• Table Size: 10
• Time Complexity: O(1) average, O(n) worst case

Performance Tips

• Works well with high load factors
• Extra memory for pointers
• No clustering issues

What is a Hash Table?

A hash table consists of:

Array/Bucket Array: Fixed-size array to store elements
Hash Function: Maps keys to array indices
Collision Resolution: Strategy to handle key collisions

Basic Operations

class HashTable:
    def __init__(self, size=10):
        self.size = size
        self.table = [None] * size
        self.count = 0
    
    def hash(self, key):
        # Simple hash function
        return hash(key) % self.size
    
    def insert(self, key, value):
        index = self.hash(key)
        # Handle collision...
        
    def search(self, key):
        index = self.hash(key)
        # Find value...
        
    def delete(self, key):
        index = self.hash(key)
        # Remove value...

Hash Functions

Properties of Good Hash Functions

Deterministic: Same input always produces same output
Uniform Distribution: Minimizes collisions
Fast Computation: O(1) time complexity
Avalanche Effect: Small input changes → large output changes

Common Hash Functions

Division Method:

h(k) = k \bmod m

Multiplication Method:

h(k) = \lfloor m(kA \bmod 1) \rfloor

Where A = √(5) - 12 ≈ 0.618 (golden ratio)

Universal Hashing:

h_a,b(k) = ((ak + b) \bmod p) \bmod m

Collision Resolution Strategies

1. Separate Chaining

Store colliding elements in a linked list at each index:

class ChainedHashTable:
    def __init__(self, size=10):
        self.size = size
        self.table = [[] for _ in range(size)]
    
    def insert(self, key, value):
        index = self.hash(key)
        # Check if key exists
        for i, (k, v) in enumerate(self.table[index]):
            if k == key:
                self.table[index][i] = (key, value)
                return
        # Add new key-value pair
        self.table[index].append((key, value))
    
    def search(self, key):
        index = self.hash(key)
        for k, v in self.table[index]:
            if k == key:
                return v
        return None

Advantages:

Simple implementation
Handles high load factors well
Deletion is straightforward

Disadvantages:

Extra memory for pointers
Cache performance issues
Possible long chains

2. Open Addressing

Find alternative locations within the table when collision occurs:

Linear Probing

h(k, i) = (h'(k) + i) \bmod m

def linear_probe_insert(self, key, value):
    index = self.hash(key)
    i = 0
    
    while i < self.size:
        pos = (index + i) % self.size
        if self.table[pos] is None or self.table[pos][0] == key:
            self.table[pos] = (key, value)
            return True
        i += 1
    
    return False  # Table full

Quadratic Probing

h(k, i) = (h'(k) + c₁ i + c₂ i²) \bmod m

def quadratic_probe_insert(self, key, value):
    index = self.hash(key)
    i = 0
    
    while i < self.size:
        pos = (index + i + i*i) % self.size
        if self.table[pos] is None or self.table[pos][0] == key:
            self.table[pos] = (key, value)
            return True
        i += 1
    
    return False

Double Hashing

h(k, i) = (h₁(k) + i · h₂(k)) \bmod m

def double_hash_insert(self, key, value):
    h1 = self.hash1(key)
    h2 = self.hash2(key)
    i = 0
    
    while i < self.size:
        pos = (h1 + i * h2) % self.size
        if self.table[pos] is None or self.table[pos][0] == key:
            self.table[pos] = (key, value)
            return True
        i += 1
    
    return False

Comparison

Method	Probe Sequence	Clustering	Cache Performance
Linear	Sequential	Primary clustering	Good
Quadratic	Quadratic jumps	Secondary clustering	Moderate
Double	Pseudo-random	Minimal clustering	Poor
Chaining	N/A	No clustering	Poor for long chains

Load Factor and Rehashing

Load Factor

α = nm

Where:

n = number of entries
m = table size

Performance vs Load Factor

Separate Chaining:

Average search: O(1 + α)
Works well even with α > 1

Open Addressing:

Average successful search: O(11-α)
Must have α < 1
Typically rehash when α > 0.7

Dynamic Resizing

def rehash(self):
    old_table = self.table
    self.size = self.size * 2
    self.table = [None] * self.size
    self.count = 0
    
    # Re-insert all elements
    for item in old_table:
        if item is not None:
            if isinstance(item, list):  # Chaining
                for key, value in item:
                    self.insert(key, value)
            else:  # Open addressing
                key, value = item
                self.insert(key, value)

Advanced Techniques

1. Robin Hood Hashing

Minimize variance in probe distances:

def robin_hood_insert(self, key, value):
    index = self.hash(key)
    distance = 0
    
    while True:
        pos = (index + distance) % self.size
        
        if self.table[pos] is None:
            self.table[pos] = (key, value, distance)
            return
        
        # Check if current element has traveled less
        _, _, existing_distance = self.table[pos]
        if distance > existing_distance:
            # Swap and continue with displaced element
            self.table[pos], (key, value, distance) = \
                (key, value, distance), self.table[pos]
        
        distance += 1

2. Cuckoo Hashing

Guaranteed O(1) worst-case lookup:

class CuckooHashTable:
    def __init__(self, size=10):
        self.size = size
        self.table1 = [None] * size
        self.table2 = [None] * size
    
    def insert(self, key, value):
        # Try table 1
        pos1 = self.hash1(key)
        if self.table1[pos1] is None:
            self.table1[pos1] = (key, value)
            return
        
        # Evict and try table 2
        for _ in range(self.size):
            # Swap with table1
            self.table1[pos1], (key, value) = \
                (key, value), self.table1[pos1]
            
            # Try table 2
            pos2 = self.hash2(key)
            if self.table2[pos2] is None:
                self.table2[pos2] = (key, value)
                return
            
            # Swap with table2
            self.table2[pos2], (key, value) = \
                (key, value), self.table2[pos2]
            
            pos1 = self.hash1(key)
        
        # Rehash if cycle detected
        self.rehash()
        self.insert(key, value)

3. Consistent Hashing

For distributed systems:

class ConsistentHash:
    def __init__(self, nodes=None, virtual_nodes=150):
        self.ring = {}
        self.sorted_keys = []
        self.virtual_nodes = virtual_nodes
        
        if nodes:
            for node in nodes:
                self.add_node(node)
    
    def hash(self, key):
        return hash(key) & 0xffffffff
    
    def add_node(self, node):
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:{i}"
            hash_value = self.hash(virtual_key)
            self.ring[hash_value] = node
            self.sorted_keys.append(hash_value)
        self.sorted_keys.sort()
    
    def get_node(self, key):
        if not self.ring:
            return None
        
        hash_value = self.hash(key)
        # Binary search for the first node >= hash_value
        idx = bisect.bisect_right(self.sorted_keys, hash_value)
        
        # Wrap around
        if idx == len(self.sorted_keys):
            idx = 0
        
        return self.ring[self.sorted_keys[idx]]

Real-World Applications

1. Database Indexing

-- Hash index in PostgreSQL
CREATE INDEX idx_users_email 
ON users USING HASH (email);

2. Caching Systems

class LRUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}  # Hash table
        self.order = collections.OrderedDict()
    
    def get(self, key):
        if key in self.cache:
            # Move to end (most recent)
            self.order.move_to_end(key)
            return self.cache[key]
        return None
    
    def put(self, key, value):
        if key in self.cache:
            self.order.move_to_end(key)
        else:
            if len(self.cache) >= self.capacity:
                # Remove least recent
                oldest = next(iter(self.order))
                del self.cache[oldest]
                del self.order[oldest]
            
        self.cache[key] = value
        self.order[key] = None

3. Distributed Systems

Memcached/Redis: Distributed caching
DHT (Distributed Hash Tables): P2P networks
Sharding: Database partitioning

Performance Analysis

Time Complexity

Operation	Average Case	Worst Case (Chaining)	Worst Case (Open Addr)
Insert	O(1)	O(n)	O(n)
Search	O(1)	O(n)	O(n)
Delete	O(1)	O(n)	O(n)

Space Complexity

Chaining: O(n + m) where m is table size
Open Addressing: O(m)

Cache Performance

Open addressing with linear probing often performs better due to cache locality:

# Cache-friendly iteration
def iterate_linear_probe(self):
    for i in range(self.size):
        if self.table[i] is not None:
            yield self.table[i]

# vs Chaining (potential cache misses)
def iterate_chains(self):
    for chain in self.table:
        for item in chain:  # May jump around memory
            yield item

Implementation Tips

1. Choose Good Table Sizes

For division method, use prime numbers:

PRIMES = [53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593]

def next_prime_size(current_size):
    for prime in PRIMES:
        if prime > current_size * 2:
            return prime
    return current_size * 2 + 1

2. Handle Deletions Properly

For open addressing, use tombstones:

class OpenAddressHashTable:
    DELETED = object()  # Sentinel for deleted entries
    
    def delete(self, key):
        index = self.search_position(key)
        if index is not None:
            self.table[index] = self.DELETED
            self.count -= 1
            return True
        return False

3. Monitor Performance

class InstrumentedHashTable:
    def __init__(self):
        self.collisions = 0
        self.probes = 0
        self.max_probe_length = 0
    
    def insert_with_stats(self, key, value):
        index = self.hash(key)
        probes = 0
        
        while self.table[index] is not None:
            self.collisions += 1
            probes += 1
            index = self.probe_next(index)
        
        self.probes += probes
        self.max_probe_length = max(self.max_probe_length, probes)
        self.table[index] = (key, value)

Common Pitfalls

Poor Hash Functions: Lead to clustering and collisions
Fixed Size Tables: Forgetting to implement rehashing
Ignoring Load Factor: Performance degrades at high load
Key Mutability: Changing keys after insertion
Thread Safety: Hash tables aren't thread-safe by default

Best Practices

Start with standard library implementations (dict in Python, unordered_map in C++)
Profile before optimizing - Modern implementations are highly optimized
Consider memory layout for performance-critical applications
Use consistent hashing for distributed systems
Implement proper equals() and hashCode() for custom objects

Understanding hash tables connects to:

Binary Search Trees: Alternative for ordered operations
Bloom Filters: Probabilistic data structure using hashing
Cache Design: LRU/LFU caches use hash tables
Database Indexing: Hash indexes vs B-tree indexes
Cryptographic Hashing: Different requirements but similar principles

Conclusion

Hash tables are fundamental to efficient programming, providing constant-time operations that make many algorithms practical. While the basic concept is simple, proper implementation requires careful attention to hash functions, collision resolution, and load factor management. Understanding these details helps in choosing the right strategy for your specific use case and performance requirements.

Table of Contents

Hash Table Visualization

Current Configuration

Performance Tips