Hash Tables and Sets
The data structure ML engineers use more than any other — for deduping data, counting features, caching embeddings, and 80% of interview problems.
Data Structures & Algorithms beginner
#dsa
#hash-tables
#python
#interview
Prereqs: Basic Python
Why hash tables are the most-used DS in ML
Before you train a model, you have to clean data. Cleaning data is mostly:
- Deduping — did we see this row before? Put it in a set.
- Counting — how often does this token appear? Increment a dict value.
- Joining — match rows from table A to rows from table B. Hash the join keys.
- Caching — have we embedded this text before? Lookup in a dict.
Every one of these is O(1) with a hash table and O(n) without.
The mental model
A hash table maps keys to values using a hash function to index into an array. Insertion, lookup, and deletion are all amortized O(1) — meaning occasionally you hit a resize cost, but on average it’s constant.
In Python:
dict— hash map.d[k] = v,d.get(k).set— hash set.s.add(x),x in s,s & other,s | other.collections.Counter— hash map with auto-increment. Your best friend for frequency counts.collections.defaultdict(list)— hash map where missing keys auto-initialize. Cleaner thand.setdefault.
The three ML-specific patterns
1. Vocabulary building
from collections import Counter
counts = Counter(token for doc in corpus for token in tokenize(doc))
vocab = {tok: i for i, (tok, _) in enumerate(counts.most_common(30_000))}
2. Caching embeddings
cache: dict[str, list[float]] = {}
def embed(text: str) -> list[float]:
if text not in cache:
cache[text] = model.encode(text)
return cache[text]
3. Set intersection for evals
# What fraction of ground-truth entities did the model extract?
gold = set(extract_entities(ground_truth))
pred = set(extract_entities(model_output))
recall = len(gold & pred) / len(gold) if gold else 0
The gotchas
- Keys must be hashable — no lists, no dicts. Use tuples or frozensets.
- Dict order is insertion order in Python 3.7+, but don’t rely on it for determinism across runs.
- Hash collisions in adversarial inputs — rare but real. Someone feeding you hand-crafted strings can degrade your dict to O(n).
Interview patterns
- “Find the first duplicate” → set
- “Two-sum” → dict mapping
value → index - “Group anagrams” → dict mapping sorted key → list
- “Longest substring without repeating characters” → sliding window + set
- “Top K frequent elements” → Counter + heapq