Hash Tables and Sets

The data structure ML engineers use more than any other — for deduping data, counting features, caching embeddings, and 80% of interview problems.

Data Structures & Algorithms beginner #dsa #hash-tables #python #interview

Prereqs: Basic Python

Why hash tables are the most-used DS in ML

Before you train a model, you have to clean data. Cleaning data is mostly:

Deduping — did we see this row before? Put it in a set.
Counting — how often does this token appear? Increment a dict value.
Joining — match rows from table A to rows from table B. Hash the join keys.
Caching — have we embedded this text before? Lookup in a dict.

Every one of these is O(1) with a hash table and O(n) without.

The mental model

A hash table maps keys to values using a hash function to index into an array. Insertion, lookup, and deletion are all amortized O(1) — meaning occasionally you hit a resize cost, but on average it’s constant.

In Python:

dict — hash map. d[k] = v, d.get(k).
set — hash set. s.add(x), x in s, s & other, s | other.
collections.Counter — hash map with auto-increment. Your best friend for frequency counts.
collections.defaultdict(list) — hash map where missing keys auto-initialize. Cleaner than d.setdefault.

The three ML-specific patterns

1. Vocabulary building

from collections import Counter
counts = Counter(token for doc in corpus for token in tokenize(doc))
vocab = {tok: i for i, (tok, _) in enumerate(counts.most_common(30_000))}

2. Caching embeddings

cache: dict[str, list[float]] = {}
def embed(text: str) -> list[float]:
    if text not in cache:
        cache[text] = model.encode(text)
    return cache[text]

3. Set intersection for evals

# What fraction of ground-truth entities did the model extract?
gold = set(extract_entities(ground_truth))
pred = set(extract_entities(model_output))
recall = len(gold & pred) / len(gold) if gold else 0

The gotchas

Keys must be hashable — no lists, no dicts. Use tuples or frozensets.
Dict order is insertion order in Python 3.7+, but don’t rely on it for determinism across runs.
Hash collisions in adversarial inputs — rare but real. Someone feeding you hand-crafted strings can degrade your dict to O(n).

Interview patterns

“Find the first duplicate” → set
“Two-sum” → dict mapping value → index
“Group anagrams” → dict mapping sorted key → list
“Longest substring without repeating characters” → sliding window + set
“Top K frequent elements” → Counter + heapq