Hash Tables

5 min read

Reading Progress0%
Dsa Index
Dsa Index

Hash Tables

1. What It Is and Why It Exists

A stores key→value pairs in an array of "buckets," using a hash function to map each key to a bucket index. Lookup, insert, and delete become "compute the hash, jump to that bucket" — average O(1) — instead of the O(n) scan a list or O(log n) walk a tree would need. It exists to answer "have I seen this?" / "what's associated with this?" in constant time, which is the single most common optimization in interviews: replace a nested-loop scan with one hash-map pass and an O(n^2) solution collapses to O(n).

The "why it works" intuition: hashing trades space for time — you allocate more buckets than entries so collisions stay rare, and a good hash spreads keys uniformly so each bucket holds O(1) elements on average. Collisions are resolved by chaining (a list per bucket) or open addressing (probe to the next slot). The catch every interviewer probes: this is average O(1); the worst case is O(n) when many keys collide into one bucket, and keys must be hashable/immutable (in Python: no list or dict keys — use tuple).

QUICK CHECK

A backend service processes a list of user IDs to find duplicates. The current implementation uses a nested loop, resulting in O(n²) time complexity. A developer refactors it to use a hash set, reducing it to O(n). Which property of hash tables makes this improvement possible?

Choose one answer

2. Operations and Complexity

OperationAverageWorstNotes
Insert / updateO(1) amortizedO(n)Worst = all keys collide; resize is amortized
Lookup d[k] / k in dO(1)O(n)The reason hash maps win interviews
DeleteO(1)O(n)
Iterate all entriesO(n)O(n)Order = insertion order in modern Python
Resize / rehashO(n) (rare)O(n)Amortizes to O(1) per insert (doubling)

Space is O(n). The most-probed points: (1) average vs worst — say "average O(1), worst O(n) under adversarial collisions"; (2) keys must be hashable (immutable); (3) a hash set is just a without values — use it for membership/dedup. Note dict/set give no sorted order; if you need ordering, that is a tree's job, not a 's.

QUICK CHECK

A backend service uses a Python dict to cache user session data, keyed by session token. Under normal load, lookups are fast. However, a security researcher discovers that an attacker can craft session tokens that all hash to the same bucket, dramatically slowing the service. What are the correct average-case and worst-case time complexities for a dict lookup, and what does this attack exploit?

Choose one answer

3. Implementation

A separate-chaining with the collision list made explicit:

class HashMap:
    def __init__(self, capacity=8):
        self._buckets = [[] for _ in range(capacity)]   # chaining: list per bucket
        self._n = 0
        self._cap = capacity

    def _index(self, key):
        return hash(key) % self._cap        # hash -> bucket; needs a hashable key

    def put(self, key, value):
        bucket = self._buckets[self._index(key)]
        for i, (k, _) in enumerate(bucket): # collision scan: O(1) avg, O(n) worst
            if k == key:
                bucket[i] = (key, value)    # update existing
                return
        bucket.append((key, value))
        self._n += 1
        if self._n > self._cap:             # load factor > 1 -> grow
            self._resize()                  # O(n), but amortizes to O(1)/insert

    def get(self, key):
        for k, v in self._buckets[self._index(key)]:
            if k == key:
                return v
        raise KeyError(key)

    def _resize(self):
        old = [pair for b in self._buckets for pair in b]
        self._cap *= 2                      # DOUBLE keeps amortized O(1)
        self._buckets = [[] for _ in range(self._cap)]
        self._n = 0
        for k, v in old:
            self.put(k, v)

In a real interview, use Python's built-in dict and set — never hand-roll this unless explicitly asked (e.g., LeetCode 706 "Design HashMap"). dict for key→value, set for membership/dedup, collections.Counter for frequency counts, collections.defaultdict(list) for grouping.

QUICK CHECK

A separate-chaining hash table currently has 8 buckets and 8 stored key-value pairs. When a 9th pair is inserted, the table triggers a resize that doubles the bucket count to 16 and rehashes all existing entries. Why does doubling the capacity (rather than, say, adding just 1 bucket) keep the amortized cost of insertions at O(1)?

Choose one answer

4. When to Reach for It

  • You need O(1) membership / "have I seen this?"set.
  • You need fast key→value lookup (two-sum complement, caching, , index-of) → dict.
  • You are computing frequencies / counts / groupingCounter / defaultdict.
  • A brute-force solution does a nested scan to find a matching element → trade the inner loop for a hash map and go from O(n^2) to O(n).
  • Do not use a when you need sorted order or range queries — that's a balanced BST / .
QUICK CHECK

A backend service receives a list of API request IDs and needs to find all duplicate IDs — ones that appear more than once. The naive implementation uses a nested loop, resulting in O(n²) time. Which refactored approach best reduces this to O(n)?

Choose one answer

5. Common Pitfalls

  • Unhashable key: using a list (or set/dict) as a key raises TypeError. Convert to tuple (or frozenset).
  • Claiming guaranteed O(1): it's average O(1); worst case is O(n). Say "average" or the interviewer will correct you.
  • Mutating a key after insertion: if a key object's hash changes, you can never find it again. Keys must be effectively immutable.
  • Relying on dict for ordering semantics: insertion order is preserved in modern Python, but it is not sorted order — don't confuse the two.
  • KeyError on missing key: use d.get(k, default) or defaultdict/Counter instead of bare d[k].
  • Mutating a dict/set while iterating it: raises RuntimeError; iterate over list(d) or collect changes first.
QUICK CHECK

A backend developer writes a caching layer where cached results are keyed by a list of user-selected filter IDs. The code crashes with a TypeError when inserting into the cache dictionary. What is the correct fix, and why?

Choose one answer

6. Interview Cheat Sheet

  • " is average O(1) insert/lookup/delete by trading space for time; worst case is O(n) under collisions."
  • "I use a set for membership, a dict for key→value, and Counter/defaultdict for frequencies — turning many O(n^2) scans into O(n)."
  • "Keys must be hashable/immutable; in Python that means tuple, not list."
  • "If I need sorted order or range queries, a is the wrong tool — that's a balanced BST or ."

Follow-ups:

  • "How are collisions handled?" — Chaining (a list per bucket) or open addressing (linear/quadratic probing). Either keeps average O(1) with a low load factor; resize (double + rehash) keeps inserts amortized O(1).
  • "When is it O(n)?" — When many keys hash to the same bucket (bad hash or adversarial input); the bucket degenerates to a linear scan.
  • "Hash set vs hash map?" — A set is a hash table storing only keys (membership/dedup); a map stores key→value. Same complexity.