Space/time trade-offs in hash coding with allowable errors

https://dl.acm.org/doi/10.1145/362686.362692

This paper, written by Burton H. Bloom, introduces a clever way to save a lot of computer memory by allowing for a small, controlled number of mistakes. The main idea is a trade-off: less space for less-than-perfect accuracy.

The core problem it addresses is membership testing—quickly checking if an item is part of a very large collection, especially when most items you check are not in the collection.

The Standard Way: Conventional Hashing (No Errors)

Before introducing his new methods, the author describes the traditional approach to this problem, which is 100% accurate.

How it Works: Imagine you have a set of words you need to store, like {"apple", "banana", "cherry"}. A conventional hash table is like a set of numbered bins. To store "apple," a "hash function" converts the word "apple" into a number, say '4'. You then place the word "apple" into bin #4. If bin #4 is already full, you have a system to find the next empty bin (e.g., trying bin #5, then #6, and so on).
Checking for a Word: To check if "banana" is in your set, you hash "banana" to get its bin number, say '2'. You look in bin #2. If you find "banana" there, it's in the set. If you find an empty bin during your search, you know for sure the word is not in the set.
The Downside: This method requires a lot of memory because you have to store the entire word in each bin. For a very large set of items (like all the words in a dictionary), the hash table might be too big to fit in a computer's fast main memory (RAM).

The New Ideas: Hashing with Allowable Errors

The paper proposes two new methods that dramatically reduce memory usage by permitting a small fraction of false positives.

Method 1: Hashing with Short Codes

This is a simple modification of the conventional method.

How it Works: Instead of storing the full word "banana" in a bin, you store a much shorter, fixed-size code derived from it, like "ban". This obviously saves a lot of space.
The Error: The problem is that another word you want to check, like "bandana" (which is not in your set), might also produce the code "ban". When you check for "bandana", the system finds the code "ban" and mistakenly tells you it's in the set. This is a false positive. The size of the code is a trade-off: a longer code is more accurate but uses more space.

Method 2: The Bit Array (The Bloom Filter)

This is the paper's main and most powerful contribution, a technique now famously known as a Bloom Filter. It completely abandons the idea of storing items in bins.

Setup: Imagine a very long row of light switches, all initially in the OFF position (a bit value of 0). You also pick several different hash functions (e.g., three of them).
Adding an Item:
1. To add the word "cat" to your set, you feed it to all three hash functions.
2. The first hash function might output the number '7', the second '34', and the third '81'.
3. You go to switches #7, #34, and #81 and flip them to the
  ON position (a bit value of 1).
4. You repeat this for every word in your set. Note that a single switch might be flipped ON by multiple different words.
Checking for an Item:
1. Now, say you want to check if the word "dog" is in your set. You feed "dog" to the same three hash functions.
2. They might output the numbers '15', '34', and '92'.
3. You go and look at the state of switches #15, #34, and #92.
4. Case 1: A "No" Answer. If even one of these switches is still OFF (e.g., switch #15 is OFF), you can say with 100% certainty that "dog" is not in your set. If it had been added, all its switches would have been flipped ON.
5. Case 2: A "Maybe" Answer. If all three switches are ON, the system reports that "dog" is probably in the set. It's only "probably" because switch #34 might have been turned on by "cat", and switches #15 and #92 might have been turned on by other words. The fact that all three are ON could be a coincidence. This is the source of the false positives.

Space/time trade-offs in hash coding with allowable errors

The Standard Way: Conventional Hashing (No Errors)

The New Ideas: Hashing with Allowable Errors

Method 1: Hashing with Short Codes

Method 2: The Bit Array (The Bloom Filter)

Comments

More from this blog

Bigtable: A Distributed Storage System for Structured Data

WTF: The Who to Follow Service at Twitter

Calculators Vs AI Vs AGI

Null reference exception with a twist.

Command Palette

The Standard Way: Conventional Hashing (No Errors)

The New Ideas: Hashing with Allowable Errors

Method 1: Hashing with Short Codes

Method 2: The Bit Array (The Bloom Filter)

Comments

More from this blog