128. Bloom filter

The Bloom filter is a fast and memory-efficient data structure capable of providing a probabilistic answer to the question Is value X in the given set?

Commonly, this algorithm is useful when the set is huge and most searching algorithms are facing memory and speed issues.

The speed and memory efficiency of the Bloom filter come from the fact that this data structure relies on an array of bits (for example, java.util.BitSet). Initially, the bits of this array are set to 0 or false.

The array of bits is the first main ingredient of the Bloom filter. The second main ingredient consists of one or more hash functions. Ideally, these are pairwise independent and uniformly distributed hash functions. Also, it is very important to be extremely fast. Murmur, the fnv series, and HashMix are some of the hash functions that respect these constraints to an acceptable extent for being used by the Bloom filter.

Now, when we add an element to the Bloom filter, we need to hash this element (pass it through each available hash function) and set the bits in the bit array at the index of those hashes to 1 or true.

The following snippet of code should clarify the main idea:

private BitSet bitset; // the array of bits
private static final Charset CHARSET = StandardCharsets.UTF_8;
...
public void add(T element) {

add(element.toString().getBytes(CHARSET));
}

public void add(byte[] bytes) {

int[] hashes = hash(bytes, numberOfHashFunctions);

for (int hash: hashes) {
bitset.set(Math.abs(hash % bitSetSize), true);
}

numberOfAddedElements++;
}

Now, when we search for an element, we pass this element through the same hash functions. Furthermore, we check whether the resultant values are marked in the array of bits as 1 or true. If they are not, then the element is not in the set for sure. But if they are, then we know with a certain probability that the element is in the set. This is not 100% certain since another element or combination of elements may have been flipped up those bits. Wrong answers are known as false positives.

In terms of code lines, we have the following:

private BitSet bitset; // the array of bits
private static final Charset CHARSET = StandardCharsets.UTF_8;
...

public boolean contains(T element) {

return contains(element.toString().getBytes(CHARSET));
}

public boolean contains(byte[] bytes) {

int[] hashes = hash(bytes, numberOfHashFunctions);

for (int hash: hashes) {
if (!bitset.get(Math.abs(hash % bitSetSize))) {

return false;
}
}

return true;
}

In a graphical representation, we can represent a Bloom filter with an array of bits of size 11 and three hash functions as follows (we have added two elements):

Obviously, we want to reduce the number of false positives as much as possible. While we cannot totally eliminate them, we can still affect their rate by joggling with the size of the bit array, the number of hash functions, and the number of elements in the set.

The following mathematical formulas can be used to shape the optimal Bloom filter:

  • Number of items in the filter (can be estimated based on m, k, and p):

n = ceil(m / (-k / log(1 - exp(log(p) / k))));

  • Probability of false positives, a fraction between 0 and 1, or a number indicating 1-in-p:

p = pow(1 - exp(-k / (m / n)), k);

  • Number of bits in the filter (or size in terms of KB, KiB, MB, Mb, GiB, and so on):

m = ceil((n * log(p)) / log(1 / pow(2, log(2))));

  • Number of hash functions (can be estimated based on m and n):

k = round((m / n) * log(2));

As a rule of thumb, a larger filter will have fewer false positives than a smaller one. Moreover, by increasing the number of hash functions, we obtain fewer false positives, but we slow down the filter and will fill it up quickly. The performance of the Bloom filter is O(h), where h is the number of hash functions.

In the code bundled to this book, there is an implementation of the Bloom filter using hash functions based on SHA-256 and murmur. Since this code is too big to be listed in this book, please consider as a starting point the example from the Main class.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.156.251