Skip to content

Guava BloomFilter implementation

The Guava BloomFilter is a practical implementation of the Bloom filter probabilistic data structure, provided by the Google Guava library.^[600-developer-principle-bloomfilter.md]

Basic Usage

To utilize a BloomFilter in Guava, one must import the necessary classes and define a Funnel (which describes how to map an object to primitives) and a Charset.^[600-developer-principle-bloomfilter.md] The filter is instantiated using the create method, which typically accepts the expected number of insertions (n) and the desired false positive probability (p).^[600-developer-principle-bloomfilter.md]

Elements are added to the set using the put method.^[600-developer-principle-bloomfilter.md]

import java.nio.charset.Charset;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

// Create a filter expecting 1000 insertions with a low [False Positive Rate](<./false-positive-rate.md>)
BloomFilter<String> filter = BloomFilter.create(
    Funnels.stringFunnel(Charset.forName("utf-8")), 
    1000, 
    0.000001
);

filter.put("123");

Testing Membership

To check if an element might exist in the filter, the mightContain method is used.^[600-developer-principle-bloomfilter.md] This method returns true if the element is definitely in the set or is a false positive, and false only if the element is definitely not in the set.

boolean exists = filter.mightContain("123");

Combining Filters

The implementation supports merging two filters together using the putAll method.^[600-developer-principle-bloomfilter.md] This is useful for distributed scenarios or batch processing where separate filter instances need to be consolidated.

BloomFilter<String> filterA = ...;
BloomFilter<String> filterB = ...;
filterA.putAll(filterB); // Absorbs filterB into filterA

Use Case: Cache penetration

A primary application of this implementation is mitigating cache penetration.^[600-developer-principle-bloomfilter.md] By placing a Bloom filter in front of a storage layer (like a database), the system can quickly determine if a key exists. If the filter indicates the key is not present, the database access is skipped entirely, preventing the load from overwhelming the database when querying for non-existent data.

Sources