Go package implementing Bloom filters

Last update: Dec 30, 2022

Comments: 13

Bloom filters

A Bloom filter is a representation of a set of n items, where the main requirement is to make membership queries; i.e., whether an item is a member of a set.

A Bloom filter has two parameters: m, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

This implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

n := uint(1000)
filter := bloom.New(20*n, 5) // load of 20, 5 keys
filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

if filter.Test([]byte("Love"))

For numeric data, I recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

i := uint32(100)
n1 := make([]byte, 4)
binary.BigEndian.PutUint32(n1, i)
filter.Add(n1)

Finally, there is a method to estimate the false positive rate of a particular bloom filter for a set of size n:

if filter.EstimateFalsePositiveRate(1000) > 0.001

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Discussion here: Bloom filter

Godoc documentation: https://godoc.org/github.com/willf/bloom

Installation

go get -u github.com/willf/bloom

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project include a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa

Owner

Will Fitzgerald

Words and numbers

https://github.com/willf/bloom

Comments

Applying the bloom filter for HASH160 addresses

Example Address List.

0af80f98e511577e5c87d56069896c70fe4a6263 0af81021340c23bb20429e47c3bdd62137c322d3 0af811994346cdca9d6dbe8e7da4f3393130fc71 0af81250856d377cf7f6914ebe196a9570ada313 0af814b46c4ef5c57dbc53ae03c812eea37d4aac 0af816bb57c0adac8bca0401d295724cd80b0956 64dfb83dcca1f516d2bac93c8c9e4eb6513fd364 64dfb8c23802f3aa67c9078c0e7970fa17584892 64dfb8c5d21743ba3dd8cef5b576e1ddf6d4c239 64dfb931e9d68aa68842efa3c5d0bc3c1101b5ec 64dfba34a39823c273a33fadf49112c063ec3724

It gives quite a lot of positive and false results and for this reason I cannot use it. I want to use bloom filter to filter about 25 million hash160 addresses. How can I use the settings more efficiently?

I want to use it for this project. https://github.com/oguzdelioglu/odelbrain
Incorrect tests

Some of the tests are faulty. TestDirect20_5 for example. If we use the formula from estimateParameters (which is the same as the formula on wikipedia). We get that for 10000 entries and a 0.0001% precision (p = 0.000001 because EstimateFalsePositiveRate (after fix #20) returns a 100-0 range and estimateParameters uses a 1-0 range) the hash should have 287,552 bits and 20 hashing functions. The values used in the test (200,000 bits and only 5 hashing functions) there for result in a much higher false positive chance of 0.03%.

Even with the #20 fix the tests still pass but this is only because the Fnv hashing functions used results in a really weird false positive rate for different inputs. Some inputs result in a much lower false positive rate (which I still don't understand) and many other result in a much higher false positive rate than expected.

For testing I quickly hacked a version which uses the Sha256 hashing function and the research from Less Hashing, Same Performance: Building a Better Bloom Filter (also used in my redis bloom filter). This version always results in the correct false positive rate but is around 26 times slower. If you are interested I can make a fork with this code so you can see if it's interesting.
WriteTo() variant that exports only the bitset
Currently there is no elegant solution to export only the bloom filter bitset. That makes it difficult to engineer custom serialization formats.

I'm eager to submit a PR to change that if we could agree on the design.

Of the top of my head we could do one of those things:

expose BloomFilter.b – the simplest solution. However will allow developers to shoot themselves in the foot.

add a function that would return a copy of BloomFilter.b. *BloomFilter.BisetCopy?

add a *BloomFilter.WriteBitSetTo() function?

and my favourite: add a *BloomFilter.BitSetWriter() function that would return a

type BitSetWriter interface { WriterTo }
Problem with uniformity of FNV implementation
Hi Will,

Really appreciate your time implementing Bloom. We have opted to use it in a new project of ours at work, and while reviewing I have found a problem with your custom implementation of FNV. I threw together a test in a branch of my fork to help illustrate. You can find that here: https://github.com/turgon/bloom-1/blob/uniformity/uniformity_test.go

The arrangement of the test is pretty straightforward. The hashing method under test is used to determine the location of a bit to set in the filter, just like it is during your Add() method. I count up the number of times each bit is hit, and then compare the histogram to the uniform distribution using a Chi Square goodness of fit test.

I tested five different methods of finding bit locations:

your current implementation

an implementation using Murmur3

your implementation preceding changeset ed0c3ad552eb74b37999995a9678f3723ca44454

an implementation using Go's native FNV-1

an implementation using Go's native FNV-1a

The results indicate that your two custom implementations of FNV are not uniformly distributing mod m. In practice one could size up the filter in m and k and not suffer terribly, however, it certainly will increase the false positive rate in very unpredictable ways.

I think the problem stems from your addition of the index on line 77. That addition makes the initial value differ from FNV for indices i greater than zero. My suspicion is that fnv64Hash(0, data) is ok, but the others are not. Also, the others (1, 2, 3) have no influence on the bit location when ii is zero on line 99. I believe this causes the first bit location to be uniformly distributed, but not the rest.

After further research I introduced Go's native FNV-1 and FNV-1a methods as a baseline, and although they performed surprisingly well, there are only 32 and 64 bit implementations. Therefore the g(x) = h0(x) + i*h1(x) trick can only be applied for filters with m < 2^32 when splitting 64-bit FNV.

Cassandra uses Murmur3 instead of FNV for its bloom filters and the linked article hints at why. It satisfies the requirement of Kirsch and Mitzenmacher, Our k hash functions will be of the form gi(x) = h1(x)+ih2(x) mod p, where h1(x) and h2(x) are two independent, uniform random hash functions on the universe with range {0, 1, 2, . . . , p − 1}, and throughout we assume that i ranges from 0 to k − 1. I.e.: Uniformity is critical or the math doesn't hold up.

I also prefer Murmur3 as it produces 128-bit values, which lends itself well to addressing 64-bit locations, while not suffering the kind of performance hit a cryptographic hash takes.

I hope this was clear enough, and you can find it useful. Thank you!
Concurrent

I use the package for filtering some data in web requests. Since BloomFilter structure uses shared locBuff and hasher fields it is not safe to call Test function from multiple goroutines (included a test case to demonstrate the problem). I have added another method (TestConcurrent) that does not use these shared fileds. Feel free to reject if you don't like the idea.

OOB read using ReadFrom

When upgrading to latest version we observe this error:

panic: runtime error: index out of range [14977] with length 14977

goroutine 761 [running]:
github.com/bits-and-blooms/bitset.(*BitSet).ReadFrom(0xc00036ff00, {0x50e95a0?, 0xc0003b8dc0?})
	github.com/bits-and-blooms/[email protected]/bitset.go:955 +0x392
github.com/bits-and-blooms/bloom/v3.(*BloomFilter).ReadFrom(0xc001096a50, {0x50e95a0, 0xc0003b8dc0})
	github.com/bits-and-blooms/bloom/[email protected]/bloom.go:367 +0xd6
github.com/minio/minio/cmd.(*dataUpdateTracker).deserialize(0xc000eb4850, {0x50e95a0, 0xc0003b8dc0}, {0x1b?, 0xc001344f00?, 0x0?})
	github.com/minio/minio/cmd/data-update-tracker.go:434 +0x487
github.com/minio/minio/cmd.(*dataUpdateTracker).load(0xc000eb4850, {0x50fa338, 0xc000ff7280}, {0xc001254700?, 0x64, 0xc000128c80?})
[...]

We are de-serializing into a new instance of &bloom.BloomFilter{}.

Possible regression from https://github.com/bits-and-blooms/bitset/pull/103

Is one hash enough?

Hi Will, is one hash really enough for a bloom filter?

Using one is efficient, but if fnv(A) and fnv(B) collide then surely the multiplication you're doing will simply spread the collision across all the buckets in the same way? That doesn't provide any net benefit to avoiding false-positives by increasing the number of hashes.

Non-fnv-collisions (same bucket chosen from a different hash) do benefit still though, and there will be more of those.

I'm interested in the rationale. Thanks for sharing your implementation too, I'm grateful.

Cheers, Bruce
Issue when importing package from master branch

when I importing bloom from master branch go get -u github.com/bits-and-blooms/bloom@master error appear that bitset package has invalid go.mod module github.com/willf/bitset
Heap-free hashing
There is no good reason why hashing a block of data should require heap allocation or even a non-trivial dynamic buffer.

Unfortunately, the murmur library makes this a bit difficult. Thankfully, we can just roll our own thin functions. It is possible to do while preserving exactly the same hash values. That's what this PR does.

The goal of this PR is not to improve speed. Of course, it does so... On my machine...

Before:

BenchmarkEstimated BenchmarkEstimated-8 1000000000 0.1333 ns/op 0 B/op 0 allocs/op BenchmarkSeparateTestAndAdd BenchmarkSeparateTestAndAdd-8 2706183 508.2 ns/op 194 B/op 4 allocs/op BenchmarkCombinedTestAndAdd BenchmarkCombinedTestAndAdd-8 3312842 437.6 ns/op 97 B/op 2 allocs/op PASS

After...

BenchmarkEstimated-8 1000000000 0.05285 ns/op 0 B/op 0 allocs/op BenchmarkSeparateTestAndAdd BenchmarkSeparateTestAndAdd-8 6219651 451.9 ns/op 0 B/op 0 allocs/op BenchmarkCombinedTestAndAdd BenchmarkCombinedTestAndAdd-8 5010103 393.2 ns/op 0 B/op 0 allocs/op PASS

So we see about a 10% boost in performance in this instance on my system. Results will vary.

The important part is that it does away entirely with heap allocation.

The PR includes tests to ensure that the hash values are preserved.

Note that if we break backward compatibility (with exact hash value reproduction), there are many great optimization opportunities. But that should be addressed in other PRs.

Related to PRs https://github.com/willf/bloom/pull/51 and https://github.com/willf/bloom/pull/57
Remove the side effect of the `EstimateFalsePositiveRate` function
This pull request include three changes.

Remove the side effect of the EstimateFalsePositiveRate function. Using the Copy function will has no effect on the current object.

Remove the repeated TestOrAdd function. For the TestAndAdd function, when the f.b.Test(l) gets the true result, executing the f.b.Set(l) and not are same. So only getting false to execute f.b.Set(l) will decrease one step. Changing it will find the TestAndAdd function is same as the TestOrAdd.

Update the .gitignore file to filter the .idea directory.

Backward compatibillity break test failures on s390x

With version 3.0.1 I'm seeing TestHashBasic and TestHashRandom fail on s390x:

--- FAIL: TestHashBasic (0.00s)
    murmur_test.go:29: Backward compatibillity break.
    murmur_test.go:29: Backward compatibillity break.
[...]
--- FAIL: TestHashRandom (0.02s)
    murmur_test.go:60: Backward compatibillity break.
    murmur_test.go:60: Backward compatibillity break.
[...]
FAIL
exit status 1
FAIL	github.com/willf/bloom	0.217s

You can see the full output in https://koji.fedoraproject.org/koji/taskinfo?taskID=73367600 (look at build.log).

This improves slightly the documentation.

The TestAndAdd and TestOrAdd methods differ in a small manner that might not always be clear to the user. This PR tries to improve the comments/documentation.

This relates to PR https://github.com/bits-and-blooms/bloom/pull/81 by @SimFG

I am honestly not sure what TestAndAdd is good for given that we have TestOrAdd. It was added by @chkno in 2013...

https://github.com/bits-and-blooms/bloom/commit/7924c81d327f2e9b2bfef5376106c7064e4be7fe

Meanwhile TestOrAdd is a more recent addition (2021?) by @moredure ...

https://github.com/bits-and-blooms/bloom/commit/dd9e96df53ad3229c2693b8b0fa58ae1351749e3

It might be time to do away with TestAndAdd.
2x 128 bit hash instead of 2x 64 bit hash. Why?

I noticed that for computing multiple hashes, you make use of the work of Less Hashing, Same Performance, which is calculated by: gi(x) = h1(x)+ih2(x) . For that you generate a 256 bits long hash partitioned into 4 uint64s. So I wonder why you decided to generate a 256 long hash partitioned into 4 uint64s. instead of 128 long hash partitioned into 2 uint64s 🤔 Wouldn't it be the same regarding hashing, but with a performance improvement?

Related tags

Utilities go bloom bloom-filters

Go package implementing Bloom filters

Bloom filters

Installation

Contributing

Running all tests

Owner

Will Fitzgerald

Comments

Applying the bloom filter for HASH160 addresses

Incorrect tests

WriteTo() variant that exports only the bitset

Problem with uniformity of FNV implementation

Concurrent

OOB read using ReadFrom

Is one hash enough?

Issue when importing package from master branch

Heap-free hashing

Remove the side effect of the `EstimateFalsePositiveRate` function

Backward compatibillity break test failures on s390x

This improves slightly the documentation.

2x 128 bit hash instead of 2x 64 bit hash. Why?

Related tags

Implementing tail -f in Go

A package for running subprocesses in Go, similar to Python's subprocesses package.

Utility to restrict which package is allowed to import another package.

Golang source code parsing, usage like reflect package

A Go preprocessor for package scoped reflection

reactssr is a package for rendering React applications.

Package ethtool allows control of the Linux ethtool generic netlink interface.

Extremely flexible golang deep comparison, extends the go testing package, tests HTTP APIs and provides tests suite

A well tested and comprehensive Golang statistics library package with no dependencies.

A computational topology package for gophers.

go-i18n is a Go package and a command that helps you translate Go programs into multiple languages.

Go package to generate and manage color palettes & schemes 🎨

Goridge is high performance PHP-to-Golang codec library which works over native PHP sockets and Golang net/rpc package.

Gene parsing package for Axie Infinity

keeper is package for Go that provides a mechanism for waiting a result of execution function until context cancel.

A lightweight casting package for Go projects

📋 cross-platform clipboard package that supports accessing text and image in Go (macOS/Linux/Windows/Android/iOS)

Raspberry pi GPIO controller package(CGO)

Package strnaming is used to Convert string to camelCase, snake_case, kebab-case.