Go package implementing Bloom filters

Bloom filters

Test Go Report Card Go Reference

A Bloom filter is a concise/compressed representation of a set, where the main requirement is to make membership queries; i.e., whether an item is a member of a set. A Bloom filter will always correctly report the presence of an element in the set when the element is indeed present. A Bloom filter can use much less storage than the original set, but it allows for some 'false positives': it may sometimes report that an element is in the set whereas it is not.

When you construct, you need to know how many elements you have (the desired capacity), and what is the desired false positive rate you are willing to tolerate. A common false-positive rate is 1%. The lower the false-positive rate, the more memory you are going to require. Similarly, the higher the capacity, the more memory you will use. You may construct the Bloom filter capable of receiving 1 million elements with a false-positive rate of 1% in the following manner.

    filter := bloom.NewWithEstimates(1000000, 0.01) 

You should call NewWithEstimates conservatively: if you specify a number of elements that it is too small, the false-positive bound might be exceeded. A Bloom filter is not a dynamic data structure: you must know ahead of time what your desired capacity is.

Our implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

    filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

    if filter.Test([]byte("Love"))

For numerical data, we recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

    i := uint32(100)
    n1 := make([]byte, 4)
    binary.BigEndian.PutUint32(n1, i)
    filter.Add(n1)

Discussion here: Bloom filter

Godoc documentation: https://pkg.go.dev/github.com/bits-and-blooms/bloom

Installation

go get -u github.com/bits-and-blooms/bloom/v3

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project includes a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa

Design

A Bloom filter has two parameters: m, the number of bits used in storage, and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Owner
Go language BitSet and Bloom filter
Go language BitSet and Bloom filter
Comments
  • Applying the bloom filter for HASH160 addresses

    Applying the bloom filter for HASH160 addresses

    Example Address List.

    0af80f98e511577e5c87d56069896c70fe4a6263 0af81021340c23bb20429e47c3bdd62137c322d3 0af811994346cdca9d6dbe8e7da4f3393130fc71 0af81250856d377cf7f6914ebe196a9570ada313 0af814b46c4ef5c57dbc53ae03c812eea37d4aac 0af816bb57c0adac8bca0401d295724cd80b0956 64dfb83dcca1f516d2bac93c8c9e4eb6513fd364 64dfb8c23802f3aa67c9078c0e7970fa17584892 64dfb8c5d21743ba3dd8cef5b576e1ddf6d4c239 64dfb931e9d68aa68842efa3c5d0bc3c1101b5ec 64dfba34a39823c273a33fadf49112c063ec3724

    It gives quite a lot of positive and false results and for this reason I cannot use it. I want to use bloom filter to filter about 25 million hash160 addresses. How can I use the settings more efficiently?

    I want to use it for this project. https://github.com/oguzdelioglu/odelbrain

  • Incorrect tests

    Incorrect tests

    Some of the tests are faulty. TestDirect20_5 for example. If we use the formula from estimateParameters (which is the same as the formula on wikipedia). We get that for 10000 entries and a 0.0001% precision (p = 0.000001 because EstimateFalsePositiveRate (after fix #20) returns a 100-0 range and estimateParameters uses a 1-0 range) the hash should have 287,552 bits and 20 hashing functions. The values used in the test (200,000 bits and only 5 hashing functions) there for result in a much higher false positive chance of 0.03%.

    Even with the #20 fix the tests still pass but this is only because the Fnv hashing functions used results in a really weird false positive rate for different inputs. Some inputs result in a much lower false positive rate (which I still don't understand) and many other result in a much higher false positive rate than expected.

    For testing I quickly hacked a version which uses the Sha256 hashing function and the research from Less Hashing, Same Performance: Building a Better Bloom Filter (also used in my redis bloom filter). This version always results in the correct false positive rate but is around 26 times slower. If you are interested I can make a fork with this code so you can see if it's interesting.

  • WriteTo() variant that exports only the bitset

    WriteTo() variant that exports only the bitset

    Currently there is no elegant solution to export only the bloom filter bitset. That makes it difficult to engineer custom serialization formats.

    I'm eager to submit a PR to change that if we could agree on the design.

    Of the top of my head we could do one of those things:

    • expose BloomFilter.b – the simplest solution. However will allow developers to shoot themselves in the foot.
    • add a function that would return a copy of BloomFilter.b. *BloomFilter.BisetCopy?
    • add a *BloomFilter.WriteBitSetTo() function?
    • and my favourite: add a *BloomFilter.BitSetWriter() function that would return a
    type BitSetWriter interface {
      WriterTo
    }
    
  • Problem with uniformity of FNV implementation

    Problem with uniformity of FNV implementation

    Hi Will,

    Really appreciate your time implementing Bloom. We have opted to use it in a new project of ours at work, and while reviewing I have found a problem with your custom implementation of FNV. I threw together a test in a branch of my fork to help illustrate. You can find that here: https://github.com/turgon/bloom-1/blob/uniformity/uniformity_test.go

    The arrangement of the test is pretty straightforward. The hashing method under test is used to determine the location of a bit to set in the filter, just like it is during your Add() method. I count up the number of times each bit is hit, and then compare the histogram to the uniform distribution using a Chi Square goodness of fit test.

    I tested five different methods of finding bit locations:

    1. your current implementation
    2. an implementation using Murmur3
    3. your implementation preceding changeset ed0c3ad552eb74b37999995a9678f3723ca44454
    4. an implementation using Go's native FNV-1
    5. an implementation using Go's native FNV-1a

    The results indicate that your two custom implementations of FNV are not uniformly distributing mod m. In practice one could size up the filter in m and k and not suffer terribly, however, it certainly will increase the false positive rate in very unpredictable ways.

    I think the problem stems from your addition of the index on line 77. That addition makes the initial value differ from FNV for indices i greater than zero. My suspicion is that fnv64Hash(0, data) is ok, but the others are not. Also, the others (1, 2, 3) have no influence on the bit location when ii is zero on line 99. I believe this causes the first bit location to be uniformly distributed, but not the rest.

    After further research I introduced Go's native FNV-1 and FNV-1a methods as a baseline, and although they performed surprisingly well, there are only 32 and 64 bit implementations. Therefore the g(x) = h0(x) + i*h1(x) trick can only be applied for filters with m < 2^32 when splitting 64-bit FNV.

    Cassandra uses Murmur3 instead of FNV for its bloom filters and the linked article hints at why. It satisfies the requirement of Kirsch and Mitzenmacher, Our k hash functions will be of the form gi(x) = h1(x)+ih2(x) mod p, where h1(x) and h2(x) are two independent, uniform random hash functions on the universe with range {0, 1, 2, . . . , p − 1}, and throughout we assume that i ranges from 0 to k − 1. I.e.: Uniformity is critical or the math doesn't hold up.

    I also prefer Murmur3 as it produces 128-bit values, which lends itself well to addressing 64-bit locations, while not suffering the kind of performance hit a cryptographic hash takes.

    I hope this was clear enough, and you can find it useful. Thank you!

  • Concurrent

    Concurrent

    I use the package for filtering some data in web requests. Since BloomFilter structure uses shared locBuff and hasher fields it is not safe to call Test function from multiple goroutines (included a test case to demonstrate the problem). I have added another method (TestConcurrent) that does not use these shared fileds. Feel free to reject if you don't like the idea.

  • OOB read using ReadFrom

    OOB read using ReadFrom

    When upgrading to latest version we observe this error:

    panic: runtime error: index out of range [14977] with length 14977
    
    goroutine 761 [running]:
    github.com/bits-and-blooms/bitset.(*BitSet).ReadFrom(0xc00036ff00, {0x50e95a0?, 0xc0003b8dc0?})
    	github.com/bits-and-blooms/[email protected]/bitset.go:955 +0x392
    github.com/bits-and-blooms/bloom/v3.(*BloomFilter).ReadFrom(0xc001096a50, {0x50e95a0, 0xc0003b8dc0})
    	github.com/bits-and-blooms/bloom/[email protected]/bloom.go:367 +0xd6
    github.com/minio/minio/cmd.(*dataUpdateTracker).deserialize(0xc000eb4850, {0x50e95a0, 0xc0003b8dc0}, {0x1b?, 0xc001344f00?, 0x0?})
    	github.com/minio/minio/cmd/data-update-tracker.go:434 +0x487
    github.com/minio/minio/cmd.(*dataUpdateTracker).load(0xc000eb4850, {0x50fa338, 0xc000ff7280}, {0xc001254700?, 0x64, 0xc000128c80?})
    [...]
    

    We are de-serializing into a new instance of &bloom.BloomFilter{}.

    Possible regression from https://github.com/bits-and-blooms/bitset/pull/103

  • Is one hash enough?

    Is one hash enough?

    Hi Will, is one hash really enough for a bloom filter?

    Using one is efficient, but if fnv(A) and fnv(B) collide then surely the multiplication you're doing will simply spread the collision across all the buckets in the same way? That doesn't provide any net benefit to avoiding false-positives by increasing the number of hashes.

    Non-fnv-collisions (same bucket chosen from a different hash) do benefit still though, and there will be more of those.

    I'm interested in the rationale. Thanks for sharing your implementation too, I'm grateful.

    Cheers, Bruce

  • Issue when importing package from master branch

    Issue when importing package from master branch

    when I importing bloom from master branch go get -u github.com/bits-and-blooms/bloom@master error appear that bitset package has invalid go.mod module github.com/willf/bitset

  • Heap-free hashing

    Heap-free hashing

    There is no good reason why hashing a block of data should require heap allocation or even a non-trivial dynamic buffer.

    Unfortunately, the murmur library makes this a bit difficult. Thankfully, we can just roll our own thin functions. It is possible to do while preserving exactly the same hash values. That's what this PR does.

    The goal of this PR is not to improve speed. Of course, it does so... On my machine...

    Before:

    BenchmarkEstimated
    BenchmarkEstimated-8                    1000000000               0.1333 ns/op          0 B/op          0 allocs/op
    BenchmarkSeparateTestAndAdd
    BenchmarkSeparateTestAndAdd-8            2706183               508.2 ns/op           194 B/op          4 allocs/op
    BenchmarkCombinedTestAndAdd
    BenchmarkCombinedTestAndAdd-8            3312842               437.6 ns/op            97 B/op          2 allocs/op
    PASS
    

    After...

    BenchmarkEstimated-8                    1000000000               0.05285 ns/op         0 B/op          0 allocs/op
    BenchmarkSeparateTestAndAdd
    BenchmarkSeparateTestAndAdd-8            6219651               451.9 ns/op             0 B/op          0 allocs/op
    BenchmarkCombinedTestAndAdd
    BenchmarkCombinedTestAndAdd-8            5010103               393.2 ns/op             0 B/op          0 allocs/op
    PASS
    

    So we see about a 10% boost in performance in this instance on my system. Results will vary.

    The important part is that it does away entirely with heap allocation.

    The PR includes tests to ensure that the hash values are preserved.

    Note that if we break backward compatibility (with exact hash value reproduction), there are many great optimization opportunities. But that should be addressed in other PRs.

    Related to PRs https://github.com/willf/bloom/pull/51 and https://github.com/willf/bloom/pull/57

  • Remove the side effect of the `EstimateFalsePositiveRate` function

    Remove the side effect of the `EstimateFalsePositiveRate` function

    This pull request include three changes.

    • Remove the side effect of the EstimateFalsePositiveRate function. Using the Copy function will has no effect on the current object.
    • Remove the repeated TestOrAdd function. For the TestAndAdd function, when the f.b.Test(l) gets the true result, executing the f.b.Set(l) and not are same. So only getting false to execute f.b.Set(l) will decrease one step. Changing it will find the TestAndAdd function is same as the TestOrAdd.
    • Update the .gitignore file to filter the .idea directory.
  • Backward compatibillity break test failures on s390x

    Backward compatibillity break test failures on s390x

    With version 3.0.1 I'm seeing TestHashBasic and TestHashRandom fail on s390x:

    --- FAIL: TestHashBasic (0.00s)
        murmur_test.go:29: Backward compatibillity break.
        murmur_test.go:29: Backward compatibillity break.
    [...]
    --- FAIL: TestHashRandom (0.02s)
        murmur_test.go:60: Backward compatibillity break.
        murmur_test.go:60: Backward compatibillity break.
    [...]
    FAIL
    exit status 1
    FAIL	github.com/willf/bloom	0.217s
    

    You can see the full output in https://koji.fedoraproject.org/koji/taskinfo?taskID=73367600 (look at build.log).

  • This improves slightly the documentation.

    This improves slightly the documentation.

    The TestAndAdd and TestOrAdd methods differ in a small manner that might not always be clear to the user. This PR tries to improve the comments/documentation.

    This relates to PR https://github.com/bits-and-blooms/bloom/pull/81 by @SimFG

    I am honestly not sure what TestAndAdd is good for given that we have TestOrAdd. It was added by @chkno in 2013...

    https://github.com/bits-and-blooms/bloom/commit/7924c81d327f2e9b2bfef5376106c7064e4be7fe

    Meanwhile TestOrAdd is a more recent addition (2021?) by @moredure ...

    https://github.com/bits-and-blooms/bloom/commit/dd9e96df53ad3229c2693b8b0fa58ae1351749e3

    It might be time to do away with TestAndAdd.

  • 2x 128 bit hash instead of 2x 64 bit hash. Why?

    2x 128 bit hash instead of 2x 64 bit hash. Why?

    I noticed that for computing multiple hashes, you make use of the work of Less Hashing, Same Performance, which is calculated by: gi(x) = h1(x)+ih2(x) . For that you generate a 256 bits long hash partitioned into 4 uint64s. So I wonder why you decided to generate a 256 long hash partitioned into 4 uint64s. instead of 128 long hash partitioned into 2 uint64s 🤔 Wouldn't it be the same regarding hashing, but with a performance improvement?

Sync distributed sets using bloom filters

goSetReconciliation An implementation to sync distributed sets using bloom filters. Based on the paper "Low complexity set reconciliation using Bloom

Jan 4, 2022
Go library implementing xor filters
Go library implementing xor filters

xorfilter: Go library implementing xor filters Bloom filters are used to quickly check whether an element is part of a set. Xor filters are a faster a

Dec 30, 2022
Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

Nov 20, 2022
Cuckoo Filter: Practically Better Than Bloom

Cuckoo Filter Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient

Jan 4, 2023
A simple Bloom Filter implementation in Go

This is a simple Bloom filter implementation written in Go. For the theory behind Bloom filters, read http://en.wikipedia.org/wiki/Bloom_filter This

Apr 26, 2018
go/golang: fast bit set Bloom filter

package implements a fast bloom filter with real 'bitset' and JSONMarshal/JSONUnmarshal to store/reload the Bloom filter.

Nov 3, 2022
Leaked password check library with bloom filter

Leaked password check With this library you can check the password is probably leaked or not. Pre generated bitset DB includes 6 Million leaked passwo

Aug 24, 2022
A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set

A Go implementation of a bloom filter, with support for boltdb and badgerdb as optional in-memory persistent storage.

Jul 4, 2022
A Go implementation of an in-memory bloom filter, with support for boltdb and badgerdb as optional data persistent storage.

Sprout A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set. Bloom filters are fast and space

Jul 4, 2022
Go package implementing bitsets

bitset Go language library to map between non-negative integers and boolean values Description Package bitset implements bitsets, a mapping between no

Jan 1, 2023
Go package implementing an indexable ordered multimap

PACKAGE package skiplist import "github.com/glenn-brown/skiplist" Package skiplist implements fast indexable ordered multimaps. This sk

Jul 2, 2022
Package set is a small wrapper around the official reflect package that facilitates loose type conversion and assignment into native Go types.

Package set is a small wrapper around the official reflect package that facilitates loose type conversion and assignment into native Go types. Read th

Dec 27, 2022
Go package for mapping values to and from space-filling curves, such as Hilbert and Peano curves.
Go package for mapping values to and from space-filling curves, such as Hilbert and Peano curves.

Hilbert Go package for mapping values to and from space-filling curves, such as Hilbert and Peano curves. Documentation available here This is not an

Dec 23, 2022
A package for Go that can be used for range queries on large number of intervals

go-stree go-stree is a package for Go that can be used to process a large number of intervals. The main purpose of this module is to solve the followi

May 14, 2022
parody of some of the basic python core features (collections package)

collections import "github.com/marcsantiago/collections" Overview Index Subdirectories Overview Index func StringEncoder(encoder *bytes.Buffer, data D

Jan 14, 2022
Package mafsa implements Minimal Acyclic Finite State Automata in Go, essentially a high-speed, memory-efficient, Unicode-friendly set of strings.

MA-FSA for Go Package mafsa implements Minimal Acyclic Finite State Automata (MA-FSA) with Minimal Perfect Hashing (MPH). Basically, it's a set of str

Oct 27, 2022
Package for indexing zip files and storing a compressed index

zipindex zipindex provides a size optimized representation of a zip file to allow decompressing the file without reading the zip file index. It will o

Nov 30, 2022
peanut is a Go package to write tagged data structs to disk in a variety of formats.

peanut peanut is a Go package to write tagged data structs to disk in a variety of formats. Its primary purpose is to provide a single consistent inte

Dec 16, 2022
graph package by golang

graph-go sample package main import ( "fmt" "github.com/Iovesophy/graph-go" ) func main() { samplePlace := []graph.Node{ {ID: 1, Element: "plac

Oct 24, 2021