Go package implementing Bloom filters

Bloom filters

Master Build Status Coverage Status Go Report Card GoDoc

A Bloom filter is a representation of a set of n items, where the main requirement is to make membership queries; i.e., whether an item is a member of a set.

A Bloom filter has two parameters: m, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

This implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

n := uint(1000)
filter := bloom.New(20*n, 5) // load of 20, 5 keys
filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

if filter.Test([]byte("Love"))

For numeric data, I recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

i := uint32(100)
n1 := make([]byte, 4)
binary.BigEndian.PutUint32(n1, i)
filter.Add(n1)

Finally, there is a method to estimate the false positive rate of a particular bloom filter for a set of size n:

if filter.EstimateFalsePositiveRate(1000) > 0.001

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Discussion here: Bloom filter

Godoc documentation: https://godoc.org/github.com/willf/bloom

Installation

go get -u github.com/willf/bloom

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project include a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa
Owner
Will Fitzgerald
Words and numbers
Will Fitzgerald
Comments
  • Applying the bloom filter for HASH160 addresses

    Applying the bloom filter for HASH160 addresses

    Example Address List.

    0af80f98e511577e5c87d56069896c70fe4a6263 0af81021340c23bb20429e47c3bdd62137c322d3 0af811994346cdca9d6dbe8e7da4f3393130fc71 0af81250856d377cf7f6914ebe196a9570ada313 0af814b46c4ef5c57dbc53ae03c812eea37d4aac 0af816bb57c0adac8bca0401d295724cd80b0956 64dfb83dcca1f516d2bac93c8c9e4eb6513fd364 64dfb8c23802f3aa67c9078c0e7970fa17584892 64dfb8c5d21743ba3dd8cef5b576e1ddf6d4c239 64dfb931e9d68aa68842efa3c5d0bc3c1101b5ec 64dfba34a39823c273a33fadf49112c063ec3724

    It gives quite a lot of positive and false results and for this reason I cannot use it. I want to use bloom filter to filter about 25 million hash160 addresses. How can I use the settings more efficiently?

    I want to use it for this project. https://github.com/oguzdelioglu/odelbrain

  • Incorrect tests

    Incorrect tests

    Some of the tests are faulty. TestDirect20_5 for example. If we use the formula from estimateParameters (which is the same as the formula on wikipedia). We get that for 10000 entries and a 0.0001% precision (p = 0.000001 because EstimateFalsePositiveRate (after fix #20) returns a 100-0 range and estimateParameters uses a 1-0 range) the hash should have 287,552 bits and 20 hashing functions. The values used in the test (200,000 bits and only 5 hashing functions) there for result in a much higher false positive chance of 0.03%.

    Even with the #20 fix the tests still pass but this is only because the Fnv hashing functions used results in a really weird false positive rate for different inputs. Some inputs result in a much lower false positive rate (which I still don't understand) and many other result in a much higher false positive rate than expected.

    For testing I quickly hacked a version which uses the Sha256 hashing function and the research from Less Hashing, Same Performance: Building a Better Bloom Filter (also used in my redis bloom filter). This version always results in the correct false positive rate but is around 26 times slower. If you are interested I can make a fork with this code so you can see if it's interesting.

  • WriteTo() variant that exports only the bitset

    WriteTo() variant that exports only the bitset

    Currently there is no elegant solution to export only the bloom filter bitset. That makes it difficult to engineer custom serialization formats.

    I'm eager to submit a PR to change that if we could agree on the design.

    Of the top of my head we could do one of those things:

    • expose BloomFilter.b – the simplest solution. However will allow developers to shoot themselves in the foot.
    • add a function that would return a copy of BloomFilter.b. *BloomFilter.BisetCopy?
    • add a *BloomFilter.WriteBitSetTo() function?
    • and my favourite: add a *BloomFilter.BitSetWriter() function that would return a
    type BitSetWriter interface {
      WriterTo
    }
    
  • Problem with uniformity of FNV implementation

    Problem with uniformity of FNV implementation

    Hi Will,

    Really appreciate your time implementing Bloom. We have opted to use it in a new project of ours at work, and while reviewing I have found a problem with your custom implementation of FNV. I threw together a test in a branch of my fork to help illustrate. You can find that here: https://github.com/turgon/bloom-1/blob/uniformity/uniformity_test.go

    The arrangement of the test is pretty straightforward. The hashing method under test is used to determine the location of a bit to set in the filter, just like it is during your Add() method. I count up the number of times each bit is hit, and then compare the histogram to the uniform distribution using a Chi Square goodness of fit test.

    I tested five different methods of finding bit locations:

    1. your current implementation
    2. an implementation using Murmur3
    3. your implementation preceding changeset ed0c3ad552eb74b37999995a9678f3723ca44454
    4. an implementation using Go's native FNV-1
    5. an implementation using Go's native FNV-1a

    The results indicate that your two custom implementations of FNV are not uniformly distributing mod m. In practice one could size up the filter in m and k and not suffer terribly, however, it certainly will increase the false positive rate in very unpredictable ways.

    I think the problem stems from your addition of the index on line 77. That addition makes the initial value differ from FNV for indices i greater than zero. My suspicion is that fnv64Hash(0, data) is ok, but the others are not. Also, the others (1, 2, 3) have no influence on the bit location when ii is zero on line 99. I believe this causes the first bit location to be uniformly distributed, but not the rest.

    After further research I introduced Go's native FNV-1 and FNV-1a methods as a baseline, and although they performed surprisingly well, there are only 32 and 64 bit implementations. Therefore the g(x) = h0(x) + i*h1(x) trick can only be applied for filters with m < 2^32 when splitting 64-bit FNV.

    Cassandra uses Murmur3 instead of FNV for its bloom filters and the linked article hints at why. It satisfies the requirement of Kirsch and Mitzenmacher, Our k hash functions will be of the form gi(x) = h1(x)+ih2(x) mod p, where h1(x) and h2(x) are two independent, uniform random hash functions on the universe with range {0, 1, 2, . . . , p − 1}, and throughout we assume that i ranges from 0 to k − 1. I.e.: Uniformity is critical or the math doesn't hold up.

    I also prefer Murmur3 as it produces 128-bit values, which lends itself well to addressing 64-bit locations, while not suffering the kind of performance hit a cryptographic hash takes.

    I hope this was clear enough, and you can find it useful. Thank you!

  • Concurrent

    Concurrent

    I use the package for filtering some data in web requests. Since BloomFilter structure uses shared locBuff and hasher fields it is not safe to call Test function from multiple goroutines (included a test case to demonstrate the problem). I have added another method (TestConcurrent) that does not use these shared fileds. Feel free to reject if you don't like the idea.

  • OOB read using ReadFrom

    OOB read using ReadFrom

    When upgrading to latest version we observe this error:

    panic: runtime error: index out of range [14977] with length 14977
    
    goroutine 761 [running]:
    github.com/bits-and-blooms/bitset.(*BitSet).ReadFrom(0xc00036ff00, {0x50e95a0?, 0xc0003b8dc0?})
    	github.com/bits-and-blooms/[email protected]/bitset.go:955 +0x392
    github.com/bits-and-blooms/bloom/v3.(*BloomFilter).ReadFrom(0xc001096a50, {0x50e95a0, 0xc0003b8dc0})
    	github.com/bits-and-blooms/bloom/[email protected]/bloom.go:367 +0xd6
    github.com/minio/minio/cmd.(*dataUpdateTracker).deserialize(0xc000eb4850, {0x50e95a0, 0xc0003b8dc0}, {0x1b?, 0xc001344f00?, 0x0?})
    	github.com/minio/minio/cmd/data-update-tracker.go:434 +0x487
    github.com/minio/minio/cmd.(*dataUpdateTracker).load(0xc000eb4850, {0x50fa338, 0xc000ff7280}, {0xc001254700?, 0x64, 0xc000128c80?})
    [...]
    

    We are de-serializing into a new instance of &bloom.BloomFilter{}.

    Possible regression from https://github.com/bits-and-blooms/bitset/pull/103

  • Is one hash enough?

    Is one hash enough?

    Hi Will, is one hash really enough for a bloom filter?

    Using one is efficient, but if fnv(A) and fnv(B) collide then surely the multiplication you're doing will simply spread the collision across all the buckets in the same way? That doesn't provide any net benefit to avoiding false-positives by increasing the number of hashes.

    Non-fnv-collisions (same bucket chosen from a different hash) do benefit still though, and there will be more of those.

    I'm interested in the rationale. Thanks for sharing your implementation too, I'm grateful.

    Cheers, Bruce

  • Issue when importing package from master branch

    Issue when importing package from master branch

    when I importing bloom from master branch go get -u github.com/bits-and-blooms/bloom@master error appear that bitset package has invalid go.mod module github.com/willf/bitset

  • Heap-free hashing

    Heap-free hashing

    There is no good reason why hashing a block of data should require heap allocation or even a non-trivial dynamic buffer.

    Unfortunately, the murmur library makes this a bit difficult. Thankfully, we can just roll our own thin functions. It is possible to do while preserving exactly the same hash values. That's what this PR does.

    The goal of this PR is not to improve speed. Of course, it does so... On my machine...

    Before:

    BenchmarkEstimated
    BenchmarkEstimated-8                    1000000000               0.1333 ns/op          0 B/op          0 allocs/op
    BenchmarkSeparateTestAndAdd
    BenchmarkSeparateTestAndAdd-8            2706183               508.2 ns/op           194 B/op          4 allocs/op
    BenchmarkCombinedTestAndAdd
    BenchmarkCombinedTestAndAdd-8            3312842               437.6 ns/op            97 B/op          2 allocs/op
    PASS
    

    After...

    BenchmarkEstimated-8                    1000000000               0.05285 ns/op         0 B/op          0 allocs/op
    BenchmarkSeparateTestAndAdd
    BenchmarkSeparateTestAndAdd-8            6219651               451.9 ns/op             0 B/op          0 allocs/op
    BenchmarkCombinedTestAndAdd
    BenchmarkCombinedTestAndAdd-8            5010103               393.2 ns/op             0 B/op          0 allocs/op
    PASS
    

    So we see about a 10% boost in performance in this instance on my system. Results will vary.

    The important part is that it does away entirely with heap allocation.

    The PR includes tests to ensure that the hash values are preserved.

    Note that if we break backward compatibility (with exact hash value reproduction), there are many great optimization opportunities. But that should be addressed in other PRs.

    Related to PRs https://github.com/willf/bloom/pull/51 and https://github.com/willf/bloom/pull/57

  • Remove the side effect of the `EstimateFalsePositiveRate` function

    Remove the side effect of the `EstimateFalsePositiveRate` function

    This pull request include three changes.

    • Remove the side effect of the EstimateFalsePositiveRate function. Using the Copy function will has no effect on the current object.
    • Remove the repeated TestOrAdd function. For the TestAndAdd function, when the f.b.Test(l) gets the true result, executing the f.b.Set(l) and not are same. So only getting false to execute f.b.Set(l) will decrease one step. Changing it will find the TestAndAdd function is same as the TestOrAdd.
    • Update the .gitignore file to filter the .idea directory.
  • Backward compatibillity break test failures on s390x

    Backward compatibillity break test failures on s390x

    With version 3.0.1 I'm seeing TestHashBasic and TestHashRandom fail on s390x:

    --- FAIL: TestHashBasic (0.00s)
        murmur_test.go:29: Backward compatibillity break.
        murmur_test.go:29: Backward compatibillity break.
    [...]
    --- FAIL: TestHashRandom (0.02s)
        murmur_test.go:60: Backward compatibillity break.
        murmur_test.go:60: Backward compatibillity break.
    [...]
    FAIL
    exit status 1
    FAIL	github.com/willf/bloom	0.217s
    

    You can see the full output in https://koji.fedoraproject.org/koji/taskinfo?taskID=73367600 (look at build.log).

  • This improves slightly the documentation.

    This improves slightly the documentation.

    The TestAndAdd and TestOrAdd methods differ in a small manner that might not always be clear to the user. This PR tries to improve the comments/documentation.

    This relates to PR https://github.com/bits-and-blooms/bloom/pull/81 by @SimFG

    I am honestly not sure what TestAndAdd is good for given that we have TestOrAdd. It was added by @chkno in 2013...

    https://github.com/bits-and-blooms/bloom/commit/7924c81d327f2e9b2bfef5376106c7064e4be7fe

    Meanwhile TestOrAdd is a more recent addition (2021?) by @moredure ...

    https://github.com/bits-and-blooms/bloom/commit/dd9e96df53ad3229c2693b8b0fa58ae1351749e3

    It might be time to do away with TestAndAdd.

  • 2x 128 bit hash instead of 2x 64 bit hash. Why?

    2x 128 bit hash instead of 2x 64 bit hash. Why?

    I noticed that for computing multiple hashes, you make use of the work of Less Hashing, Same Performance, which is calculated by: gi(x) = h1(x)+ih2(x) . For that you generate a 256 bits long hash partitioned into 4 uint64s. So I wonder why you decided to generate a 256 long hash partitioned into 4 uint64s. instead of 128 long hash partitioned into 2 uint64s 🤔 Wouldn't it be the same regarding hashing, but with a performance improvement?

Implementing tail -f in Go

golang-tail-f An interviewer made me design this in 2 hours

Nov 15, 2021
A package for running subprocesses in Go, similar to Python's subprocesses package.

A package for running subprocesses in Go, similar to Python's subprocesses package.

Jul 28, 2022
Utility to restrict which package is allowed to import another package.

go-import-rules Utility to restrict which package is allowed to import another package. This tool will read import-rules.yaml or import-rules.yml in t

Jan 7, 2022
Golang source code parsing, usage like reflect package

gotype Golang source code parsing, usage like reflect package English įŽ€äŊ“中文 Usage API Documentation Examples License Pouch is licensed under the MIT Li

Dec 9, 2022
A Go preprocessor for package scoped reflection

pkgreflect - A Go preprocessor for package scoped reflection Problem: Go reflection does not support enumerating types, variables and functions of pac

Dec 13, 2022
reactssr is a package for rendering React applications.

reactssr A Go package to perform Server Side Rendering of React apps. Example usage Given a bundle produced from an additional entrypoint to your appl

Jan 9, 2023
Package ethtool allows control of the Linux ethtool generic netlink interface.

ethtool Package ethtool allows control of the Linux ethtool generic netlink interface.

Dec 14, 2022
Extremely flexible golang deep comparison, extends the go testing package, tests HTTP APIs and provides tests suite
Extremely flexible golang deep comparison, extends the go testing package, tests HTTP APIs and provides tests suite

go-testdeep Extremely flexible golang deep comparison, extends the go testing package. Latest news Synopsis Description Installation Functions Availab

Jan 5, 2023
A well tested and comprehensive Golang statistics library package with no dependencies.

Stats - Golang Statistics Package A well tested and comprehensive Golang statistics library / package / module with no dependencies. If you have any s

Dec 30, 2022
A computational topology package for gophers.
A computational topology package for gophers.

Simplices; simplicial complexes; simplicial chains; chain, cycle, boundary and homology groups; sets of simplices; methods for computing boundaries, Euler characteristics, Euler integrals, and Betti numbers, and more (with even more to come)!

Apr 19, 2021
go-i18n is a Go package and a command that helps you translate Go programs into multiple languages.

go-i18n is a Go package and a command that helps you translate Go programs into multiple languages.

Jan 2, 2023
Go package to generate and manage color palettes & schemes 🎨
Go package to generate and manage color palettes & schemes 🎨

Go package to generate and manage color palettes & schemes

Dec 29, 2022
Goridge is high performance PHP-to-Golang codec library which works over native PHP sockets and Golang net/rpc package.
Goridge is high performance PHP-to-Golang codec library which works over native PHP sockets and Golang net/rpc package.

Goridge is high performance PHP-to-Golang codec library which works over native PHP sockets and Golang net/rpc package. The library allows you to call Go service methods from PHP with a minimal footprint, structures and []byte support.

Dec 28, 2022
Gene parsing package for Axie Infinity

agp Package agp is a gene parsing package for Axie Infinity. The name agp stands for "Axie Gene Parser" which decodes the hex representation of an Axi

Apr 18, 2022
keeper is package for Go that provides a mechanism for waiting a result of execution function until context cancel.

keeper is package for Go that provides a mechanism for waiting a result of execution function until context cancel.

Apr 18, 2022
A lightweight casting package for Go projects

Cast GoLobby Cast is a lightweight casting package for Go projects. Documentation Required Go Versions It requires Go v1.11 or newer versions. Install

Dec 21, 2022
📋 cross-platform clipboard package that supports accessing text and image in Go (macOS/Linux/Windows/Android/iOS)

clipboard Cross platform (macOS/Linux/Windows/Android/iOS) clipboard package in Go import "golang.design/x/clipboard" Features Cross platform supports

Dec 24, 2022
Raspberry pi GPIO controller package(CGO)
Raspberry pi GPIO controller package(CGO)

GOPIO A simple gpio controller package for raspberrypi. Documentation Examples Installation sudo apt-get install wiringpi go get github.com/polarspet

Nov 24, 2022
Package strnaming is used to Convert string to camelCase, snake_case, kebab-case.

strnaming Package strnaming is used to Convert string to camelCase, snake_case, kebab-case. Contents strnaming Contents API Examples Install Quick sta

Oct 24, 2021