Cuckoo Filter: Practically Better Than Bloom

Cuckoo Filter

GoDoc CodeHunt.io

Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient data structures to serve queries like "if item x is in a set?", they do not support deletion. Their variances to enable deletion (like counting Bloom filters) usually require much more space.

Cuckoo filters provide the flexibility to add and remove items dynamically. A cuckoo filter is based on cuckoo hashing (and therefore named as cuckoo filter). It is essentially a cuckoo hash table storing each key's fingerprint. Cuckoo hash tables can be highly compact, thus a cuckoo filter could use less space than conventional Bloom filters, for applications that require low false positive rates (< 3%).

For details about the algorithm and citations please use this article for now

"Cuckoo Filter: Better Than Bloom" by Bin Fan, Dave Andersen and Michael Kaminsky

Implementation details

The paper cited above leaves several parameters to choose. In this implementation

  1. Every element has 2 possible bucket indices
  2. Buckets have a static size of 4 fingerprints
  3. Fingerprints have a static size of 8 bits

1 and 2 are suggested to be the optimum by the authors. The choice of 3 comes down to the desired false positive rate. Given a target false positive rate of r and a bucket size b, they suggest choosing the fingerprint size f using

f >= log2(2b/r) bits

With the 8 bit fingerprint size in this repository, you can expect r ~= 0.03. Other implementations use 16 bit, which correspond to a false positive rate of r ~= 0.0001.

Example usage:

package main

import "fmt"
import "github.com/seiflotfy/cuckoofilter"

func main() {
  cf := cuckoo.NewFilter(1000)
  cf.InsertUnique([]byte("geeky ogre"))

  // Lookup a string (and it a miss) if it exists in the cuckoofilter
  cf.Lookup([]byte("hello"))

  count := cf.Count()
  fmt.Println(count) // count == 1

  // Delete a string (and it a miss)
  cf.Delete([]byte("hello"))

  count = cf.Count()
  fmt.Println(count) // count == 1

  // Delete a string (a hit)
  cf.Delete([]byte("geeky ogre"))

  count = cf.Count()
  fmt.Println(count) // count == 0
  
  cf.Reset()    // reset
}

Documentation:

"Cuckoo Filter on GoDoc"

Owner
Seif Lotfy
eat, sleep, , repeat
Seif Lotfy
Comments
  • Efficiency

    Efficiency

    Cuckoo filter should be memory and CPU efficient. Using "slice" for buckets and fingerprints destroys both cause sizeof(slice)==24byte and every slice is a pointer inderection.

  • ../pkg/mod/github.com/seiflotfy/cuckoofilter@v0.0.0-20200323075608-c8f23b6b6cef/util.go:17:14: invalid operation: 1 << i (shift count type int, must be unsigned integer)

    ../pkg/mod/github.com/seiflotfy/[email protected]/util.go:17:14: invalid operation: 1 << i (shift count type int, must be unsigned integer)

    Hi everyone, I am working on a project called infinicache, which uses this filter. The system is composed from a Client a Proxy and some Nodes. The client is the one which uses the filter. I modified some code on the proxy side (very little modifications, I just added a field to a structure) and is 2 days now since the cuckoo filter gives problems. Basically when I run the client file, I obtain this.

    $ go run client/example/main.go
    # github.com/seiflotfy/cuckoofilter
    ../pkg/mod/github.com/seiflotfy/[email protected]/util.go:17:14: invalid operation: 1 << i (shift count type int, must be unsigned integer)
    

    Given that I am not a pro of Golang, I am do not understand very well which kind of error is it. For sure it talks about cuckoofilter, so for me it seems a compile error, but I am not sure about that. Basically I am not able to log anything to the screen when I run the file, so that's why I think is a compile errror. However, I didn't see any change on your master branch since 24 days, hence the library code should have not changed.

    Please help me in understanding why I have this error.

    Thanks in advance

  • added save/load to/from file for filter

    added save/load to/from file for filter

    Added Save and Load to the filter to write a filter to a file as gob and load it again. I had to export the fields in the CuckooFilter struct to make gob happy. Had to rename Count due to the Conflict with the function.

  • Effective Method Naming

    Effective Method Naming

    just an fyi, according to https://golang.org/doc/effective_go.html#Getters you can drop the Get in getters i.e. GetCount() becomes Count()

    Thanks for the package!

  • PR #36 broke this repo

    PR #36 broke this repo

    The patch modified the .mod -file so it is now referring the module as coming from @seven4x which breaks the import.

    go: github.com/seiflotfy/cuckoofilter: github.com/seiflotfy/[email protected]: parsing go.mod:
    	module declares its path as: github.com/seven4x/cuckoofilter
    	        but was required as: github.com/seiflotfy/cuckoofilter
    
  • Is thread-safe concurrent by default?

    Is thread-safe concurrent by default?

    Hi, thanks for the contribution. May I have a question that InsertUnique is concurrency safe? Thanks for looking.

    I had met this data race issue:

    ==================
    WARNING: DATA RACE
    Read at 0x00c0b79cac1c by goroutine 276:
      github.com/seiflotfy/cuckoofilter.(*bucket).getFingerprintIndex()
          /Users/wenwei/work/go/src/github.com/balabalabala/vendor/github.com/seiflotfy/cuckoofilter/bucket.go:33 +0x54
      github.com/seiflotfy/cuckoofilter.(*Filter).Lookup()
          /Users/wenwei/work/go/src/github.com/balabala/vendor/github.com/seiflotfy/cuckoofilter/cuckoofilter.go:37 +0xde
      github.com/seiflotfy/cuckoofilter.(*Filter).InsertUnique()
          /Users/wenwei/work/go/src/github.com/balabalavendor/github.com/seiflotfy/cuckoofilter/cuckoofilter.go:74 +0x5a
    
  • getAltIndex implementation problem

    getAltIndex implementation problem

    Given what I know about cuckoo filters, given a fingerprint f there should be two indices i1 and i2 such that i1 ^ f = i2.

    The problem is in util.go, here:

    func getAltIndex(fp byte, i uint, numBuckets uint) uint {
    	hash := uint(metro.Hash64([]byte{fp}, 1337))
    	return (i ^ hash) % numBuckets
    }
    

    For example, take f = 216, i = 8, and numBuckets = 100 (example taken when data = []byte("hello world"))

    Then getAltIndex(216, 8, 100) = 70

    One would expect that getAltIndex(216, 70, 100) would produce 8. However, this is not the case:

    getAltIndex(216, 70, 100) = 36

    I believe the problem is the implementation of getAltIndex. Basically, xor (^) and mod (%) do not commute, and the modulo operation needs to be done first. That is:

    func getAltIndex(fp byte, i uint, numBuckets uint) uint {
    	hash := (uint(metro.Hash64([]byte{fp}, 1337))) % numBuckets
    	return (i % numBuckets) ^ hash
    }
    

    This produces the desired result.

  • Reduce repetition in naming

    Reduce repetition in naming

    Not a big deal, but cuckoofilter.NewCuckooFilter() is a bit repetitive for my taste -- would you be open to renaming the types/functions to New() and NewDefault() (not sure if NewDefault is really necessary either, if you want to shrink your API surface a bit) and Filter?

    In fact, the package could be renamed to just cuckoo and then you'd have cuckoo.NewFilter() and cuckoo.Filter which is a little more natural.

    Just my 2 cents!

  • Algorithm improvements

    Algorithm improvements

    I have a test that checks a space utilization:

    	cf := cuckoofilter.NewCuckooFilter(cap)
    	for i := uint(0); i < cap; i++ {
    		ok := cf.Insert(randomBytes())
    		if !ok {
    			fmt.Printf("Error on %v/%v\n", i, cap)
    		}
    	}
    

    With unmodified version of the cuckoofilter I have an error on 64073/1048576. It means that reinsert failed after 6% entries occupied.

    With this PR I have no errors up to 96% space utilized. I believe that the problem is the similar hashing algorithms of fingerprint and i1. Your version is trying to reinsert fp to the same bucket and after maxCuckooCount it fails.

    Also I removed the unnecessary lock in getFingerprint because it affected the use of multiple instances of the filter and didn't make the filter thread-safe. I think that lock should be in a filter object and lock/unlock should protect each public method.

  • fixed imports and generalized test

    fixed imports and generalized test

    /word is more common between linux distros. It contains ~10k words, should be sufficient. Not sure if the number of words is consistent so I replaced the hard-coded value with a counter.

  • warning message when get source to local

    warning message when get source to local

    go version is "go1.5beta2 darwin/amd64"

    //util.go
    warning: code.google.com is shutting down; import path code.google.com/p/gofarmhash will stop working
    warning: package github.com/seiflotfy/cuckoofilter
        imports code.google.com/p/gofarmhash
    

    prefer using github gofarmhash Thank you for makes this project.

  • undo changes when filter is full

    undo changes when filter is full

    fixes #42

    I undo the changes whenever inserting is not possible. However, the correct way to address this is to rebuild the filter with a different hash function which, I believe, requires major changes to the structure of the project.

  • Element in cuckoo filter may lose when inserting after the filter is full

    Element in cuckoo filter may lose when inserting after the filter is full

    The Insert method in cuckoofilter behave unexpected when the filter is full. In general, insert method should return true when it can store the element, and return false when it can't store. But when the cuckoo filter is full, insert method will store the element, and withdraw a random one inside filter while returning false.

    I make a simple test about this:

    func TestExhausted(t *testing.T){
    	ckflter := seiflotfy_filter.NewFilter(10000)
    	var cached [maxNumForBenchmark]bool
    	elementList := make([]int,0)
    	var lastElement int
    
    	isFinish := false
    	for !isFinish{
    		randNum := rand.Intn(maxNumForBenchmark)
    		for cached[randNum]{
    			randNum = rand.Intn(maxNumForBenchmark)
    		}
    
    		finish := ckflter.Insert([]byte(strconv.Itoa(randNum)))
    		if !finish{
    			t.Logf("Last element is %v",randNum)
    			lastElement = randNum
    			isFinish = true
    			break
    		}else{
    			cached[randNum] = true
    			elementList = append(elementList,randNum)
    		}
    	}
    
    	for i:=0;i<len(elementList);i++{
    		isInside := ckflter.Lookup([]byte(strconv.Itoa(elementList[i])))
    		if !isInside{
    			t.Errorf("%v should inside but not",elementList[i])
    		}
    	}
    	isInside := ckflter.Lookup([]byte(strconv.Itoa(lastElement)))
    	t.Logf("%v should not inside but got %v",lastElement,isInside)
    }
    

    The output of code above is

    === RUN   TestExhausted
    --- FAIL: TestExhausted (0.00s)
        standardCuckooFilter_test.go:308: Last element is 1776
        standardCuckooFilter_test.go:321: 16055 should inside but not
        standardCuckooFilter_test.go:325: 1776 should not inside but got true
    FAIL
    
    Process finished with exit code 1
    

    My solution of this is adding a backup array for all buckets in cuckoo filter, and recover when insertion failure. But this costs a lot when insertion doesn't fail.

  • Why force power of 2 for capacity?

    Why force power of 2 for capacity?

    Perhaps I am not seeing it in the paper, but why force the capacity up to the next power of 2? When rebuilding as in the scalable filter, this decreases the potential load factor.

  • Great work! Can you use xxh3?

    Great work! Can you use xxh3?

    https://github.com/seiflotfy/cuckoofilter/blob/master/util.go

    This one instead of metrohash. https://github.com/zeebo/xxh3

    Will wait for your update. It's great piece of software.

    1. By the way, what's the recommended size of NewFilter(1000000) <- what do you suggest? and roughly how much memory is taken by increasing this value?

    Sorry I'm a bit dense on this cuckoo filter thing. Can you suggest a value for NewFilter?

    I would like to perform matching against 16mil ip addresses.

    1. What is a better use case for panmari 16bit cuckoo filter u mentioned? wouldnt everyone want a lower false positive match?
  • Adversarial resistance

    Adversarial resistance

    Are the cuckoo filters implemented in this library resistant to adversarial attempts to induce false positives on chosen elements as described in this paper?

    Thanks for reading and thank you for this implementation!

  • Faster insert when the filter is filled

    Faster insert when the filter is filled

    Profiling a bit this package, I found that about 75% of the time spend for Insert is calling rand.Intn(bucketSize). This is with an almost empty filter so I expect it's getting worse as the it fill.

    I expect the requirement for randomness here is fairly low (it's just drawing a number between 0 and 3 to not do the same thing each time), there should be ways to do that much faster, and especially without mutex locking as rand.Intn has.

A simple Bloom Filter implementation in Go

This is a simple Bloom filter implementation written in Go. For the theory behind Bloom filters, read http://en.wikipedia.org/wiki/Bloom_filter This

Apr 26, 2018
go/golang: fast bit set Bloom filter

package implements a fast bloom filter with real 'bitset' and JSONMarshal/JSONUnmarshal to store/reload the Bloom filter.

Nov 3, 2022
Leaked password check library with bloom filter

Leaked password check With this library you can check the password is probably leaked or not. Pre generated bitset DB includes 6 Million leaked passwo

Aug 24, 2022
A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set

A Go implementation of a bloom filter, with support for boltdb and badgerdb as optional in-memory persistent storage.

Jul 4, 2022
A Go implementation of an in-memory bloom filter, with support for boltdb and badgerdb as optional data persistent storage.

Sprout A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set. Bloom filters are fast and space

Jul 4, 2022
My clean Go solution that's faster than 100%, takes up less memory than 100%.

Remove-element My very clean Go solution that's faster than 100% of all solutions on Leetcode. Leetcode Challenge: "Given an integer array nums and an

Dec 24, 2021
Go package implementing Bloom filters

Bloom filters A Bloom filter is a representation of a set of n items, where the main requirement is to make membership queries; i.e., whether an item

Dec 30, 2022
Go package implementing Bloom filters

Bloom filters A Bloom filter is a concise/compressed representation of a set, where the main requirement is to make membership queries; i.e., whether

Dec 28, 2022
Sync distributed sets using bloom filters

goSetReconciliation An implementation to sync distributed sets using bloom filters. Based on the paper "Low complexity set reconciliation using Bloom

Jan 4, 2022
skipmap is a high-performance concurrent sorted map based on skip list. Up to 3x ~ 10x faster than sync.Map in the typical pattern.
skipmap is a high-performance concurrent sorted map based on skip list. Up to 3x ~ 10x faster than sync.Map in the typical pattern.

Introduction skipmap is a high-performance concurrent map based on skip list. In typical pattern(one million operations, 90%LOAD 9%STORE 1%DELETE), th

Jan 8, 2023
TBH, there are probably better/faster implementations out there.

Scott's Skiplist TBH, there are probably better/faster implementations out there. See https://github.com/MauriceGit/skiplist-survey I wrote this one b

Feb 10, 2022
cuckoo-filter go implement. config by you 布谷鸟过滤器的Go实现,可以定制化过滤器参数

Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient data structures to serve queries like "if item x is in a set?", they do not support deletion. Their variances to enable deletion (like counting Bloom filters) usually require much more space.

Jan 1, 2023
Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

Nov 20, 2022
A simple Bloom Filter implementation in Go

This is a simple Bloom filter implementation written in Go. For the theory behind Bloom filters, read http://en.wikipedia.org/wiki/Bloom_filter This

Apr 26, 2018
go/golang: fast bit set Bloom filter

package implements a fast bloom filter with real 'bitset' and JSONMarshal/JSONUnmarshal to store/reload the Bloom filter.

Nov 3, 2022
Leaked password check library with bloom filter

Leaked password check With this library you can check the password is probably leaked or not. Pre generated bitset DB includes 6 Million leaked passwo

Aug 24, 2022
A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set

A Go implementation of a bloom filter, with support for boltdb and badgerdb as optional in-memory persistent storage.

Jul 4, 2022
A Go implementation of an in-memory bloom filter, with support for boltdb and badgerdb as optional data persistent storage.

Sprout A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set. Bloom filters are fast and space

Jul 4, 2022
My highly maintainable Go solution that's faster than 94%, takes up less memory than 100%

Longest-Palindromic-Substring My simple Go solution that's faster than 94%, takes up less memory than 100% Given a string s, return the longest palind

Dec 1, 2021
My clean Go solution that's faster than 100%, takes up less memory than 100%.

Remove-element My very clean Go solution that's faster than 100% of all solutions on Leetcode. Leetcode Challenge: "Given an integer array nums and an

Dec 24, 2021