Cuckoo Filter: Practically Better Than Bloom

Last update: Jan 4, 2023

Comments: 17

Cuckoo Filter

Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient data structures to serve queries like "if item x is in a set?", they do not support deletion. Their variances to enable deletion (like counting Bloom filters) usually require much more space.

Cuckoo ﬁlters provide the ﬂexibility to add and remove items dynamically. A cuckoo filter is based on cuckoo hashing (and therefore named as cuckoo filter). It is essentially a cuckoo hash table storing each key's fingerprint. Cuckoo hash tables can be highly compact, thus a cuckoo filter could use less space than conventional Bloom ﬁlters, for applications that require low false positive rates (< 3%).

For details about the algorithm and citations please use this article for now

"Cuckoo Filter: Better Than Bloom" by Bin Fan, Dave Andersen and Michael Kaminsky

Implementation details

The paper cited above leaves several parameters to choose. In this implementation

Every element has 2 possible bucket indices
Buckets have a static size of 4 fingerprints
Fingerprints have a static size of 8 bits

1 and 2 are suggested to be the optimum by the authors. The choice of 3 comes down to the desired false positive rate. Given a target false positive rate of r and a bucket size b, they suggest choosing the fingerprint size f using

f >= log2(2b/r) bits

With the 8 bit fingerprint size in this repository, you can expect r ~= 0.03. Other implementations use 16 bit, which correspond to a false positive rate of r ~= 0.0001.

Example usage:

package main

import "fmt"
import "github.com/seiflotfy/cuckoofilter"

func main() {
  cf := cuckoo.NewFilter(1000)
  cf.InsertUnique([]byte("geeky ogre"))

  // Lookup a string (and it a miss) if it exists in the cuckoofilter
  cf.Lookup([]byte("hello"))

  count := cf.Count()
  fmt.Println(count) // count == 1

  // Delete a string (and it a miss)
  cf.Delete([]byte("hello"))

  count = cf.Count()
  fmt.Println(count) // count == 1

  // Delete a string (a hit)
  cf.Delete([]byte("geeky ogre"))

  count = cf.Count()
  fmt.Println(count) // count == 0
  
  cf.Reset()    // reset
}

Documentation:

"Cuckoo Filter on GoDoc"

Owner

Seif Lotfy

eat, sleep,

, repeat



https://github.com/seiflotfy/cuckoofilter



Comments












Efficiency



Cuckoo filter should be memory and CPU efficient.
Using "slice" for buckets and fingerprints destroys both cause sizeof(slice)==24byte and every slice is a pointer inderection.














../pkg/mod/github.com/seiflotfy/[email protected]/util.go:17:14: invalid operation: 1 << i (shift count type int, must be unsigned integer)



Hi everyone, I am working on a project called infinicache, which uses this filter. The system is composed from a Client a Proxy and some Nodes. The client is the one which uses the filter. I modified some code on the proxy side (very little modifications, I just added a field to a structure) and is 2 days now since the cuckoo filter gives problems. Basically when I run the client file, I obtain this.
$ go run client/example/main.go
# github.com/seiflotfy/cuckoofilter
../pkg/mod/github.com/seiflotfy/[email protected]/util.go:17:14: invalid operation: 1 << i (shift count type int, must be unsigned integer)

Given that I am not a pro of Golang, I am do not understand very well which kind of error is it. For sure it talks about cuckoofilter, so for me it seems a compile error, but I am not sure about that. Basically I am not able to log anything to the screen when I run the file, so that's why I think is a compile errror. However, I didn't see any change on your master branch since 24 days, hence the library code should have not changed.
Please help me in understanding why I have this error.
Thanks in advance














added save/load to/from file for filter



Added Save and Load to the filter to write a filter to a file as gob and load it again. I had to export the fields in the CuckooFilter struct to make gob happy. Had to rename Count due to the Conflict with the function.














Effective Method Naming



just an fyi, according to https://golang.org/doc/effective_go.html#Getters you can drop the Get in getters i.e. GetCount() becomes Count()
Thanks for the package!














PR #36 broke this repo



The patch modified the .mod -file so it is now referring the module as coming from @seven4x which breaks the import.
go: github.com/seiflotfy/cuckoofilter: github.com/seiflotfy/[email protected]: parsing go.mod:
	module declares its path as: github.com/seven4x/cuckoofilter
	        but was required as: github.com/seiflotfy/cuckoofilter















Is thread-safe concurrent by default?



Hi, thanks for the contribution.
May I have a question that InsertUnique is concurrency safe? Thanks for looking.
I had met this data race issue:
==================
WARNING: DATA RACE
Read at 0x00c0b79cac1c by goroutine 276:
  github.com/seiflotfy/cuckoofilter.(*bucket).getFingerprintIndex()
      /Users/wenwei/work/go/src/github.com/balabalabala/vendor/github.com/seiflotfy/cuckoofilter/bucket.go:33 +0x54
  github.com/seiflotfy/cuckoofilter.(*Filter).Lookup()
      /Users/wenwei/work/go/src/github.com/balabala/vendor/github.com/seiflotfy/cuckoofilter/cuckoofilter.go:37 +0xde
  github.com/seiflotfy/cuckoofilter.(*Filter).InsertUnique()
      /Users/wenwei/work/go/src/github.com/balabalavendor/github.com/seiflotfy/cuckoofilter/cuckoofilter.go:74 +0x5a















getAltIndex implementation problem



Given what I know about cuckoo filters, given a fingerprint f there should be two indices i1 and i2 such that i1 ^ f = i2.
The problem is in util.go, here:
func getAltIndex(fp byte, i uint, numBuckets uint) uint {
	hash := uint(metro.Hash64([]byte{fp}, 1337))
	return (i ^ hash) % numBuckets
}

For example, take f = 216, i = 8, and numBuckets = 100 (example taken when data = []byte("hello world"))
Then getAltIndex(216, 8, 100) = 70
One would expect that getAltIndex(216, 70, 100) would produce 8. However, this is not the case:
getAltIndex(216, 70, 100) = 36
I believe the problem is the implementation of getAltIndex. Basically, xor (^) and mod (%) do not commute, and the modulo operation needs to be done first. That is:
func getAltIndex(fp byte, i uint, numBuckets uint) uint {
	hash := (uint(metro.Hash64([]byte{fp}, 1337))) % numBuckets
	return (i % numBuckets) ^ hash
}

This produces the desired result.














Reduce repetition in naming



Not a big deal, but cuckoofilter.NewCuckooFilter() is a bit repetitive for my taste -- would you be open to renaming the types/functions to New() and NewDefault() (not sure if NewDefault is really necessary either, if you want to shrink your API surface a bit) and Filter?
In fact, the package could be renamed to just cuckoo and then you'd have cuckoo.NewFilter() and cuckoo.Filter which is a little more natural.
Just my 2 cents!














Algorithm improvements



I have a test that checks a space utilization:
	cf := cuckoofilter.NewCuckooFilter(cap)
	for i := uint(0); i < cap; i++ {
		ok := cf.Insert(randomBytes())
		if !ok {
			fmt.Printf("Error on %v/%v\n", i, cap)
		}
	}

With unmodified version of the cuckoofilter I have an error on 64073/1048576. It means that reinsert failed after 6% entries occupied.
With this PR I have no errors up to 96% space utilized. I believe that the problem is the similar hashing algorithms of fingerprint and i1. Your version is trying to reinsert fp to the same bucket and after maxCuckooCount it fails.
Also I removed the unnecessary lock in getFingerprint because it affected the use of multiple instances of the filter and didn't make the filter thread-safe. I think that lock should be in a filter object and lock/unlock should protect each public method.














fixed imports and generalized test



/word is more common between linux distros. It contains ~10k words, should be sufficient. Not sure if the number of words is consistent so I replaced the hard-coded value with a counter.














warning message when get source to local



go version is "go1.5beta2 darwin/amd64"
//util.go
warning: code.google.com is shutting down; import path code.google.com/p/gofarmhash will stop working
warning: package github.com/seiflotfy/cuckoofilter
    imports code.google.com/p/gofarmhash

prefer using github gofarmhash
Thank you for makes this project.














undo changes when filter is full



fixes #42
I undo the changes whenever inserting is not possible.
However, the correct way to address this is to rebuild the filter with a different hash function which, I believe, requires major changes to the structure of the project.














Element in cuckoo filter may lose when inserting after the filter is full



The Insert method in cuckoofilter behave unexpected when the filter is full. In general, insert method should return true when it can store the element, and return false when it can't store. But when the cuckoo filter is full, insert method will store the element, and withdraw a random one inside filter while returning false.
I make a simple test about this:
func TestExhausted(t *testing.T){
	ckflter := seiflotfy_filter.NewFilter(10000)
	var cached [maxNumForBenchmark]bool
	elementList := make([]int,0)
	var lastElement int

	isFinish := false
	for !isFinish{
		randNum := rand.Intn(maxNumForBenchmark)
		for cached[randNum]{
			randNum = rand.Intn(maxNumForBenchmark)
		}

		finish := ckflter.Insert([]byte(strconv.Itoa(randNum)))
		if !finish{
			t.Logf("Last element is %v",randNum)
			lastElement = randNum
			isFinish = true
			break
		}else{
			cached[randNum] = true
			elementList = append(elementList,randNum)
		}
	}

	for i:=0;i<len(elementList);i++{
		isInside := ckflter.Lookup([]byte(strconv.Itoa(elementList[i])))
		if !isInside{
			t.Errorf("%v should inside but not",elementList[i])
		}
	}
	isInside := ckflter.Lookup([]byte(strconv.Itoa(lastElement)))
	t.Logf("%v should not inside but got %v",lastElement,isInside)
}

The output of code above is
=== RUN   TestExhausted
--- FAIL: TestExhausted (0.00s)
    standardCuckooFilter_test.go:308: Last element is 1776
    standardCuckooFilter_test.go:321: 16055 should inside but not
    standardCuckooFilter_test.go:325: 1776 should not inside but got true
FAIL

Process finished with exit code 1

My solution of this is adding a backup array for all buckets in cuckoo filter, and recover when insertion failure. But this costs a lot when insertion doesn't fail.














Why force power of 2 for capacity?



Perhaps I am not seeing it in the paper, but why force the capacity up to the next power of 2? When rebuilding as in the scalable filter, this decreases the potential load factor.














Great work! Can you use xxh3?



https://github.com/seiflotfy/cuckoofilter/blob/master/util.go
This one instead of metrohash.
https://github.com/zeebo/xxh3
Will wait for your update. It's great piece of software.

By the way, what's the recommended size of NewFilter(1000000) <- what do you suggest? and roughly how much memory is taken by increasing this value?

Sorry I'm a bit dense on this cuckoo filter thing. Can you suggest a value for NewFilter?
I would like to perform matching against 16mil ip addresses.

What is a better use case for panmari 16bit cuckoo filter u mentioned? wouldnt everyone want a lower false positive match?















Adversarial resistance



Are the cuckoo filters implemented in this library resistant to adversarial attempts to induce false positives on chosen elements as described in this paper?
Thanks for reading and thank you for this implementation!














Faster insert when the filter is filled



Profiling a bit this package, I found that about 75% of the time spend for Insert is calling rand.Intn(bucketSize). This is with an almost empty filter so I expect it's getting worse as the it fill.
I expect the requirement for randomness here is fairly low (it's just drawing a number between 0 and 3 to not do the same thing each time), there should be ways to do that much faster, and especially without mutex locking as rand.Intn has.