Go library implementing xor filters

xorfilter: Go library implementing xor filters

GoDoc Build Status

Bloom filters are used to quickly check whether an element is part of a set. Xor filters are a faster and more concise alternative to Bloom filters. They are also smaller than cuckoo filters.

We are assuming that your set is made of 64-bit integers. If you have strings or other data structures, you need to hash them first to a 64-bit integer. It is not important to have a good hash function, but collision should be unlikely (~1/2^64).

The current implementation has a false positive rate of about 0.3% and a memory usage of less than 9 bits per entry for sizeable sets.

You construct the filter as follows starting from a slice of 64-bit integers:

filter,_ := xorfilter.Populate(keys) // keys is of type []uint64

It returns an object of type Xor8. The 64-bit integers would typically be hash values of your objects.

You can then query it as follows:

filter.Contains(v) // v is of type uint64

It will always return true if v was part of the initial construction (Populate) and almost always return false otherwise.

An xor filter is immutable, it is concurrent. The expectation is that you build it once and use it many times.

Though the filter itself does not use much memory, the construction of the filter needs many bytes of memory per set entry.

For persistence, you only need to serialize the following data structure:

type Xor8 struct {
	Seed         uint64
	BlockLength  uint32
	Fingerprints []uint8
}

Duplicate keys

When constructing the filter, you should ensure that there is no duplicate keys. If you think that this might happen, then you should check the error condition.

filter,err := xorfilter.Populate(keys) // keys is of type []uint64
if err != nil {
	// you have duplicate keys, de-duplicate them?
}

Effectively, an error is returned when the filter could not be build after MaxIterations iterations (default to 100).

Implementations of xor filters in other programming languages

Comments
  • Implement the compressed xor+

    Implement the compressed xor+

    I noticed your paper mentioned xor+ but didn't see any mention or implementation of compression. Just curious if you intend to provide that here as well?

  • Cut releases using tags

    Cut releases using tags

    Would you be able to cut release tags so that tooling like dependabot or renovate will not propose an upgrade on every commit?

    You could use something like the really nifty release-please to automate that process along with a nice changelog.

  • chore: Configure release-please

    chore: Configure release-please

    Related to https://github.com/FastFilter/xorfilter/issues/25

    Release-Please will create a running PR with all commits made since the last release. Once you're ready to cut a new release, just merge the PR.

    One small caveat is that commits will only show up in the changelog if they follow semantic commit message guidelines (e.g. start with fix/feat/chore. Release-please uses them to determine what kind of version bump it should do. Very useful is that you can mark breaking changes with a bang, e.g. fix!: bla and it will do a major bump. To make sure the commits follow this, i added an action for linting the PR title (assuming you squash-merge PRs).

    Do you want to start with a specific version number? E.g. 0.1.0 or something like that?

  • 3 points on 2 segments Xor8Plus variant

    3 points on 2 segments Xor8Plus variant

    This PR is to demonstrate alternative points placement to save memory lookup in Xor8Plus variant.

    It trades a bit of BitsPerValue (9.17 vs 9.13 for 3 segment) of a Xor8Plus variant for collocating first two lookups in a single cache line.

    (this PR contains 1 commit over #10)

  • add a test which shows binary fuse filter duplicate issue

    add a test which shows binary fuse filter duplicate issue

    This test shows the issue I reported in #30, with just 121 keys I encounter:

    --- FAIL: Test_DuplicateKeysBinaryFuseDup_Issue30 (0.00s)
        binaryfusefilter_test.go:261: Unexpected error: too many iterations, you probably have duplicate keys
    

    Signed-off-by: Stephen Gutekanst [email protected]

  • chore(master): release 0.1.1

    chore(master): release 0.1.1

    :robot: I have created a release beep boop

    0.1.1 (2022-01-08)

    Bug Fixes

    • Use correct default branch name for release-please (6785e2b)

    This PR was generated with Release Please. See documentation.

  • `PopulateBinaryFuse8` returns `too many iterations` for keys without duplicates

    `PopulateBinaryFuse8` returns `too many iterations` for keys without duplicates

    I ran into a case where PopulateBinaryFuse8 returned too many iterations, you probably have duplicate keys for a set of 11501 keys that doesn't contain any duplicates. I found that increasing MaxIterations to 217 allowed the filter to be built successfully.

    I was able to reproduce the issue in an existing test by setting NUM_KEYS keys to 11500 and running the tests multiple times with the -count option: i.e. go test -run TestBinaryFuse8Basic -count 20. This makes it fail reliably for me, whereas with the original NUM_KEYS or with some lower number like 10000 it never fails.

  • Golang Interface in place for Source64 and Hash64?

    Golang Interface in place for Source64 and Hash64?

    Hi,

    First, Thank you for your paper and implementation/s!

    I'm wondering if it's possible to re-write this that would accept a more idiomatic 'Go' way? say, accepting Source64 and Hash64 instead of using SplitMix64 and Murmur (with that being said, I'm wondering if the random and hash functions can also be changed, say using Mersenne or XoRoShiRo generators and/or T1ha or Metro for the hash). I'm not sure if the literature mentions anything specific random generators but I know you specifically chose Murmur for the hash (in which case, Hash64 interface isn't necessary).

  • implement 8-bit fuse xor filters

    implement 8-bit fuse xor filters

    This commit adds a Fuse8 filter, a xor filter with 8-bit fingerprints that uses the fuse graph data structure to achieves better space ratios than the Xor8 filter (and is probably faster to populate). The implementation is very similar to the C one, which was added in https://github.com/FastFilter/xor_singleheader/commit/4392b5e03f82037fa61e70cf0ca824ecbe038a65.

    Fuse8 filters require a large number of keys (> 100_000), but have the same false positive percentage as Xor8 filters (< 0.4%) while using much less space (~9.1 bits/entry as compared to Xor8's ~9.9 bits/entry).

    16-bit fingerprint versions of the xor filter using fuse graphs should be trivial to implement.

    Closes #8

  • permit empty filter construction

    permit empty filter construction

    PopulateBinaryFuse8([]uint64{}) currently panics. It should return an error or preferably just construct a trivial tiny filter. Adding if size == 0 { size = 1 } is sufficient to prevent panicking.

  • [Proposal] Improve xorfilter.go codebase

    [Proposal] Improve xorfilter.go codebase

    I would like to refactor the xorfilter.go to improve code quality & readability by doing the following:

    1. Separate the struct definitions & hashing functions to other files so that it can be made more modular.

    2. Reduce the number of lines by combining the same procedures to its own function so that it can be reused easily.

    3. Probably add more comments, tests, and benchmarks along the way.

    Is there anything else that can be added?

  • binaryfusefilter cannot handle some sets of large, duplicate keys

    binaryfusefilter cannot handle some sets of large, duplicate keys

    My understanding is that after https://github.com/FastFilter/xorfilter/pull/22 duplicate keys should be supported in binary fuse filters. I am working with uint64 keys that are CRC64 checksums (so quite large numbers) and am regularly finding sets that result in:

    Unexpected error: too many iterations, you probably have duplicate keys
    

    I've managed to reduce one such set down to just 121 keys, I will send a PR to add a test for this. Is it possible to fix this, or are there limitations to the duplicate key checking in some way?

Go package implementing Bloom filters

Bloom filters A Bloom filter is a concise/compressed representation of a set, where the main requirement is to make membership queries; i.e., whether

Dec 28, 2022
Sync distributed sets using bloom filters

goSetReconciliation An implementation to sync distributed sets using bloom filters. Based on the paper "Low complexity set reconciliation using Bloom

Jan 4, 2022
Go package implementing bitsets

bitset Go language library to map between non-negative integers and boolean values Description Package bitset implements bitsets, a mapping between no

Jan 1, 2023
Go package implementing an indexable ordered multimap

PACKAGE package skiplist import "github.com/glenn-brown/skiplist" Package skiplist implements fast indexable ordered multimaps. This sk

Jul 2, 2022
A small flexible merge library in go
A small flexible merge library in go

conjungo A merge utility designed for flexibility and customizability. The library has a single simple point of entry that works out of the box for mo

Dec 27, 2022
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Dec 20, 2022
Go native library for fast point tracking and K-Nearest queries

Geo Index Geo Index library Overview Splits the earth surface in a grid. At each cell we can store data, such as list of points, count of points, etc.

Dec 3, 2022
Data structure and algorithm library for go, designed to provide functions similar to C++ STL

GoSTL English | 简体中文 Introduction GoSTL is a data structure and algorithm library for go, designed to provide functions similar to C++ STL, but more p

Dec 26, 2022
Zero allocation Nullable structures in one library with handy conversion functions, marshallers and unmarshallers

nan - No Allocations Nevermore Package nan - Zero allocation Nullable structures in one library with handy conversion functions, marshallers and unmar

Dec 20, 2022
A Go library for an efficient implementation of a skip list: https://godoc.org/github.com/MauriceGit/skiplist
A Go library for an efficient implementation of a skip list: https://godoc.org/github.com/MauriceGit/skiplist

Fast Skiplist Implementation This Go-library implements a very fast and efficient Skiplist that can be used as direct substitute for a balanced tree o

Dec 30, 2022
Go Library [DEPRECATED]

Tideland Go Library Description The Tideland Go Library contains a larger set of useful Google Go packages for different purposes. ATTENTION: The cell

Nov 15, 2022
an R-Tree library for Go

rtreego A library for efficiently storing and querying spatial data in the Go programming language. About The R-tree is a popular data structure for e

Jan 3, 2023
Golang library for reading and writing Microsoft Excel™ (XLSX) files.
Golang library for reading and writing Microsoft Excel™ (XLSX) files.

Excelize Introduction Excelize is a library written in pure Go providing a set of functions that allow you to write to and read from XLSX / XLSM / XLT

Jan 9, 2023
Golang library for querying and parsing OFX

OFXGo OFXGo is a library for querying OFX servers and/or parsing the responses. It also provides an example command-line client to demonstrate the use

Nov 25, 2022
Go (golang) library for reading and writing XLSX files.

XLSX Introduction xlsx is a library to simplify reading and writing the XML format used by recent version of Microsoft Excel in Go programs. Tutorial

Jan 5, 2023
indexing library for Go

Bluge modern text indexing in go - blugelabs.com Features Supported field types: Text, Numeric, Date, Geo Point Supported query types: Term, Phrase, M

Jan 3, 2023
A feature complete and high performance multi-group Raft library in Go.
A feature complete and high performance multi-group Raft library in Go.

Dragonboat - A Multi-Group Raft library in Go / 中文版 News 2021-01-20 Dragonboat v3.3 has been released, please check CHANGELOG for all changes. 2020-03

Jan 5, 2023
Library for hashing any Golang interface

recursive-deep-hash Library for hashing any Golang interface Making huge struct comparison fast & easy How to use package main import ( "fmt" "git

Mar 3, 2022
The Go library that will drive you to AOP world!

Beyond The Golang library that will drive you to the AOP paradigm world! Check Beyond Documentation What's AOP? In computing, aspect-oriented programm

Dec 6, 2022