HyperLogLog with lots of sugar (Sparse, LogLog-Beta bias correction and TailCut space reduction)

Hyperloglog Logo


GoDoc Go Report Card CircleCI

An improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset using 33-50% less space than other usual HyperLogLog implementations.

This work is based on "Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen".

Implementation

The core differences between this and other implementations are:

  • use metro hash instead of xxhash
  • sparse representation for lower cardinalities (like HyperLogLog++)
  • loglog-beta for dynamic bias correction medium and high cardinalities.
  • 4-bit register instead of 5 (HLL) and 6 (HLL++), but most implementations use 1-byte registers out of convenience

In general it borrows a lot from InfluxData's fork of Clark Duvall's HyperLogLog++ implementation, but uses 50% less space.

Results

A direct comparison with the HyperLogLog++ implementation used by InfluxDB yielded the following results:

Exact Axiom (8.2 KB) Influx (16.39 KB)
10 10 (0.0% off) 10 (0.0% off)
50 50 (0.0% off) 50 (0.0% off)
250 250 (0.0% off) 250 (0.0% off)
1250 1249 (0.08% off) 1249 (0.08% off)
6250 6250 (0.0% off) 6250 (0.0% off)
31250 31008 (0.7744% off) 31565 (1.0080% off)
156250 156013 (0.1517% off) 156652 (0.2573% off)
781250 782364 (0.1426% off) 775988 (0.6735% off)
3906250 3869332 (0.9451% off) 3889909 (0.4183% off)
10000000 9952682 (0.4732% off) 9889556 (1.1044% off)

Note

A big thank you to Prof. Shigang Chen and his team at the University of Florida who are actively conducting research around "Big Network Data".


An Axiom production.

Do you enjoy solving problems like these? If so, get in touch with us at [email protected]!

Comments
  • make new() thread safe

    make new() thread safe

    https://github.com/axiomhq/hyperloglog/blob/master/hyperloglog.go#L50 is not threadsafe (compile with -race and run in a multi-threaded program to detect it). this can be worked around by adding a package init function.

    func init() {
        hash = hashFunc
    }
    

    the new() function is not threadsafe because it is modifying a module level variable without locks

  • request: insert-hash method

    request: insert-hash method

    It would be useful to have an insert method that just takes the hash value directly. This would allow the caller to choose the hash function. It is particularly useful if our data is just integers - we could use some function that works on integers directly instead of converting them to []byte.

  • fatal error: concurrent map writes

    fatal error: concurrent map writes

    var Counter *hyperloglog.Sketch
    
    ...
    
    Counter = hyperloglog.New16()
    
    ...
    
    		bpk := make([]byte, 8)
    		binary.LittleEndian.PutUint64(bpk, pk)
    		Counter.Insert(bpk)
    
    fatal error: concurrent map writes
    
    goroutine 458 [running]:
    runtime.throw(0x788fc0, 0x15)
    	/Go/src/runtime/panic.go:1114 +0x79 fp=0xc005f75e48 sp=0xc005f75e18 pc=0x435c49
    runtime.mapassign_fast32(0x742f00, 0xc00813aba0, 0xc002276860, 0xa52e40)
    	/Go/src/runtime/map_fast32.go:101 +0x328 fp=0xc005f75e88 sp=0xc005f75e48 pc=0x411028
    github.com/axiomhq/hyperloglog.set.add(...)
    	/Go/bin/src/github.com/axiomhq/hyperloglog/sparse.go:44
    github.com/axiomhq/hyperloglog.(*Sketch).InsertHash(0xc0000615f0, 0x89da180ed69d24d0, 0x8)
    	/Go/bin/src/github.com/axiomhq/hyperloglog/hyperloglog.go:220 +0x117 fp=0xc005f75ed0 sp=0xc005f75e88 pc=0x6d4797
    github.com/axiomhq/hyperloglog.(*Sketch).Insert(0xc0000615f0, 0xc0020a7388, 0x8, 0x8, 0xc00848fa40)
    	/Go/bin/src/github.com/axiomhq/hyperloglog/hyperloglog.go:214 +0x65 fp=0xc005f75f00 sp=0xc005f75ed0 pc=0x6d4665
    main.BulkWorker(0x18ad287d)
    	/src/BulkRequest.go:92 +0x1cb fp=0xc005f75f58 sp=0xc005f75f00 pc=0x6da52b
    main.main.func1(0x70c3c0, 0xc00205eaa0)
    	/src/api.go:71 +0x46 fp=0xc005f75f80 sp=0xc005f75f58 pc=0x6dccd6
    github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc005d43290)
    	/Go/bin/src/github.com/panjf2000/ants/worker_func.go:68 +0xb7 fp=0xc005f75fd8 sp=0xc005f75f80 pc=0x6d9b07
    runtime.goexit()
    	/Go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc005f75fe0 sp=0xc005f75fd8 pc=0x4626f1
    created by github.com/panjf2000/ants.(*goWorkerWithFunc).run
    	/Go/bin/src/github.com/panjf2000/ants/worker_func.go:48 +0x53
    
  • Reduce size of new sparse sketch

    Reduce size of new sparse sketch

    A new instance of a sparse Sketch requires around 118 bytes of memory on amd64.

    type Sketch struct {
    	p          uint8                // 1 byte
    	b          uint8                // 1 byte
    	m          uint32               // 4 bytes
    	alpha      float64              // 8 bytes
    	tmpSet     set					// 56 bytes
    	sparseList *compressedList      // 40 bytes  (8 + 2 * 4 + 24 (pointer + 2 x uint32 + slice))
    	regs       *registers           // 8 bytes (nil pointer)
    	                                // 118 bytes total
    }
    

    The scenario I would like to optimize for is storing many sketches that contain few elements. In my case I have a more-or-less zipfian distribution of set cardinalities (few large ones and a long tail of small ones). 118 bytes is considerably more than, for example, Redis which uses 21 bytes for a HLL containing a single element. See my gist on that.

    Looking over the code today and previously I'm considering a few optimizations. @seiflotfy I'd like to get you opinion on some of them before I get my hands dirty.

    1. Remove tmpSet. This would already reduce the memory footprint by half. As far as I understand the code it's used to "cache" multiple elements inserted into a the sparse representation of the sketch. I believe the rationale here was to reduce the time spent on extending and traversing the compressed list. Lack of this cache can be mitigated for example by providing a InsertMany function instead.

    2. Remove 2 pointers to compressedList and registers. Both those types represent []uint8 slices with some additional metadata. That metadata could be rolled into the data structure. That's a 16 bytes saved for pointers. The tradeoff here is code readability.

    3. Don't precompute alpha and m. 12 bytes saving. A few CPU cycles are a tradeoff here.

    The above memory optimizations would amount to savings of 75 bytes bringing down the size of an empty sketch to 43 bytes.

    type Sketch struct {
    	sparse bool     // 1 byte - bringing it back since there is no *compressedList to match against
    	p      uint8    // 1 byte
    	b      uint8    // 1 byte
    	regs   []uint32 // 24 bytes
    	nz     uint32   // 4 bytes
    	count  uint32   // 4 bytes
    	last   uint32   // 4 bytes
    	                // 43 bytes total
    }
    

    What are your thoughts?

  • Remove redundant sparse bool

    Remove redundant sparse bool

    The Sketch.sparse bool is redundant. It can be computed by checking whether sparseList is nil.

    Go surprisingly does report any allocation size reductions:

    Benchmark_Size_New_Sparse-8   	 6900733	       167 ns/op	     128 B/op	       3 allocs/op
    

    However it does report that a bool itself is represented by 1 byte.

    	var b bool
    	fmt.Printf("size %d", unsafe.Sizeof(b))
    

    One byte might seem a small saving, but if millions or billions of HLLs are considered it can be a lot.

  • Sparse representation is too big

    Sparse representation is too big

    I'm counting unique metrics accross different dimensions, and the number of used HLLs can go up to several millions (e.g. number of unique IPs for each user agent or vice versa). It seems that even empty hyperloglog.Sketch will allocate at least 2^14 = 16384 bytes for the sparseList, which is quite a lot. Redis hyperloglog has way better memory footprint for small cardinalities - http://download.redis.io/redis-stable/src/hyperloglog.c I wonder if it will be possible to align / merge these two approaches.

  • Explicit public toNormal method

    Explicit public toNormal method

    We have thousands of distinct datasets that we merge together and the underlying Sketch's get Merge called to get a resulting Sketch. We're seeing a significant amount of time spent in maybeToNormal. image

    Would it be better for us to store all of the datasets with the "normal" Sketch instead of a sparse one? In other words, we'd call ToNormal (or something like that) before calling MarshalBinary. Would this reduce the time spent merging?

  • Add continuous integration

    Add continuous integration

    Hi @seiflotfy,

    As a contributor I'd be somehow more confident creating PRs if the project had CI integration.

    Does TravisCI sound good for you? I can make a PR with the changes.

  • Reduce size of sparse representation

    Reduce size of sparse representation

    Reduce size of sparse representation by not preallocating the compressed list's variableLengthList.

    I think this is the right trade-off to make. The sparse representation is intended to save memory. Preallocating the whole buffer ruins that.

    Consider a HLL of precision 16:

    Benchmark_Size_New_Sparse-8   	  210655	      6247 ns/op	   65664 B/op	       4 allocs/op
    

    with change:

    Benchmark_Size_New_Sparse-8   	 7066699	       172 ns/op	     128 B/op	       3 allocs/op
    

    Resolves #21

  • fix underflow of `registers.nz`

    fix underflow of `registers.nz`

    This PR includes:

    • added a test for sparse encoder and decoder
    • fixed registers.nz undlerflow
      • when this underflow occurred, base offset (sk.b) wouldn't increase anymore
      • sk.b was zero in most dense cases by this problem
    • fixed that sk.b wasn't be unmarshalized when precision wasn't matched
  • Add CircleCI

    Add CircleCI

    This configure CircleCI and runs tests and benchmarks on every push. It collects test-results with gotestsum to show a summary.

    It also adds a badge showing the build status of master.

    Screenshot 2019-11-12 at 14 08 06

    Before merging this, please go to Add Projects on CircleCI and click Set Up Project next to this project, then Start Building.

  • Avoid influxdb dependency while importing hyperloglog

    Avoid influxdb dependency while importing hyperloglog

    Currently when hyperloglog is imported via go-module a dependency of influxdata/influxdb is added due to its usage in demo. The PR creates a separate module for the demo which removes the influxdata/influxdb dependency requirement from the actual module.

  • Make New Public Again

    Make New Public Again

    The public constructors at present allow only 14 or 16 registers.

    func New() *Sketch
    func New14() *Sketch
    func New16() *Sketch
    func New16NoSparse() *Sketch
    

    The private constructor newSketch looks like it can take between 4-18 registers, and also a flag to say sparse or normal.

    Looking at the commit history, the public constructor was removed in commit https://github.com/axiomhq/hyperloglog/commit/dba7ba93e183d71f2e4811de2717a7f98d64f185

    Do you see any fundamental issue with the reduced bases? Is it possible to allow those again?

    I have a use case where I need to maintain a large number of these HLL sketches, that can tolerate a higher error rate as well. So I am planning to use just 4 or 8 registers.

  • What is the actual memory usage?

    What is the actual memory usage?

    I see options for both 16384 and 65536 registers. How much memory does each instance actually use? The read-me says each register 4-bits, what about the sketch structure overhead?

  • document requirements for InsertHash

    document requirements for InsertHash

    It would be good to provide some documentation about what properties are expected of the input hash for InsertHash. For example, a reasonable question is: if my data is a set of 64-bit ints, can I use them directly as the "hash"? I ran some experiments with that and got bad results. I also got bad results when I used a simplistic hash function (xor and multiply by large prime). Does it require good avalanche characteristics?

Go package for mapping values to and from space-filling curves, such as Hilbert and Peano curves.
Go package for mapping values to and from space-filling curves, such as Hilbert and Peano curves.

Hilbert Go package for mapping values to and from space-filling curves, such as Hilbert and Peano curves. Documentation available here This is not an

Dec 23, 2022
Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).
Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).

Slim - surprisingly space efficient data types in Golang Slim is collection of surprisingly space efficient data types, with corresponding serializati

Jan 2, 2023
estruct traverses javascript projects and maps all the dependencies and relationships to a JSON. the output can be used to build network visualizations of the project and document the architecture.
estruct traverses javascript projects and maps all the dependencies and relationships to a JSON. the output can be used to build network visualizations of the project and document the architecture.

EStruct traverses javascript projects and maps all the dependencies and relationships to a JSON. The output can be used to build network visualizations of the project and document the architecture.

Jan 27, 2022
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Dec 20, 2022
Levenshtein distance and similarity metrics with customizable edit costs and Winkler-like bonus for common prefix.

A Go package for calculating the Levenshtein distance between two strings This package implements distance and similarity metrics for strings, based o

Dec 15, 2022
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Dec 22, 2022
Dasel - Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool.
Dasel - Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool.

Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.

Jan 1, 2023
💯 Go Struct and Field validation, including Cross Field, Cross Struct, Map, Slice and Array diving

Package validator implements value validations for structs and individual fields based on tags.

Nov 9, 2022
Go translations of the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

Overview Go translations of the Java source code for the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin

Dec 13, 2022
Package iter provides generic, lazy iterators, functions for producing them from primitive types, as well as functions and methods for transforming and consuming them.

iter Package iter provides generic, lazy iterators, functions for producing them from primitive types, as well as functions and methods for transformi

Dec 16, 2022
Snackbox - Snackbox can make it easier for customers to order snacks and rice boxes and do tracking
Snackbox - Snackbox can make it easier for customers to order snacks and rice boxes and do tracking

Catering Ecommerce Platform API Docs · Wireflow · Use Case Diagram · Entity Rela

Dec 5, 2022
An app with Trie tree and Breve search Implementation CLI and HTTP both 🥳

Introduction LifeLongLearner project consists of two different parts. My English Vocabulary My Technical Book Notes All of them provided by me within

Jul 1, 2022
A binary stream packer and unpacker

binpacker A binary packer and unpacker. Install go get github.com/zhuangsirui/binpacker Examples Packer buffer := new(bytes.Buffer) packer := binpacke

Dec 1, 2022
A collection of useful, performant, and threadsafe Go datastructures.

go-datastructures Go-datastructures is a collection of useful, performant, and threadsafe Go datastructures. NOTE: only tested with Go 1.3+. Augmented

Dec 29, 2022
Go native library for fast point tracking and K-Nearest queries

Geo Index Geo Index library Overview Splits the earth surface in a grid. At each cell we can store data, such as list of points, count of points, etc.

Dec 3, 2022
:pushpin: State of the art point location and neighbour finding algorithms for region quadtrees, in Go
:pushpin: State of the art point location and neighbour finding algorithms for region quadtrees, in Go

Region quadtrees in Go Region quadtrees and efficient neighbour finding techniques in Go Go-rquad proposes various implementations of region quadtrees

Dec 13, 2022
A simple set type for the Go language. Trusted by Docker, 1Password, Ethereum and Hashicorp.

golang-set The missing set collection for the Go language. Until Go has sets built-in...use this. Coming from Python one of the things I miss is the s

Jan 8, 2023
Data structure and algorithm library for go, designed to provide functions similar to C++ STL

GoSTL English | 简体中文 Introduction GoSTL is a data structure and algorithm library for go, designed to provide functions similar to C++ STL, but more p

Dec 26, 2022
Gota: DataFrames and data wrangling in Go (Golang)

Gota: DataFrames, Series and Data Wrangling for Go This is an implementation of DataFrames, Series and data wrangling methods for the Go programming l

Jan 6, 2023