Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching.

GoDoc

Trie

Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching.

Usage

Create a Trie with:

t := trie.New()

Add Keys with:

// Add can take in meta information which can be stored with the key.
// i.e. you could store any information you would like to associate with
// this particular key.
t.Add("foobar", 1)

Find a key with:

node, ok := t.Find("foobar")
meta := node.Meta()
// use meta with meta.(type)

Remove Keys with:

t.Remove("foobar")

Prefix search with:

t.PrefixSearch("foo")

Fast test for valid prefix:

t.HasKeysWithPrefix("foo")

Fuzzy search with:

t.FuzzySearch("fb")

Contributing

Fork this repo and run tests with:

go test

Create a feature branch, write your tests and code and submit a pull request.

License

MIT

Owner
Derek Parker
Senior Software Engineer at Red Hat. Previously CoreOS, Hashrocket.
Derek Parker
Comments
  • Handle adding distinct meta info for key

    Handle adding distinct meta info for key

    Currently there is no way to append meta information for duplicate keys. This should be resolved by storing meta as []interface{}, and append meta for duplicate keys.

  • defer impact on trie.Add()

    defer impact on trie.Add()

    I want to share a similar exploration I've made in other project regarding the impact on defer.

    I avoid using defer in the Add method with some interesting results. Considering there wasn't a benchmark for Add, I created one. Using this new benchmark, I tested the actual implementation and this new one.

    The end results are:

    benchmark               old ns/op     new ns/op     delta
    BenchmarkAdd/128B-8     55666075      49689240      -10.74%
    BenchmarkAdd/256B-8     54631261      50188483      -8.13%
    BenchmarkAdd/512B-8     57324361      50005511      -12.77%
    BenchmarkAdd/1K-8       55605380      51304633      -7.73%
    BenchmarkAdd/2K-8       55761200      50059447      -10.23%
    BenchmarkAdd/4K-8       56268353      50719990      -9.86%
    BenchmarkAdd/8K-8       55517450      50338050      -9.33%
    BenchmarkAdd/16K-8      57047413      50159699      -12.07%
    BenchmarkAdd/32K-8      56182766      50119649      -10.79%
    
    benchmark               old allocs     new allocs     delta
    BenchmarkAdd/128B-8     306915         306915         +0.00%
    BenchmarkAdd/256B-8     306915         306915         +0.00%
    BenchmarkAdd/512B-8     306915         306915         +0.00%
    BenchmarkAdd/1K-8       306915         306915         +0.00%
    BenchmarkAdd/2K-8       306915         306915         +0.00%
    BenchmarkAdd/4K-8       306915         306915         +0.00%
    BenchmarkAdd/8K-8       306915         306915         +0.00%
    BenchmarkAdd/16K-8      306915         306915         +0.00%
    BenchmarkAdd/32K-8      306915         306915         +0.00%
    
    benchmark               old bytes     new bytes     delta
    BenchmarkAdd/128B-8     14731932      14731929      -0.00%
    BenchmarkAdd/256B-8     14731921      14731923      +0.00%
    BenchmarkAdd/512B-8     14731922      14731920      -0.00%
    BenchmarkAdd/1K-8       14731920      14731920      +0.00%
    BenchmarkAdd/2K-8       14731921      14731920      -0.00%
    BenchmarkAdd/4K-8       14731921      14731920      -0.00%
    BenchmarkAdd/8K-8       14731921      14731920      -0.00%
    BenchmarkAdd/16K-8      14731921      14731920      -0.00%
    BenchmarkAdd/32K-8      14731920      14731920      +0.00%
    

    This means that not using defer seems to have a considerable impact in ns/op of Add.

    What are those sub-benchs? Just applying the same idea of bitcask to test if the meta size has some influence on the results. Seeing the implementation, it shouldn't and the results seem to reflect that. (so we can ignore the sub-benchmarks if you like, I kept them just in case).

    How to reproduce these results? (not completely elegant, but enough to show the experiment)

    git clone --branch master_avoiddefer https://github.com/jsign/trie.git
    rm *.txt ; go test -benchmem -benchtime=5s -bench="BenchmarkAdd\b" > before.txt && go test -benchmem -benchtime=5s -bench=BenchmarkAddWithoutDefer > after.txt && sed -i 's/WithoutDefer//g' after.txt && benchcmp before.txt after.txt
    

    Small explanation:

    • Clone my fork wich has the new benchmark plus both the original a new implementation of Add
    • Run both benchmarks, fix the ouput to be compared with benchcmp.

    This idea may apply to other parts where defer sync.Mutex.Unlock() exists.

    In the case of bitcask, the impact was close to 4%, here seems to be greater. If I'm not missing something, looks like the Add function is quite fast and defer overhead is relatively big (relatively).

  • Optimize collect

    Optimize collect

    Was

    $ go test -bench . -benchtime 4s -benchmem
    TieKeys-4        	 1000000	      8335 ns/op	     847 B/op	      67 allocs/op
    PrefixSearch-4   	   10000	    617542 ns/op	   71653 B/op	    6287 allocs/op
    FuzzySearch-4    	    1000	   8789504 ns/op	  793950 B/op	   52051 allocs/op
    BuildTree-4      	      30	 172820241 ns/op	64263468 B/op	  908713 allocs/op
    PASS
    ok  	_trie	30.907s
    

    And now its

    $ go test -bench . -benchtime 4s -benchmem
    TieKeys-4        	 1000000	      4583 ns/op	     566 B/op	       7 allocs/op
    PrefixSearch-4   	   20000	    219790 ns/op	   33746 B/op	      16 allocs/op
    FuzzySearch-4    	    1000	   5759077 ns/op	  479819 B/op	    5065 allocs/op
    BuildTree-4      	      20	 240212152 ns/op	78269795 B/op	 1347616 allocs/op
    PASS
    ok  	_trie	24.888s
    

    It's better than #16, but there's additional memory footprint and time to build trie.

  • What is mask for?

    What is mask for?

    type Node struct {
    	val       rune
    	path      string
    	term      bool
    	depth     int
    	meta      interface{}
    	mask      uint64
    	parent    *Node
    	children  map[rune]*Node
    	termCount int
    }
    

    hello derekparker: I don't understand the function of the field mask in the Node, could you explain it? And why this trie is so fast. Thank you so much!

  • Fix find of non-term node

    Fix find of non-term node

    Non terminal nodes need to check for their [nul] children before using them. Adding ("foobar" and "foodar") to trie creates the non-terminal node "foo" which should not be 'findable'. This used to cause a panic without the check for node,ok := children[nul]. which is fixed by this commit.

  • Fix panic during searches when trie is empty

    Fix panic during searches when trie is empty

    make([]T, len, cap) panics if len > cap. The offending slice is already hardcoded to have starting length 1, so this change is the simplest way to ensure that the capacity is always valid. Per https://pkg.go.dev/builtin#make:

    A second integer argument may be provided to specify a different capacity; it must be no smaller than the length.

    This fixes the issue described by https://github.com/go-delve/delve/issues/3217.

  • fix fuzzysearch on empty rune

    fix fuzzysearch on empty rune

    Fixes #18

    I preferred to make this case a direct exception than trying to pollute the logic and cpu cycles just to cover this case.

    • fixes issue
    • add two test cases: one for the issue and one with zero-results.
  • Trie.Keys() panic

    Trie.Keys() panic

    desc

    new trie object, and call 'Keys'

    panic code

    trie.New().Keys()
    

    panic stack

    panic: runtime error: makeslice: cap out of range
    
    goroutine 1 [running]:
    github.com/derekparker/trie.collect(0xc00007e2a0, 0xc0000dde58, 0x0, 0x20)
    	/Users/fmt/work/gopath1/src/github.com/derekparker/trie/trie.go:245 +0xad
    github.com/derekparker/trie.Trie.PrefixSearch(0x0, 0xc00007e2a0, 0x0, 0x0, 0x0, 0x0, 0x4, 0xc00000e360)
    	/Users/fmt/work/gopath1/src/github.com/derekparker/trie/trie.go:143 +0xaa
    github.com/derekparker/trie.(*Trie).Keys(...)
    	/Users/fmt/work/gopath1/src/github.com/derekparker/trie/trie.go:126
    
    
  • avoid refer in Add and Remove

    avoid refer in Add and Remove

    Closes #21

    Apart from the actual change, I added two new benchmarks to see the impact on Add and Remove.

    Here's the comparison with master (with these added benchmarks):

    benchmark                   old ns/op     new ns/op     delta
    BenchmarkTieKeys-8          2384          2391          +0.29%
    BenchmarkPrefixSearch-8     118198        116171        -1.71%
    BenchmarkFuzzySearch-8      4172439       4265630       +2.23%
    BenchmarkBuildTree-8        120406301     115108497     -4.40%
    BenchmarkAdd-8              117840555     112938392     -4.16%
    BenchmarkAddRemove-8        7538          6616          -12.23%
    
    benchmark                   old allocs     new allocs     delta
    BenchmarkTieKeys-8          3              3              +0.00%
    BenchmarkPrefixSearch-8     3              3              +0.00%
    BenchmarkFuzzySearch-8      8009           8009           +0.00%
    BenchmarkBuildTree-8        1010968        1010968        +0.00%
    BenchmarkAdd-8              908710         908710         +0.00%
    BenchmarkAddRemove-8        56             56             +0.00%
    
    benchmark                   old bytes     new bytes     delta
    BenchmarkTieKeys-8          256           256           +0.00%
    BenchmarkPrefixSearch-8     13143         13121         -0.17%
    BenchmarkFuzzySearch-8      475965        476646        +0.14%
    BenchmarkBuildTree-8        76167905      76167968      +0.00%
    BenchmarkAdd-8              75030785      75030708      -0.00%
    BenchmarkAddRemove-8        4592          4592          +0.00%
    

    In short, it improves both Add and Remove. The other differences seem quite volatile for me, but I encourage anyone to do the benchmarks again to double check.

    My original tests for Remove involved using the /usr/share/dict/words, but I re-discovered that there's a bug when removing keys that aren't leafs in the trie. (I later saw that there was already an issue with a pending PR).

  • Improve performance tuning slice capacity

    Improve performance tuning slice capacity

    I worked on top of #17.

    • deleted unused function: Just to cleanup agreed on @derekparker review.
    • created a count of leaves per node: this allows to create the keys slice in collect with the correct capacity to avoid allocations while appending.
    • in collect the nodes capacity is set to a lower-bound which save more allocations.

    Here's the benchmark comparison to master now:

    benchmark                   old ns/op     new ns/op     delta
    BenchmarkTieKeys-8          5098          2385          -53.22%
    BenchmarkPrefixSearch-8     393791        118249        -69.97%
    BenchmarkFuzzySearch-8      6001483       4151116       -30.83%
    
    benchmark                   old allocs     new allocs     delta
    BenchmarkTieKeys-8          56             3              -94.64%
    BenchmarkPrefixSearch-8     5548           3              -99.95%
    BenchmarkFuzzySearch-8      47089          8009           -82.99%
    
    benchmark                   old bytes     new bytes     delta
    BenchmarkTieKeys-8          840           256           -69.52%
    BenchmarkPrefixSearch-8     71628         13120         -81.68%
    BenchmarkFuzzySearch-8      793796        476874        -39.92%
    
  • Significant improvement in all benchmarks

    Significant improvement in all benchmarks

    The collect method was doing too much allocations, and we could take advantage of the calculated depth per node.

    Here are the before/after benchmarks:

    benchmark                   old ns/op     new ns/op     delta
    BenchmarkTieKeys-8          5116          4569          -10.69%
    BenchmarkPrefixSearch-8     399301        286591        -28.23%
    BenchmarkFuzzySearch-8      5985155       5120769       -14.44%
    
    benchmark                   old allocs     new allocs     delta
    BenchmarkTieKeys-8          56             29             -48.21%
    BenchmarkPrefixSearch-8     5548           1494           -73.07%
    BenchmarkFuzzySearch-8      47089          15004          -68.14%
    
    benchmark                   old bytes     new bytes     delta
    BenchmarkTieKeys-8          840           756           -10.00%
    BenchmarkPrefixSearch-8     71613         52985         -26.01%
    BenchmarkFuzzySearch-8      794541        623999        -21.46%
    
  • Two Bug !

    Two Bug !

    bug1: add duplicated key into the trie will incr the size but the key was replaced. bug2: the meta was not store in the node of last character of key but the new child with val=0x00 `

    trie.Add("apple\x00p1", "1") // key contains 0 trie.Add("apple\x00p2", "1") // key contains 0 trie.Add("apple", "1") // after apple inserted, the above 2 keys are dispear.

    `

  • High RAM usage

    High RAM usage

    This package is pulled in as a dependency of one of my dependencies, and it's causing unusually high RAM usage. Related issue: https://github.com/prologic/bitcask/issues/67

    This looks like a memory leak to me since it's unlikely the garbage collector ran in the short time my program was running, and during that time it managed to slurp up 13 GB of memory.

  • Use of a Read Write Mutex

    Use of a Read Write Mutex

    I was wondering if this library supports concurrency, and saw that a recent commit seems to imply that the support has been added: https://github.com/derekparker/trie/commit/1ce4922c7ad906a094c6298977755068ba76ad31

    However, wouldn't your trie still be open to concurrent read operations while you have mutex locked Add or Remove. In that case, would it make sense to shift to sync.RWMutex.

    Also to support custom concurrent walks, it would make sense to have GetChild(rune) *Node function instead of Children() map[rune]*Node. So that a read lock can be taken inside GetChild. Whereas, using Children() would allow the caller to easily ignore an internal mutex lock.

  • Simple collect optimization

    Simple collect optimization

    Was

    $ go test -bench . -benchtime 4s
    BenchmarkTieKeys-4        	 1000000	      8525 ns/op
    BenchmarkPrefixSearch-4   	   10000	    627035 ns/op
    BenchmarkFuzzySearch-4    	    1000	   8602394 ns/op
    BenchmarkBuildTree-4      	      30	 175859534 ns/op
    PASS
    ok  	trie	31.224s
    

    After fix

    $ go test -bench . -benchtime 4s
    BenchmarkTieKeys-4        	 1000000	      6502 ns/op
    BenchmarkPrefixSearch-4   	   10000	    403476 ns/op
    BenchmarkFuzzySearch-4    	    1000	   6794919 ns/op
    BenchmarkBuildTree-4      	      30	 173354546 ns/op
    PASS
    ok  	trie	24.806s
    
Related tags
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Dec 20, 2022
Data Structure Libraries and Algorithms implementation

Algorithms Data Structure Libraries and Algorithms implementation in C++ Disclaimer This repository is meant to be used as a reference to learn data s

Dec 8, 2022
Data structure,Algorithms implemented in Go (for education)
 Data structure,Algorithms implemented in Go (for education)

Data structure,Algorithms implemented in Go (for education) List of Content : 1. Math - 2. String - 3. Conversions - 4. Sort - 5. Search - 6. Data str

Dec 13, 2022
Go translations of the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

Overview Go translations of the Java source code for the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin

Dec 13, 2022
Levenshtein distance and similarity metrics with customizable edit costs and Winkler-like bonus for common prefix.

A Go package for calculating the Levenshtein distance between two strings This package implements distance and similarity metrics for strings, based o

Dec 15, 2022
A prefix tree implementation in go

Trie (Prefix tree) This library is compatible with Go 1.11+ Please refer to CHANGELOG.md if you encounter breaking changes. Motivation Introduction Us

Nov 3, 2022
A prefix-enhanced map in Go

PrefixMap PrefixMap is a prefix-enhanced map that eases the retrieval of values based on key prefixes. Quick Start Creating a PrefixMap // creates the

Jun 13, 2022
Document Indexing and Searching Library in Go

Fehrist Fehrist is a pure Go library for indexing different types of documents. Currently it supports only CSV and JSON but flexible architecture give

May 22, 2022
Data structure and algorithm library for go, designed to provide functions similar to C++ STL

GoSTL English | 简体中文 Introduction GoSTL is a data structure and algorithm library for go, designed to provide functions similar to C++ STL, but more p

Dec 26, 2022
Data Structure Series: Heaps and heap applications

heaps Data Structure Series: Heaps and heap applications Current Updates: GO: heaps.go: max-heap implementation standard add, remove, and restore func

Mar 17, 2022
Bitset data structure

Your basic bit Set data structure for positive numbers A bit array, or bit set, is an efficient set data structure. It consists of an array that compa

Dec 29, 2022
Probabilistic set data structure
Probabilistic set data structure

Your basic Bloom filter Golang probabilistic set data structure A Bloom filter is a fast and space-efficient probabilistic data structure used to test

Dec 15, 2022
A simple Set data structure implementation in Go (Golang) using LinkedHashMap.

Set Set is a simple Set data structure implementation in Go (Golang) using LinkedHashMap. This library allow you to get a set of int64 or string witho

Sep 26, 2022
BTree provides a simple, ordered, in-memory data structure for Go programs.

BTree implementation for Go This package provides an in-memory B-Tree implementation for Go, useful as an ordered, mutable data structure. The API is

Dec 30, 2022
Set data structure for Go

Archived project. No maintenance. This project is not maintained anymore and is archived.. Please create your own map[string]Type or use one of the ot

Nov 21, 2022
Set data structure for Go

Archived project. No maintenance. This project is not maintained anymore and is archived.. Please create your own map[string]Type or use one of the ot

Nov 21, 2022
Generates data structure definitions from JSON files for any kind of programming language

Overview Archivist generates data structure definitions from JSON files for any kind of programming language. It also provides a library for golang to

Jun 28, 2022
A data structure for storing points.
A data structure for storing points.

ptree This package provides an in-memory data structure for storing points. Under the hood it stores points in a tree structure where nodes are spatia

Apr 18, 2022
Trie data structure implementation in Golang 🌳

Gotri Gotri is an Unicode character based Trie/prefix tree implementation in Go, with the suggestion/auto-complete feature for character searching. Si

Jun 17, 2022