Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching.

Last update: Dec 27, 2022

Comments: 15

Trie

Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching.

Usage

Create a Trie with:

t := trie.New()

Add Keys with:

// Add can take in meta information which can be stored with the key.
// i.e. you could store any information you would like to associate with
// this particular key.
t.Add("foobar", 1)

Find a key with:

node, ok := t.Find("foobar")
meta := node.Meta()
// use meta with meta.(type)

Remove Keys with:

t.Remove("foobar")

Prefix search with:

t.PrefixSearch("foo")

Fast test for valid prefix:

t.HasKeysWithPrefix("foo")

Fuzzy search with:

t.FuzzySearch("fb")

Contributing

Fork this repo and run tests with:

go test

Create a feature branch, write your tests and code and submit a pull request.

License

MIT

Owner

Derek Parker

Senior Software Engineer at Red Hat. Previously CoreOS, Hashrocket.

https://github.com/derekparker/trie

Comments

Handle adding distinct meta info for key

Currently there is no way to append meta information for duplicate keys. This should be resolved by storing meta as []interface{}, and append meta for duplicate keys.

defer impact on trie.Add()

I want to share a similar exploration I've made in other project regarding the impact on defer.

I avoid using defer in the Add method with some interesting results. Considering there wasn't a benchmark for Add, I created one. Using this new benchmark, I tested the actual implementation and this new one.

The end results are:

benchmark               old ns/op     new ns/op     delta
BenchmarkAdd/128B-8     55666075      49689240      -10.74%
BenchmarkAdd/256B-8     54631261      50188483      -8.13%
BenchmarkAdd/512B-8     57324361      50005511      -12.77%
BenchmarkAdd/1K-8       55605380      51304633      -7.73%
BenchmarkAdd/2K-8       55761200      50059447      -10.23%
BenchmarkAdd/4K-8       56268353      50719990      -9.86%
BenchmarkAdd/8K-8       55517450      50338050      -9.33%
BenchmarkAdd/16K-8      57047413      50159699      -12.07%
BenchmarkAdd/32K-8      56182766      50119649      -10.79%

benchmark               old allocs     new allocs     delta
BenchmarkAdd/128B-8     306915         306915         +0.00%
BenchmarkAdd/256B-8     306915         306915         +0.00%
BenchmarkAdd/512B-8     306915         306915         +0.00%
BenchmarkAdd/1K-8       306915         306915         +0.00%
BenchmarkAdd/2K-8       306915         306915         +0.00%
BenchmarkAdd/4K-8       306915         306915         +0.00%
BenchmarkAdd/8K-8       306915         306915         +0.00%
BenchmarkAdd/16K-8      306915         306915         +0.00%
BenchmarkAdd/32K-8      306915         306915         +0.00%

benchmark               old bytes     new bytes     delta
BenchmarkAdd/128B-8     14731932      14731929      -0.00%
BenchmarkAdd/256B-8     14731921      14731923      +0.00%
BenchmarkAdd/512B-8     14731922      14731920      -0.00%
BenchmarkAdd/1K-8       14731920      14731920      +0.00%
BenchmarkAdd/2K-8       14731921      14731920      -0.00%
BenchmarkAdd/4K-8       14731921      14731920      -0.00%
BenchmarkAdd/8K-8       14731921      14731920      -0.00%
BenchmarkAdd/16K-8      14731921      14731920      -0.00%
BenchmarkAdd/32K-8      14731920      14731920      +0.00%

This means that not using defer seems to have a considerable impact in ns/op of Add.

What are those sub-benchs? Just applying the same idea of bitcask to test if the meta size has some influence on the results. Seeing the implementation, it shouldn't and the results seem to reflect that. (so we can ignore the sub-benchmarks if you like, I kept them just in case).

How to reproduce these results? (not completely elegant, but enough to show the experiment)

git clone --branch master_avoiddefer https://github.com/jsign/trie.git
rm *.txt ; go test -benchmem -benchtime=5s -bench="BenchmarkAdd\b" > before.txt && go test -benchmem -benchtime=5s -bench=BenchmarkAddWithoutDefer > after.txt && sed -i 's/WithoutDefer//g' after.txt && benchcmp before.txt after.txt

Small explanation:

Clone my fork wich has the new benchmark plus both the original a new implementation of Add
Run both benchmarks, fix the ouput to be compared with benchcmp.

This idea may apply to other parts where defer sync.Mutex.Unlock() exists.

In the case of bitcask, the impact was close to 4%, here seems to be greater. If I'm not missing something, looks like the Add function is quite fast and defer overhead is relatively big (relatively).

Optimize collect

Was

$ go test -bench . -benchtime 4s -benchmem
TieKeys-4        	 1000000	      8335 ns/op	     847 B/op	      67 allocs/op
PrefixSearch-4   	   10000	    617542 ns/op	   71653 B/op	    6287 allocs/op
FuzzySearch-4    	    1000	   8789504 ns/op	  793950 B/op	   52051 allocs/op
BuildTree-4      	      30	 172820241 ns/op	64263468 B/op	  908713 allocs/op
PASS
ok  	_trie	30.907s

And now its

$ go test -bench . -benchtime 4s -benchmem
TieKeys-4        	 1000000	      4583 ns/op	     566 B/op	       7 allocs/op
PrefixSearch-4   	   20000	    219790 ns/op	   33746 B/op	      16 allocs/op
FuzzySearch-4    	    1000	   5759077 ns/op	  479819 B/op	    5065 allocs/op
BuildTree-4      	      20	 240212152 ns/op	78269795 B/op	 1347616 allocs/op
PASS
ok  	_trie	24.888s

It's better than #16, but there's additional memory footprint and time to build trie.

What is mask for?

type Node struct {
	val       rune
	path      string
	term      bool
	depth     int
	meta      interface{}
	mask      uint64
	parent    *Node
	children  map[rune]*Node
	termCount int
}

hello derekparker: I don't understand the function of the field mask in the Node, could you explain it? And why this trie is so fast. Thank you so much!

Fix find of non-term node

Non terminal nodes need to check for their [nul] children before using them. Adding ("foobar" and "foodar") to trie creates the non-terminal node "foo" which should not be 'findable'. This used to cause a panic without the check for node,ok := children[nul]. which is fixed by this commit.
Fix panic during searches when trie is empty

make([]T, len, cap) panics if len > cap. The offending slice is already hardcoded to have starting length 1, so this change is the simplest way to ensure that the capacity is always valid. Per https://pkg.go.dev/builtin#make:

A second integer argument may be provided to specify a different capacity; it must be no smaller than the length.

This fixes the issue described by https://github.com/go-delve/delve/issues/3217.
fix fuzzysearch on empty rune
Fixes #18

I preferred to make this case a direct exception than trying to pollute the logic and cpu cycles just to cover this case.

fixes issue

add two test cases: one for the issue and one with zero-results.

Trie.Keys() panic

desc

new trie object, and call 'Keys'

panic code

trie.New().Keys()

panic stack

panic: runtime error: makeslice: cap out of range

goroutine 1 [running]:
github.com/derekparker/trie.collect(0xc00007e2a0, 0xc0000dde58, 0x0, 0x20)
	/Users/fmt/work/gopath1/src/github.com/derekparker/trie/trie.go:245 +0xad
github.com/derekparker/trie.Trie.PrefixSearch(0x0, 0xc00007e2a0, 0x0, 0x0, 0x0, 0x0, 0x4, 0xc00000e360)
	/Users/fmt/work/gopath1/src/github.com/derekparker/trie/trie.go:143 +0xaa
github.com/derekparker/trie.(*Trie).Keys(...)
	/Users/fmt/work/gopath1/src/github.com/derekparker/trie/trie.go:126

avoid refer in Add and Remove

Closes #21

Apart from the actual change, I added two new benchmarks to see the impact on Add and Remove.

Here's the comparison with master (with these added benchmarks):

benchmark                   old ns/op     new ns/op     delta
BenchmarkTieKeys-8          2384          2391          +0.29%
BenchmarkPrefixSearch-8     118198        116171        -1.71%
BenchmarkFuzzySearch-8      4172439       4265630       +2.23%
BenchmarkBuildTree-8        120406301     115108497     -4.40%
BenchmarkAdd-8              117840555     112938392     -4.16%
BenchmarkAddRemove-8        7538          6616          -12.23%

benchmark                   old allocs     new allocs     delta
BenchmarkTieKeys-8          3              3              +0.00%
BenchmarkPrefixSearch-8     3              3              +0.00%
BenchmarkFuzzySearch-8      8009           8009           +0.00%
BenchmarkBuildTree-8        1010968        1010968        +0.00%
BenchmarkAdd-8              908710         908710         +0.00%
BenchmarkAddRemove-8        56             56             +0.00%

benchmark                   old bytes     new bytes     delta
BenchmarkTieKeys-8          256           256           +0.00%
BenchmarkPrefixSearch-8     13143         13121         -0.17%
BenchmarkFuzzySearch-8      475965        476646        +0.14%
BenchmarkBuildTree-8        76167905      76167968      +0.00%
BenchmarkAdd-8              75030785      75030708      -0.00%
BenchmarkAddRemove-8        4592          4592          +0.00%

In short, it improves both Add and Remove. The other differences seem quite volatile for me, but I encourage anyone to do the benchmarks again to double check.

My original tests for Remove involved using the /usr/share/dict/words, but I re-discovered that there's a bug when removing keys that aren't leafs in the trie. (I later saw that there was already an issue with a pending PR).

Improve performance tuning slice capacity

I worked on top of #17.

deleted unused function: Just to cleanup agreed on @derekparker review.
created a count of leaves per node: this allows to create the keys slice in collect with the correct capacity to avoid allocations while appending.
in collect the nodes capacity is set to a lower-bound which save more allocations.

Here's the benchmark comparison to master now:

benchmark                   old ns/op     new ns/op     delta
BenchmarkTieKeys-8          5098          2385          -53.22%
BenchmarkPrefixSearch-8     393791        118249        -69.97%
BenchmarkFuzzySearch-8      6001483       4151116       -30.83%

benchmark                   old allocs     new allocs     delta
BenchmarkTieKeys-8          56             3              -94.64%
BenchmarkPrefixSearch-8     5548           3              -99.95%
BenchmarkFuzzySearch-8      47089          8009           -82.99%

benchmark                   old bytes     new bytes     delta
BenchmarkTieKeys-8          840           256           -69.52%
BenchmarkPrefixSearch-8     71628         13120         -81.68%
BenchmarkFuzzySearch-8      793796        476874        -39.92%

Significant improvement in all benchmarks

The collect method was doing too much allocations, and we could take advantage of the calculated depth per node.

Here are the before/after benchmarks:

benchmark                   old ns/op     new ns/op     delta
BenchmarkTieKeys-8          5116          4569          -10.69%
BenchmarkPrefixSearch-8     399301        286591        -28.23%
BenchmarkFuzzySearch-8      5985155       5120769       -14.44%

benchmark                   old allocs     new allocs     delta
BenchmarkTieKeys-8          56             29             -48.21%
BenchmarkPrefixSearch-8     5548           1494           -73.07%
BenchmarkFuzzySearch-8      47089          15004          -68.14%

benchmark                   old bytes     new bytes     delta
BenchmarkTieKeys-8          840           756           -10.00%
BenchmarkPrefixSearch-8     71613         52985         -26.01%
BenchmarkFuzzySearch-8      794541        623999        -21.46%

Two Bug !

bug1: add duplicated key into the trie will incr the size but the key was replaced. bug2: the meta was not store in the node of last character of key but the new child with val=0x00 `

trie.Add("apple\x00p1", "1") // key contains 0 trie.Add("apple\x00p2", "1") // key contains 0 trie.Add("apple", "1") // after apple inserted, the above 2 keys are dispear.

`
High RAM usage

This package is pulled in as a dependency of one of my dependencies, and it's causing unusually high RAM usage. Related issue: https://github.com/prologic/bitcask/issues/67

This looks like a memory leak to me since it's unlikely the garbage collector ran in the short time my program was running, and during that time it managed to slurp up 13 GB of memory.
Use of a Read Write Mutex

I was wondering if this library supports concurrency, and saw that a recent commit seems to imply that the support has been added: https://github.com/derekparker/trie/commit/1ce4922c7ad906a094c6298977755068ba76ad31

However, wouldn't your trie still be open to concurrent read operations while you have mutex locked Add or Remove. In that case, would it make sense to shift to sync.RWMutex.

Also to support custom concurrent walks, it would make sense to have GetChild(rune) *Node function instead of Children() map[rune]*Node. So that a read lock can be taken inside GetChild. Whereas, using Children() would allow the caller to easily ignore an internal mutex lock.

Simple collect optimization

Was

$ go test -bench . -benchtime 4s
BenchmarkTieKeys-4        	 1000000	      8525 ns/op
BenchmarkPrefixSearch-4   	   10000	    627035 ns/op
BenchmarkFuzzySearch-4    	    1000	   8602394 ns/op
BenchmarkBuildTree-4      	      30	 175859534 ns/op
PASS
ok  	trie	31.224s

After fix

$ go test -bench . -benchtime 4s
BenchmarkTieKeys-4        	 1000000	      6502 ns/op
BenchmarkPrefixSearch-4   	   10000	    403476 ns/op
BenchmarkFuzzySearch-4    	    1000	   6794919 ns/op
BenchmarkBuildTree-4      	      30	 173354546 ns/op
PASS
ok  	trie	24.806s

Related tags

Data Structures trie

Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Dec 20, 2022

Data Structure Libraries and Algorithms implementation

Algorithms Data Structure Libraries and Algorithms implementation in C++ Disclaimer This repository is meant to be used as a reference to learn data s

Dec 8, 2022

Data structure,Algorithms implemented in Go (for education)

Data structure,Algorithms implemented in Go (for education) List of Content : 1. Math - 2. String - 3. Conversions - 4. Sort - 5. Search - 6. Data str

Dec 13, 2022

Go translations of the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

Overview Go translations of the Java source code for the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin

Dec 13, 2022

Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching.

Trie

Usage

Contributing

License

Owner

Derek Parker

Comments

Handle adding distinct meta info for key

defer impact on trie.Add()

Optimize collect

What is mask for?

Fix find of non-term node

Fix panic during searches when trie is empty

fix fuzzysearch on empty rune

Trie.Keys() panic

desc

panic code

panic stack

avoid refer in Add and Remove

Improve performance tuning slice capacity

Significant improvement in all benchmarks

Two Bug !

High RAM usage

Use of a Read Write Mutex

Simple collect optimization

Related tags

Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Data Structure Libraries and Algorithms implementation

Data structure,Algorithms implemented in Go (for education)

Go translations of the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

Levenshtein distance and similarity metrics with customizable edit costs and Winkler-like bonus for common prefix.

A prefix tree implementation in go

A prefix-enhanced map in Go

Document Indexing and Searching Library in Go

Data structure and algorithm library for go, designed to provide functions similar to C++ STL

Data Structure Series: Heaps and heap applications

Bitset data structure

Probabilistic set data structure

A simple Set data structure implementation in Go (Golang) using LinkedHashMap.

BTree provides a simple, ordered, in-memory data structure for Go programs.

Set data structure for Go

Set data structure for Go

Generates data structure definitions from JSON files for any kind of programming language

A data structure for storing points.

Trie data structure implementation in Golang 🌳