Package mafsa implements Minimal Acyclic Finite State Automata in Go, essentially a high-speed, memory-efficient, Unicode-friendly set of strings.

MA-FSA for Go

Package mafsa implements Minimal Acyclic Finite State Automata (MA-FSA) with Minimal Perfect Hashing (MPH). Basically, it's a set of strings that lets you test for membership, do spelling correction (fuzzy matching) and autocomplete, but with higher memory efficiency than a regular trie. With MPH, you can associate each entry in the tree with data from your own application.

In this package, MA-FSA is implemented by two types:

  • BuildTree
  • MinTree

A BuildTree is used to build data from scratch. Once all the elements have been inserted, the BuildTree can be serialized into a byte slice or written to a file directly. It can then be decoded into a MinTree, which uses significantly less memory. MinTrees are read-only, but this greatly improves space efficiency.

Tutorial

Create a BuildTree and insert your items in lexicographical order. Be sure to call Finish() when you're done.

bt := mafsa.New()
bt.Insert("cities") // an error will be returned if input is < last input
bt.Insert("city")
bt.Insert("pities")
bt.Insert("pity")
bt.Finish()

The tree is now compressed to a minimum number of nodes and is ready to be saved.

err := bt.Save("filename")
if err != nil {
    log.Fatal("Could not save data to file:", err)
}

In your production application, then, you can read the file into a MinTree directly:

mt, err := mafsa.Load("filename")
if err != nil {
    log.Fatal("Could not load data from file:", err)
}

The mt variable is a *MinTree which has the same data as the original BuildTree, but without all the extra "scaffolding" that was required for adding new elements. In our testing, we were able to store over 8 million phrases (average length 24, much longer than words in a typical dictionary) in as little as 2 GB on a 64-bit system.

The package provides some basic read mechanisms.

// See if a string is a member of the set
fmt.Println("Does tree contain 'cities'?", mt.Contains("cities"))
fmt.Println("Does tree contain 'pitiful'?", mt.Contains("pitiful"))

// You can traverse down to a certain node, if it exists
fmt.Printf("'y' node is at: %p\n", mt.Traverse([]rune("city")))

// To traverse the tree and get the number of elements inserted
// before the prefix specified
node, idx := mt.IndexedTraverse([]rune("pit"))
fmt.Println("Index number for 'pit': %d", idx)

To associate entries in the tree with data from your own application, create a slice with the data in the same order as the elements were inserted into the tree:

myData := []string{
    "The plural of city",
    "Noun; a large town",
    "The state of pitying",
    "A feeling of sorrow and compassion",
}

The index number returned with IndexedTraverse() (usually minus 1) will get you to the element in your slice if you traverse directly to a final node:

node, i := mt.IndexedTraverse([]rune("pities"))
if node != nil && node.Final {
    fmt.Println(myData[i-1])
}

If you do IndexedTraverse() directly to a word in the tree, you must -1 because that method returns the number of elements in the tree before those that start with the specified prefix, which is non-inclusive with the node the method landed on.

There are many ways to apply MA-FSA with minimal perfect hashing, so the package only provides the basic utilities. Along with Traverse() and IndexedTraverse(), the edges of each node are exported so you may conduct your own traversals according to your needs.

Encoding Format

This section describes the file format used by Encoder and Decoder. You probably will never need to implement this yourself; it's already done for you in this package.

BuildTrees can be encoded as bytes and then stored on disk or decoded into a MinTree. The encoding of a BuildTree is a binary file that is composed of a sequence of words, which is a sequence of big-endian bytes. Each word is the same length. The file is inspected one word at a time.

The first word contains file format information. Byte 0 is the file version (right now, 1). Byte 1 is the word length. Byte 2 is the character length. Byte 3 is the pointer length. The rest of the bytes of the first word are 0s.

The word length must be at least 4, and must equal Byte 2 + Byte 3 + 1 (for the flag byte).

Package smartystreets/mafsa initializes the file with this word:

[]byte{0x01 0x06 0x01 0x04 0x00 0x00}

This indicates that using version 1 of the file format, each word is 6 bytes long, each character is 1 byte, and each pointer to another node is 4 bytes. This pointer size allows us to encode trees with up to 2^32 (a little over 4.2 billion) edges.

Every other word in the file represents an edge (not a node). Those words consist of bytes according to this format:

[Character] [Flags] [Pointer]

The length of the character and pointer bytes are specified in the initial word, and the flags part is always a single byte. This allows us to have 8 flags per edge, but currently only 2 are used.

Flags are:

  • 0x01 = End of Word (EOW, or final)
  • 0x02 = End of Node (EON)

A node essentially consists of a consecutive, variable-length sequence of words, where each word represents an edge. To encode a node, write each edge sequentially. Set the final (EOW) flag if the node it points to is a final node, and set the EON flag if it is the last edge for that node. The EON flag indicates the next word is an edge belonging to a different node. The pointer in each word should be the word index of the start of the node being pointed to.

For example, the tree containing these words:

  • cities
  • city
  • pities
  • pity

Encodes to:

0   01 06 0104 0000
1   63 00 0000 0003
2   70 02 0000 0004
3   69 02 0000 0005
4   69 02 0000 0005
5   74 02 0000 0006
6   69 00 0000 0008
7   79 03 0000 0000
8   65 02 0000 0009
9   73 03 0000 0000

Or here's the hexdump:

00000000  01 06 01 04 00 00 63 00  00 00 00 03 70 02 00 00  |......c.....p...|
00000010  00 04 69 02 00 00 00 05  69 02 00 00 00 05 74 02  |..i.....i.....t.|
00000020  00 00 00 06 69 00 00 00  00 08 79 03 00 00 00 00  |....i.....y.....|
00000030  65 02 00 00 00 09 73 03  00 00 00 00              |e.....s.....|

Now, this tree isn't a thorough-enough example to test your implementation against, but it should illustrate the idea. First notice that the first word is the initializer word. The second word is the first edge coming off the root node, and words 1 and 2 are both edges coming off the root node. The word at 2 has the EON flag set (0x02) which indicates the end of the edges coming off that node. You'll see that edges at word indices 3 and 4 both point to the node starting at edge with word index 5. That would be the shared 't' node after 'ci' and 'pi'.

If a node has no outgoing edges, the pointer bits are 0.

Owner
SmartyStreets (Archives)
Archive of long-term or otherwise historical source code and reference artifacts.
SmartyStreets (Archives)
Comments
  • support encoding FSA with characterLen > 1

    support encoding FSA with characterLen > 1

    BuiltTree tracks the largest rune seen, this allows the Encoder to choose a characterLen large enough for all characters used in the FSA.

    Encoder chooses the smallest characterLen that will work, and generates correct header.

    Encoder encodes each character into the correct space dictated by the characterLen. The encoding is designed to be compatible with the existing decodeCharacter method (already present in the Decoder).

    Encoder tests were added to verify encoding out put of files when the characterLen of 2 and 4 are required.

    Decoder tests were added to verify correct behavior with inputs containing characterLen 2 and 4, though the Decoder itself was not modified.

    Encoder was changed to make WriteTo the primary API, since it can stream out words to the underlying Writer as opposed to writing them all out in memory first. The Encode() method is now a wrapper around the WriteTo streaming version.

    This allows a performance optimization, since the Encoder is writing out words at a time, and the words are fixed size, we can reuse a buffer. This change means the Encoder generates significantly less garbage and makes fewer calls into the GC subsystem.

    A new benchmark BenchmarkEncodeTo was added to quantify this:

    $ benchcmp before.txt after.txt benchmark old ns/op new ns/op delta BenchmarkEncodeTo-4 442626 377830 -14.64%

    benchmark old allocs new allocs delta BenchmarkEncodeTo-4 4968 2833 -42.98%

    benchmark old bytes new bytes delta BenchmarkEncodeTo-4 144112 86688 -39.85%

    NOTE: one side-effect of this change is that Encoder is not thread-safe. It can be reused, but should only be used by a single goroutine at a time.

  • Support Unicode encodings

    Support Unicode encodings

    The encoding currently only supports ASCII (oops - forgot to upgrade that) since each character only gets 1 byte. This should be expanded to more bytes when the user wants to encode trees with Unicode characters.

    However, since this greatly increases the file size and may not be necessary, so this option should be configurable.

  • When encoding, adjust pointer size

    When encoding, adjust pointer size

    To minimize the size of the file even more, the size of the pointer should be adjusted based on how many items are in the tree that are being encoded. Right now we simply use 4 bytes, but small trees don't need this many. 4 bytes allows over 4 billion items...

    The pointer size should be automatically adjusted or configurable.

  • Impossible to use utf

    Impossible to use utf

    It seems something wrong with encoding or decoding of utf. Consider this test:

    import (
    	"github.com/smartystreets/mafsa"
    	"github.com/stretchr/testify/require"
    	"testing"
    )
    
    func TestRussian(t *testing.T) {
    	a := mafsa.New()
    	a.Insert("я") // I'm in russian
    	a.Finish()
    	a.Save("test")
    	require.True(t, a.Contains("я")) // Fine
    
    	b, err := mafsa.Load("test")
    	require.NoError(t, err)
    	require.True(t, b.Contains("я")) // Failure!!
    }
    
    func TestSpanish(t *testing.T) {
    	a := mafsa.New()
    	a.Insert("Gracías")
    	a.Finish()
    	a.Save("test")
    	require.True(t, a.Contains("Gracías")) // Thank you in spanish
    
    	b, err := mafsa.Load("test")
    	require.NoError(t, err)
    	require.True(t, b.Contains("Gracías")) // Failure!!
    }
    
    
    
  • unused Entry and unnecessary entry

    unused Entry and unnecessary entry

    I've been getting a deeper understanding of the code and found the following.

    1. The type Entry defined in mintree.go does not appear to be used anywhere. It is defined here: https://github.com/smartystreets/mafsa/blob/1575156d598714bb1c41713191eefb7f5ccf3e8b/mintree.go#L26

    2. The entry argument to decodeEdge(...) seems to serve no purpose. We only ever append to it and pass to recursive calls, but its never used for anything. Further, when I remove it, and all usage of it, all tests in the package continue to pass. See: https://github.com/smartystreets/mafsa/blob/1575156d598714bb1c41713191eefb7f5ccf3e8b/decoder.go#L85

    Is this just left over from some previous version of the implementation? Or am I missing some key aspect of the way things are working?

  • Share the testing data

    Share the testing data

    Thanks for making this. Can you please share the 8M strings of testing data you used for this? This will give a good point of comparison with other libs out there.

  • Unusual Encoder/Decoder signatures?

    Unusual Encoder/Decoder signatures?

    Please correct me if I'm wrong. I don't think there's a standard interface for Encoder/Decoder types, but as far as I can see, they all typically take an io.Reader to create a Decoder, and io.Writer to create an Encoder.

    For example, look at:

    http://godoc.org/encoding/json#NewDecoder http://godoc.org/encoding/xml#NewDecoder http://godoc.org/encoding/ascii85#NewDecoder

    Then, their Encode/Decode methods take one parameter and return just error.

    The Encoder/Decoder in this package do not take a writer/reader, but instead return the encoded/decoded data alongside the error.

    My question is, would it be a better API if they did take a Reader/Writer?

  • Elaborate on fuzzy matching and autocomplete?

    Elaborate on fuzzy matching and autocomplete?

    From the README,

    Basically, it's a set of strings that lets you test for membership, do spelling correction (fuzzy matching) and autocomplete, but with higher memory efficiency than a regular trie.

    Can you please elaborate on how this can be used for fuzzy matching and autocomplete?

    As far as I can tell, you can only test for strict presence of a full string. It's not possible to check if a substring or a fuzzy match exists in the set, is it?

A memory-efficient trie for testing the existence/prefixes of string only(for now).

Succinct Trie A memory-efficient trie for testing the existence/prefixes of string only(for now). Install go get -u github.com/nobekanai/sutrie Docume

Mar 10, 2022
ZSet is an in-memory Redis like sorted set datastructure

zset Getting Started Installing To start using hash, install Go and run go get: $ go get -u github.com/arriqaaq/zset This will retrieve the library. U

Oct 13, 2022
Package set is a small wrapper around the official reflect package that facilitates loose type conversion and assignment into native Go types.

Package set is a small wrapper around the official reflect package that facilitates loose type conversion and assignment into native Go types. Read th

Dec 27, 2022
A Go implementation of a radix tree, that uses binary searches to speed up insert, retrieve and delete operations on dense trees

radixs A Go implementation of a radix tree, that uses binary searches to speed up insert, retrieve and delete operations on dense trees. This implemen

Feb 14, 2022
Type-agnostic partitioning for Go's indexable collections and strings.

Go Type-Agnostic Collection Partitioning Type-agnostic partitioning for anything that can be indexed in Go - slices, arrays,strings. Inspired by Guava

Aug 11, 2021
Storing strings without GC overhead

stringbank stringbank allows you to hold large numbers of strings without bothering the garbage collector. For small strings storage is reduced as the

Nov 17, 2022
:pushpin: State of the art point location and neighbour finding algorithms for region quadtrees, in Go
:pushpin: State of the art point location and neighbour finding algorithms for region quadtrees, in Go

Region quadtrees in Go Region quadtrees and efficient neighbour finding techniques in Go Go-rquad proposes various implementations of region quadtrees

Dec 13, 2022
A generic Go library for implementations of tries (radix trees), state commitments and proofs of inclusion

trie.go Go library for implementations of tries (radix trees), state commitments and proof of inclusion for large data sets. It implements a generic 2

Aug 3, 2022
A Go library for an efficient implementation of a skip list: https://godoc.org/github.com/MauriceGit/skiplist
A Go library for an efficient implementation of a skip list: https://godoc.org/github.com/MauriceGit/skiplist

Fast Skiplist Implementation This Go-library implements a very fast and efficient Skiplist that can be used as direct substitute for a balanced tree o

Dec 30, 2022
A simple and efficient thread-safe sharded hashmap for Go

shardmap A simple and efficient thread-safe sharded hashmap for Go. This is an alternative to the standard Go map and sync.Map, and is optimized for w

Dec 17, 2022
Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).
Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).

Slim - surprisingly space efficient data types in Golang Slim is collection of surprisingly space efficient data types, with corresponding serializati

Jan 2, 2023
Novel, efficient, and practical image compression with visually appealing results. 🤏 ✨
Novel, efficient, and practical image compression with visually appealing results. 🤏 ✨

Tiny Thumb ?? ✨ A novel, efficient, and practical method for lossy image compression, that produces visually appealing thumbnails. This technique is u

Nov 1, 2022
Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

Nov 20, 2022
Probabilistic set data structure
Probabilistic set data structure

Your basic Bloom filter Golang probabilistic set data structure A Bloom filter is a fast and space-efficient probabilistic data structure used to test

Dec 15, 2022
A simple set type for the Go language. Trusted by Docker, 1Password, Ethereum and Hashicorp.

golang-set The missing set collection for the Go language. Until Go has sets built-in...use this. Coming from Python one of the things I miss is the s

Jan 8, 2023
Set is a useful collection but there is no built-in implementation in Go lang.

goset Set is a useful collection but there is no built-in implementation in Go lang. Why? The only one pkg which provides set operations now is golang

Sep 26, 2022
A simple Set data structure implementation in Go (Golang) using LinkedHashMap.

Set Set is a simple Set data structure implementation in Go (Golang) using LinkedHashMap. This library allow you to get a set of int64 or string witho

Sep 26, 2022
Set data structure for Go

Archived project. No maintenance. This project is not maintained anymore and is archived.. Please create your own map[string]Type or use one of the ot

Nov 21, 2022
Set data structure for Go

Archived project. No maintenance. This project is not maintained anymore and is archived.. Please create your own map[string]Type or use one of the ot

Nov 21, 2022