A Go library implementing an FST (finite state transducer)

vellum vellum

Tests Coverage Status GoDoc Go Report Card License

A Go library implementing an FST (finite state transducer) capable of:

  • mapping between keys ([]byte) and a value (uint64)
  • enumerating keys in lexicographic order

Some additional goals of this implementation:

  • bounded memory use while building the FST
  • streaming out FST data while building
  • mmap FST runtime to support very large FTSs (optional)

Usage

Building an FST

To build an FST, create a new builder using the New() method. This method takes an io.Writer as an argument. As the FST is being built, data will be streamed to the writer as soon as possible. With this builder you MUST insert keys in lexicographic order. Inserting keys out of order will result in an error. After inserting the last key into the builder, you MUST call Close() on the builder. This will flush all remaining data to the underlying writer.

In memory:

  var buf bytes.Buffer
  builder, err := vellum.New(&buf, nil)
  if err != nil {
    log.Fatal(err)
  }

To disk:

  f, err := os.Create("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }
  builder, err := vellum.New(f, nil)
  if err != nil {
    log.Fatal(err)
  }

MUST insert keys in lexicographic order:

err = builder.Insert([]byte("cat"), 1)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("dog"), 2)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("fish"), 3)
if err != nil {
  log.Fatal(err)
}

err = builder.Close()
if err != nil {
  log.Fatal(err)
}

Using an FST

After closing the builder, the data can be used to instantiate an FST. If the data was written to disk, you can use the Open() method to mmap the file. If the data is already in memory, or you wish to load/mmap the data yourself, you can instantiate the FST with the Load() method.

Load in memory:

  fst, err := vellum.Load(buf.Bytes())
  if err != nil {
    log.Fatal(err)
  }

Open from disk:

  fst, err := vellum.Open("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }

Get key/value:

  val, exists, err = fst.Get([]byte("dog"))
  if err != nil {
    log.Fatal(err)
  }
  if exists {
    fmt.Printf("contains dog with val: %d\n", val)
  } else {
    fmt.Printf("does not contain dog")
  }

Iterate key/values:

  itr, err := fst.Iterator(startKeyInclusive, endKeyExclusive)
  for err == nil {
    key, val := itr.Current()
    fmt.Printf("contains key: %s val: %d", key, val)
    err = itr.Next()
  }
  if err != nil {
    log.Fatal(err)
  }

How does the FST get built?

A full example of the implementation is beyond the scope of this README, but let's consider a small example where we want to insert 3 key/value pairs.

First we insert "are" with the value 4.

step1

Next, we insert "ate" with the value 2.

step2

Notice how the values associated with the transitions were adjusted so that by summing them while traversing we still get the expected value.

At this point, we see that state 5 looks like state 3, and state 4 looks like state 2. But, we cannot yet combine them because future inserts could change this.

Now, we insert "see" with value 3. Once it has been added, we now know that states 5 and 4 can longer change. Since they are identical to 3 and 2, we replace them.

step3

Again, we see that states 7 and 8 appear to be identical to 2 and 3.

Having inserted our last key, we call Close() on the builder.

step4

Now, states 7 and 8 can safely be replaced with 2 and 3.

For additional information, see the references at the bottom of this document.

What does the serialized format look like?

We've broken out a separate document on the vellum disk format v1.

What if I want to use this on a system that doesn't have mmap?

The mmap library itself is guarded with system/architecture build tags, but we've also added an additional build tag in vellum. If you'd like to Open() a file based representation of an FST, but not use mmap, you can build the library with the nommap build tag. NOTE: if you do this, the entire FST will be read into memory.

Can I use this with Unicode strings?

Yes, however this implementation is only aware of the byte representation you choose. In order to find matches, you must work with some canonical byte representation of the string. In the future, some encoding-aware traversals may be possible on top of the lower-level byte transitions.

How did this library come to be?

In my work on the Bleve project I became aware of the power of the FST for many search-related tasks. The obvious starting point for such a thing in Go was the mafsa project. While working with mafsa I encountered some issues. First, it did not stream data to disk while building. Second, it chose to use a rune as the fundamental unit of transition in the FST, but I felt using a byte would be more powerful in the end. My hope is that higher-level encoding-aware traversals will be possible when necessary. Finally, as I reported bugs and submitted PRs I learned that the mafsa project was mainly a research project and no longer being maintained. I wanted to build something that could be used in production. As the project advanced more and more techniques from the BurntSushi/fst were adapted to our implementation.

Are there tools to work with vellum files?

Under the cmd/vellum subdirectory, there's a command-line tool which features subcommands that can allow you to create, inspect and query vellum files.

How can I generate a state transition diagram from a vellum file?

The vellum command-line tool has a "dot" subcommand that can emit graphviz dot output data from an input vellum file. The dot file can in turn be converted into an image using graphviz tools. Example...

$ vellum dot myFile.vellum > output.dot
$ dot -Tpng output.dot -o output.png

Related Work

Much credit goes to two existing projects:

Most of the original implementation here started with my digging into the internals of mafsa. As the implementation progressed, I continued to borrow ideas/approaches from the BurntSushi/fst library as well.

For a great introduction to this topic, please read the blog post Index 1,600,000,000 Keys with Automata and Rust

Owner
bleve
modern text indexing for go - supported and sponsored by Couchbase
bleve
Similar Resources

go-paddle is a Go client library for accessing the Paddle API.

go-paddle go-paddle is a Go client library for accessing the Paddle API. Installation go get github.com/Fakerr/go-paddle Alternatively the same can be

Aug 22, 2022

Self-contained Machine Learning and Natural Language Processing library in Go

Self-contained Machine Learning and Natural Language Processing library in Go

Self-contained Machine Learning and Natural Language Processing library in Go

Jan 8, 2023

Go (Golang) encrypted deep learning library; Fully homomorphic encryption over neural network graphs

DC DarkLantern A lantern is a portable case that protects light, A dark lantern is one who's light can be hidden at will. DC DarkLantern is a golang i

Oct 31, 2022

Go-hubspot - An auto-generated go client library of all of Hubspot's API

go-hubspot This is an auto-generated go client library of all of Hubspot's API (

Jan 22, 2022

Bcfm-study-case - A simple http server using the Echo library in Go language

Bcfm-study-case - A simple http server using the Echo library in Go language

Task 1 Hakkında Burada Go dilinde Echo kütüphanesini kullanarak basit bir http s

Feb 2, 2022

State observer - StateObserver used to synchronize the local(cached) state of the remote object with the real state

state observer StateObserver used to synchronize the local(cached) state of the

Jan 19, 2022

Finite State Machine for Go

FSM for Go FSM is a finite state machine for Go. It is heavily based on two FSM implementations: Javascript Finite State Machine, https://github.com/j

Dec 27, 2022

Package mafsa implements Minimal Acyclic Finite State Automata in Go, essentially a high-speed, memory-efficient, Unicode-friendly set of strings.

MA-FSA for Go Package mafsa implements Minimal Acyclic Finite State Automata (MA-FSA) with Minimal Perfect Hashing (MPH). Basically, it's a set of str

Oct 27, 2022

Package fsm allows you to add finite-state machines to your Go code.

fsm Package fsm allows you to add finite-state machines to your Go code. States and Events are defined as int consts: const ( StateFoo fsm.State =

Dec 9, 2022

Finite State Machine for Go

Finite State Machine for Go

FSM for Go Finite State Machine for Go It is heavily inspired from looplab/fsm library but with more abstractions and optimizations License FSM is lic

Nov 30, 2021

Finite-state machine with processors

FSM Finite-state machine with processors. Builder Register state processors type state1Processor struct { // clients } func NewState1Processor(..

Jan 7, 2022

cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resource objects related of Kubernetes Cluster API.

Overview cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resou

Oct 27, 2022

Us-api: a simple service that returns the US state code based on the state

us-api us-api is a simple service that returns the US state code based on the state. It does not support creating, updating nor deleting data. Local D

Dec 13, 2021

Using finite projective planes to make card (maybe video) games

pairwise What it is Using finite projective plane to generate card (maybe video) games. Running Run with go run . Right now uses Go 1.17 but 1.18 just

Jan 24, 2022

A Go library for implementing command-line interfaces.

Go CLI Library cli is a library for implementing powerful command-line interfaces in Go. cli is the library that powers the CLI for Packer, Serf, Cons

Dec 25, 2022

A Go library for implementing command-line interfaces.

Go CLI Library cli is a library for implementing powerful command-line interfaces in Go. cli is the library that powers the CLI for Packer, Serf, Cons

Jan 1, 2023

Go library implementing xor filters

Go library implementing xor filters

xorfilter: Go library implementing xor filters Bloom filters are used to quickly check whether an element is part of a set. Xor filters are a faster a

Dec 30, 2022

A simple Go library for 3D ray-tracing rendering, implementing the book Ray Tracing in One Weekend.

A simple Go library for 3D ray-tracing rendering, implementing the book Ray Tracing in One Weekend.

Ray Tracing in Go A Go implementation of the book Ray Tracing in One Weekend. The repository provides a library to describe and render your own scenes

Sep 23, 2022

Go library for creating state machines

Go library for creating state machines

Stateless Create state machines and lightweight state machine-based workflows directly in Go code: phoneCall := stateless.NewStateMachine(stateOffHook

Jan 6, 2023
Comments
  • Signature Mismatch on v1.0.6

    Signature Mismatch on v1.0.6

    Keep getting signature mismatch problems between my Athens caching module proxy and https://proxy.golang.org/ was the current release by chance retagged?

  • Moved github.com/willf/bitset import to github.com/bits-and-blooms/bitset

    Moved github.com/willf/bitset import to github.com/bits-and-blooms/bitset

    This is a minor but critical change, as the bitset repo has been renamed, and using the legacy import still results in "module declares its path as: github.com/bits-and-blooms/bitset but was required as: github.com/willf/bitset" errors in some build systems. Specifically, this import has been causing issues when using Bazel and Go.

  • Search doesn't seem to work with character replacements

    Search doesn't seem to work with character replacements

    The distance between cat and car is either 1 or 2, right? But the iterator in the code below returns no matches.

    If the distance is 1:

    package main
    
    import (
    	"bytes"
    	"errors"
    	"fmt"
    	"sort"
    
    	"github.com/blevesearch/vellum"
    	"github.com/blevesearch/vellum/levenshtein"
    )
    
    func main() {
    	var buf bytes.Buffer
    	builder, err := vellum.New(&buf, nil)
    	if err != nil {
    		panic(err)
    	}
    	words := []string{
    		"cat",
    	}
    	sort.Strings(words)
    	for _, w := range words {
    		if err := builder.Insert([]byte(w), 0); err != nil {
    			panic(err)
    		}
    	}
    	if err := builder.Close(); err != nil {
    		panic(err)
    	}
    
    	lb1, err := levenshtein.NewLevenshteinAutomatonBuilder(1, false)
    	if err != nil {
    		panic(err)
    	}
    
    	dfa, err := lb1.BuildDfa("car", 1)
    	if err != nil {
    		panic(err)
    	}
    
    	fst, err := vellum.Load(buf.Bytes())
    	if err != nil {
    		panic(err)
    	}
    
    	it, err := fst.Search(dfa, nil, nil)
    	// it, err := fst.Search(dfa, []byte("aaa"), []byte("zzz"))
    	if err != nil {
    		panic(err)
    	}
    
    	for {
    		if err := it.Next(); err != nil {
    			if errors.Is(err, vellum.ErrIteratorDone) {
    				break
    			}
    			panic(err)
    		}
    		m, _ := it.Current()
    
    		fmt.Println(string(m))
    	}
    }
    

    If the distance is 2:

    package main
    
    import (
    	"bytes"
    	"errors"
    	"fmt"
    	"sort"
    
    	"github.com/blevesearch/vellum"
    	"github.com/blevesearch/vellum/levenshtein"
    )
    
    func main() {
    	var buf bytes.Buffer
    	builder, err := vellum.New(&buf, nil)
    	if err != nil {
    		panic(err)
    	}
    	words := []string{
    		"cat",
    	}
    	sort.Strings(words)
    	for _, w := range words {
    		if err := builder.Insert([]byte(w), 0); err != nil {
    			panic(err)
    		}
    	}
    	if err := builder.Close(); err != nil {
    		panic(err)
    	}
    
    	lb2, err := levenshtein.NewLevenshteinAutomatonBuilder(2, false)
    	if err != nil {
    		panic(err)
    	}
    
    	dfa, err := lb2.BuildDfa("car", 2)
    	if err != nil {
    		panic(err)
    	}
    
    	fst, err := vellum.Load(buf.Bytes())
    	if err != nil {
    		panic(err)
    	}
    
    	it, err := fst.Search(dfa, nil, nil)
    	// it, err := fst.Search(dfa, []byte("aaa"), []byte("zzz"))
    	if err != nil {
    		panic(err)
    	}
    
    	for {
    		if err := it.Next(); err != nil {
    			if errors.Is(err, vellum.ErrIteratorDone) {
    				break
    			}
    			panic(err)
    		}
    		m, _ := it.Current()
    
    		fmt.Println(string(m))
    	}
    }
    
  • dot format cannot represent the whole graph

    dot format cannot represent the whole graph

    Thank you for your great work. But I have a question. To reconstruct the whole graph , I try to use dot output .dt file , but it seems that this format loss the final value. For example:
    ban => b/1 + a/109 + n/306 which is not true , when compared to the dumped format. Because dumped format has a final state value, like:

    State: 785381 (0xbfbe5) final (2887)

    So, why .dt format loss the last value(2887)?

Gorgonia is a library that helps facilitate machine learning in Go.
Gorgonia is a library that helps facilitate machine learning in Go.

Gorgonia is a library that helps facilitate machine learning in Go. Write and evaluate mathematical equations involving multidimensional arrays easily

Dec 30, 2022
Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

gosseract OCR Golang OCR package, by using Tesseract C++ library. OCR Server Do you just want OCR server, or see the working example of this package?

Jan 3, 2023
onnx-go gives the ability to import a pre-trained neural network within Go without being linked to a framework or library.
onnx-go gives the ability to import a pre-trained neural network within Go without being linked to a framework or library.

This is a Go Interface to Open Neural Network Exchange (ONNX). Overview onnx-go contains primitives to decode a onnx binary model into a computation b

Dec 24, 2022
A neural network library built in Go

go-mind A neural network library built in Go. Usage import "github.com/stevenmiller888/go-mind" m := mind.New(0.7, 10000, 3, "sigmoid") m.Learn([][]

Aug 27, 2022
Gorgonia is a library that helps facilitate machine learning in Go.
Gorgonia is a library that helps facilitate machine learning in Go.

Gorgonia is a library that helps facilitate machine learning in Go. Write and evaluate mathematical equations involving multidimensional arrays easily

Dec 27, 2022
A High-level Machine Learning Library for Go
A High-level Machine Learning Library for Go

Overview Goro is a high-level machine learning library for Go built on Gorgonia. It aims to have the same feel as Keras. Usage import ( . "github.

Nov 20, 2022
Library for multi-armed bandit selection strategies, including efficient deterministic implementations of Thompson sampling and epsilon-greedy.
Library for multi-armed bandit selection strategies, including efficient deterministic implementations of Thompson sampling and epsilon-greedy.

Mab Multi-Armed Bandits Go Library Description Installation Usage Creating a bandit and selecting arms Numerical integration with numint Documentation

Jan 2, 2023
Bigmachine is a library for self-managing serverless computing in Go

Bigmachine Bigmachine is a toolkit for building self-managing serverless applications in Go. Bigmachine provides an API that lets a driver process for

Nov 15, 2022
A high performance go implementation of Wappalyzer Technology Detection Library

wappalyzergo A high performance port of the Wappalyzer Technology Detection Library to Go. Inspired by https://github.com/rverton/webanalyze. Features

Jan 8, 2023
A high-performance timeline tracing library for Golang, used by TiDB

Minitrace-Go A high-performance, ergonomic timeline tracing library for Golang. Basic Usage package main import ( "context" "fmt" "strcon

Nov 28, 2022