efficient string matching in Golang via the aho-corasick algorithm.

aho-corasick

Efficient string matching in Golang via the aho-corasick algorithm.

x20 faster than https://github.com/cloudflare/ahocorasick and x3 faster than https://github.com/anknown/ahocorasick

Memory consuption is a eigth of https://github.com/cloudflare/ahocorasick and half of https://github.com/anknown/ahocorasick

This library is heavily inspired by https://github.com/BurntSushi/aho-corasick

Usage

builder := NewAhoCorasickBuilder(Opts{
  AsciiCaseInsensitive: true,
  MatchOnlyWholeWords:  true,
  MatchKind:            LeftMostLongestMatch,
  DFA:                  true,
})

ac := builder.Build([]string{"bear", "masha"})
haystack := "The Bear and Masha"
matches := ac.FindAll(haystack)

for _, match := range matches {
  println(haystack[match.Start():match.End()])
}

Matching can be done via NFA or DFA. NFA has runtime complexity O(N + M) in relation to the haystack and number of matches. DFA has runtime complexity O(N), but it uses more memory.

Replacing of matches in the haystack.

replaceWith needs to be the same length as the patterns

r := NewReplacer(ac)
replaced := r.ReplaceAll(haystack, replaceWith)

ReplaceAllFunc is useful, for example, if you want to use the original text cassing but you are matching case insensitively. You can replace partially by return false and from that point, the original string will be preserved.

replaced := r.ReplaceAllFunc(haystack, func(match Match) (string, bool) {
    return `<a>` + haystack[match.Start():match.End()] + `<\a>`, true
})

Search for matches one at a time via the iterator

iter := ac.Iter(haystack)

for next := iter.Next(); next != nil; next = iter.Next() {
    ...
}

It's plenty fast but if you want to use it in parallel, that is also possible.

Memory consumption won't increase because the read-only automaton is not actually copied, only the counters are.

The magic line is ac := ac

var w sync.WaitGroup

w.Add(50)
for i := 0; i < 50; i++ {
    go func() {
        ac := ac
        matches := ac.FindAll(haystack)
        println(len(matches))
        w.Done()
    }()
}
w.Wait()
Owner
Petar Dambovaliev
(Go/Rust) developer/Musician
Petar Dambovaliev
Comments
  • Provide methods that accept `[]byte`.

    Provide methods that accept `[]byte`.

    Hi,

    I really like the implementation but would love to have a FindAll(haystack []byte) and also a Build(patterns [][]byte). I saw that all patterns and haystacks are converted to []byte internally. What do you think about accepting []byte by default and have the string methods call these? E.g. FindAllByte(haystack []byte) and then FindAll(haystack string) { FindAllByte([]byte(haystack)) }.

  • Should Opts structs fields be exported ?

    Should Opts structs fields be exported ?

    Hi, thanks for the work on this. I'm having a few issues with the Opts struct (I'm relatively new to Go). I seem to get an exported issue with it, so wondering if the fields should be in caps ?

  • Provide byte functions.

    Provide byte functions.

    This patch adds FindAllByte and BuildeByte so that []byte patterns and haystacks can be passed directly without any conversion. The original string to []byte conversions are push up to the constructor methods.

    Closes #6

  • feat: adds support for findN.

    feat: adds support for findN.

    This PR adds support for findN function which would improve the user experience when the matching has a limit.

    One example is this snippet where we match all but only want to deal with at most 10 matches.

    No significant performance impact:

    Before

    goos: darwin
    goarch: arm64
    pkg: github.com/petar-dambovaliev/aho-corasick
    BenchmarkAhoCorasick_LeftmostInsensitiveWholeWord-8   	   11784	     99401 ns/op	   71360 B/op	    1661 allocs/op
    PASS
    

    After

    goos: darwin
    goarch: arm64
    pkg: github.com/petar-dambovaliev/aho-corasick
    BenchmarkAhoCorasick_LeftmostInsensitiveWholeWord-8   	   11911	    101753 ns/op	   71360 B/op	    1661 allocs/op
    PASS
    

    Ping @anuraaga @jptosso

  • High memory usage compared to other implementation

    High memory usage compared to other implementation

    Hi, I found this while looking for a lower memory usage alternative to anknown/ahocorasick.

    I have a dataset of around 6 million strings. The total memory usage, as shown by pprof, after building the automaton is just over 30GB, compared to 6.5GB for the anknown version.

    Do you have any tips for working out why it's using so much more RAM?

    Thanks in advance.

Json-match - Command line util for matching values in a JSON input

json-match Match JSON input by specifying key and value > json-match -json '{\"p

Jan 12, 2022
Fast, secure, efficient backup program
Fast, secure, efficient backup program

Introduction restic is a backup program that is fast, efficient and secure. It supports the three major operating systems (Linux, macOS, Windows) and

Dec 31, 2022
Procmon is a Linux reimagining of the classic Procmon tool from the Sysinternals suite of tools for Windows. Procmon provides a convenient and efficient way for Linux developers to trace the syscall activity on the system.
Procmon is a Linux reimagining of the classic Procmon tool from the Sysinternals suite of tools for Windows. Procmon provides a convenient and efficient way for Linux developers to trace the syscall activity on the system.

Process Monitor for Linux (Preview) Process Monitor (Procmon) is a Linux reimagining of the classic Procmon tool from the Sysinternals suite of tools

Dec 29, 2022
Robust, flexible and resource-efficient pipelines using Go and the commandline
Robust, flexible and resource-efficient pipelines using Go and the commandline

Robust, flexible and resource-efficient pipelines using Go and the commandline Project links: Documentation & Main Website | Issue Tracker | Chat Why

Dec 25, 2022
argv - Go library to split command line string as arguments array using the bash syntax.

Argv Argv is a library for Go to split command line string into arguments array. Documentation Documentation can be found at Godoc Example func TestAr

Nov 19, 2022
A simple go program which checks if your websites are running and runs forever (stop it with ctrl+c). It takes two optional arguments, comma separated string with urls and an interval.

uptime A simple go program which checks if your websites are running and runs forever (stop it with ctrl+c). It takes two optional arguments: -interva

Dec 15, 2022
cross-platform, cli app to perform various operations on string
cross-platform, cli app to perform various operations on string

sttr is command line software that allows you to quickly run various transformation operations on the string.

Dec 30, 2022
sttr is command line software that allows you to quickly run various transformation operations on the string.
sttr is command line software that allows you to quickly run various transformation operations on the string.

sttr is command line software that allows you to quickly run various transformation operations on the string.

Sep 21, 2021
This is a Go Cli app that receives an string path to a log file, and based on it generates and prints in console an encoded polyline with the locations found in the log file.
This is a Go Cli app that receives an string path to a log file, and based on it generates  and prints in console an encoded polyline with the locations found in the log file.

GEOENCODE GO CLI APP DESCRIPTION This is a Go Cli app that receives an string path to a log file, and based on it generates and prints in console an e

Oct 1, 2021
Code examples for Algorithm Analysis and design (CS311) [School project]

Introduction Algorithm Analysis and design 2021/2022 Code examples implemeneted using golang Why Golang? Low Level programming language Awesome garbag

Dec 5, 2021
Cli-algorithm - A cli program with A&DS in go!

cli-algorithm Objectives The objective of this cli is to implement 4 basic algorithms to sort arrays been Merge Sort Insertion Sort Bubble Sort Quick

Jan 2, 2022
File backup via GoLang

SnakeSync Simple File Backup Tool Getting Started Requirements Go >= 1.17 CLI Usage Scan How to use the program via scan go run ./cmd -scan With Fla

Jul 9, 2022
Listen podcast via CLI for golang

goplay Listen podcast via CLI.

Dec 2, 2022
:mag: Search the Go packages via command-line

GoSearch Search the Go packages for pkg.go.dev via command-line. It supports all search options in Search Help. Installation go get github.com/mingram

Jun 23, 2022
Upterm is an open-source solution for sharing terminal sessions instantly over the public internet via secure tunnels.
Upterm is an open-source solution for sharing terminal sessions instantly over the public internet via secure tunnels.

Upterm is an open-source solution for sharing terminal sessions instantly over the public internet via secure tunnels.

Jan 8, 2023
A multiple reverse shell sessions/clients manager via terminal written in go

A multiple reverse shell sessions/clients manager via terminal written in go

Nov 9, 2022
A set of Go scripts to monitor YAGPDB status via the command-line.
A set of Go scripts to monitor YAGPDB status via the command-line.

A set of Go scripts to monitor YAGPDB status by making GET requests to the YAGPDB status endpoint.

Apr 20, 2022
A CLI to execute AT Commands via serial port connections.
A CLI to execute AT Commands via serial port connections.

AT Command CLI A CLI to execute AT Commands via serial port connections. Development Install Go Run go run main.go

Dec 13, 2022
A command line http test tool. Maintain the case via git and pure text
A command line http test tool. Maintain the case via git and pure text

httptest A command line http test tool Maintain the api test cases via git and pure text We want to test the APIs via http requests and assert the res

Dec 16, 2022