Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library

Travis CI Test coverage Go Report Card License: MIT Documentation link PkgGoDev

Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...


Table of Contents


Requirements

  • Go (v1.13+)

Introduction

Golang open-source library which includes most (and soon all) edit-distance and string comparision algorithms with some extra!
Designed to be fully compatible with Unicode characters!
This library is 100% test covered 😁

Features

  • Levenshtein

  • LCS (Longest common subsequence) with edit distance, backtrack and diff functions

  • Hamming

  • Damerau-Levenshtein, with following variants :

    • OSA (Optimal string alignment)
    • Adjacent transpositions
  • Jaro & Jaro-Winkler similarity algorithms

  • Cosine Similarity algorithm to compare strings

  • Computed similarity percentage functions based on all available edit distance algorithms in this lib

  • Fuzzy search functions based on edit distance with unique or multiples strings output

  • Unicode compatibility ! 🥳

  • And many more to come !

Benchmarks

You can check an interactive Google chart with few benchmark cases for all similarity algorithms in this library through StringSimilarity function here

However, if you want or need more details, you can also viewing benchmark raw output here, which also includes memory allocations and test cases output (similarity result and errors).

If you are on Linux and want to run them on your setup, you can run ./tests/benchmark.sh script.

Installation

Open bash into your project folder and run:

go get github.com/hbollon/go-edlib

And import it into your project:

import (
	"github.com/hbollon/go-edlib"
)

Run tests

If you are on Linux and want to run all unit tests just run ./tests/tests.sh script.

For Windows users you can run:

go test ./... # Add desired parameters to this command if you want

Documentation

You can find all the documentation here : Documentation

Examples

Calculate string similarity index between two string

You can use StringSimilarity(str1, str2, algorithm) function. algorithm parameter must one of the following constants:

// Algorithm identifiers
const (
	Levenshtein Algorithm = iota
	DamerauLevenshtein
	OSADamerauLevenshtein
	Lcs
	Hamming
	Jaro
	JaroWinkler
	Cosine
)

Example with levenshtein:

res, err := edlib.StringsSimilarity("string1", "string2", edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Similarity: %f", res)
}

Execute fuzzy search based on string similarity algorithm

1. Most matching unique result without threshold

You can use FuzzySearch(str, strList, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearch("testnig", strList, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result: %s", res)
}
Result: testing 

2. Most matching unique result with threshold

You can use FuzzySearchThreshold(str, strList, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchThreshold("testnig", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig': %s", res)
}

res, err = edlib.FuzzySearchThreshold("hello", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'hello': %s", res)
}
Result for 'testnig': testing
Result for 'hello':

3. Most matching result set without threshold

You can use FuzzySearchSet(str, strList, resultQuantity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSet("testnig", strList, 3, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Results: %s", strings.Join(res, ", "))
}
Results: testing, test, tester 

4. Most matching result set with threshold

You can use FuzzySearchSetThreshold(str, strList, resultQuantity, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.5, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.5' threshold: %s", strings.Join(res, " "))
}

res, err = edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.7' threshold: %s", strings.Join(res, " "))
}
Result for 'testnig' with '0.5' threshold: testing test tester
Result for 'testnig' with '0.7' threshold: testing

Get raw edit distance (Levenshtein, LCS, Damerau–Levenshtein, Hamming)

You can use one of the following function to get an edit distance between two strings :

Example with Levenshtein distance:

res := edlib.LevenshteinDistance("kitten", "sitting")
fmt.Printf("Result: %d", res)
Result: 3

LCS, LCS Backtrack and LCS Diff

1. Compute LCS(Longuest Common Subsequence) between two strings

You can use LCS(str1, str2) function.

lcs := edlib.LCS("ABCD", "ACBAD")
fmt.Printf("Length of their LCS: %d", lcs)
Length of their LCS: 3

2. Backtrack their LCS

You can use LCSBacktrack(str1, str2) function.

res, err := edlib.LCSBacktrack("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", res)
}
LCS: ABD

3. Backtrack all their LCS

You can use LCSBacktrackAll(str1, str2) function.

res, err := edlib.LCSBacktrackAll("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", strings.Join(res, ", "))
}
LCS: ABD, ACD

4. Get LCS Diff between two strings

You can use LCSDiff(str1, str2) function.

res, err := edlib.LCSDiff("computer", "houseboat")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: \n%s\n%s", res[0], res[1])
}
LCS Diff: 
 h c o m p u s e b o a t e r
 + -   - -   + + + + +   - -

Author

👤 Hugo Bollon

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐️ if this project helped you!

📝 License

Copyright © 2020 Hugo Bollon.
This project is MIT License licensed.

Owner
Hugo Bollon
Student at University Savoie Mont-Blanc in CS Master. Plan to become DevOps Engineer. I don't know everything but I can learn anything
Hugo Bollon
Comments
  • Missing methods for string shingling

    Missing methods for string shingling

    Missing methods for string shingling(https://en.wikipedia.org/wiki/W-shingling) aka n-grams for strings. In general, Cosine similarity and Jaccard index, among other methods, are computed on n-grams of the string, instead of the string itself(eg: https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/). Right now, we just split the string on spaces and compute the index based on the words in the string, which is fine but is probably not technically correct. I propose we introduce a method to shingle the given strings so that we can later use the shingled string in finding cosine similarity, etc. This has been done before here: Java: https://github.com/tdebatty/java-string-similarity Julia: https://github.com/matthieugomez/StringDistances.jl

    I think we should introduce a function that does something like:

    // Function signature
    shingle(str string, n int) map[string]int
    // Example usage:
    shingled_s1 := shingle("ABCD", 2) // returns map[AB:1 BC:1 CD:1]
    shingled_s2 := shingle("ABCE", 2) // returns map[AB:1 BC:1 CE:1]
    

    This will enable us to use these maps to later compute stuff, eg:

    res := JaccardSimilarityWithShingles(shingled_s1, shingled_s2) // should return 0.5.
    
  • Add Shingle and ShingleSlidingWindow

    Add Shingle and ShingleSlidingWindow

    Fixes #10 Added two methods, one that slices length n each time and the other uses sliding window. The first one was slightly faster for all values I checked, so I think we should remove the sliding window method. Sorry for the delay, I wanted to benchmark them both properly but couldn't find the time.

  • Q-gram index and Sorensen-Dice coefficient

    Q-gram index and Sorensen-Dice coefficient

    Currently, we do not have methods for Q-gram and Sorensen-Dice coefficient(https://en.wikipedia.org/wiki/Sørensen–Dice_coefficient) which are fairly simple to implement since they're based on n-grams.

  • LCSBacktrackAll not work

    LCSBacktrackAll not work

    My code:

    ress,err:=edlib.LCSBacktrackAll(" 是ab是cde22f123g", "222222是ab是cd123") if err!=nil{ fmt.Println(err) }else { fmt.Printf("%s", strings.Join(ress, "\r\n")) }

    Shoud be print ` 是ab是cd

    123 `

    But is print 是ab是cd123

  • Jaccard Similarity missing

    Jaccard Similarity missing

    Jaccard similarity coefficient(https://en.wikipedia.org/wiki/Jaccard_index) is a simple way to measure string similarity. I cannot find a function implementing this, is there one? If it is isn't there yet, it can be easily implemented using existing functions from cosine.go

  • reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m*n))

    reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m*n))

    Hi,this is the pr for issue. I run the benchmark on OSADamerauLevenshtein. Here is the result. | name | before | after | alloc_before | alloc_after | | :-----:| :----: | :----: | :----: | :----: | | sameLengthStringInput | 3389 ns/op |2416 ns/op |23 allocs/op |4 allocs/op | | pneumonoultramicroscopi | 15462 ns/op| 10253 ns/op |51 allocs/op| 8 allocs/op | |こにんちこにんちこに | 4027 ns/op | 3342 ns/op |18 allocs/op | 4 allocs/op| | I_love_horror_movi | 4282 ns/op | 3073 ns/op | 22 allocs/op | 4 allocs/op |

  • Reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n))

    Reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n))

    Discussed in https://github.com/hbollon/go-edlib/discussions/15

    Originally posted by tang-hi July 2, 2022 When we calculate the distance of OSADamerau–Levenshtein, we actually calculate the matrix with dimension(m*n) Suppose we have a matrix looks like

    1656691342850 the value in (3,0) is always 3 the value in (3,1) depends on (3,0) (2,0) (2,1) the value in (3,2) depends on (3,1) (2,1) (2,2) and (1,1) we can found that the value in (row : i) only depends values in (row:i-1,i-2) eg. the value in (row : 3) only depends values in (row:2,1)

    So any row number >= 3, we could write it in (row %3, j), eg. write the value of (3,0) in (0,0) .... (3,j) in (0,j) write the value of (4,0) in (1,0) .... (4,j) in (1,j) write the value of (i,0) in (i%3,0) .... (i,j) in (i%3,j)

    In this way, we only need the matrix with dimension(3, min(m,n)) to calculate OSADamerau–Levenshtein distance Here is the code to illustrate my idea

  • chore: release 1.5.0

    chore: release 1.5.0

  • StringsSimilarity is broken for unicode

    StringsSimilarity is broken for unicode

    The test is failing

    func TestStringsSimilarity(t *testing.T) {
    	str1 := "abcde"
    	str2 := "бвгдж"
    	str3 := "fghjk"
    	sim12, _ := edlib.StringsSimilarity(str1, str2, edlib.Levenshtein)
    	sim13, _ := edlib.StringsSimilarity(str1, str3, edlib.Levenshtein)
    	if sim12 != sim13 {
    		t.Errorf("different similarities, %v != %v", sim12, sim13)
    	}
    }
    

    output: different similarities, 0.5 != 0

    Similarity between "abcde" and "бвгдж" should be the same as between "abcde" and "fghjk". Looks like bug in the matchingIndex() because it compares length of strings but ...Distance() functions use runes.

  • chore: release 1.6.0

    chore: release 1.6.0

    :robot: I have created a release *beep* *boop*

    1.6.0 (2022-07-03)

    Features

    • reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n)) (#17) (37da2e1)

    This PR was generated with Release Please. See documentation.

Some algorithms in go: maxflow(min-cuts or graph-cuts), edit-distance.

Algorithms In this repository, some algorithms are implemented in go language. GoDoc link: ed maxflow About Max-flow problem: A flow network is repres

Sep 8, 2022
Go implementation to calculate Levenshtein Distance.

levenshtein Go package to calculate the Levenshtein Distance The library is fully capable of working with non-ascii strings. But the strings are not n

Dec 14, 2022
Recursively searches a map[string]interface{} structure for another map[string]interface{} structure

msirecurse Recursively searches a map[string]interface{} structure for existence of a map[string]interface{} structure Motivation I wrote this package

Mar 3, 2022
Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching.

Trie Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching. Usage Create a Trie with: t := trie.New() Add Keys with:

Dec 27, 2022
Go translations of the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

Overview Go translations of the Java source code for the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin

Dec 13, 2022
An in-memory string-interface{} map with various expiration options for golang

TTLCache - an in-memory cache with expiration TTLCache is a simple key/value cache in golang with the following functions: Expiration of items based o

Jan 8, 2023
An in-memory string-interface{} map with various expiration options for golang

TTLCache - an in-memory cache with expiration TTLCache is a simple key/value cache in golang with the following functions: Expiration of items based o

Dec 28, 2022
Convert json string to Golang struct

json-to-go-cli Convert json string to Golang struct How to install git clone https://github.com/tiancheng92/json-to-go-cli.git cd json-to-go-cli go bu

May 10, 2022
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Dec 22, 2022
Multi-String Pattern Matching Algorithm Using TrieHashNode

Multi-String Pattern Matching algorithm. This implementation is inspired from Aho-Corasick algorithm Getting Started modelA = mspm.NewModel("mspm_mode

Dec 9, 2022
Implementation of Boyer-Moore fast string search algorithm in Go

boyermoore Implementation of Boyer-Moore fast string search algorithm in Go

Oct 7, 2022
A fast (5x) string keyed read-only map for Go - particularly good for keys using a small set of nearby runes.

faststringmap faststringmap is a fast read-only string keyed map for Go (golang). For our use case it is approximately 5 times faster than using Go's

Jan 8, 2023
A memory-efficient trie for testing the existence/prefixes of string only(for now).

Succinct Trie A memory-efficient trie for testing the existence/prefixes of string only(for now). Install go get -u github.com/nobekanai/sutrie Docume

Mar 10, 2022
Algorithms and Data Structures Solved in Golang

Algorithms and Data Structures Solved in Golang Hi! I'm Bruno Melo and this repository contains a lot challenges solved on many plataforms using go as

Oct 20, 2022
Some data structures and algorithms using golang

Some data structures and algorithms using golang

Aug 13, 2022
Convert arbitrary formats to Go Struct (including json, toml, yaml, etc.)

go2struct Convert arbitrary formats to Go Struct (including json, toml, yaml, etc.) Installation Run the following command under your project: go get

Nov 15, 2022
An interesting go struct tag expression syntax for field validation, etc.

An interesting go struct tag expression syntax for field validation, etc.

Jan 8, 2023
Generic types that are missing from Go, including sets, trees, sorted lists, etc.

go-typ Generic types that are missing from Go, including sets, trees, sorted lists, etc. All code is implemented with 0 dependencies and in pure Go co

Dec 4, 2022
Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

docconv A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain tex

Jan 5, 2023