Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library

Travis CI Test coverage Go Report Card License: MIT Documentation link PkgGoDev

Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...


Table of Contents


Requirements

  • Go (v1.13+)

Introduction

Golang open-source library which includes most (and soon all) edit-distance and string comparision algorithms with some extra!
Designed to be fully compatible with Unicode characters!
This library is 100% test covered 😁

Features

  • Levenshtein

  • LCS (Longest common subsequence) with edit distance, backtrack and diff functions

  • Hamming

  • Damerau-Levenshtein, with following variants :

    • OSA (Optimal string alignment)
    • Adjacent transpositions
  • Jaro & Jaro-Winkler similarity algorithms

  • Cosine Similarity algorithm to compare strings

  • Computed similarity percentage functions based on all available edit distance algorithms in this lib

  • Fuzzy search functions based on edit distance with unique or multiples strings output

  • Unicode compatibility ! 🥳

  • And many more to come !

Benchmarks

You can check an interactive Google chart with few benchmark cases for all similarity algorithms in this library through StringSimilarity function here

However, if you want or need more details, you can also viewing benchmark raw output here, which also includes memory allocations and test cases output (similarity result and errors).

If you are on Linux and want to run them on your setup, you can run ./tests/benchmark.sh script.

Installation

Open bash into your project folder and run:

go get github.com/hbollon/go-edlib

And import it into your project:

import (
	"github.com/hbollon/go-edlib"
)

Run tests

If you are on Linux and want to run all unit tests just run ./tests/tests.sh script.

For Windows users you can run:

go test ./... # Add desired parameters to this command if you want

Documentation

You can find all the documentation here : Documentation

Examples

Calculate string similarity index between two string

You can use StringSimilarity(str1, str2, algorithm) function. algorithm parameter must one of the following constants:

// Algorithm identifiers
const (
	Levenshtein Algorithm = iota
	DamerauLevenshtein
	OSADamerauLevenshtein
	Lcs
	Hamming
	Jaro
	JaroWinkler
	Cosine
)

Example with levenshtein:

res, err := edlib.StringsSimilarity("string1", "string2", edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Similarity: %f", res)
}

Execute fuzzy search based on string similarity algorithm

1. Most matching unique result without threshold

You can use FuzzySearch(str, strList, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearch("testnig", strList, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result: %s", res)
}
Result: testing 

2. Most matching unique result with threshold

You can use FuzzySearchThreshold(str, strList, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchThreshold("testnig", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig': %s", res)
}

res, err = edlib.FuzzySearchThreshold("hello", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'hello': %s", res)
}
Result for 'testnig': testing
Result for 'hello':

3. Most matching result set without threshold

You can use FuzzySearchSet(str, strList, resultQuantity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSet("testnig", strList, 3, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Results: %s", strings.Join(res, ", "))
}
Results: testing, test, tester 

4. Most matching result set with threshold

You can use FuzzySearchSetThreshold(str, strList, resultQuantity, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.5, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.5' threshold: %s", strings.Join(res, " "))
}

res, err = edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.7' threshold: %s", strings.Join(res, " "))
}
Result for 'testnig' with '0.5' threshold: testing test tester
Result for 'testnig' with '0.7' threshold: testing

Get raw edit distance (Levenshtein, LCS, Damerau–Levenshtein, Hamming)

You can use one of the following function to get an edit distance between two strings :

Example with Levenshtein distance:

res := edlib.LevenshteinDistance("kitten", "sitting")
fmt.Printf("Result: %d", res)
Result: 3

LCS, LCS Backtrack and LCS Diff

1. Compute LCS(Longuest Common Subsequence) between two strings

You can use LCS(str1, str2) function.

lcs := edlib.LCS("ABCD", "ACBAD")
fmt.Printf("Length of their LCS: %d", lcs)
Length of their LCS: 3

2. Backtrack their LCS

You can use LCSBacktrack(str1, str2) function.

res, err := edlib.LCSBacktrack("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", res)
}
LCS: ABD

3. Backtrack all their LCS

You can use LCSBacktrackAll(str1, str2) function.

res, err := edlib.LCSBacktrackAll("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", strings.Join(res, ", "))
}
LCS: ABD, ACD

4. Get LCS Diff between two strings

You can use LCSDiff(str1, str2) function.

res, err := edlib.LCSDiff("computer", "houseboat")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: \n%s\n%s", res[0], res[1])
}
LCS Diff: 
 h c o m p u s e b o a t e r
 + -   - -   + + + + +   - -

Author

👤 Hugo Bollon

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐️ if this project helped you!

📝 License

Copyright © 2020 Hugo Bollon.
This project is MIT License licensed.

Owner
Hugo Bollon
Student at University Savoie Mont-Blanc in CS Master. Plan to become DevOps Engineer. I don't know everything but I can learn anything
Hugo Bollon
Comments
  • Missing methods for string shingling

    Missing methods for string shingling

    Missing methods for string shingling(https://en.wikipedia.org/wiki/W-shingling) aka n-grams for strings. In general, Cosine similarity and Jaccard index, among other methods, are computed on n-grams of the string, instead of the string itself(eg: https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/). Right now, we just split the string on spaces and compute the index based on the words in the string, which is fine but is probably not technically correct. I propose we introduce a method to shingle the given strings so that we can later use the shingled string in finding cosine similarity, etc. This has been done before here: Java: https://github.com/tdebatty/java-string-similarity Julia: https://github.com/matthieugomez/StringDistances.jl

    I think we should introduce a function that does something like:

    // Function signature
    shingle(str string, n int) map[string]int
    // Example usage:
    shingled_s1 := shingle("ABCD", 2) // returns map[AB:1 BC:1 CD:1]
    shingled_s2 := shingle("ABCE", 2) // returns map[AB:1 BC:1 CE:1]
    

    This will enable us to use these maps to later compute stuff, eg:

    res := JaccardSimilarityWithShingles(shingled_s1, shingled_s2) // should return 0.5.
    
  • Add Shingle and ShingleSlidingWindow

    Add Shingle and ShingleSlidingWindow

    Fixes #10 Added two methods, one that slices length n each time and the other uses sliding window. The first one was slightly faster for all values I checked, so I think we should remove the sliding window method. Sorry for the delay, I wanted to benchmark them both properly but couldn't find the time.

  • Q-gram index and Sorensen-Dice coefficient

    Q-gram index and Sorensen-Dice coefficient

    Currently, we do not have methods for Q-gram and Sorensen-Dice coefficient(https://en.wikipedia.org/wiki/Sørensen–Dice_coefficient) which are fairly simple to implement since they're based on n-grams.

  • LCSBacktrackAll not work

    LCSBacktrackAll not work

    My code:

    ress,err:=edlib.LCSBacktrackAll(" 是ab是cde22f123g", "222222是ab是cd123") if err!=nil{ fmt.Println(err) }else { fmt.Printf("%s", strings.Join(ress, "\r\n")) }

    Shoud be print ` 是ab是cd

    123 `

    But is print 是ab是cd123

  • Jaccard Similarity missing

    Jaccard Similarity missing

    Jaccard similarity coefficient(https://en.wikipedia.org/wiki/Jaccard_index) is a simple way to measure string similarity. I cannot find a function implementing this, is there one? If it is isn't there yet, it can be easily implemented using existing functions from cosine.go

  • reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m*n))

    reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m*n))

    Hi,this is the pr for issue. I run the benchmark on OSADamerauLevenshtein. Here is the result. | name | before | after | alloc_before | alloc_after | | :-----:| :----: | :----: | :----: | :----: | | sameLengthStringInput | 3389 ns/op |2416 ns/op |23 allocs/op |4 allocs/op | | pneumonoultramicroscopi | 15462 ns/op| 10253 ns/op |51 allocs/op| 8 allocs/op | |こにんちこにんちこに | 4027 ns/op | 3342 ns/op |18 allocs/op | 4 allocs/op| | I_love_horror_movi | 4282 ns/op | 3073 ns/op | 22 allocs/op | 4 allocs/op |

  • Reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n))

    Reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n))

    Discussed in https://github.com/hbollon/go-edlib/discussions/15

    Originally posted by tang-hi July 2, 2022 When we calculate the distance of OSADamerau–Levenshtein, we actually calculate the matrix with dimension(m*n) Suppose we have a matrix looks like

    1656691342850 the value in (3,0) is always 3 the value in (3,1) depends on (3,0) (2,0) (2,1) the value in (3,2) depends on (3,1) (2,1) (2,2) and (1,1) we can found that the value in (row : i) only depends values in (row:i-1,i-2) eg. the value in (row : 3) only depends values in (row:2,1)

    So any row number >= 3, we could write it in (row %3, j), eg. write the value of (3,0) in (0,0) .... (3,j) in (0,j) write the value of (4,0) in (1,0) .... (4,j) in (1,j) write the value of (i,0) in (i%3,0) .... (i,j) in (i%3,j)

    In this way, we only need the matrix with dimension(3, min(m,n)) to calculate OSADamerau–Levenshtein distance Here is the code to illustrate my idea

  • chore: release 1.5.0

    chore: release 1.5.0

  • StringsSimilarity is broken for unicode

    StringsSimilarity is broken for unicode

    The test is failing

    func TestStringsSimilarity(t *testing.T) {
    	str1 := "abcde"
    	str2 := "бвгдж"
    	str3 := "fghjk"
    	sim12, _ := edlib.StringsSimilarity(str1, str2, edlib.Levenshtein)
    	sim13, _ := edlib.StringsSimilarity(str1, str3, edlib.Levenshtein)
    	if sim12 != sim13 {
    		t.Errorf("different similarities, %v != %v", sim12, sim13)
    	}
    }
    

    output: different similarities, 0.5 != 0

    Similarity between "abcde" and "бвгдж" should be the same as between "abcde" and "fghjk". Looks like bug in the matchingIndex() because it compares length of strings but ...Distance() functions use runes.

  • chore: release 1.6.0

    chore: release 1.6.0

    :robot: I have created a release *beep* *boop*

    1.6.0 (2022-07-03)

    Features

    • reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n)) (#17) (37da2e1)

    This PR was generated with Release Please. See documentation.

GNU Aspell spell checking library bindings for Go (golang)

Aspell library bindings for Go GNU Aspell is a spell checking tool written in C/C++. This package provides simplified Aspell bindings for Go. It uses

Nov 14, 2022
A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Nov 28, 2022
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Dec 20, 2022
Levenshtein distance and similarity metrics with customizable edit costs and Winkler-like bonus for common prefix.

A Go package for calculating the Levenshtein distance between two strings This package implements distance and similarity metrics for strings, based o

Dec 15, 2022
Some algorithms in go: maxflow(min-cuts or graph-cuts), edit-distance.

Algorithms In this repository, some algorithms are implemented in go language. GoDoc link: ed maxflow About Max-flow problem: A flow network is repres

Sep 8, 2022
Levenshtein distance for golang

levenshtein An implementation of the Levenshtein distance for Golang Sources Levenshtein distance Golang Install go get github.com/gitchander/levensht

Sep 26, 2022
Go implementation to calculate Levenshtein Distance.

levenshtein Go package to calculate the Levenshtein Distance The library is fully capable of working with non-ascii strings. But the strings are not n

Dec 14, 2022
Inflection is a string transformation library. It transforms strings from CamelCase to underscored string.

Inflection Inflection is a string transformation library. It transforms strings from CamelCase to underscored string. This is an implement of Inflecti

Jul 25, 2022
View, edit, and save text files via http to the file system.

go-wiki View, edit, and save text files via http to the file system. (DONE) https://golang.org/doc/articles/wiki/ Instructions go run main.go In a web

Nov 25, 2021
go-editor is the clean go module that refractors from Kubernetes to help you edit resources in a command-line way.

go-editor The source code of go-editor comes from Kubernetes and refractor as the clean Go module. You can embed go-editor in your command-line tool l

Dec 5, 2021
Todo-cmd: an app you can add your tasks , edit or delete them

TODO CMD APP! ??‍♂️ Table of contents General info Update Requirements set-up usage General info todo-cmd is an app you can add your tasks , edit or d

Dec 13, 2021
Golang metrics for calculating string similarity and other string utility functions

strutil strutil provides string metrics for calculating string similarity as well as other string utility functions. Full documentation can be found a

Jan 3, 2023
✨ #PTerm is a modern go module to beautify console output. Featuring charts, progressbars, tables, trees, and many more 🚀 It's completely configurable and 100% cross-platform compatible.
✨ #PTerm is a modern go module to beautify console output. Featuring charts, progressbars, tables, trees, and many more 🚀 It's completely configurable and 100% cross-platform compatible.

?? PTerm | Pretty Terminal Printer A golang module to print pretty text Show Demo Code PTerm.sh | Installation | Documentation | Quick Start | Example

Dec 27, 2022
Headless CMS with automatic JSON API. Featuring auto-HTTPS from Let's Encrypt, HTTP/2 Server Push, and flexible server framework written in Go.
Headless CMS with automatic JSON API. Featuring auto-HTTPS from Let's Encrypt, HTTP/2 Server Push, and flexible server framework written in Go.

Ponzu Watch the video introduction Ponzu is a powerful and efficient open-source HTTP server framework and CMS. It provides automatic, free, and secur

Dec 28, 2022
⚗ The most advanced CLI template on earth! Featuring automatic releases, website generation and a custom CI-System out of the box.
⚗ The most advanced CLI template on earth! Featuring automatic releases, website generation and a custom CI-System out of the box.

cli-template ✨ ⚗ A template for beautiful, modern, cross-platform compatible CLI tools written with Go! Getting Started | Wiki This template features

Dec 4, 2022
Jaken - A general purpose IRC bot featuring user acls and arbitrary plugins

Design principles This bot is based on the premise of a loosely coupling between

Jul 21, 2022
"We will game" blockchain featuring the "Willgame coin".

We will game "We will game" blockchain featuring the "Willgame coin". Our vision is to become the number one design company for play to earn decentral

Jan 2, 2022
Recursively searches a map[string]interface{} structure for another map[string]interface{} structure

msirecurse Recursively searches a map[string]interface{} structure for existence of a map[string]interface{} structure Motivation I wrote this package

Mar 3, 2022
Extremely flexible golang deep comparison, extends the go testing package and tests HTTP APIs
Extremely flexible golang deep comparison, extends the go testing package and tests HTTP APIs

go-testdeep Extremely flexible golang deep comparison, extends the go testing package. Latest news Synopsis Description Installation Functions Availab

Dec 22, 2022
Extremely flexible golang deep comparison, extends the go testing package, tests HTTP APIs and provides tests suite
Extremely flexible golang deep comparison, extends the go testing package, tests HTTP APIs and provides tests suite

go-testdeep Extremely flexible golang deep comparison, extends the go testing package. Latest news Synopsis Description Installation Functions Availab

Jan 5, 2023