gomtch - find text even if it doesn't want to be found

gomtch - find text even if it doesn't want to be found

Do your users have clever ways to hide some terms from you? Sometimes it is hard to find forbidden terms when the user doesn't want it to be found.

technology Go Build Status GoDoc GoReportCard

gomtch aims to help you find tokens in real life text offering the flexibility that most out-of-the-box algorithms lack. Ever wanted to find instances of a split word in text corpora (s p l i t e d)? Most NLP algorithms require a lot of normalization what could warm the integrity of the text corpora you are working with. gomtch looks for instances of splited words making the whole process easier. Also, the classic duplicated character problem (reeeeal), gomtch takes care of that for you as well. Finally, gomtch gives you the possibility to choose how to analise a potentially dangerous text corpora by considering special characters and digits as wild cards and leaving to you to choose how much (%) of a term should be considered (ex: h4rd matches 90% with the word hard).

https://nicolasassi.medium.com/gomtch-find-text-even-if-it-doesnt-want-to-be-found-a2229aed2a88

Table of Contents

Installation

just your good old go get

$ go get github.com/nicolasassi/gomtch

(optional) To run unit tests:

$ cd $GOPATH/src/github.com/nicolasassi/gomtch
$ go test

(optional) To run benchmarks (warning: it might take some time):

$ cd $GOPATH/src/github.com/nicolasassi/gomtch
$ go test -bench=".*"

Docs

https://pkg.go.dev/github.com/nicolasassi/gomtch

API

gomtch exposes a Document interface. The porpoise of it Document is to be compared with another Document interface. Keep in mind that one Document can be compared with as many Documents as necessary. Use the Scan() in the reference Document with the Documents do be compared as arguments (ex: referenceDocument.Scan(doc1, doc2 doc3...)).

gomtch provides a variety of text normalization features. Some features already implemented are:

  • HTML parsing (remove any HTML tags and keep the text)
  • Sequential character removal (reaaal = real)
  • Upper and lower normalization
  • Unicode normalization (canção = cancao)
  • Replace any unwanted token with a regexp

The implementation of Document requires the field matchScoreFunc with has the following signature func(int, int) bool. This field is used to determine the percentage of a token that should match a token.

Note that the matching behaves differently when comparing digits, letters and special characters.

Examples

Adapted from examples_test.go:

Simple document

package main

import (
    "bytes"
    "github.com/nicolasassi/gomtch"
    "log"
)

func main() {
    text := []byte("this is a text c o r p o r a")
    tokenToFind := []byte("corpora")
    corp, err := gomtch.NewDoc(bytes.NewReader(text))
    if err != nil {
        log.Fatal(err)
    }
    match, err := gomtch.NewDoc(bytes.NewReader(tokenToFind))
    if err != nil {
        log.Fatal(err)
    }
    for index, match := range corp.Scan(match) {
        log.Printf("index: %v match: %s", index, string(match))
    }
}

Playing with matching scores

package main

import (
  "bytes"
  "github.com/nicolasassi/gomtch"
  "log"
)

func main() {
  text := []byte("this is a text corp0ra")
  tokenToFind := []byte("corpora")
  corp, err := gomtch.NewDoc(bytes.NewReader(text))
  if err != nil {
    log.Fatal(err)
  }
  // this will not match because the default minimum match score of NewDoc is 100 and
  // "corp0ra" != "corpora"
  match1, err := gomtch.NewDoc(bytes.NewReader(tokenToFind))
  if err != nil {
    log.Fatal(err)
  }
  // this will match because 90% of len(tokenToFind) == +- 6. This means that there is space for
  // one not matching letter.
  match2, err := gomtch.NewDoc(bytes.NewReader(tokenToFind), gomtch.WithMinimumMatchScore(90))
  if err != nil {
    log.Fatal(err)
  }
  for index, match := range corp.Scan(match1, match2) {
    log.Printf("index: %v match: %s", index, string(match))
  }
}

Complex document and conditions

package main

import (
  "bytes"
  "github.com/nicolasassi/gomtch"
  "log"
  "regexp"
)

func main() {
    text := []byte("

This is REAAAAAAL WORLD example of a téxt quite h4rd to match!!

") corp, err := gomtch.NewDoc(bytes.NewReader(text), gomtch.WithSetLower(), gomtch.WithSequentialEqualCharsRemoval(), gomtch.WithHMTLParsing(), gomtch.WithReplacer(regexp.MustCompile(`[\[\]()\-.,:;{}"'!?]`), " "), gomtch.WithTransform(gomtch.NewASCII())) if err != nil { log.Fatal(err) } // this will match because we set that sequential equal caracters shoud removed and the // text should be all in lower case. // So REAAAAAL becames real match1, err := gomtch.NewDoc(bytes.NewReader([]byte("real world"))) if err != nil { log.Fatal(err) } // this will match because we allow each token to have a 90% minimum match score. match2, err := gomtch.NewDoc(bytes.NewReader([]byte("text quite hard to match")), gomtch.WithMinimumMatchScore(90)) if err != nil { log.Fatal(err) } for index, match := range corp.Scan(match1, match2) { log.Printf("index: %v match: %s", index, string(match)) } }

Support

There are a number of ways you can support the project:

  • Use it, star it, build something with it, spread the word!
  • Raise issues to improve the project (note: doc typos and clarifications are issues too!)
    • Please search existing issues before opening a new one - it may have already been addressed.
  • Pull requests: please discuss new code in an issue first, unless the fix is really trivial.
    • Make sure new code is tested.
    • Be mindful of existing code - PRs that break existing code have a high probability of being declined, unless it fixes a serious issue.

License

The BSD 3-Clause license, the same as the Go language.

Owner
Nicolas Augusto Sassi
If tech is the closest we may have to magic, I'm just trying hard to be a good wizard.
Nicolas Augusto Sassi
Similar Resources

Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022

Package sanitize provides functions for sanitizing text in golang strings.

sanitize Package sanitize provides functions to sanitize html and paths with go (golang). FUNCTIONS sanitize.Accents(s string) string Accents replaces

Dec 5, 2022

Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Jul 30, 2022

text to speech bot for discord

text to speech bot for discord

text to speech bot for discord

Oct 1, 2022

A diff3 text merge implementation in Go

Diff3 A diff3 text merge implementation in Go based on the awesome paper below. "A Formal Investigation of Diff3" by Sanjeev Khanna, Keshav Kunal, and

Nov 5, 2022

Unified text diffing in Go (copy of the internal diffing packages the officlal Go language server uses)

gotextdiff - unified text diffing in Go This is a copy of the Go text diffing packages that the official Go language server gopls uses internally to g

Dec 26, 2022

Convert scanned image PDF file to text annotated PDF file

Convert scanned image PDF file to text annotated PDF file

Jisui (自炊) This tool is PoC (Proof of Concept). Jisui is a helper tool to create e-book. Ordinary the scanned book have not text information, so you c

Dec 11, 2022

A modern text indexing library for go

A modern text indexing library for go

bleve modern text indexing in go - blevesearch.com Features Index any go data structure (including JSON) Intelligent defaults backed up by powerful co

Jan 4, 2023

Paranoid text spacing in Go (Golang)

pangu.go Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Oct 15, 2022
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

html-to-markdown Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of regexp as much as possible. That should prevent so

Jan 6, 2023
Auto-gen fuzzing wrappers from normal code. Automatically find buggy call sequences, including data races & deadlocks. Supports rich signature types.

fzgen fzgen auto-generates fuzzing wrappers for Go 1.18, optionally finds problematic API call sequences, can automatically wire outputs to inputs acr

Dec 23, 2022
A general purpose application and library for aligning text.

align A general purpose application that aligns text The focus of this application is to provide a fast, efficient, and useful tool for aligning text.

Sep 27, 2022
Parse placeholder and wildcard text commands

allot allot is a small Golang library to match and parse commands with pre-defined strings. For example use allot to define a list of commands your CL

Nov 24, 2022
Guess the natural language of a text in Go

guesslanguage This is a Go version of python guess-language. guesslanguage provides a simple way to detect the natural language of unicode string and

Dec 26, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022
Extract urls from text

xurls Extract urls from text using regular expressions. Requires Go 1.13 or later. import "mvdan.cc/xurls/v2" func main() { rxRelaxed := xurls.Relax

Jan 7, 2023
Easy AWK-style text processing in Go

awk Description awk is a package for the Go programming language that provides an AWK-style text processing capability. The package facilitates splitt

Jul 25, 2022
Change the color of console text.

go-colortext package This is a package to change the color of the text and background in the console, working both under Windows and other systems. Un

Oct 26, 2022