Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Last update: Dec 25, 2022

Comments: 12

Natural Language Processing

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.

Features

LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
Fast comparison and retrieval of semantically similar documents using SimHash(random hyperplanes/sign random projection) algorithm with multi-index and Forest schemes for LSH (Locality Sensitive Hashing) to support fast, approximate cosine similarity/angular distance comparisons and approximate nearest neighbour search using significantly less memory and processing time.
Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) over large, web-scale corpora.
Latent Dirichlet Allocation (LDA) using a parallelised implementation of the fast SCVB0 (Stochastic Collapsed Variational Bayesian inference) algorithm for unsupervised topic extraction.
PCA (Principal Component Analysis)
TF-IDF weighting to account for frequently occuring words
Sparse matrix implementations used for more efficient memory usage and processing over large document corpora.
Stop word removal to remove frequently occuring English words e.g. "the", "and"
Feature hashing ('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
Similarity/distance measures to calculate the similarity/distance between feature vectors.

Planned

Expanded persistence support
Stemming to treat words with common root as the same e.g. "go" and "going"
Clustering algorithms e.g. Heirachical, K-means, etc.
Classification algorithms e.g. SVM, KNN, random forest, etc.

References

Owner

James Bowman

CTO @Fresh8 Gaming. Go | Microservices | Machine Learning | NLP

https://github.com/james-bowman/nlp

Comments

func CosineSimilarity() return NaN

Dear James bowman I use your library to calculate similarity. The function CosineSimilarity() returns many NaN. So i can't continue my work. I have only changed your vectorisers.go file . All my changes are as following.

func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error) { //function begin from here mat := sparse.NewDOK(len(v.Vocabulary), len(docs))

for d, doc := range docs {
	v.Tokeniser.ForEachIn(doc, func(word string) {
		i, exists := v.Vocabulary[word]

		if exists {
			weight, wieghtExist := TrainingData.WeightMap[word]
			// normal weight value: 2,  unimportant weight value: 1, important  weight value: 3
			if wieghtExist {
				mat.Set(i, d, mat.At(i, d)+weight)
			} else {
				mat.Set(i, d, mat.At(i, d)+1)
			}

		}
	})
}
return mat.ToCSR(), nil

}

Ocr

This looks really nice. Thank you for putting this open.

I am attempting to do OCR. I can identify all the letters, but then i need to check them against a word list so i can pick up where the OCR has maybe made a mistake.

This way it can then propagate back to the OCR system to get better.

There is no reason also why it can't use semantic meaning of a sentence e to also correct the OCR. It's kind of one step up from just using single words.

I don't have it up on a git repo yet, but figured it would be interesting to you. If you feel like commenting about this idea it would be great.

I am really curious too where you get data sources . For semantic you need training data right ?
Opening the NLP package to non a-Z languages and custom stopword-lists
I changed the vectorisers functions so that they now parse all words written in any alphabet. Before this καί etc. was ignored.

Because this means we need to add support for customised stopword lists, a simple boolean does not work anymore. I changed vectorisers.go so that it accepts stopword lists as variable and modified its tests accordingly.
Vectorisers.go only tokenise for a-Z languages.

Hi,

Thanks for your hard work. I really like your code base. I have noticed that the package as is only works for languages that can be expressed in a-Z alphabets and, in addition, the hardcoded stop words make it a bit challenging for even historic or fringe English corpora. I have a fix for both. But did not want to PR without creating the issue first and see if you want to open up the project for non-English, historic English, and non a-Z languages.

Thanks again!

Best,

Thomas
LDA model persistence

Thanks for this library, it seems really useful. I have been playing around a bit with a feature extractor pipeline of countvectoriser and tfidf transformer feeding into an LDA transformer, but I can't seem to save the Fit'ed pipeline to disk and reload it later to Transform new docs. Looking at the serialized pipeline in json, it seems the vocabulary is there, as well as the tokenizer info and various LDA params, but I don't see the induced topics (matrices). Maybe this is a problem with the way I serialized it? If you can point to a working example of how to properly serialize a trained LDA model and re-use it later, that would be great. Thanks again!

Example fails, possibly due to gonum/matrix being deprecated?

Hello,

When running a slightly modified version of your example, I receive the following error:

# github.com/james-bowman/nlp
../../go/src/github.com/james-bowman/nlp/vectorisers.go:163: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
        *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/vectorisers.go:221: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
        *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:43: impossible type assertion:
        *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:62: cannot use sparse.NewDIA(m, weights) (type *sparse.DIA) as type mat64.Matrix in assignment:
        *sparse.DIA does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use t.transform (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
        mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                have T() mat64.Matrix
                want T() mat.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use mat (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
        mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                have T() mat64.Matrix
                want T() mat.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:81: cannot use product (type *sparse.CSR) as type mat64.Matrix in return argument:
        *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix

The code of my modified example is below:

package main

import (
	"fmt"
    "github.com/james-bowman/nlp"
	"github.com/gonum/matrix/mat64"
)

func main() {
	testCorpus := []string{
		"The quick brown fox jumped over the lazy dog",
		"hey diddle diddle, the cat and the fiddle",
		"the cow jumped over the moon",
		"the little dog laughed to see such fun",
		"and the dish ran away with the spoon",
	}

	query := "the brown fox ran around the dog"

	vectoriser := nlp.NewCountVectoriser(true)
	transformer := nlp.NewTfidfTransformer()

	// set k (the number of dimensions following truncation) to 4
	reducer := nlp.NewTruncatedSVD(4)

	// Transform the corpus into an LSI fitting the model to the documents in the process
	mat, _ := vectoriser.FitTransform(testCorpus...)
	mat, _ = transformer.FitTransform(mat)
	lsi, _ := reducer.FitTransform(mat)

	// run the query through the same pipeline that was fitted to the corpus and
	// to project it into the same dimensional space
	mat, _ = vectoriser.Transform(query)
	mat, _ = transformer.Transform(mat)
	queryVector, _ := reducer.Transform(mat)

	// iterate over document feature vectors (columns) in the LSI and compare with the
	// query vector for similarity.  Similarity is determined by the difference between
	// the angles of the vectors known as the cosine similarity
	highestSimilarity := -1.0
	var matched int
	_, docs := lsi.Dims()
	for i := 0; i < docs; i++ {
		similarity := CosineSimilarity(queryVector.(*mat64.Dense).ColView(0), lsi.(*mat64.Dense).ColView(i))
		if similarity > highestSimilarity {
			matched = i
			highestSimilarity = similarity
		}
	}

	fmt.Printf("Matched '%s'", testCorpus[matched])
	// Output: Matched 'The quick brown fox jumped over the lazy dog'
}

I see that gonum/matrix was deprecated a month ago in favor of gonum/gonum and wonder if that could be related.

Thanks very much for your help!

Interface for Tokeniser, Allow Custom Tokenisers?
Hi James,

Would you be open to a PR that made some changes to the Tokeniser type, and to dependent types, to allow for custom Tokenisers? This would make nlp more general for different languages, or for handling different tokenisation strategies.

What I'm imagining is this (note, the workflow is designed to avoid breaking API changes):

Convert Tokeniser to an interface, providing ForEachIn and Tokenise methods.

Convert NewTokeniser to a method that returns a default implementation, which would be identical to the current implementation.

Add a new method, NewCustomTokeniser(tokenPattern string, stopWordList []string) *Tokeniser, which would enable easy creation of a custom tokeniser.

Add new constructors for CountVectoriser and HashingVectoriser to allow use of a custom Tokeniser, OR (your preference) make their vec.tokeniser field into a public field vec.Tokeniser, allowing overrides or manual construction of either.

I could probably make the required changes quickly enough, if you're interested. :)
Adding SetComponents method to RandomIndexing type

Not sure if you would like the comment updated to be more descriptive as the rest of your comments are.

Just wanting to add this so that I can load a pre-fitted model into the RandomIndexer
replace default regexp in tokenizer on more universal

old regexp - "[\p{L}]+" convert documents like "os24120z R2D2" to ["os","z","R","D"] replaced with \S - not whitespace [^\t\n\f\r ]
[Question] Optimal number of topics in LDA

Hi! I'm planning to use the LDA functionality, and, as I read (I'm very new to the matter), in Gensim there's a coherence score that could be used to determine that (magic) key number. I wonder if there's any similar functionality implemented in the library? I have also read that perplexity may be added to the game to help deciding the number of topics but I'm not quite sure if it is correct or how to use it. I would really appreciate any clarification.

Thank you!
Methods for large corpora?

Sort of related to #8...

You have methods in the API, like in your example, that take an array of strings (docs).

matrix, _ := vectoriser.FitTransform(testCorpus...)

I'd like to use this for very large corpora, with 10s or 100s of millions of (not tiny) documents. Putting these all into a single array of strings does not sound optimal. Any chance the methods that now have a string array parameter for the documents could be altered to take in a function or interface that could allow iteration to get all the docs? (Or new methods that support this?)

Thanks, Glen
Online/streaming LDA?

Is it possible to run LDA (or other processing algorithms) in a streaming/online fashion, such as is done with gensim? It seems that this would not easily support online processing, but I thought I'd bounce the question off of you since you know the internals much better.

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing

Features

Planned

References

Owner

James Bowman

Comments

func CosineSimilarity() return NaN

Ocr

Opening the NLP package to non a-Z languages and custom stopword-lists

Vectorisers.go only tokenise for a-Z languages.

LDA model persistence

Example fails, possibly due to gonum/matrix being deprecated?

Interface for Tokeniser, Allow Custom Tokenisers?

Adding SetComponents method to RandomIndexing type

replace default regexp in tokenizer on more universal

[Question] Optimal number of topics in LDA

Methods for large corpora?

Online/streaming LDA?

Related tags

Natural language detection package in pure Go

Natural language detection library for Go

A natural language date/time parser with pluggable rules

A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

Complete Translation - translate a document to another language

Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

Unicode transliterator for #golang

Golang implementation of the Paice/Husk Stemming Algorithm

Golang port of Petrovich - an inflector for Russian anthroponyms.

A multilingual command line sentence tokenizer in Golang

Cross platform locale detection for Golang

Golang RESTful Client for HanLP.中文分词词性标注命名实体识别依存句法分析语义依存分析新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

📖 Tutorial: An easy way to translate your Golang application

i18n of golang

Licence-server - Building a golang Swagger API with Echo

Go-i18n - i18n for Golang

i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing

Features

Planned

References

Owner

James Bowman

Comments

func CosineSimilarity() return NaN

Ocr

Opening the NLP package to non a-Z languages and custom stopword-lists

Vectorisers.go only tokenise for a-Z languages.

LDA model persistence

Example fails, possibly due to gonum/matrix being deprecated?

Interface for Tokeniser, Allow Custom Tokenisers?

Adding SetComponents method to RandomIndexing type

replace default regexp in tokenizer on more universal

[Question] Optimal number of topics in LDA

Methods for large corpora?

Online/streaming LDA?

Related tags

Natural language detection package in pure Go

Natural language detection library for Go

A natural language date/time parser with pluggable rules

A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

Complete Translation - translate a document to another language

Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

Unicode transliterator for #golang

Golang implementation of the Paice/Husk Stemming Algorithm

Golang port of Petrovich - an inflector for Russian anthroponyms.

A multilingual command line sentence tokenizer in Golang

Cross platform locale detection for Golang

Golang RESTful Client for HanLP.中文分词 词性标注 命名实体识别 依存句法分析 语义依存分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

📖 Tutorial: An easy way to translate your Golang application

i18n of golang

Licence-server - Building a golang Swagger API with Echo

Go-i18n - i18n for Golang

i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

Golang RESTful Client for HanLP.中文分词词性标注命名实体识别依存句法分析语义依存分析新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理