Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing

License: MIT GoDoc Build Status Go Report Card codecov Mentioned in Awesome Go Sourcegraph

nlp

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.


Features

Planned

  • Expanded persistence support
  • Stemming to treat words with common root as the same e.g. "go" and "going"
  • Clustering algorithms e.g. Heirachical, K-means, etc.
  • Classification algorithms e.g. SVM, KNN, random forest, etc.

References

  1. Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
  2. Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
  3. Thomo, Alex. Latent Semantic Analysis (Tutorial).
  4. Latent Semantic Indexing. Standford NLP Course
  5. Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380.
  6. M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005.
  7. A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999.
  8. Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
  9. Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
  10. Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature
  11. QasemiZadeh, Behrang and Handschuh, Siegfried. Random Indexing Explained with High Probability
  12. Foulds, James; Boyles, Levi; Dubois, Christopher; Smyth, Padhraic; Welling, Max (2013). Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation
Owner
James Bowman
CTO @Fresh8 Gaming. Go | Microservices | Machine Learning | NLP
James Bowman
Comments
  • func  CosineSimilarity() return NaN

    func CosineSimilarity() return NaN

    Dear James bowman I use your library to calculate similarity. The function CosineSimilarity() returns many NaN. So i can't continue my work. I have only changed your vectorisers.go file . All my changes are as following.

    func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error) { //function begin from here mat := sparse.NewDOK(len(v.Vocabulary), len(docs))

    for d, doc := range docs {
    	v.Tokeniser.ForEachIn(doc, func(word string) {
    		i, exists := v.Vocabulary[word]
    
    		if exists {
    			weight, wieghtExist := TrainingData.WeightMap[word]
    			// normal weight value: 2,  unimportant weight value: 1, important  weight value: 3
    			if wieghtExist {
    				mat.Set(i, d, mat.At(i, d)+weight)
    			} else {
    				mat.Set(i, d, mat.At(i, d)+1)
    			}
    
    		}
    	})
    }
    return mat.ToCSR(), nil
    

    }

  • Ocr

    Ocr

    This looks really nice. Thank you for putting this open.

    I am attempting to do OCR. I can identify all the letters, but then i need to check them against a word list so i can pick up where the OCR has maybe made a mistake.

    This way it can then propagate back to the OCR system to get better.

    There is no reason also why it can't use semantic meaning of a sentence e to also correct the OCR. It's kind of one step up from just using single words.

    I don't have it up on a git repo yet, but figured it would be interesting to you. If you feel like commenting about this idea it would be great.

    I am really curious too where you get data sources . For semantic you need training data right ?

  • Opening the NLP package to non a-Z languages and custom stopword-lists

    Opening the NLP package to non a-Z languages and custom stopword-lists

    1. I changed the vectorisers functions so that they now parse all words written in any alphabet. Before this καί etc. was ignored.
    2. Because this means we need to add support for customised stopword lists, a simple boolean does not work anymore. I changed vectorisers.go so that it accepts stopword lists as variable and modified its tests accordingly.
  • Vectorisers.go only tokenise for a-Z languages.

    Vectorisers.go only tokenise for a-Z languages.

    Hi,

    Thanks for your hard work. I really like your code base. I have noticed that the package as is only works for languages that can be expressed in a-Z alphabets and, in addition, the hardcoded stop words make it a bit challenging for even historic or fringe English corpora. I have a fix for both. But did not want to PR without creating the issue first and see if you want to open up the project for non-English, historic English, and non a-Z languages.

    Thanks again!

    Best,

    Thomas

  • LDA model persistence

    LDA model persistence

    Thanks for this library, it seems really useful. I have been playing around a bit with a feature extractor pipeline of countvectoriser and tfidf transformer feeding into an LDA transformer, but I can't seem to save the Fit'ed pipeline to disk and reload it later to Transform new docs. Looking at the serialized pipeline in json, it seems the vocabulary is there, as well as the tokenizer info and various LDA params, but I don't see the induced topics (matrices). Maybe this is a problem with the way I serialized it? If you can point to a working example of how to properly serialize a trained LDA model and re-use it later, that would be great. Thanks again!

  • Example fails, possibly due to gonum/matrix being deprecated?

    Example fails, possibly due to gonum/matrix being deprecated?

    Hello,

    When running a slightly modified version of your example, I receive the following error:

    # github.com/james-bowman/nlp
    ../../go/src/github.com/james-bowman/nlp/vectorisers.go:163: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
            *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    ../../go/src/github.com/james-bowman/nlp/vectorisers.go:221: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
            *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:43: impossible type assertion:
            *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:62: cannot use sparse.NewDIA(m, weights) (type *sparse.DIA) as type mat64.Matrix in assignment:
            *sparse.DIA does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use t.transform (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
            mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                    have T() mat64.Matrix
                    want T() mat.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use mat (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
            mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                    have T() mat64.Matrix
                    want T() mat.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:81: cannot use product (type *sparse.CSR) as type mat64.Matrix in return argument:
            *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    

    The code of my modified example is below:

    package main
    
    import (
    	"fmt"
        "github.com/james-bowman/nlp"
    	"github.com/gonum/matrix/mat64"
    )
    
    func main() {
    	testCorpus := []string{
    		"The quick brown fox jumped over the lazy dog",
    		"hey diddle diddle, the cat and the fiddle",
    		"the cow jumped over the moon",
    		"the little dog laughed to see such fun",
    		"and the dish ran away with the spoon",
    	}
    
    	query := "the brown fox ran around the dog"
    
    	vectoriser := nlp.NewCountVectoriser(true)
    	transformer := nlp.NewTfidfTransformer()
    
    	// set k (the number of dimensions following truncation) to 4
    	reducer := nlp.NewTruncatedSVD(4)
    
    	// Transform the corpus into an LSI fitting the model to the documents in the process
    	mat, _ := vectoriser.FitTransform(testCorpus...)
    	mat, _ = transformer.FitTransform(mat)
    	lsi, _ := reducer.FitTransform(mat)
    
    	// run the query through the same pipeline that was fitted to the corpus and
    	// to project it into the same dimensional space
    	mat, _ = vectoriser.Transform(query)
    	mat, _ = transformer.Transform(mat)
    	queryVector, _ := reducer.Transform(mat)
    
    	// iterate over document feature vectors (columns) in the LSI and compare with the
    	// query vector for similarity.  Similarity is determined by the difference between
    	// the angles of the vectors known as the cosine similarity
    	highestSimilarity := -1.0
    	var matched int
    	_, docs := lsi.Dims()
    	for i := 0; i < docs; i++ {
    		similarity := CosineSimilarity(queryVector.(*mat64.Dense).ColView(0), lsi.(*mat64.Dense).ColView(i))
    		if similarity > highestSimilarity {
    			matched = i
    			highestSimilarity = similarity
    		}
    	}
    
    	fmt.Printf("Matched '%s'", testCorpus[matched])
    	// Output: Matched 'The quick brown fox jumped over the lazy dog'
    } 
    

    I see that gonum/matrix was deprecated a month ago in favor of gonum/gonum and wonder if that could be related.

    Thanks very much for your help!

  • Interface for Tokeniser, Allow Custom Tokenisers?

    Interface for Tokeniser, Allow Custom Tokenisers?

    Hi James,

    Would you be open to a PR that made some changes to the Tokeniser type, and to dependent types, to allow for custom Tokenisers? This would make nlp more general for different languages, or for handling different tokenisation strategies.

    What I'm imagining is this (note, the workflow is designed to avoid breaking API changes):

    • Convert Tokeniser to an interface, providing ForEachIn and Tokenise methods.
    • Convert NewTokeniser to a method that returns a default implementation, which would be identical to the current implementation.
    • Add a new method, NewCustomTokeniser(tokenPattern string, stopWordList []string) *Tokeniser, which would enable easy creation of a custom tokeniser.
    • Add new constructors for CountVectoriser and HashingVectoriser to allow use of a custom Tokeniser, OR (your preference) make their vec.tokeniser field into a public field vec.Tokeniser, allowing overrides or manual construction of either.

    I could probably make the required changes quickly enough, if you're interested. :)

  • Adding SetComponents method to RandomIndexing type

    Adding SetComponents method to RandomIndexing type

    Not sure if you would like the comment updated to be more descriptive as the rest of your comments are.

    Just wanting to add this so that I can load a pre-fitted model into the RandomIndexer

  • replace default regexp in tokenizer on more universal

    replace default regexp in tokenizer on more universal

    old regexp - "[\p{L}]+" convert documents like "os24120z R2D2" to ["os","z","R","D"] replaced with \S - not whitespace [^\t\n\f\r ]

  • [Question] Optimal number of topics in LDA

    [Question] Optimal number of topics in LDA

    Hi! I'm planning to use the LDA functionality, and, as I read (I'm very new to the matter), in Gensim there's a coherence score that could be used to determine that (magic) key number. I wonder if there's any similar functionality implemented in the library? I have also read that perplexity may be added to the game to help deciding the number of topics but I'm not quite sure if it is correct or how to use it. I would really appreciate any clarification.

    Thank you!

  • Methods for large corpora?

    Methods for large corpora?

    Sort of related to #8...

    You have methods in the API, like in your example, that take an array of strings (docs).

    matrix, _ := vectoriser.FitTransform(testCorpus...)

    I'd like to use this for very large corpora, with 10s or 100s of millions of (not tiny) documents. Putting these all into a single array of strings does not sound optimal. Any chance the methods that now have a string array parameter for the documents could be altered to take in a function or interface that could allow iteration to get all the docs? (Or new methods that support this?)

    Thanks, Glen

  • Online/streaming LDA?

    Online/streaming LDA?

    Is it possible to run LDA (or other processing algorithms) in a streaming/online fashion, such as is done with gensim? It seems that this would not easily support online processing, but I thought I'd bounce the question off of you since you know the internals much better.

GNU Aspell spell checking library bindings for Go (golang)

Aspell library bindings for Go GNU Aspell is a spell checking tool written in C/C++. This package provides simplified Aspell bindings for Go. It uses

Nov 14, 2022
A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Nov 28, 2022
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

Dec 25, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

Dec 30, 2022
Self-contained Machine Learning and Natural Language Processing library in Go
Self-contained Machine Learning and Natural Language Processing library in Go

Self-contained Machine Learning and Natural Language Processing library in Go

Jan 8, 2023
Natural-deploy - A natural and simple way to deploy workloads or anything on other machines.

Natural Deploy Its Go way of doing Ansibles: Motivation: Have you ever felt when using ansible or any declarative type of program that is used for dep

Jan 3, 2022
Provides simple, semantic manipulation of the operating system's signal processing.
Provides simple, semantic manipulation of the operating system's signal processing.

Provides simple, semantic manipulation of the operating system's signal processing.

Dec 15, 2021
InStockBot notifies a selected Discord channel when a specific product is back in stock

InStockBot notifies a selected Discord channel when a specific product is back in stock

Jan 20, 2022
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Dec 20, 2022
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Dec 20, 2022
Go translations of the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

Overview Go translations of the Java source code for the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin

Dec 13, 2022
👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike
👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

?? The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

Dec 25, 2022
👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike
👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

Dec 29, 2022
Sentiment Analysis Pipeline + API written in Golang (currently processing Twitter tweets).

Go Sentiment Analysis Components Config: config module based in JSON (enter twitter credentials for use) Controllers: handle the API db call/logic for

Mar 22, 2022
Sentiment Analysis Pipeline + API written in Golang (currently processing Twitter tweets).

Go Sentiment Analysis Components Config: config module based in JSON (enter twitter credentials for use) Controllers: handle the API db call/logic for

Mar 22, 2022
Natural language detection package in pure Go

getlang getlang provides fast natural language detection in Go. Features Offline -- no internet connection required Supports 29 languages Provides ISO

Dec 26, 2022
Natural language detection library for Go
Natural language detection library for Go

Whatlanggo Natural language detection for Go. Features Supports 84 languages 100% written in Go No external dependencies Fast Recognizes not only a la

Dec 28, 2022
A natural language date/time parser with pluggable rules

when when is a natural language date/time parser with pluggable rules and merge strategies Examples tonight at 11:10 pm at Friday afternoon the deadli

Dec 26, 2022