Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing

License: MIT GoDoc Build Status Go Report Card codecov Mentioned in Awesome Go Sourcegraph

nlp

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.


Features

Planned

  • Expanded persistence support
  • Stemming to treat words with common root as the same e.g. "go" and "going"
  • Clustering algorithms e.g. Heirachical, K-means, etc.
  • Classification algorithms e.g. SVM, KNN, random forest, etc.

References

  1. Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
  2. Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
  3. Thomo, Alex. Latent Semantic Analysis (Tutorial).
  4. Latent Semantic Indexing. Standford NLP Course
  5. Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380.
  6. M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005.
  7. A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999.
  8. Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
  9. Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
  10. Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature
  11. QasemiZadeh, Behrang and Handschuh, Siegfried. Random Indexing Explained with High Probability
  12. Foulds, James; Boyles, Levi; Dubois, Christopher; Smyth, Padhraic; Welling, Max (2013). Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation
Owner
James Bowman
CTO @Fresh8 Gaming. Go | Microservices | Machine Learning | NLP
James Bowman
Comments
  • func  CosineSimilarity() return NaN

    func CosineSimilarity() return NaN

    Dear James bowman I use your library to calculate similarity. The function CosineSimilarity() returns many NaN. So i can't continue my work. I have only changed your vectorisers.go file . All my changes are as following.

    func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error) { //function begin from here mat := sparse.NewDOK(len(v.Vocabulary), len(docs))

    for d, doc := range docs {
    	v.Tokeniser.ForEachIn(doc, func(word string) {
    		i, exists := v.Vocabulary[word]
    
    		if exists {
    			weight, wieghtExist := TrainingData.WeightMap[word]
    			// normal weight value: 2,  unimportant weight value: 1, important  weight value: 3
    			if wieghtExist {
    				mat.Set(i, d, mat.At(i, d)+weight)
    			} else {
    				mat.Set(i, d, mat.At(i, d)+1)
    			}
    
    		}
    	})
    }
    return mat.ToCSR(), nil
    

    }

  • Ocr

    Ocr

    This looks really nice. Thank you for putting this open.

    I am attempting to do OCR. I can identify all the letters, but then i need to check them against a word list so i can pick up where the OCR has maybe made a mistake.

    This way it can then propagate back to the OCR system to get better.

    There is no reason also why it can't use semantic meaning of a sentence e to also correct the OCR. It's kind of one step up from just using single words.

    I don't have it up on a git repo yet, but figured it would be interesting to you. If you feel like commenting about this idea it would be great.

    I am really curious too where you get data sources . For semantic you need training data right ?

  • Opening the NLP package to non a-Z languages and custom stopword-lists

    Opening the NLP package to non a-Z languages and custom stopword-lists

    1. I changed the vectorisers functions so that they now parse all words written in any alphabet. Before this καί etc. was ignored.
    2. Because this means we need to add support for customised stopword lists, a simple boolean does not work anymore. I changed vectorisers.go so that it accepts stopword lists as variable and modified its tests accordingly.
  • Vectorisers.go only tokenise for a-Z languages.

    Vectorisers.go only tokenise for a-Z languages.

    Hi,

    Thanks for your hard work. I really like your code base. I have noticed that the package as is only works for languages that can be expressed in a-Z alphabets and, in addition, the hardcoded stop words make it a bit challenging for even historic or fringe English corpora. I have a fix for both. But did not want to PR without creating the issue first and see if you want to open up the project for non-English, historic English, and non a-Z languages.

    Thanks again!

    Best,

    Thomas

  • LDA model persistence

    LDA model persistence

    Thanks for this library, it seems really useful. I have been playing around a bit with a feature extractor pipeline of countvectoriser and tfidf transformer feeding into an LDA transformer, but I can't seem to save the Fit'ed pipeline to disk and reload it later to Transform new docs. Looking at the serialized pipeline in json, it seems the vocabulary is there, as well as the tokenizer info and various LDA params, but I don't see the induced topics (matrices). Maybe this is a problem with the way I serialized it? If you can point to a working example of how to properly serialize a trained LDA model and re-use it later, that would be great. Thanks again!

  • Example fails, possibly due to gonum/matrix being deprecated?

    Example fails, possibly due to gonum/matrix being deprecated?

    Hello,

    When running a slightly modified version of your example, I receive the following error:

    # github.com/james-bowman/nlp
    ../../go/src/github.com/james-bowman/nlp/vectorisers.go:163: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
            *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    ../../go/src/github.com/james-bowman/nlp/vectorisers.go:221: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
            *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:43: impossible type assertion:
            *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:62: cannot use sparse.NewDIA(m, weights) (type *sparse.DIA) as type mat64.Matrix in assignment:
            *sparse.DIA does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use t.transform (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
            mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                    have T() mat64.Matrix
                    want T() mat.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use mat (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
            mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                    have T() mat64.Matrix
                    want T() mat.Matrix
    ../../go/src/github.com/james-bowman/nlp/weightings.go:81: cannot use product (type *sparse.CSR) as type mat64.Matrix in return argument:
            *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                    have T() mat.Matrix
                    want T() mat64.Matrix
    

    The code of my modified example is below:

    package main
    
    import (
    	"fmt"
        "github.com/james-bowman/nlp"
    	"github.com/gonum/matrix/mat64"
    )
    
    func main() {
    	testCorpus := []string{
    		"The quick brown fox jumped over the lazy dog",
    		"hey diddle diddle, the cat and the fiddle",
    		"the cow jumped over the moon",
    		"the little dog laughed to see such fun",
    		"and the dish ran away with the spoon",
    	}
    
    	query := "the brown fox ran around the dog"
    
    	vectoriser := nlp.NewCountVectoriser(true)
    	transformer := nlp.NewTfidfTransformer()
    
    	// set k (the number of dimensions following truncation) to 4
    	reducer := nlp.NewTruncatedSVD(4)
    
    	// Transform the corpus into an LSI fitting the model to the documents in the process
    	mat, _ := vectoriser.FitTransform(testCorpus...)
    	mat, _ = transformer.FitTransform(mat)
    	lsi, _ := reducer.FitTransform(mat)
    
    	// run the query through the same pipeline that was fitted to the corpus and
    	// to project it into the same dimensional space
    	mat, _ = vectoriser.Transform(query)
    	mat, _ = transformer.Transform(mat)
    	queryVector, _ := reducer.Transform(mat)
    
    	// iterate over document feature vectors (columns) in the LSI and compare with the
    	// query vector for similarity.  Similarity is determined by the difference between
    	// the angles of the vectors known as the cosine similarity
    	highestSimilarity := -1.0
    	var matched int
    	_, docs := lsi.Dims()
    	for i := 0; i < docs; i++ {
    		similarity := CosineSimilarity(queryVector.(*mat64.Dense).ColView(0), lsi.(*mat64.Dense).ColView(i))
    		if similarity > highestSimilarity {
    			matched = i
    			highestSimilarity = similarity
    		}
    	}
    
    	fmt.Printf("Matched '%s'", testCorpus[matched])
    	// Output: Matched 'The quick brown fox jumped over the lazy dog'
    } 
    

    I see that gonum/matrix was deprecated a month ago in favor of gonum/gonum and wonder if that could be related.

    Thanks very much for your help!

  • Interface for Tokeniser, Allow Custom Tokenisers?

    Interface for Tokeniser, Allow Custom Tokenisers?

    Hi James,

    Would you be open to a PR that made some changes to the Tokeniser type, and to dependent types, to allow for custom Tokenisers? This would make nlp more general for different languages, or for handling different tokenisation strategies.

    What I'm imagining is this (note, the workflow is designed to avoid breaking API changes):

    • Convert Tokeniser to an interface, providing ForEachIn and Tokenise methods.
    • Convert NewTokeniser to a method that returns a default implementation, which would be identical to the current implementation.
    • Add a new method, NewCustomTokeniser(tokenPattern string, stopWordList []string) *Tokeniser, which would enable easy creation of a custom tokeniser.
    • Add new constructors for CountVectoriser and HashingVectoriser to allow use of a custom Tokeniser, OR (your preference) make their vec.tokeniser field into a public field vec.Tokeniser, allowing overrides or manual construction of either.

    I could probably make the required changes quickly enough, if you're interested. :)

  • Adding SetComponents method to RandomIndexing type

    Adding SetComponents method to RandomIndexing type

    Not sure if you would like the comment updated to be more descriptive as the rest of your comments are.

    Just wanting to add this so that I can load a pre-fitted model into the RandomIndexer

  • replace default regexp in tokenizer on more universal

    replace default regexp in tokenizer on more universal

    old regexp - "[\p{L}]+" convert documents like "os24120z R2D2" to ["os","z","R","D"] replaced with \S - not whitespace [^\t\n\f\r ]

  • [Question] Optimal number of topics in LDA

    [Question] Optimal number of topics in LDA

    Hi! I'm planning to use the LDA functionality, and, as I read (I'm very new to the matter), in Gensim there's a coherence score that could be used to determine that (magic) key number. I wonder if there's any similar functionality implemented in the library? I have also read that perplexity may be added to the game to help deciding the number of topics but I'm not quite sure if it is correct or how to use it. I would really appreciate any clarification.

    Thank you!

  • Methods for large corpora?

    Methods for large corpora?

    Sort of related to #8...

    You have methods in the API, like in your example, that take an array of strings (docs).

    matrix, _ := vectoriser.FitTransform(testCorpus...)

    I'd like to use this for very large corpora, with 10s or 100s of millions of (not tiny) documents. Putting these all into a single array of strings does not sound optimal. Any chance the methods that now have a string array parameter for the documents could be altered to take in a function or interface that could allow iteration to get all the docs? (Or new methods that support this?)

    Thanks, Glen

  • Online/streaming LDA?

    Online/streaming LDA?

    Is it possible to run LDA (or other processing algorithms) in a streaming/online fashion, such as is done with gensim? It seems that this would not easily support online processing, but I thought I'd bounce the question off of you since you know the internals much better.

Natural language detection package in pure Go

getlang getlang provides fast natural language detection in Go. Features Offline -- no internet connection required Supports 29 languages Provides ISO

Dec 26, 2022
Natural language detection library for Go
Natural language detection library for Go

Whatlanggo Natural language detection for Go. Features Supports 84 languages 100% written in Go No external dependencies Fast Recognizes not only a la

Dec 28, 2022
A natural language date/time parser with pluggable rules

when when is a natural language date/time parser with pluggable rules and merge strategies Examples tonight at 11:10 pm at Friday afternoon the deadli

Dec 26, 2022
A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

Jan 4, 2023
Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Stemmer package for Go Stemmer package provides an interface for stemmers and includes English, German and Dutch stemmers as sub-packages: porter2 sub

Dec 14, 2022
Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

Gopher Translator Service An interview assignment project. To see the full assig

Jan 25, 2022
Complete Translation - translate a document to another language
 Complete Translation - translate a document to another language

Complete Translation This project is to translate a document to another language. The initial target is English to Korean. Consider this project is no

Feb 25, 2022
Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

gse Go efficient text segmentation; support english, chinese, japanese and other. 简体中文 Dictionary with double array trie (Double-Array Trie) to achiev

Jan 8, 2023
Unicode transliterator for #golang

Unicode transliterator (also known as unidecode) for Go Use the following command to install gounidecode go get -u github.com/fiam/gounidecode/unideco

Sep 27, 2022
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Sep 27, 2022
Golang port of Petrovich - an inflector for Russian anthroponyms.
Golang port of Petrovich - an inflector for Russian anthroponyms.

Petrovich is the library which inflects Russian names to given grammatical case. This is the Go port of https://github.com/petrovich. Installation go

Dec 25, 2022
A multilingual command line sentence tokenizer in Golang
A multilingual command line sentence tokenizer in Golang

Sentences - A command line sentence tokenizer This command line utility will convert a blob of text into a list of sentences. Demo Docs Install go get

Dec 30, 2022
Cross platform locale detection for Golang

go-locale go-locale is a Golang lib for cross platform locale detection. OS Support Support all OS that Golang supported, except android: aix: IBM AIX

Aug 20, 2022
Golang RESTful Client for HanLP.中文分词 词性标注 命名实体识别 依存句法分析 语义依存分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

gohanlp 中文分词 词性标注 命名实体识别 依存句法分析 语义依存分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理 HanLP 的golang 接口 在线轻量级RESTful API 仅数KB,适合敏捷开发、移动APP等场景。服务器算力有限,匿名用户配额较少

Dec 16, 2022
📖 Tutorial: An easy way to translate your Golang application
📖 Tutorial: An easy way to translate your Golang application

?? Tutorial: An easy way to translate your Golang application ?? The full article is published on April 13, 2021, on Dev.to: https://dev.to/koddr/an-e

Feb 9, 2022
i18n of golang

i18n i18n of golang 使用方法 下载i18n go get https://github.com/itmisx/i18n 定义 code 语言包 var langPack1 = map[string]map[interface{}]interface{}{ "zh-cn": {

Dec 11, 2021
Licence-server - Building a golang Swagger API with Echo

Building a golang Swagger API with Echo Known Issues References [1] https://deve

Jan 9, 2022
Go-i18n - i18n for Golang

I18n for Go Installation go get -u github.com/fitv/go-i18n Usage YAML files ├──

Oct 18, 2022
i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

go-localize Simple and easy to use i18n (Internationalization and localization) engine written in Go, used for translating locale strings. Use with go

Nov 29, 2022