Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

Last update: Jan 8, 2023

Comments: 12

gse

Go efficient text segmentation; support english, chinese, japanese and other.

Dictionary with double array trie (Double-Array Trie) to achieve, Sender algorithm is the shortest path based on word frequency plus dynamic programming, and DAG and HMM algorithm word segmentation.

Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes, support user dictionary, POS tagging, run JSON RPC service.

Support HMM cut text use Viterbi algorithm.

Text Segmentation speed single thread 9.2MB/s，goroutines concurrent 26.8MB/s. HMM text segmentation single thread 3.2MB/s. (2core 4threads Macbook Pro).

Binding:

gse-bind, binding JavaScript and other, support more language.

Install / update

go get -u github.com/go-ego/gse

Build-tools

go get -u github.com/go-ego/re

re gse

To create a new gse application

$ re gse my-gse

re run

To run the application we just created, you can navigate to the application folder and execute:

$ cd my-gse && re run

Use

package main

import (
	"fmt"

	"github.com/go-ego/gse"
	"github.com/go-ego/gse/hmm/pos"
)

var (
	text = "Hello world, Helloworld. Winter is coming! 你好世界."

	new = gse.New("zh,testdata/test_dict3.txt", "alpha")

	seg gse.Segmenter
	posSeg pos.Segmenter
)

func cut() {
	hmm := new.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)

	hmm = new.CutAll(text)
	fmt.Println("cut all: ", hmm)
}

func main() {
	cut()

	segCut()
}

func posAndTrim(cut []string) {
	cut = seg.Trim(cut)
	fmt.Println("cut all: ", cut)

	pos.WithGse(seg)
	po := posSeg.Cut(text, true)
	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")
	fmt.Println("trim pos: ", po)
}

func cutPos() {
	fmt.Println(seg.String(text, true))
	fmt.Println(seg.Slice(text, true))

	po := seg.Pos(text, true)
	fmt.Println("pos: ", po)
	po = seg.TrimPos(po)
	fmt.Println("trim pos: ", po)
}

func segCut() {
	// Loading the default dictionary
	seg.LoadDict()
	// Load the dictionary
	// seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

	// Text Segmentation
	tb := []byte(text)
	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)

	// Handle word segmentation results
	// Support for normal mode and search mode two participle,
	// see the comments in the code ToString function.
	// The search mode is mainly used to provide search engines
	// with as many keywords as possible
	fmt.Println(gse.ToString(segments, true))
}

Look at an custom dictionary example

package main

import (
	"fmt"

	"github.com/go-ego/gse"
)

func main() {
	var seg gse.Segmenter
	seg.LoadDict("zh,testdata/test_dict.txt,testdata/test_dict1.txt")

	text1 := "你好世界, Hello world"
	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))
	fmt.Println(gse.ToString(segments))
}

Look at an Chinese example

Look at an Japanese example

Authors

License

Gse is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0), thanks for sego and jieba(jiebago).

Owner

ego

https://github.com/go-ego/gse

Comments

老哥，停止词典一直不生效，加了
package main

import ( "fmt"

"github.com/go-ego/gse"

)

var ( text = "第一次爱的人是谁演唱的" new, _ = gse.New("dict.txt")

seg gse.Segmenter

)

func main() { cut() }

func cut() { new.LoadStop("stop.txt") new.AddStop("的") new.AddStop("是") //加了这行也没用 fmt.Println("cut: ", new.Cut(text, true)) fmt.Println("cut all: ", new.CutAll(text)) fmt.Println("cut for search: ", new.CutSearch(text, true)) fmt.Println(new.String(text, true)) }

//控制台打印如下所示 //2022/02/18 17:44:34 Dict files path: [dict.txt] //2022/02/18 17:44:34 Load the gse dictionary: "dict.txt" //2022/02/18 17:44:34 Gse dictionary loaded finished. //2022/02/18 17:44:34 Load the stop word dictionary: "stop.txt" //cut: [第一次爱的人是谁演唱的] //cut all: [第一次爱的人是谁演唱的] //cut for search: [第一次爱的人是谁演唱的] //第一次爱的人/n 是/x 谁/x 演唱/v 的/x
Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?
Gse version (or commit ref): 0.60

Go version: 1.14

Operating system and bit: macOS 10.14

Description

In my case, I need get start and end info of each word after segmenting in hmm and search mode. By reading apis, I only found:

CutSearch(string, true) which only return []string but no star and end infos

Segment([]byte(text)) which can return segment with start and end info, but it does not accept param to choose search mode.

Is there anyway to something like Segment([]byte(text), searchMode)?
Bleve

Has anyone tried using this with bleve.

Bleve does this plus alot more but lacks decent Chinese / Japanese stemmers.

Using this with bleve would be a powerful stack
Could not load dictionaries
I pulled gse through go mod, but I found that the dictionary data in gse was not pulled down, so I found a "Could not load dictionaries" error. Then, I copied the dictionary data into the gse package to run it through. So, I think,

Can you delete the hard-coded dictionary location in gse, or can it be configurable through parameters.

If you want to load dictionary data, is it possible to convert the dictionary data into go static data code through "go-bindata" or other
How to build without embed dictionary on Go1.16 or above?
Hi,

I noticed that gse leads to a large binary size. After reviewing the code, I found the problem may lie here, which is caused by the embedded dictionary.

The program binary may vary, but the dictionary is relatively not changed. So is there a way to build a gse project without embedded dictionary?

Thanks.

Gse version (or commit ref): 0.70.1

Go version: 1.17

Operating system and bit: Ubuntu 20.04 64bit

Can you reproduce the bug at Examples:

[x] Yes (provide example code)

[ ] No

[ ] Not relevant

Is there any bug of seg.ModeSegment?

The result of seg.Segment and seg.ModeSegment are the same, is there any bug?

I thought the result of ModeSegment should like seg.CutSearch.

test code:

package main

import (
	"fmt"

	"github.com/go-ego/gse"
)

var (
	seg  gse.Segmenter
	text = "《复仇者联盟3：无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
)

func main() {
	seg.LoadDict()
	addToken()
	cut()
}

func addToken() {
	seg.AddToken("《复仇者联盟3：无限战争》", 100, "n")
}

// 使用 DAG 或 HMM 模式分词
func cut() {
	// "《复仇者联盟3：无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."

	// use DAG and HMM
	hmm := seg.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)
	// cut use hmm:  [《复仇者联盟3：无限战争》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]

	cut := seg.Cut(text)
	fmt.Println("cut: ", cut)
	// cut:  [《 复仇者 联盟 3 ： 无限 战争 》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]

	hmm = seg.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)
	//cut search use hmm:  [复仇 仇者 联盟 无限 战争 复仇者 《复仇者联盟3：无限战争》 是 全片 使用 imax 摄影 摄影机 拍摄 制作 的 的 科幻 科幻片 .]
	fmt.Println("analyze: ", seg.Analyze(hmm, text))

	cut = seg.CutSearch(text)
	fmt.Println("cut search: ", cut)
	// cut search:  [《 复仇 者 复仇者 联盟 3 ： 无限 战争 》 是 全片 使用 imax 摄影 机 摄影机 拍摄 制作 的 的 科幻 片 科幻片 .]

	segment1 := seg.Segment([]byte(text))
	for i, token := range segment1 {
		fmt.Println(i, token.Token().Text())
	}
	segment2 := seg.ModeSegment([]byte(text), true)
	for i, token := range segment2 {
		fmt.Println(i, token.Token().Text())
	}
}

"\001" in text gets error result

Gse version (or commit ref): 1fd1428e78fe
Go version: 1.14.2
Operating system and bit: any
Can you reproduce the bug at Examples:
- [ ] No
- [ ] Yes (provide example code)
- [x] Not relevant
Provide example code:

func TestSegment(t *testing.T) {
	seg := &gse.Segmenter{}
	err := seg.LoadDict("../data/dictionary.txt")
	if err != nil {
		t.Fatal(err)
	}
	data := []byte("\001你好吗", )
	res := seg.Segment(data)
	for _, re := range res {
		t.Log(re.Token().Text())
		t.Log(re.Start())
		t.Log(re.End())
	}
}

Log gist:


    TestSegment: process_test.go:51: 你
    TestSegment: process_test.go:52: 0
    TestSegment: process_test.go:53: 3
    TestSegment: process_test.go:51: 你好
    TestSegment: process_test.go:52: 3
    TestSegment: process_test.go:53: 9
    TestSegment: process_test.go:51: 吗
    TestSegment: process_test.go:52: 9
    TestSegment: process_test.go:53: 12

Description

the first token should be "\001", we get second word instand. the start of second token should be 1.

How to disable output of dictionary loading?

There always output information about dictionary loading

2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"
2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"

My codes as below

	segmenter, err := gse.NewEmbed(static.DictFile)
	segmenter.LoadStopEmbed(static.StopFile)
	segmenter.MoreLog = false  <-- seems no usable
	segmenter.SkipLog = true  <-- seems no usable

Is there any option to disable output such info?

老哥，停止词典的那个方法一直无法生效，咋回事呀
package main

import ( "fmt"

"github.com/go-ego/gse"

)

var ( text = "第一次爱的人是谁演唱的" new, _ = gse.New("dict.txt")

seg gse.Segmenter

)

func main() { cut() }

// loadDictEmbed supported from go1.16 func loadDictEmbed() { seg.LoadDictEmbed() seg.LoadStopEmbed() }

func cut() { new.LoadStop("stop.txt") new.IsStop("是") //将“是“加入停止词典以后，“是”仍然出现在了分词结果中 fmt.Println("cut: ", new.Cut(text, true)) fmt.Println("cut all: ", new.CutAll(text)) fmt.Println("cut for search: ", new.CutSearch(text, true)) fmt.Println(new.String(text, true)) }

// 输出结果如下: // cut: [第一次爱的人是谁演唱的] //cut all: [第一次爱的人是谁演唱的] //cut for search: [第一次爱的人是谁演唱的] // 第一次爱的人/n 是/x 谁/x 演/x 唱/x 的/x
Float should not be split
Split “ loss of 76.7”. I got "loss / of / 76 / . / 7", I want got "loss / of / 76 . 7".

What i can do?

Gse version (or commit ref):v0.69.3

Go version:1.16
- temporarily disable debug output until a better mechanism
... to turn them on

Please provide Issues links to:

Issues: #1 https://github.com/go-ego/gpy/issues/16

Provide test code:

$ pinyin -p 银行

Description

The output from pinyin -p looks very like debug output, disable it temporarily before we can have a better way to turn it on.
V1 Release?

Hi, I was looking for a good Chinese/Japanese tokenizer in Go and stumbled across this one.

Based on the release history it seems like it looks like this library has been in use for quite a while, but it's still v0. Any reason not to issue an official v1 release?

It would also be nice to see quality metrics on the readme, if you have any. E.g. comparison to data like https://universaldependencies.org/

A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

Dec 19, 2022

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp nlp is a general purpose any-lang Natural Language Processor that parses the data inside a text and returns a filled model Supported types int in

Nov 24, 2022

Chinese word splitting algorithm MMSEG in GO

MMSEGO This is a GO implementation of MMSEG which a Chinese word splitting algorithm. TO DO list Documentation/comments Benchmark Usage #Input Diction

Sep 27, 2022

Self-contained Japanese Morphological Analyzer written in pure Go

Kagome v2 Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, Uni

Dec 24, 2022

A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Nov 28, 2022

Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Stemmer package for Go Stemmer package provides an interface for stemmers and includes English, German and Dutch stemmers as sub-packages: porter2 sub

Dec 14, 2022

Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

Gopher Translator Service An interview assignment project. To see the full assig

Jan 25, 2022

A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

Jan 4, 2023

ASCII transliterations of Unicode text.

go-unidecode ASCII transliterations of Unicode text. Inspired by python-unidecode. Installation go get -u github.com/mozillazg/go-unidecode Install C

Dec 2, 2022

A tool to find all duplicates in large sets of text documents.

⊧ dupi Dupi is an engine for identifying and exploring duplicative text in sets of documents. Status Dupi is in alpha/early beta development stage. Pl

Dec 23, 2022

i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

go-localize Simple and easy to use i18n (Internationalization and localization) engine written in Go, used for translating locale strings. Use with go

Nov 29, 2022

Read and use word2vec vectors in Go

Introduction This is a package for reading word2vec vectors in Go and finding similar words and analogies. Installation This package can be installed

Nov 28, 2022

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

Dec 25, 2022

Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

Dec 30, 2022

A go library for reading and creating ISO9660 images

iso9660 A package for reading and creating ISO9660 Joliet and Rock Ridge extensions are not supported. Examples Extracting an ISO package main import

Jan 2, 2023

Package i18n provides internationalization and localization for your Go applications.

i18n Package i18n provides internationalization and localization for your Go applications. Installation The minimum requirement of Go is 1.16. go get

Nov 9, 2022

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022

Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

gse

Binding:

Install / update

Build-tools

re gse

re run

Use

Authors

License

Owner

ego

Comments

老哥，停止词典一直不生效，加了

Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?

Description

Bleve

Could not load dictionaries

How to build without embed dictionary on Go1.16 or above?

Is there any bug of seg.ModeSegment?

"\001" in text gets error result

Description

How to disable output of dictionary loading?

老哥，停止词典的那个方法一直无法生效，咋回事呀

Float should not be split

- temporarily disable debug output until a better mechanism

Description

V1 Release?

Related tags

A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

Chinese word splitting algorithm MMSEG in GO

Self-contained Japanese Morphological Analyzer written in pure Go

A Go package for n-gram based text categorization, with support for utf-8 and raw text

Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

ASCII transliterations of Unicode text.

A tool to find all duplicates in large sets of text documents.

i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

Read and use word2vec vectors in Go

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Self-contained Machine Learning and Natural Language Processing library in Go

A go library for reading and creating ISO9660 images

Package i18n provides internationalization and localization for your Go applications.

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

An easy-to-use OCR and Japanese to English translation tool

Utilities for working with discrete probability distributions and other tools useful for doing NLP work

A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29