Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

gse

Go efficient text segmentation; support english, chinese, japanese and other.

Build Status CircleCI Status codecov Build Status Go Report Card GoDoc GitHub release Join the chat at https://gitter.im/go-ego/ego

简体中文

Dictionary with double array trie (Double-Array Trie) to achieve, Sender algorithm is the shortest path based on word frequency plus dynamic programming, and DAG and HMM algorithm word segmentation.

Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes, support user dictionary, POS tagging, run JSON RPC service.

Support HMM cut text use Viterbi algorithm.

Text Segmentation speed single thread 9.2MB/s,goroutines concurrent 26.8MB/s. HMM text segmentation single thread 3.2MB/s. (2core 4threads Macbook Pro).

Binding:

gse-bind, binding JavaScript and other, support more language.

Install / update

go get -u github.com/go-ego/gse

Build-tools

go get -u github.com/go-ego/re

re gse

To create a new gse application

$ re gse my-gse

re run

To run the application we just created, you can navigate to the application folder and execute:

$ cd my-gse && re run

Use

package main

import (
	"fmt"

	"github.com/go-ego/gse"
	"github.com/go-ego/gse/hmm/pos"
)

var (
	text = "Hello world, Helloworld. Winter is coming! 你好世界."

	new = gse.New("zh,testdata/test_dict3.txt", "alpha")

	seg gse.Segmenter
	posSeg pos.Segmenter
)

func cut() {
	hmm := new.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)

	hmm = new.CutAll(text)
	fmt.Println("cut all: ", hmm)
}

func main() {
	cut()

	segCut()
}

func posAndTrim(cut []string) {
	cut = seg.Trim(cut)
	fmt.Println("cut all: ", cut)

	pos.WithGse(seg)
	po := posSeg.Cut(text, true)
	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")
	fmt.Println("trim pos: ", po)
}

func cutPos() {
	fmt.Println(seg.String(text, true))
	fmt.Println(seg.Slice(text, true))

	po := seg.Pos(text, true)
	fmt.Println("pos: ", po)
	po = seg.TrimPos(po)
	fmt.Println("trim pos: ", po)
}

func segCut() {
	// Loading the default dictionary
	seg.LoadDict()
	// Load the dictionary
	// seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

	// Text Segmentation
	tb := []byte(text)
	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)

	// Handle word segmentation results
	// Support for normal mode and search mode two participle,
	// see the comments in the code ToString function.
	// The search mode is mainly used to provide search engines
	// with as many keywords as possible
	fmt.Println(gse.ToString(segments, true))
}

Look at an custom dictionary example

package main

import (
	"fmt"

	"github.com/go-ego/gse"
)

func main() {
	var seg gse.Segmenter
	seg.LoadDict("zh,testdata/test_dict.txt,testdata/test_dict1.txt")

	text1 := "你好世界, Hello world"
	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))
	fmt.Println(gse.ToString(segments))
}

Look at an Chinese example

Look at an Japanese example

Authors

License

Gse is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0), thanks for sego and jieba(jiebago).

Comments
  • 老哥,停止词典一直不生效,加了

    老哥,停止词典一直不生效,加了

    package main

    import ( "fmt"

    "github.com/go-ego/gse"
    

    )

    var ( text = "第一次爱的人是谁演唱的" new, _ = gse.New("dict.txt")

    seg gse.Segmenter
    

    )

    func main() { cut() }

    func cut() { new.LoadStop("stop.txt") new.AddStop("的") new.AddStop("是") //加了这行也没用 fmt.Println("cut: ", new.Cut(text, true)) fmt.Println("cut all: ", new.CutAll(text)) fmt.Println("cut for search: ", new.CutSearch(text, true)) fmt.Println(new.String(text, true)) }

    //控制台打印如下所示 //2022/02/18 17:44:34 Dict files path: [dict.txt] //2022/02/18 17:44:34 Load the gse dictionary: "dict.txt" //2022/02/18 17:44:34 Gse dictionary loaded finished. //2022/02/18 17:44:34 Load the stop word dictionary: "stop.txt" //cut: [第一次爱的人 是 谁 演唱 的] //cut all: [第一次爱的人 是 谁 演唱 的] //cut for search: [第一次爱的人 是 谁 演唱 的] //第一次爱的人/n 是/x 谁/x 演唱/v 的/x

  • Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?

    Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?

    • Gse version (or commit ref): 0.60
    • Go version: 1.14
    • Operating system and bit: macOS 10.14

    Description

    In my case, I need get start and end info of each word after segmenting in hmm and search mode. By reading apis, I only found:

    • CutSearch(string, true) which only return []string but no star and end infos
    • Segment([]byte(text)) which can return segment with start and end info, but it does not accept param to choose search mode.

    Is there anyway to something like Segment([]byte(text), searchMode)?

  • Bleve

    Bleve

    Has anyone tried using this with bleve.

    Bleve does this plus alot more but lacks decent Chinese / Japanese stemmers.

    Using this with bleve would be a powerful stack

  • Could not load dictionaries

    Could not load dictionaries

    I pulled gse through go mod, but I found that the dictionary data in gse was not pulled down, so I found a "Could not load dictionaries" error. Then, I copied the dictionary data into the gse package to run it through. So, I think,

    1. Can you delete the hard-coded dictionary location in gse, or can it be configurable through parameters.
    2. If you want to load dictionary data, is it possible to convert the dictionary data into go static data code through "go-bindata" or other
  • How to build without embed dictionary on Go1.16 or above?

    How to build without embed dictionary on Go1.16 or above?

    Hi,

    I noticed that gse leads to a large binary size. After reviewing the code, I found the problem may lie here, which is caused by the embedded dictionary.

    The program binary may vary, but the dictionary is relatively not changed. So is there a way to build a gse project without embedded dictionary?

    Thanks.

    • Gse version (or commit ref): 0.70.1
    • Go version: 1.17
    • Operating system and bit: Ubuntu 20.04 64bit
    • Can you reproduce the bug at Examples:
      • [x] Yes (provide example code)
      • [ ] No
      • [ ] Not relevant
  • Is there any bug of seg.ModeSegment?

    Is there any bug of seg.ModeSegment?

    The result of seg.Segment and seg.ModeSegment are the same, is there any bug?

    I thought the result of ModeSegment should like seg.CutSearch.

    test code:

    package main
    
    import (
    	"fmt"
    
    	"github.com/go-ego/gse"
    )
    
    var (
    	seg  gse.Segmenter
    	text = "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
    )
    
    func main() {
    	seg.LoadDict()
    	addToken()
    	cut()
    }
    
    func addToken() {
    	seg.AddToken("《复仇者联盟3:无限战争》", 100, "n")
    }
    
    // 使用 DAG 或 HMM 模式分词
    func cut() {
    	// "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
    
    	// use DAG and HMM
    	hmm := seg.Cut(text, true)
    	fmt.Println("cut use hmm: ", hmm)
    	// cut use hmm:  [《复仇者联盟3:无限战争》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]
    
    	cut := seg.Cut(text)
    	fmt.Println("cut: ", cut)
    	// cut:  [《 复仇者 联盟 3 : 无限 战争 》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]
    
    	hmm = seg.CutSearch(text, true)
    	fmt.Println("cut search use hmm: ", hmm)
    	//cut search use hmm:  [复仇 仇者 联盟 无限 战争 复仇者 《复仇者联盟3:无限战争》 是 全片 使用 imax 摄影 摄影机 拍摄 制作 的 的 科幻 科幻片 .]
    	fmt.Println("analyze: ", seg.Analyze(hmm, text))
    
    	cut = seg.CutSearch(text)
    	fmt.Println("cut search: ", cut)
    	// cut search:  [《 复仇 者 复仇者 联盟 3 : 无限 战争 》 是 全片 使用 imax 摄影 机 摄影机 拍摄 制作 的 的 科幻 片 科幻片 .]
    
    	segment1 := seg.Segment([]byte(text))
    	for i, token := range segment1 {
    		fmt.Println(i, token.Token().Text())
    	}
    	segment2 := seg.ModeSegment([]byte(text), true)
    	for i, token := range segment2 {
    		fmt.Println(i, token.Token().Text())
    	}
    }
    
  • "\001" in text gets error result

    • Gse version (or commit ref): 1fd1428e78fe
    • Go version: 1.14.2
    • Operating system and bit: any
    • Can you reproduce the bug at Examples:
      • [ ] No
      • [ ] Yes (provide example code)
      • [x] Not relevant
    • Provide example code:
    func TestSegment(t *testing.T) {
    	seg := &gse.Segmenter{}
    	err := seg.LoadDict("../data/dictionary.txt")
    	if err != nil {
    		t.Fatal(err)
    	}
    	data := []byte("\001你好吗", )
    	res := seg.Segment(data)
    	for _, re := range res {
    		t.Log(re.Token().Text())
    		t.Log(re.Start())
    		t.Log(re.End())
    	}
    }
    
    • Log gist:
    
        TestSegment: process_test.go:51: 你
        TestSegment: process_test.go:52: 0
        TestSegment: process_test.go:53: 3
        TestSegment: process_test.go:51: 你好
        TestSegment: process_test.go:52: 3
        TestSegment: process_test.go:53: 9
        TestSegment: process_test.go:51: 吗
        TestSegment: process_test.go:52: 9
        TestSegment: process_test.go:53: 12
    

    Description

    the first token should be "\001", we get second word instand. the start of second token should be 1.

  • How to disable output of dictionary loading?

    How to disable output of dictionary loading?

    There always output information about dictionary loading

    2023/01/07 23:04:15 Dict files path:  [./xxx]
    2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
    2023/01/07 23:04:15 Gse dictionary loaded finished.
    2023/01/07 23:04:15 Dict files path:  [./xxx]
    2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
    2023/01/07 23:04:15 Gse dictionary loaded finished.
    2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"
    2023/01/07 23:04:15 Dict files path:  [./xxx]
    2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
    2023/01/07 23:04:15 Gse dictionary loaded finished.
    2023/01/07 23:04:15 Dict files path:  [./xxx]
    2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
    2023/01/07 23:04:15 Gse dictionary loaded finished.
    2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"
    

    My codes as below

    	segmenter, err := gse.NewEmbed(static.DictFile)
    	segmenter.LoadStopEmbed(static.StopFile)
    	segmenter.MoreLog = false  <-- seems no usable
    	segmenter.SkipLog = true  <-- seems no usable
    

    Is there any option to disable output such info?

  • 老哥,停止词典的那个方法一直无法生效,咋回事呀

    老哥,停止词典的那个方法一直无法生效,咋回事呀

    package main

    import ( "fmt"

    "github.com/go-ego/gse"
    

    )

    var ( text = "第一次爱的人是谁演唱的" new, _ = gse.New("dict.txt")

    seg gse.Segmenter
    

    )

    func main() { cut() }

    // loadDictEmbed supported from go1.16 func loadDictEmbed() { seg.LoadDictEmbed() seg.LoadStopEmbed() }

    func cut() { new.LoadStop("stop.txt") new.IsStop("是") //将“是“加入停止词典以后,“是”仍然出现在了分词结果中 fmt.Println("cut: ", new.Cut(text, true)) fmt.Println("cut all: ", new.CutAll(text)) fmt.Println("cut for search: ", new.CutSearch(text, true)) fmt.Println(new.String(text, true)) }

    // 输出结果如下: // cut: [第一次爱的人 是 谁 演唱 的] //cut all: [第一次爱的人 是 谁 演 唱 的] //cut for search: [第一次爱的人 是 谁 演唱 的] // 第一次爱的人/n 是/x 谁/x 演/x 唱/x 的/x

  • Float should not be split

    Float should not be split

    Split “ loss of 76.7”. I got "loss / of / 76 / . / 7", I want got "loss / of / 76 . 7".

    What i can do?

    • Gse version (or commit ref):v0.69.3
    • Go version:1.16
  • - temporarily disable debug output until a better mechanism

    - temporarily disable debug output until a better mechanism

    ... to turn them on

    Please provide Issues links to:

    • Issues: #1 https://github.com/go-ego/gpy/issues/16

    Provide test code:

    $ pinyin -p 银行
    

    Description

    The output from pinyin -p looks very like debug output, disable it temporarily before we can have a better way to turn it on.

  • V1 Release?

    V1 Release?

    Hi, I was looking for a good Chinese/Japanese tokenizer in Go and stumbled across this one.

    Based on the release history it seems like it looks like this library has been in use for quite a while, but it's still v0. Any reason not to issue an official v1 release?

    It would also be nice to see quality metrics on the readme, if you have any. E.g. comparison to data like https://universaldependencies.org/

A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

Dec 19, 2022
[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp nlp is a general purpose any-lang Natural Language Processor that parses the data inside a text and returns a filled model Supported types int in

Nov 24, 2022
Chinese word splitting algorithm MMSEG in GO

MMSEGO This is a GO implementation of MMSEG which a Chinese word splitting algorithm. TO DO list Documentation/comments Benchmark Usage #Input Diction

Sep 27, 2022
Self-contained Japanese Morphological Analyzer written in pure Go
Self-contained Japanese Morphological Analyzer written in pure Go

Kagome v2 Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, Uni

Dec 24, 2022
A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Nov 28, 2022
Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Stemmer package for Go Stemmer package provides an interface for stemmers and includes English, German and Dutch stemmers as sub-packages: porter2 sub

Dec 14, 2022
Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

Gopher Translator Service An interview assignment project. To see the full assig

Jan 25, 2022
A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

Jan 4, 2023
ASCII transliterations of Unicode text.

go-unidecode ASCII transliterations of Unicode text. Inspired by python-unidecode. Installation go get -u github.com/mozillazg/go-unidecode Install C

Dec 2, 2022
A tool to find all duplicates in large sets of text documents.

⊧ dupi Dupi is an engine for identifying and exploring duplicative text in sets of documents. Status Dupi is in alpha/early beta development stage. Pl

Dec 23, 2022
i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

go-localize Simple and easy to use i18n (Internationalization and localization) engine written in Go, used for translating locale strings. Use with go

Nov 29, 2022
Read and use word2vec vectors in Go

Introduction This is a package for reading word2vec vectors in Go and finding similar words and analogies. Installation This package can be installed

Nov 28, 2022
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

Dec 25, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

Dec 30, 2022
A go library for reading and creating ISO9660 images

iso9660 A package for reading and creating ISO9660 Joliet and Rock Ridge extensions are not supported. Examples Extracting an ISO package main import

Jan 2, 2023
Package i18n provides internationalization and localization for your Go applications.

i18n Package i18n provides internationalization and localization for your Go applications. Installation The minimum requirement of Go is 1.16. go get

Nov 9, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022
An easy-to-use OCR and Japanese to English translation tool
An easy-to-use OCR and Japanese to English translation tool

Manga Translator An easy-to-use application for translating text in images from Japanese to English. The GUI was created using Gio. Gio supports a var

Dec 28, 2022
Utilities for working with discrete probability distributions and other tools useful for doing NLP work

GNLP A few structures for doing NLP analysis / experiments. Basics counter.Counter A map-like data structure for representing discrete probability dis

Nov 28, 2022
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

Dec 19, 2022