Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词


Go efficient text segmentation; support english, chinese, japanese and other.

Dictionary with double array trie (Double-Array Trie) to achieve, Sender algorithm is the shortest path based on word frequency plus dynamic programming, and DAG and HMM algorithm word segmentation.

Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes, support user dictionary, POS tagging, run JSON RPC service.

Support HMM cut text use Viterbi algorithm.

Text Segmentation speed single thread 9.2MB/s,goroutines concurrent 26.8MB/s. HMM text segmentation single thread 3.2MB/s. (2core 4threads Macbook Pro).


gse-bind, binding JavaScript and other, support more language.

Install / update

go get -u github.com/go-ego/gse


go get -u github.com/go-ego/re

re gse

To create a new gse application

$ re gse my-gse

re run

To run the application we just created, you can navigate to the application folder and execute:

$ cd my-gse && re run


package main

import (


var (
	text = "Hello world, Helloworld. Winter is coming! 你好世界."

	new = gse.New("zh,testdata/test_dict3.txt", "alpha")

	seg gse.Segmenter
	posSeg pos.Segmenter

func cut() {
	hmm := new.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)

	hmm = new.CutAll(text)
	fmt.Println("cut all: ", hmm)

func main() {


func posAndTrim(cut []string) {
	cut = seg.Trim(cut)
	fmt.Println("cut all: ", cut)

	po := posSeg.Cut(text, true)
	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")
	fmt.Println("trim pos: ", po)

func cutPos() {
	fmt.Println(seg.String(text, true))
	fmt.Println(seg.Slice(text, true))

	po := seg.Pos(text, true)
	fmt.Println("pos: ", po)
	po = seg.TrimPos(po)
	fmt.Println("trim pos: ", po)

func segCut() {
	// Loading the default dictionary
	// Load the dictionary
	// seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

	// Text Segmentation
	tb := []byte(text)
	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)

	// Handle word segmentation results
	// Support for normal mode and search mode two participle,
	// see the comments in the code ToString function.
	// The search mode is mainly used to provide search engines
	// with as many keywords as possible
	fmt.Println(gse.ToString(segments, true))

Look at an custom dictionary example

package main

import (


func main() {
	var seg gse.Segmenter

	text1 := "你好世界, Hello world"
	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))

Look at an Chinese example

Look at an Japanese example



Gse is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0), thanks for sego and jieba(jiebago).

  • 老哥,停止词典一直不生效,加了


    package main

    import ( "fmt"



    var ( text = "第一次爱的人是谁演唱的" new, _ = gse.New("dict.txt")

    seg gse.Segmenter


    func main() { cut() }

    func cut() { new.LoadStop("stop.txt") new.AddStop("的") new.AddStop("是") //加了这行也没用 fmt.Println("cut: ", new.Cut(text, true)) fmt.Println("cut all: ", new.CutAll(text)) fmt.Println("cut for search: ", new.CutSearch(text, true)) fmt.Println(new.String(text, true)) }

    //控制台打印如下所示 //2022/02/18 17:44:34 Dict files path: [dict.txt] //2022/02/18 17:44:34 Load the gse dictionary: "dict.txt" //2022/02/18 17:44:34 Gse dictionary loaded finished. //2022/02/18 17:44:34 Load the stop word dictionary: "stop.txt" //cut: [第一次爱的人 是 谁 演唱 的] //cut all: [第一次爱的人 是 谁 演唱 的] //cut for search: [第一次爱的人 是 谁 演唱 的] //第一次爱的人/n 是/x 谁/x 演唱/v 的/x

  • Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?

    Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?

    • Gse version (or commit ref): 0.60
    • Go version: 1.14
    • Operating system and bit: macOS 10.14


    In my case, I need get start and end info of each word after segmenting in hmm and search mode. By reading apis, I only found:

    • CutSearch(string, true) which only return []string but no star and end infos
    • Segment([]byte(text)) which can return segment with start and end info, but it does not accept param to choose search mode.

    Is there anyway to something like Segment([]byte(text), searchMode)?

  • Bleve


    Has anyone tried using this with bleve.

    Bleve does this plus alot more but lacks decent Chinese / Japanese stemmers.

    Using this with bleve would be a powerful stack

  • Could not load dictionaries

    Could not load dictionaries

    I pulled gse through go mod, but I found that the dictionary data in gse was not pulled down, so I found a "Could not load dictionaries" error. Then, I copied the dictionary data into the gse package to run it through. So, I think,

    1. Can you delete the hard-coded dictionary location in gse, or can it be configurable through parameters.
    2. If you want to load dictionary data, is it possible to convert the dictionary data into go static data code through "go-bindata" or other
  • How to build without embed dictionary on Go1.16 or above?

    How to build without embed dictionary on Go1.16 or above?


    I noticed that gse leads to a large binary size. After reviewing the code, I found the problem may lie here, which is caused by the embedded dictionary.

    The program binary may vary, but the dictionary is relatively not changed. So is there a way to build a gse project without embedded dictionary?


    • Gse version (or commit ref): 0.70.1
    • Go version: 1.17
    • Operating system and bit: Ubuntu 20.04 64bit
    • Can you reproduce the bug at Examples:
      • [x] Yes (provide example code)
      • [ ] No
      • [ ] Not relevant
  • Is there any bug of seg.ModeSegment?

    Is there any bug of seg.ModeSegment?

    The result of seg.Segment and seg.ModeSegment are the same, is there any bug?

    I thought the result of ModeSegment should like seg.CutSearch.

    test code:

    package main
    import (
    var (
    	seg  gse.Segmenter
    	text = "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
    func main() {
    func addToken() {
    	seg.AddToken("《复仇者联盟3:无限战争》", 100, "n")
    // 使用 DAG 或 HMM 模式分词
    func cut() {
    	// "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
    	// use DAG and HMM
    	hmm := seg.Cut(text, true)
    	fmt.Println("cut use hmm: ", hmm)
    	// cut use hmm:  [《复仇者联盟3:无限战争》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]
    	cut := seg.Cut(text)
    	fmt.Println("cut: ", cut)
    	// cut:  [《 复仇者 联盟 3 : 无限 战争 》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]
    	hmm = seg.CutSearch(text, true)
    	fmt.Println("cut search use hmm: ", hmm)
    	//cut search use hmm:  [复仇 仇者 联盟 无限 战争 复仇者 《复仇者联盟3:无限战争》 是 全片 使用 imax 摄影 摄影机 拍摄 制作 的 的 科幻 科幻片 .]
    	fmt.Println("analyze: ", seg.Analyze(hmm, text))
    	cut = seg.CutSearch(text)
    	fmt.Println("cut search: ", cut)
    	// cut search:  [《 复仇 者 复仇者 联盟 3 : 无限 战争 》 是 全片 使用 imax 摄影 机 摄影机 拍摄 制作 的 的 科幻 片 科幻片 .]
    	segment1 := seg.Segment([]byte(text))
    	for i, token := range segment1 {
    		fmt.Println(i, token.Token().Text())
    	segment2 := seg.ModeSegment([]byte(text), true)
    	for i, token := range segment2 {
    		fmt.Println(i, token.Token().Text())
  • "\001" in text gets error result

    • Gse version (or commit ref): 1fd1428e78fe
    • Go version: 1.14.2
    • Operating system and bit: any
    • Can you reproduce the bug at Examples:
      • [ ] No
      • [ ] Yes (provide example code)
      • [x] Not relevant
    • Provide example code:
    func TestSegment(t *testing.T) {
    	seg := &gse.Segmenter{}
    	err := seg.LoadDict("../data/dictionary.txt")
    	if err != nil {
    	data := []byte("\001你好吗", )
    	res := seg.Segment(data)
    	for _, re := range res {
    • Log gist:
        TestSegment: process_test.go:51: 你
        TestSegment: process_test.go:52: 0
        TestSegment: process_test.go:53: 3
        TestSegment: process_test.go:51: 你好
        TestSegment: process_test.go:52: 3
        TestSegment: process_test.go:53: 9
        TestSegment: process_test.go:51: 吗
        TestSegment: process_test.go:52: 9
        TestSegment: process_test.go:53: 12


    the first token should be "\001", we get second word instand. the start of second token should be 1.

  • How to disable output of dictionary loading?

    How to disable output of dictionary loading?

    There always output information about dictionary loading

    2023/01/07 23:04:15 Dict files path:  [./xxx]
    2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
    2023/01/07 23:04:15 Gse dictionary loaded finished.
    2023/01/07 23:04:15 Dict files path:  [./xxx]
    2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
    2023/01/07 23:04:15 Gse dictionary loaded finished.
    2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"
    2023/01/07 23:04:15 Dict files path:  [./xxx]
    2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
    2023/01/07 23:04:15 Gse dictionary loaded finished.
    2023/01/07 23:04:15 Dict files path:  [./xxx]
    2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
    2023/01/07 23:04:15 Gse dictionary loaded finished.
    2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"

    My codes as below

    	segmenter, err := gse.NewEmbed(static.DictFile)
    	segmenter.MoreLog = false  <-- seems no usable
    	segmenter.SkipLog = true  <-- seems no usable

    Is there any option to disable output such info?

  • 老哥,停止词典的那个方法一直无法生效,咋回事呀


    package main

    import ( "fmt"



    var ( text = "第一次爱的人是谁演唱的" new, _ = gse.New("dict.txt")

    seg gse.Segmenter


    func main() { cut() }

    // loadDictEmbed supported from go1.16 func loadDictEmbed() { seg.LoadDictEmbed() seg.LoadStopEmbed() }

    func cut() { new.LoadStop("stop.txt") new.IsStop("是") //将“是“加入停止词典以后,“是”仍然出现在了分词结果中 fmt.Println("cut: ", new.Cut(text, true)) fmt.Println("cut all: ", new.CutAll(text)) fmt.Println("cut for search: ", new.CutSearch(text, true)) fmt.Println(new.String(text, true)) }

    // 输出结果如下: // cut: [第一次爱的人 是 谁 演唱 的] //cut all: [第一次爱的人 是 谁 演 唱 的] //cut for search: [第一次爱的人 是 谁 演唱 的] // 第一次爱的人/n 是/x 谁/x 演/x 唱/x 的/x

  • Float should not be split

    Float should not be split

    Split “ loss of 76.7”. I got "loss / of / 76 / . / 7", I want got "loss / of / 76 . 7".

    What i can do?

    • Gse version (or commit ref):v0.69.3
    • Go version:1.16
  • - temporarily disable debug output until a better mechanism

    - temporarily disable debug output until a better mechanism

    ... to turn them on

    Please provide Issues links to:

    • Issues: #1 https://github.com/go-ego/gpy/issues/16

    Provide test code:

    $ pinyin -p 银行


    The output from pinyin -p looks very like debug output, disable it temporarily before we can have a better way to turn it on.

  • V1 Release?

    V1 Release?

    Hi, I was looking for a good Chinese/Japanese tokenizer in Go and stumbled across this one.

    Based on the release history it seems like it looks like this library has been in use for quite a while, but it's still v0. Any reason not to issue an official v1 release?

    It would also be nice to see quality metrics on the readme, if you have any. E.g. comparison to data like https://universaldependencies.org/

