A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)

gotokenizer GoDoc Build Status Coverage Status Go Report Card License Awesome

A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)

Motivation

I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.

Features

  • Support Maximum Matching Method
  • Support Minimum Matching Method
  • Support Reverse Maximum Matching
  • Support Reverse Minimum Matching
  • Support Bidirectional Maximum Matching
  • Support Bidirectional Minimum Matching
  • Support using Stop Tokens
  • Support Custom word Filter

Installation

go get -u github.com/xujiajun/gotokenizer

Usage

package main

import (
	"fmt"

	"github.com/xujiajun/gotokenizer"
)

func main() {
	text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器,支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"

	dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
	// NewMaxMatch default wordFilter is NumAndLetterWordFilter
	mm := gotokenizer.NewMaxMatch(dictPath)
	// load dict
	mm.LoadDict()

	fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 , 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>

	// enabled filter stop tokens 
	mm.EnabledFilterStopToken = true
	mm.StopTokens = gotokenizer.NewStopTokens()
	stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
	mm.StopTokens.Load(stopTokenDicPath)

	fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
	fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>

}

More examples see tests

Contributing

If you'd like to help out with the project. You can put up a Pull Request.

Author

License

The gotokenizer is open-sourced software licensed under the Apache-2.0

Acknowledgements

This package is inspired by the following:

https://github.com/ysc/word

Owner
徐佳军
You will never know what you can do till you try.
徐佳军
Similar Resources

Complete Translation - translate a document to another language

 Complete Translation - translate a document to another language

Complete Translation This project is to translate a document to another language. The initial target is English to Korean. Consider this project is no

Feb 25, 2022

i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

go-localize Simple and easy to use i18n (Internationalization and localization) engine written in Go, used for translating locale strings. Use with go

Nov 29, 2022

Utilities for working with discrete probability distributions and other tools useful for doing NLP work

GNLP A few structures for doing NLP analysis / experiments. Basics counter.Counter A map-like data structure for representing discrete probability dis

Nov 28, 2022

Read and use word2vec vectors in Go

Introduction This is a package for reading word2vec vectors in Go and finding similar words and analogies. Installation This package can be installed

Nov 28, 2022

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp nlp is a general purpose any-lang Natural Language Processor that parses the data inside a text and returns a filled model Supported types int in

Nov 24, 2022

A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

Jan 4, 2023

A go library for reading and creating ISO9660 images

iso9660 A package for reading and creating ISO9660 Joliet and Rock Ridge extensions are not supported. Examples Extracting an ISO package main import

Jan 2, 2023

Package i18n provides internationalization and localization for your Go applications.

i18n Package i18n provides internationalization and localization for your Go applications. Installation The minimum requirement of Go is 1.16. go get

Nov 9, 2022

Database Abstraction Layer (dbal) for Go. Support SQL builder and get result easily (now only support mysql)

godbal Database Abstraction Layer (dbal) for go (now only support mysql) Motivation I wanted a DBAL that No ORM、No Reflect、Concurrency Save, support S

Nov 17, 2022

A multilingual command line sentence tokenizer in Golang

A multilingual command line sentence tokenizer in Golang

Sentences - A command line sentence tokenizer This command line utility will convert a blob of text into a list of sentences. Demo Docs Install go get

Dec 30, 2022

Build "Dictionary of the Old Norwegian Language" into easier-to-use data formats

Old Norwegian Dictionary Builder Build "Dictionary of the Old Norwegian Language" into easier-to-use data formats. Available formats: JSON DSL XML Usa

Oct 11, 2022

A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

Dec 19, 2022

A golang client for the Twitch v3 API - public APIs only (for now)

go-twitch Test CLIENT_ID="my client ID" go test -v -cover Usage Example File: package main import ( "log" "os" "github.com/knspriggs/go-twi

Sep 27, 2022

A memory-efficient trie for testing the existence/prefixes of string only(for now).

Succinct Trie A memory-efficient trie for testing the existence/prefixes of string only(for now). Install go get -u github.com/nobekanai/sutrie Docume

Mar 10, 2022

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022

Chinese word splitting algorithm MMSEG in GO

MMSEGO This is a GO implementation of MMSEG which a Chinese word splitting algorithm. TO DO list Documentation/comments Benchmark Usage #Input Diction

Sep 27, 2022

Convert Arabic numeric amounts to Chinese character

将阿拉伯数字金额转换为汉字的形式 Convert Arabic numeric amounts to Chinese character form. 安装使用 Golang 版本大于等于1.16 go get -u github.com/aliliin/rmb-character import (

Sep 9, 2021

Urban Dictionary CLI app.

Urban Dictionary Urban Dictionary CLI app. Download Latest Release: GitHub Release Usage urban "term" [page number] (Get list of definitions b

Jan 9, 2022

Urban Dictionary API client for Go.

Urban Dictionary Urban Dictionary API client for Go. Download go get github.com/thexxiv/urbandictionary-go Example func main() { urban := urbandicti

Jan 9, 2022
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

Dec 19, 2022
Chinese word splitting algorithm MMSEG in GO

MMSEGO This is a GO implementation of MMSEG which a Chinese word splitting algorithm. TO DO list Documentation/comments Benchmark Usage #Input Diction

Sep 27, 2022
A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Nov 28, 2022
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

Dec 25, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

Dec 30, 2022
Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Stemmer package for Go Stemmer package provides an interface for stemmers and includes English, German and Dutch stemmers as sub-packages: porter2 sub

Dec 14, 2022
Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

Gopher Translator Service An interview assignment project. To see the full assig

Jan 25, 2022
Natural language detection package in pure Go

getlang getlang provides fast natural language detection in Go. Features Offline -- no internet connection required Supports 29 languages Provides ISO

Dec 26, 2022
Natural language detection library for Go
Natural language detection library for Go

Whatlanggo Natural language detection for Go. Features Supports 84 languages 100% written in Go No external dependencies Fast Recognizes not only a la

Dec 28, 2022
A natural language date/time parser with pluggable rules

when when is a natural language date/time parser with pluggable rules and merge strategies Examples tonight at 11:10 pm at Friday afternoon the deadli

Dec 26, 2022