Kagome v2
Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.
v1.
Improvements from- Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
- Brushed up and added several APIs.
Dictionaries
dict | source | package |
---|---|---|
MeCab IPADIC | mecab-ipadic-2.7.0-20070801 | github.com/ikawaha/kagome-dict/ipa |
UniDIC | unidic-mecab-2.1.2_src | github.com/ikawaha/kagome-dict/uni |
Experimental Features
dict | source | package |
---|---|---|
mecab-ipadic-NEologd | mecab-ipadic-neologd | github.com/ikawaha/kagome-ipa-neologd |
Korean MeCab | mecab-ko-dic-2.1.1-20180720 | github.com/ikawaha/kagome-dict-ko |
Segmentation mode for search
Kagome has segmentation mode for search such as Kuromoji.
- Normal: Regular segmentation
- Search: Use a heuristic to do additional segmentation useful for search
- Extended: Similar to search mode, but also uni-gram unknown words
Untokenized | Normal | Search | Extended |
---|---|---|---|
関西国際空港 | 関西国際空港 | 関西 国際 空港 | 関西 国際 空港 |
日本経済新聞 | 日本経済新聞 | 日本 経済 新聞 | 日本 経済 新聞 |
シニアソフトウェアエンジニア | シニアソフトウェアエンジニア | シニア ソフトウェア エンジニア | シニア ソフトウェア エンジニア |
デジカメを買った | デジカメ を 買っ た | デジカメ を 買っ た | デ ジ カ メ を 買っ た |
Programming example
package main
import (
"fmt"
"strings"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)
func main() {
t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
// wakati
fmt.Println("---wakati---")
seg := t.Wakati("すもももももももものうち")
fmt.Println(seg)
// tokenize
fmt.Println("---tokenize---")
tokens := t.Tokenize("すもももももももものうち")
for _, token := range tokens {
features := strings.Join(token.Features(), ",")
fmt.Printf("%s\t%v\n", token.Surface, features)
}
}
output:
---wakati---
[すもも も もも も もも の うち]
---tokenize---
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
Reference
Commands
Install
Go
env GO111MODULE=on go get -u github.com/ikawaha/kagome/v2
Homebrew tap
brew install ikawaha/kagome/kagome
Usage
$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
[tokenize] - command line tokenize (*default)
server - run tokenize server
lattice - lattice viewer
version - show version
tokenize [-file input_file] [-dict dic_file] [-userdict userdic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)]
-dict string
dict
-file string
input file
-mode string
tokenize mode (normal|search|extended) (default "normal")
-simple
display abbreviated dictionary contents
-sysdict string
system dict type (ipa|uni) (default "ipa")
-udict string
user dict
Tokenize command
% kagome
すもももももももものうち
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
Server command
API
Start a server and try to access the "/tokenize" endpoint.
% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}' | jq .
Web App
Start a server and access http://localhost:6060
. (To draw a lattice, demo application uses graphviz . You need graphviz installed.)
% kagome server &
Lattice command
A debug tool of tokenize process outputs a lattice in graphviz dot format.
% kagome lattice 私は鰻 | dot -Tpng -o lattice.png
Docker
Licence
MIT