Self-contained Japanese Morphological Analyzer written in pure Go

GoDev Go Coverage Status Docker Images Docker Pulls deploy demo

Kagome v2

Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.

Improvements from v1.

  • Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
  • Brushed up and added several APIs.

Dictionaries

dict source package
MeCab IPADIC mecab-ipadic-2.7.0-20070801 github.com/ikawaha/kagome-dict/ipa
UniDIC unidic-mecab-2.1.2_src github.com/ikawaha/kagome-dict/uni

Experimental Features

dict source package
mecab-ipadic-NEologd mecab-ipadic-neologd github.com/ikawaha/kagome-ipa-neologd
Korean MeCab mecab-ko-dic-2.1.1-20180720 github.com/ikawaha/kagome-dict-ko

Segmentation mode for search

Kagome has segmentation mode for search such as Kuromoji.

  • Normal: Regular segmentation
  • Search: Use a heuristic to do additional segmentation useful for search
  • Extended: Similar to search mode, but also uni-gram unknown words
Untokenized Normal Search Extended
関西国際空港 関西国際空港 関西 国際 空港 関西 国際 空港
日本経済新聞 日本経済新聞 日本 経済 新聞 日本 経済 新聞
シニアソフトウェアエンジニア シニアソフトウェアエンジニア シニア ソフトウェア エンジニア シニア ソフトウェア エンジニア
デジカメを買った デジカメ を 買っ た デジカメ を 買っ た デ ジ カ メ を 買っ た

Programming example

package main

import (
	"fmt"
	"strings"

	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("すもももももももものうち")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("すもももももももものうち")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

output:

---wakati---
[すもも も もも も もも の うち]
---tokenize---
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

Reference

実践:形態素解析 kagome v2

Commands

Install

Go

env GO111MODULE=on go get -u github.com/ikawaha/kagome/v2

Homebrew tap

brew install ikawaha/kagome/kagome

Usage

$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
   [tokenize] - command line tokenize (*default)
   server - run tokenize server
   lattice - lattice viewer
   version - show version

tokenize [-file input_file] [-dict dic_file] [-userdict userdic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)]
  -dict string
    	dict
  -file string
    	input file
  -mode string
    	tokenize mode (normal|search|extended) (default "normal")
  -simple
    	display abbreviated dictionary contents
  -sysdict string
    	system dict type (ipa|uni) (default "ipa")
  -udict string
    	user dict

Tokenize command

% kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Server command

API

Start a server and try to access the "/tokenize" endpoint.

% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}' | jq . 

Web App

demo

Start a server and access http://localhost:6060. (To draw a lattice, demo application uses graphviz . You need graphviz installed.)

% kagome server &

Lattice command

A debug tool of tokenize process outputs a lattice in graphviz dot format.

% kagome lattice 私は鰻 | dot -Tpng -o lattice.png

lattice

Docker

Docker

Licence

MIT

Owner
Similar Resources

Nightly binary builds of Emacs for macOS as a self-contained Emacs.app, with native-compilation.

Emacs Builds Nightly binary builds of Emacs for macOS as a self-contained Emacs.app, with native-compilation. Features Self-contained Emacs.app applic

Dec 25, 2022

A tiny self-contained pasting service with a built-in database.

A tiny self-contained pasting service with a built-in database.

Dec 18, 2021

Self-contained Machine Learning and Natural Language Processing library in Go

Self-contained Machine Learning and Natural Language Processing library in Go

Self-contained Machine Learning and Natural Language Processing library in Go

Jan 8, 2023

A simple single-file executable to pull a git-ssh repository and serve the web app found to a self-contained browser window

go-git-serve A simple single-file executable to pull a git-ssh repository (using go-git library) and serve the web app found to a self-contained brows

Jan 19, 2022

Simple HTTP/HTTPS proxy - designed to be distributed as a self-contained binary that can be dropped in anywhere and run.

Simple Proxy This is a simple HTTP/HTTPS proxy - designed to be distributed as a self-contained binary that can be dropped in anywhere and run. Code b

Jan 7, 2023

Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

gse Go efficient text segmentation; support english, chinese, japanese and other. 简体中文 Dictionary with double array trie (Double-Array Trie) to achiev

Jan 8, 2023

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022

An easy-to-use OCR and Japanese to English translation tool

An easy-to-use OCR and Japanese to English translation tool

Manga Translator An easy-to-use application for translating text in images from Japanese to English. The GUI was created using Gio. Gio supports a var

Dec 28, 2022

Tools to help with Japanese sentence mining

Tools to help with Japanese sentence mining

Oct 25, 2022

qclean lets you to clean up search query in japanese.

qclean qclean lets you to clean up search query in japanese. This is mainly used to remove wasted space. Quick Start package main var cleaner *qclean

Jan 4, 2022

Disk usage analyzer with console interface written in Go

Disk usage analyzer with console interface written in Go

Gdu is intended primarily for SSD disks where it can fully utilize parallel processing. However HDDs work as well, but the performance gain is not so huge.

Jan 7, 2023

The imghdr module determines the type of image contained in a file for go

The imghdr module determines the type of image contained in a file for go

goimghdr Inspired by Python's imghdr Installation go get github.com/corona10/goimghdr List of return value Value Image format "rgb" SGI ImgLib Files

Oct 10, 2022

Sort the emails contained in a .csv file into a text file

Go convert csv to txt This snippet of code allows you to sort the emails contained in a .csv file into a text file.

Nov 23, 2021

containedctx detects is a linter that detects struct contained context.Context field

containedctx containedctx detects is a linter that detects struct contained context.Context field Instruction go install github.com/sivchari/contained

Oct 22, 2022

Groupie Trackers consists on receiving a given API and manipulate the data contained in it, in order to create a site, displaying the information.

groupie-tracker Objectives Groupie Trackers consists on receiving a given API and manipulate the data contained in it, in order to create a site, disp

Jan 13, 2022

H265/HEVC Bitstream Analyzer in Go

GoVisa H.265/HEVC Bitstream Analyzer in Go /* The copyright in this software is being made available under the BSD License, included below. This softw

Aug 14, 2022

Analyzer: helps uncover bugs by reporting a diagnostic for mistakes of *sql.Rows usage.

sqlrows sqlrows is a static code analyzer which helps uncover bugs by reporting a diagnostic for mistakes of sql.Rows usage. Install You can get sqlro

Mar 24, 2022

A static code analyzer for annotated TODO comments

A static code analyzer for annotated TODO comments

todocheck todocheck is a static code analyzer for annotated TODO comments. It let's you create actionable TODOs by annotating them with issues from an

Dec 7, 2022
Comments
  • Working examples

    Working examples

    see. https://github.com/ikawaha/kagome/issues/277#issuecomment-1340182326

    ./sample/
    ├── dict
    │   └── userdict.txt
    ├── example      ← ※ folder for adding working examples
    └── wasm
        ├── README.md
        ├── go.mod
        ├── kagome.html
        └── main.go
    
  • There is nothing better than better documentation

    There is nothing better than better documentation

    KEINOS Thank you very much! Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.

    Originally posted by @CaptainDario in https://github.com/ikawaha/kagome/issues/274#issuecomment-1198047786

Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

gse Go efficient text segmentation; support english, chinese, japanese and other. 简体中文 Dictionary with double array trie (Double-Array Trie) to achiev

Jan 8, 2023
Natural language detection package in pure Go

getlang getlang provides fast natural language detection in Go. Features Offline -- no internet connection required Supports 29 languages Provides ISO

Dec 26, 2022
i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

go-localize Simple and easy to use i18n (Internationalization and localization) engine written in Go, used for translating locale strings. Use with go

Nov 29, 2022
:tophat: Small self-contained pure-Go web server with Lua, Markdown, HTTP/2, QUIC, Redis and PostgreSQL support
:tophat: Small self-contained pure-Go web server with Lua, Markdown, HTTP/2, QUIC, Redis and PostgreSQL support

Web server with built-in support for QUIC, HTTP/2, Lua, Markdown, Pongo2, HyperApp, Amber, Sass(SCSS), GCSS, JSX, BoltDB (built-in, stores the databas

Jan 1, 2023
A fully self-contained Nmap like parallel port scanning module in pure Golang that supports SYN-ACK (Silent Scans)

gomap What is gomap? Gomap is a fully self-contained nmap like module for Golang. Unlike other projects which provide nmap C bindings or rely on other

Dec 10, 2022
Nginx-Log-Analyzer is a lightweight (simplistic) log analyzer for Nginx.
Nginx-Log-Analyzer is a lightweight (simplistic) log analyzer for Nginx.

Nginx-Log-Analyzer is a lightweight (simplistic) log analyzer, used to analyze Nginx access logs for myself.

Nov 29, 2022
Log-analyzer - Log analyzer with golang

Log Analyzer what do we have here? Objective Installation and Running Applicatio

Jan 27, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

Dec 30, 2022
A tool for generating self-contained, type-safe test doubles in go

counterfeiter When writing unit-tests for an object, it is often useful to have fake implementations of the object's collaborators. In go, such fake i

Jan 5, 2023
Product Analytics, Business Intelligence, and Product Management in a fully self-contained box
Product Analytics, Business Intelligence, and Product Management in a fully self-contained box

Engauge Concept It's not pretty but it's functional. Track user interactions in your apps and products in real-time and see the corresponding stats in

Nov 17, 2021