:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Last update: Jan 4, 2023

Comments: 16

prose

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get github.com/jdkato/prose/v2

Overview

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmentation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

Type	Example
Email addresses	`[email protected]`
Hashtags	`#trending`
Mentions	`@jdkato`
URLs	`https://github.com/jdkato/prose`
Emoticons	`:-)`, `>:(`, `o_0`, etc.

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

Name	Language	License	GRS (English)	GRS (Other)	Speed†
Pragmatic Segmenter	Ruby	MIT	98.08% (51/52)	100.00%	3.84 s
prose	Go	MIT	75.00% (39/52)	N/A	0.96 s
TactfulTokenizer	Ruby	GNU GPLv3	65.38% (34/52)	48.57%	46.32 s
OpenNLP	Java	APLv2	59.62% (31/52)	45.71%	1.27 s
Standford CoreNLP	Java	GNU GPLv3	59.62% (31/52)	31.43%	0.92 s
Splitta	Python	APLv2	55.77% (29/52)	37.14%	N/A
Punkt	Python	APLv2	46.15% (24/52)	48.57%	1.79 s
SRX English	Ruby	GNU GPLv3	30.77% (16/52)	28.57%	6.19 s
Scapel	Ruby	GNU GPLv3	28.85% (15/52)	20.00%	0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library	Accuracy	5-Run Average (sec)
NLTK	0.893	7.224
`prose`	0.961	2.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAG	DESCRIPTION
`(`	left round bracket
`)`	right round bracket
`,`	comma
`:`	colon
`.`	period
`''`	closing quotation mark
``	opening quotation mark
`#`	number sign
`$`	currency
`CC`	conjunction, coordinating
`CD`	cardinal number
`DT`	determiner
`EX`	existential there
`FW`	foreign word
`IN`	conjunction, subordinating or preposition
`JJ`	adjective
`JJR`	adjective, comparative
`JJS`	adjective, superlative
`LS`	list item marker
`MD`	verb, modal auxiliary
`NN`	noun, singular or mass
`NNP`	noun, proper singular
`NNPS`	noun, proper plural
`NNS`	noun, plural
`PDT`	predeterminer
`POS`	possessive ending
`PRP`	pronoun, personal
`PRP$`	pronoun, possessive
`RB`	adverb
`RBR`	adverb, comparative
`RBS`	adverb, superlative
`RP`	adverb, particle
`SYM`	symbol
`TO`	infinitival to
`UH`	interjection
`VB`	verb, base form
`VBD`	verb, past tense
`VBG`	verb, gerund or present participle
`VBN`	verb, past participle
`VBP`	verb, non-3rd person singular present
`VBZ`	verb, 3rd person singular present
`WDT`	wh-determiner
`WP`	wh-pronoun, personal
`WP$`	wh-pronoun, possessive
`WRB`	wh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "github.com/jdkato/prose/v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketbal in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.

Owner

Joseph Kato

https://github.com/jdkato/prose

Comments

The example in readme does not compile
I am referring to this:

package main import "gopkg.in/jdkato/prose.v2" func main() { prose.NewDocument("Go is ...") }

The NewDocument is actually in gopkg.in/jdkato/prose.v2/summarize now.

However, go get gopkg.in/jdkato/prose.v2/summarize does not work either. The package does not compile due to usage of internal package. This is a typical error when using the gopkg.in service. gopkg.in only redirects the git URI, but does not rewrite package import paths. As a result, the referred imports will actually point back to the master branch, nullifying the essential purpose of versioning. To use gopkg.in properly, you need to manually rewrite the import paths in the entire repo in the release tags/branches (or just stop using gopkg.in for multi package repos..).

I saw your repo from Hacker News, but your repo fails to build on smallrepo. Detailed build log here:

https://smallrepo.com/builds/20180717-175536-bc73d63d

Thanks.

Wrap location parsing in a function

This moves the logic that was in chunk_test.go into a new exported function named Chunk. The primary difference is that, instead of returning a slice of locations, we're now returning a slice of strings (i.e., the actual chunks).

This changes the usage from

words := tokenize.TextToWords(text)
tagger := tag.NewPerceptronTagger()
tagged := tagger.Tag(words)
rs := Locate(tagged, TreebankNamedEntities)

for r, loc := range rs {
    res := ""
    for t, tt := range tagged[loc[0]:loc[1]] {
        if t != 0 {
            res += " "
        }
        res += tt.Text
    }

    if r >= len(expected) {
        t.Error("ERROR unexpected result: " + res)
    } else {
        if res != expected[r] {
            t.Error("ERROR", res, "!=", expected[r])
        }
    }
}

words := tokenize.TextToWords(text)
tagger := tag.NewPerceptronTagger()
tagged := tagger.Tag(words)

for i, chunk := range Chunk(tagged, TreebankNamedEntities) {
    if i >= len(expected) {
        t.Error("ERROR unexpected result: " + chunk)
    } else {
        if chunk != expected[i] {
            t.Error("ERROR", chunk, "!=", expected[i])
        }
    }
}

/cc @elliott5

Possible enhancements to the "summarize" package
Have you considered adding the Coleman–Liau index for completeness? Even though "opinion varies on its accuracy": https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

On the subject of suspect measures, a composite "years-of-education" metric, taking the average of scores together (and their standard deviation) may be of use: https://github.com/elliott5/readability/blob/master/assess.go

Finally, for giving feedback to users on how to change their prose to be easier to read, it would be great if your analysis could store:

sentences with their word length; and

words with their syllable length and frequency (the product of the two ranking non-readability).

Keep up the good work!

go get not working

I tried to go get the hugo repo. I am using Go 1.14 (with go modules)

go get github.com/gohugoio/hugo

package github.com/jdkato/prose/transform: cannot find package "github.com/jdkato/prose/transform" in any of:
	/usr/local/go/src/github.com/jdkato/prose/transform (from $GOROOT)
	/Users/x/GOPATH/src/github.com/jdkato/prose/transform (from $GOPATH)

I'm wondering why it is saying transform can't be found?

Make it possible to use vendored

Very useful project.

I wanted to test the title case functionality for use in Hugo, but we vendor our libraries, and I get a ../vendor/github.com/jdkato/prose/transform/title.go:9:2: use of internal package not allowed when importing github.com/jdkato/prose/transform.

See https://github.com/gohugoio/hugo/pull/3753

I have been Googling this, and it seems there is no (simple) workaround other than avoiding the use of internal packages in libraries.
Roadmap
This is rough outline of some improvements I'd like to make.

Documentation

[x] Improve and update README.

[ ] Add .github files.

tokenize

[x] Port the pragmatic_segmenter.

[x] Get PunktSentenceTokenizer passing the The Golden Rules and possibly submit a PR upstream.

tag

[x] Finish porting the PerceptronTagger (just training functionality left).

[x] ~~Port the TnT Tagger~~ (not going to make v1.0).

[ ] Improve testing strategy (we currently rely on NLTK).

transform

[x] Improve Title and add support for variations (e.g., AP style).

[ ] Port Change Case.

summarize

[x] Finish working on Syllables.

[x] Add a composite "years-of-education" metric.

[ ] Add the ability to update a Document's content without having to recalculate all of its statistics.

[x] Expand test suite.
Allow overridable tokenizer parameters.

This PR changes iterTokenizer struct to contain the parameters that can be overriden. This will allow future changes to NewDocument that use a custom tokenizer.

Will help with issue #41 and #32

Test: existing unit test

Prose installation issues

I tried installing prose using the command provided go get gopkg.in/jdkato/prose.v2 and this is the error log printed on my terminal.

# gopkg.in/jdkato/prose.v2
../gopkg.in/jdkato/prose.v2/extract.go:323:23: est.AtVec undefined (type *mat.VecDense has no field or method AtVec)
../gopkg.in/jdkato/prose.v2/extract.go:335:25: est.AtVec undefined (type *mat.VecDense has no field or method AtVec)
../gopkg.in/jdkato/prose.v2/extract.go:355:17: count.AtVec undefined (type *mat.VecDense has no field or method AtVec)
../gopkg.in/jdkato/prose.v2/extract.go:525:27: count.AtVec undefined (type *mat.VecDense has no field or method AtVec)

go-lang version : go1.9.7 darwin/amd64

slice bounds out of range, title.go:49

I'm trying to use TItleConverter.Title(), but it panics on me, when the string contains certain multibyte characters.

Crashing testcases:

        type tc struct {
		name  string
		input string
		want  string
	}
	tests := []tc{
                tc{"panic", "This Agreement, dated [DATE] (the “Effective Date”) for Design Services (the “Agreement”) is between [DESIGNER NAME], of [DESIGNER COMPANY](“Designer”), and [CLIENT NAME], of [CLIENT COMPANY] (“Client”) (together known as the “Parties”), for the performance of said Design Services and the production of Deliverables, as described in Schedule A, attached hereto and incorporated herein by reference.", "panic"},
		tc{"panic", "Crash,”“us,” “our” or “we” means Crash Network, Inc. (d/b/a Crash) and its subsidiaries and affiliates.", "panic"},
		tc{"panic", "a “[“New Entity”],” an [Institution] and [Institution].", "panic"},
        }

List of Label?

Hi, I can see that we have a comprehensive list of tags but can't find anything for label (for example, PERSON, GPE, etc). Would be nice if someone can redirect me to the list, even if it is somewhere in the source code.
Introduce Tokenizer interface

This PR allows the user to provide a different tokenizer.

Users can specify their own Tokenizer in the DocOpts. This replaces the boolean Tokenize option (set Tokenizer to nil to disable.)

Currently only IterTokenizer is provided, which can be customized with its own Using options. func Tokenize becomes public to allow users to provide their own implementation and completely replace IterTokenizer.

Model and Extractor need to use the same Tokenizer as Document, so this PR modifies those APIs to be consistent.

(Also separating makeCorpus from extracterFromData to simplify parameter passing.)

This solves issue #41 and #32

sentences first, then words?

I'm a bit surprised to see this:

type Document struct {
        Model *Model
        Text  string

        // TODO: Store offsets (begin, end) instead of `text` field.
        entities  []Entity
        sentences []Sentence
        tokens    []*Token
}

// A Sentence represents a segmented portion of text.
type Sentence struct {
        Text string // The sentence's text.
}

If I care about finding sentences first, then the words within them, I need to take two passes, right?

 	// First we will do only segmentation, to break up sentences
    doc, err := prose.NewDocument(string(content), prose.WithTagging(false), prose.WithTokenization(false), prose.WithExtraction(false))
	if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's sentences, and words within them
    sents := doc.Sentences()
    fmt.Println(len(sents))
    for _, sent := range sents {
        fmt.Println(sent.Text)

		sdoc, err := prose.NewDocument(sent.Text, prose.WithTagging(false), prose.WithExtraction(false), prose.WithSegmentation(false))
		if err != nil {
			log.Fatal(err)
		}

		// Iterate over the doc's tokens:
		for _, tok := range sdoc.Tokens() {
			fmt.Println(tok.Text, tok.Tag, tok.Label)
		}

		// Iterate over the doc's named-entities:
		for _, ent := range sdoc.Entities() {
			fmt.Println(ent.Text, ent.Label)
		}
    }

I suspect it's less efficient that way.

Likewise, tokens include named entities, but it might make more sense to be able to iterate tokens in a sentence in such a way that each token is either a named entity or a regular token?

If you intend to store offsets, like the TODO comment says, maybe this can be worked around, by finding overlapping offset ranges (e.g. sentence 2 goes from character position 240 to 267; word 10 goes from 240 to 247; named entity 2 goes from 248 to 257; etc... then I can see that word 10 and named entity 2 are both part of sentence 2, even if you don't offer a hierarchical model).

Seeing ~100ms overhead per doc. Did performance break, or am I using the API incorrecty?

I am seeing a ~100ms overhead to process any document (text), which seems like it can't be correct given the performance data listed for large corpuses. I've been porting a large Python/NTLK app to Go, but my legacy Python/NLTK text parsing is running ~250x faster than my new go/prose implementation, which seems like I must be doing something wrong.

Did some external dependency break the performance of prose, or am I using the API incorrectly?

Simple performance test (Go 1.18)

Here's a simple test that processes the same short sentence twice. Both executions take ~100ms.

Code:

package main

import (
	"fmt"
	"time"

	"github.com/jdkato/prose/v2"
)

var text = "This is a simple test."

func main() {
	for i := 0; i < 2; i++ {
		start := time.Now()
		doc, err := prose.NewDocument(
			text,
			prose.WithExtraction(false),
			prose.WithSegmentation(false))
		duration := time.Since(start)
		fmt.Println(duration)
		if err != nil {
			panic(err)
		}

		// Iterate over the doc's tokens:
		fmt.Print("   ")
		for _, tok := range doc.Tokens() {
			fmt.Printf("(%v, %v)  ", tok.Text, tok.Tag)
		}
		fmt.Println()
	}
}

Output:

$ go run .
118.549243ms
   (This, DT)  (is, VBZ)  (a, DT)  (simple, JJ)  (test, NN)  (., .)  
117.214746ms
   (This, DT)  (is, VBZ)  (a, DT)  (simple, JJ)  (test, NN)  (., .)  
$

Comparison test using NLTK in Python (3.8)

When I run the same test using NLTK in Python, the first document processed also has ~100ms of overhead, but all subsequent documents are processed very quickly (~400usec in the example below):

Sample code:

#!/usr/bin/env python
import nltk
from datetime import datetime

text = "This is a simple test."

for _ in range(2):
    start = datetime.now()
    raw_tokens = nltk.word_tokenize(text)
    pos_tokens = nltk.pos_tag(raw_tokens)
    duration = datetime.now() - start
    print(duration)
    print(f'   {pos_tokens}')

Output:

$ ./test-nltk.py 
0:00:00.092738
   [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('test', 'NN'), ('.', '.')]
0:00:00.000415
   [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('test', 'NN'), ('.', '.')]
$

Update dependency neurosnap/sentences

I'd recommend updating the dependency neurosnap/sentences to use its new location (github.com/neurosnap/sentences) and the latest version (v1.0.9). This version fixes a very minor security problem where the readme linked to some amazon s3 buckets for downloading binaries. When repositories vendor jdkato/prose they will also get neurosnap/sentences, and it is better if they get this updated version.
Cannot reproduce tokenization example
Installed prose with go get github.com/jdkato/prose/v2 and copied the code from the tokenization example in the readme.

The output I get is

@jdkato NN , , go VB to TO http://example.com NN thanks NNS : : ) ) . .

Instead of the expected one reported in the example:

// @jdkato NN // , , // go VB // to TO // http://example.com NN // thanks NNS // :) SYM // . .

The difference is that the smile ":)" symbol is not recognized.

$ go version go version go1.16.5 linux/amd64
Help: Unable to run on AWS lambda
Hi i have a function that utilizes prose/v2.

it works fine as it passes my tests (just the logic though).

When i deploy it on AWS lambda, it doesn't run. I have logs right to the time I do

prose.NewDocument(stringVal)

And nothing logs out after that, no error nothing. albeit a noob with aws lambda, so maybe there are logs i'm not looking at.

What is it doing internally? Wonder why it would just end

Google GCP Text-to-Speech Service in one simple binary ;)

Google text-to-speak Simple Binary file This repository is a simple implementation of google text-to-speak service. Required enable API in GCP (https:

Dec 25, 2022

PipeIt is a text transformation, conversion, cleansing and extraction tool.

PipeIt PipeIt is a text transformation, conversion, cleansing and extraction tool. Features Split - split text to text array by given separator. Regex

Aug 15, 2022

Extraction politique de conformité : xlsx (fichier de suivi) -> xml (format AlgoSec)

go_policyExtractor Extraction politique de conformité : xlsx (fichier de suivi) -> xml (format AlgoSec). Le programme suivant se base sur les intitulé

Nov 4, 2021

Sqly - An easy-to-use extension for sqlx, base on xml files and named query/exec

sqly An easy-to-use extension for sqlx ，base on xml files and named query/exec t

Jun 12, 2022

Easy AWK-style text processing in Go

awk Description awk is a package for the Go programming language that provides an AWK-style text processing capability. The package facilitates splitt

Jul 25, 2022

Words - help with a word finder game, sketches a text-processing utility program

Shell-style text processing in Go I saw a word game where the puzzle gives you six letters. By means of a clever user interface, you construct words f

Jan 1, 2022

Example Book Report API written in Golang with Fiber and GORM

book-report Example Book Report API written in Golang with Fiber and GORM API func setupRoutes(app *fiber.App) { app.Get("/api/v1/book", book.GetBook

Nov 5, 2021

Auto-gen fuzzing wrappers from normal code. Automatically find buggy call sequences, including data races & deadlocks. Supports rich signature types.

fzgen fzgen auto-generates fuzzing wrappers for Go 1.18, optionally finds problematic API call sequences, can automatically wire outputs to inputs acr

Dec 23, 2022

omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023

A general purpose application and library for aligning text.

align A general purpose application that aligns text The focus of this application is to provide a fast, efficient, and useful tool for aligning text.

Sep 27, 2022

👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

?? The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

Dec 25, 2022

A modern text indexing library for go

bleve modern text indexing in go - blevesearch.com Features Index any go data structure (including JSON) Intelligent defaults backed up by powerful co

Jan 4, 2023

Fonetic is a library to assess pronounceablility of a given text

fonetic-go assess pronounciblity of text Introduction Fonetic is a library to assess pronounceablility of a given text. For more information, check ou

Oct 21, 2022

Package sanitize provides functions for sanitizing text in golang strings.

sanitize Package sanitize provides functions to sanitize html and paths with go (golang). FUNCTIONS sanitize.Accents(s string) string Accents replaces

Dec 5, 2022

Paranoid text spacing in Go (Golang)

pangu.go Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Oct 15, 2022

Parse placeholder and wildcard text commands

allot allot is a small Golang library to match and parse commands with pre-defined strings. For example use allot to define a list of commands your CL

Nov 24, 2022

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022

Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022

Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Jul 30, 2022