:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose Build Status GoDoc Coverage Status Go Report Card codebeat badge Awesome

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get github.com/jdkato/prose/v2

Usage

Contents

Overview

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmentation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

Type Example
Email addresses [email protected]
Hashtags #trending
Mentions @jdkato
URLs https://github.com/jdkato/prose
Emoticons :-), >:(, o_0, etc.
package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

Name Language License GRS (English) GRS (Other) Speed†
Pragmatic Segmenter Ruby MIT 98.08% (51/52) 100.00% 3.84 s
prose Go MIT 75.00% (39/52) N/A 0.96 s
TactfulTokenizer Ruby GNU GPLv3 65.38% (34/52) 48.57% 46.32 s
OpenNLP Java APLv2 59.62% (31/52) 45.71% 1.27 s
Standford CoreNLP Java GNU GPLv3 59.62% (31/52) 31.43% 0.92 s
Splitta Python APLv2 55.77% (29/52) 37.14% N/A
Punkt Python APLv2 46.15% (24/52) 48.57% 1.79 s
SRX English Ruby GNU GPLv3 30.77% (16/52) 28.57% 6.19 s
Scapel Ruby GNU GPLv3 28.85% (15/52) 20.00% 0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library Accuracy 5-Run Average (sec)
NLTK 0.893 7.224
prose 0.961 2.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAG DESCRIPTION
( left round bracket
) right round bracket
, comma
: colon
. period
'' closing quotation mark
`` opening quotation mark
# number sign
$ currency
CC conjunction, coordinating
CD cardinal number
DT determiner
EX existential there
FW foreign word
IN conjunction, subordinating or preposition
JJ adjective
JJR adjective, comparative
JJS adjective, superlative
LS list item marker
MD verb, modal auxiliary
NN noun, singular or mass
NNP noun, proper singular
NNPS noun, proper plural
NNS noun, plural
PDT predeterminer
POS possessive ending
PRP pronoun, personal
PRP$ pronoun, possessive
RB adverb
RBR adverb, comparative
RBS adverb, superlative
RP adverb, particle
SYM symbol
TO infinitival to
UH interjection
VB verb, base form
VBD verb, past tense
VBG verb, gerund or present participle
VBN verb, past participle
VBP verb, non-3rd person singular present
VBZ verb, 3rd person singular present
WDT wh-determiner
WP wh-pronoun, personal
WP$ wh-pronoun, possessive
WRB wh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "github.com/jdkato/prose/v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketbal in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.

Comments
  • The example in readme does not compile

    The example in readme does not compile

    I am referring to this:

    package main
    import "gopkg.in/jdkato/prose.v2"
    func main() { prose.NewDocument("Go is ...") }
    

    The NewDocument is actually in gopkg.in/jdkato/prose.v2/summarize now.

    However, go get gopkg.in/jdkato/prose.v2/summarize does not work either. The package does not compile due to usage of internal package. This is a typical error when using the gopkg.in service. gopkg.in only redirects the git URI, but does not rewrite package import paths. As a result, the referred imports will actually point back to the master branch, nullifying the essential purpose of versioning. To use gopkg.in properly, you need to manually rewrite the import paths in the entire repo in the release tags/branches (or just stop using gopkg.in for multi package repos..).

    I saw your repo from Hacker News, but your repo fails to build on smallrepo. Detailed build log here:

    https://smallrepo.com/builds/20180717-175536-bc73d63d

    Thanks.

  • Wrap location parsing in a function

    Wrap location parsing in a function

    This moves the logic that was in chunk_test.go into a new exported function named Chunk. The primary difference is that, instead of returning a slice of locations, we're now returning a slice of strings (i.e., the actual chunks).

    This changes the usage from

    words := tokenize.TextToWords(text)
    tagger := tag.NewPerceptronTagger()
    tagged := tagger.Tag(words)
    rs := Locate(tagged, TreebankNamedEntities)
    
    for r, loc := range rs {
        res := ""
        for t, tt := range tagged[loc[0]:loc[1]] {
            if t != 0 {
                res += " "
            }
            res += tt.Text
        }
    
        if r >= len(expected) {
            t.Error("ERROR unexpected result: " + res)
        } else {
            if res != expected[r] {
                t.Error("ERROR", res, "!=", expected[r])
            }
        }
    }
    

    to

    words := tokenize.TextToWords(text)
    tagger := tag.NewPerceptronTagger()
    tagged := tagger.Tag(words)
    
    for i, chunk := range Chunk(tagged, TreebankNamedEntities) {
        if i >= len(expected) {
            t.Error("ERROR unexpected result: " + chunk)
        } else {
            if chunk != expected[i] {
                t.Error("ERROR", chunk, "!=", expected[i])
            }
        }
    }
    

    /cc @elliott5

  • Possible enhancements to the

    Possible enhancements to the "summarize" package

    Have you considered adding the Coleman–Liau index for completeness? Even though "opinion varies on its accuracy": https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

    On the subject of suspect measures, a composite "years-of-education" metric, taking the average of scores together (and their standard deviation) may be of use: https://github.com/elliott5/readability/blob/master/assess.go

    Finally, for giving feedback to users on how to change their prose to be easier to read, it would be great if your analysis could store:

    • sentences with their word length; and
    • words with their syllable length and frequency (the product of the two ranking non-readability).

    Keep up the good work!

  • go get not working

    go get not working

    I tried to go get the hugo repo. I am using Go 1.14 (with go modules)

    go get github.com/gohugoio/hugo

    package github.com/jdkato/prose/transform: cannot find package "github.com/jdkato/prose/transform" in any of:
    	/usr/local/go/src/github.com/jdkato/prose/transform (from $GOROOT)
    	/Users/x/GOPATH/src/github.com/jdkato/prose/transform (from $GOPATH)
    

    I'm wondering why it is saying transform can't be found?

  • Make it possible to use vendored

    Make it possible to use vendored

    Very useful project.

    I wanted to test the title case functionality for use in Hugo, but we vendor our libraries, and I get a ../vendor/github.com/jdkato/prose/transform/title.go:9:2: use of internal package not allowed when importing github.com/jdkato/prose/transform.

    See https://github.com/gohugoio/hugo/pull/3753

    I have been Googling this, and it seems there is no (simple) workaround other than avoiding the use of internal packages in libraries.

  • Roadmap

    Roadmap

    This is rough outline of some improvements I'd like to make.

    Documentation

    • [x] Improve and update README.
    • [ ] Add .github files.

    tokenize

    • [x] Port the pragmatic_segmenter.
    • [x] Get PunktSentenceTokenizer passing the The Golden Rules and possibly submit a PR upstream.

    tag

    • [x] Finish porting the PerceptronTagger (just training functionality left).
    • [x] ~~Port the TnT Tagger~~ (not going to make v1.0).
    • [ ] Improve testing strategy (we currently rely on NLTK).

    transform

    • [x] Improve Title and add support for variations (e.g., AP style).
    • [ ] Port Change Case.

    summarize

    • [x] Finish working on Syllables.
    • [x] Add a composite "years-of-education" metric.
    • [ ] Add the ability to update a Document's content without having to recalculate all of its statistics.
    • [x] Expand test suite.
  • Allow overridable tokenizer parameters.

    Allow overridable tokenizer parameters.

    This PR changes iterTokenizer struct to contain the parameters that can be overriden. This will allow future changes to NewDocument that use a custom tokenizer.

    Will help with issue #41 and #32

    Test: existing unit test

  • Prose installation issues

    Prose installation issues

    I tried installing prose using the command provided go get gopkg.in/jdkato/prose.v2 and this is the error log printed on my terminal.

    # gopkg.in/jdkato/prose.v2
    ../gopkg.in/jdkato/prose.v2/extract.go:323:23: est.AtVec undefined (type *mat.VecDense has no field or method AtVec)
    ../gopkg.in/jdkato/prose.v2/extract.go:335:25: est.AtVec undefined (type *mat.VecDense has no field or method AtVec)
    ../gopkg.in/jdkato/prose.v2/extract.go:355:17: count.AtVec undefined (type *mat.VecDense has no field or method AtVec)
    ../gopkg.in/jdkato/prose.v2/extract.go:525:27: count.AtVec undefined (type *mat.VecDense has no field or method AtVec)
    

    go-lang version : go1.9.7 darwin/amd64

  • slice bounds out of range, title.go:49

    slice bounds out of range, title.go:49

    I'm trying to use TItleConverter.Title(), but it panics on me, when the string contains certain multibyte characters.

    Crashing testcases:

            type tc struct {
    		name  string
    		input string
    		want  string
    	}
    	tests := []tc{
                    tc{"panic", "This Agreement, dated [DATE] (the “Effective Date”) for Design Services (the “Agreement”) is between [DESIGNER NAME], of [DESIGNER COMPANY](“Designer”), and [CLIENT NAME], of [CLIENT COMPANY] (“Client”) (together known as the “Parties”), for the performance of said Design Services and the production of Deliverables, as described in Schedule A, attached hereto and incorporated herein by reference.", "panic"},
    		tc{"panic", "Crash,”“us,” “our” or “we” means Crash Network, Inc. (d/b/a Crash) and its subsidiaries and affiliates.", "panic"},
    		tc{"panic", "a “[“New Entity”],” an [Institution] and [Institution].", "panic"},
            }
    
  • List of Label?

    List of Label?

    Hi, I can see that we have a comprehensive list of tags but can't find anything for label (for example, PERSON, GPE, etc). Would be nice if someone can redirect me to the list, even if it is somewhere in the source code.

  • Introduce Tokenizer interface

    Introduce Tokenizer interface

    This PR allows the user to provide a different tokenizer.

    Users can specify their own Tokenizer in the DocOpts. This replaces the boolean Tokenize option (set Tokenizer to nil to disable.)

    Currently only IterTokenizer is provided, which can be customized with its own Using options. func Tokenize becomes public to allow users to provide their own implementation and completely replace IterTokenizer.

    Model and Extractor need to use the same Tokenizer as Document, so this PR modifies those APIs to be consistent.

    (Also separating makeCorpus from extracterFromData to simplify parameter passing.)

    This solves issue #41 and #32

  • sentences first, then words?

    sentences first, then words?

    I'm a bit surprised to see this:

    type Document struct {
            Model *Model
            Text  string
    
            // TODO: Store offsets (begin, end) instead of `text` field.
            entities  []Entity
            sentences []Sentence
            tokens    []*Token
    }
    
    // A Sentence represents a segmented portion of text.
    type Sentence struct {
            Text string // The sentence's text.
    }
    

    If I care about finding sentences first, then the words within them, I need to take two passes, right?

     	// First we will do only segmentation, to break up sentences
        doc, err := prose.NewDocument(string(content), prose.WithTagging(false), prose.WithTokenization(false), prose.WithExtraction(false))
    	if err != nil {
            log.Fatal(err)
        }
    
        // Iterate over the doc's sentences, and words within them
        sents := doc.Sentences()
        fmt.Println(len(sents))
        for _, sent := range sents {
            fmt.Println(sent.Text)
    
    		sdoc, err := prose.NewDocument(sent.Text, prose.WithTagging(false), prose.WithExtraction(false), prose.WithSegmentation(false))
    		if err != nil {
    			log.Fatal(err)
    		}
    
    		// Iterate over the doc's tokens:
    		for _, tok := range sdoc.Tokens() {
    			fmt.Println(tok.Text, tok.Tag, tok.Label)
    		}
    
    		// Iterate over the doc's named-entities:
    		for _, ent := range sdoc.Entities() {
    			fmt.Println(ent.Text, ent.Label)
    		}
        }
    

    I suspect it's less efficient that way.

    Likewise, tokens include named entities, but it might make more sense to be able to iterate tokens in a sentence in such a way that each token is either a named entity or a regular token?

    If you intend to store offsets, like the TODO comment says, maybe this can be worked around, by finding overlapping offset ranges (e.g. sentence 2 goes from character position 240 to 267; word 10 goes from 240 to 247; named entity 2 goes from 248 to 257; etc... then I can see that word 10 and named entity 2 are both part of sentence 2, even if you don't offer a hierarchical model).

  • Seeing ~100ms overhead per doc. Did performance break, or am I using the API incorrecty?

    Seeing ~100ms overhead per doc. Did performance break, or am I using the API incorrecty?

    I am seeing a ~100ms overhead to process any document (text), which seems like it can't be correct given the performance data listed for large corpuses. I've been porting a large Python/NTLK app to Go, but my legacy Python/NLTK text parsing is running ~250x faster than my new go/prose implementation, which seems like I must be doing something wrong.

    Did some external dependency break the performance of prose, or am I using the API incorrectly?

    Simple performance test (Go 1.18)

    Here's a simple test that processes the same short sentence twice. Both executions take ~100ms.

    Code:

    package main
    
    import (
    	"fmt"
    	"time"
    
    	"github.com/jdkato/prose/v2"
    )
    
    var text = "This is a simple test."
    
    func main() {
    	for i := 0; i < 2; i++ {
    		start := time.Now()
    		doc, err := prose.NewDocument(
    			text,
    			prose.WithExtraction(false),
    			prose.WithSegmentation(false))
    		duration := time.Since(start)
    		fmt.Println(duration)
    		if err != nil {
    			panic(err)
    		}
    
    		// Iterate over the doc's tokens:
    		fmt.Print("   ")
    		for _, tok := range doc.Tokens() {
    			fmt.Printf("(%v, %v)  ", tok.Text, tok.Tag)
    		}
    		fmt.Println()
    	}
    }
    

    Output:

    $ go run .
    118.549243ms
       (This, DT)  (is, VBZ)  (a, DT)  (simple, JJ)  (test, NN)  (., .)  
    117.214746ms
       (This, DT)  (is, VBZ)  (a, DT)  (simple, JJ)  (test, NN)  (., .)  
    $
    

    Comparison test using NLTK in Python (3.8)

    When I run the same test using NLTK in Python, the first document processed also has ~100ms of overhead, but all subsequent documents are processed very quickly (~400usec in the example below):

    Sample code:

    #!/usr/bin/env python
    import nltk
    from datetime import datetime
    
    text = "This is a simple test."
    
    for _ in range(2):
        start = datetime.now()
        raw_tokens = nltk.word_tokenize(text)
        pos_tokens = nltk.pos_tag(raw_tokens)
        duration = datetime.now() - start
        print(duration)
        print(f'   {pos_tokens}')
    

    Output:

    $ ./test-nltk.py 
    0:00:00.092738
       [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('test', 'NN'), ('.', '.')]
    0:00:00.000415
       [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('test', 'NN'), ('.', '.')]
    $
    
  • Update dependency neurosnap/sentences

    Update dependency neurosnap/sentences

    I'd recommend updating the dependency neurosnap/sentences to use its new location (github.com/neurosnap/sentences) and the latest version (v1.0.9). This version fixes a very minor security problem where the readme linked to some amazon s3 buckets for downloading binaries. When repositories vendor jdkato/prose they will also get neurosnap/sentences, and it is better if they get this updated version.

  • Cannot reproduce tokenization example

    Cannot reproduce tokenization example

    Installed prose with go get github.com/jdkato/prose/v2 and copied the code from the tokenization example in the readme.

    The output I get is

    @jdkato NN
    , ,
    go VB
    to TO
    http://example.com NN
    thanks NNS
    : :
    ) )
    . .
    

    Instead of the expected one reported in the example:

            // @jdkato NN
            // , ,
            // go VB
            // to TO
            // http://example.com NN
            // thanks NNS
            // :) SYM
            // . .
    

    The difference is that the smile ":)" symbol is not recognized.

    $ go version go version go1.16.5 linux/amd64

  • Help: Unable to run on AWS lambda

    Help: Unable to run on AWS lambda

    Hi i have a function that utilizes prose/v2.

    it works fine as it passes my tests (just the logic though).

    When i deploy it on AWS lambda, it doesn't run. I have logs right to the time I do

    prose.NewDocument(stringVal)
    

    And nothing logs out after that, no error nothing. albeit a noob with aws lambda, so maybe there are logs i'm not looking at.

    What is it doing internally? Wonder why it would just end

Google GCP Text-to-Speech Service in one simple binary ;)

Google text-to-speak Simple Binary file This repository is a simple implementation of google text-to-speak service. Required enable API in GCP (https:

Dec 25, 2022
PipeIt is a text transformation, conversion, cleansing and extraction tool.
PipeIt is a text transformation, conversion, cleansing and extraction tool.

PipeIt PipeIt is a text transformation, conversion, cleansing and extraction tool. Features Split - split text to text array by given separator. Regex

Aug 15, 2022
Extraction politique de conformité : xlsx (fichier de suivi) -> xml (format AlgoSec)

go_policyExtractor Extraction politique de conformité : xlsx (fichier de suivi) -> xml (format AlgoSec). Le programme suivant se base sur les intitulé

Nov 4, 2021
Sqly - An easy-to-use extension for sqlx, base on xml files and named query/exec

sqly An easy-to-use extension for sqlx ,base on xml files and named query/exec t

Jun 12, 2022
Easy AWK-style text processing in Go

awk Description awk is a package for the Go programming language that provides an AWK-style text processing capability. The package facilitates splitt

Jul 25, 2022
Words - help with a word finder game, sketches a text-processing utility program

Shell-style text processing in Go I saw a word game where the puzzle gives you six letters. By means of a clever user interface, you construct words f

Jan 1, 2022
Example Book Report API written in Golang with Fiber and GORM

book-report Example Book Report API written in Golang with Fiber and GORM API func setupRoutes(app *fiber.App) { app.Get("/api/v1/book", book.GetBook

Nov 5, 2021
Auto-gen fuzzing wrappers from normal code. Automatically find buggy call sequences, including data races & deadlocks. Supports rich signature types.

fzgen fzgen auto-generates fuzzing wrappers for Go 1.18, optionally finds problematic API call sequences, can automatically wire outputs to inputs acr

Dec 23, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023
A general purpose application and library for aligning text.

align A general purpose application that aligns text The focus of this application is to provide a fast, efficient, and useful tool for aligning text.

Sep 27, 2022
👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike
👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

?? The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

Dec 25, 2022
A modern text indexing library for go
A modern text indexing library for go

bleve modern text indexing in go - blevesearch.com Features Index any go data structure (including JSON) Intelligent defaults backed up by powerful co

Jan 4, 2023
Fonetic is a library to assess pronounceablility of a given text

fonetic-go assess pronounciblity of text Introduction Fonetic is a library to assess pronounceablility of a given text. For more information, check ou

Oct 21, 2022
Package sanitize provides functions for sanitizing text in golang strings.

sanitize Package sanitize provides functions to sanitize html and paths with go (golang). FUNCTIONS sanitize.Accents(s string) string Accents replaces

Dec 5, 2022
Paranoid text spacing in Go (Golang)

pangu.go Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Oct 15, 2022
Parse placeholder and wildcard text commands

allot allot is a small Golang library to match and parse commands with pre-defined strings. For example use allot to define a list of commands your CL

Nov 24, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022
Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022
Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Jul 30, 2022