A multilingual command line sentence tokenizer in Golang

Build Status GODOC MIT Go Report Card

Sentences - A command line sentence tokenizer

This command line utility will convert a blob of text into a list of sentences.

Install

go get gopkg.in/neurosnap/sentences.v1
go install gopkg.in/neurosnap/sentences.v1/_cmd/sentences

Binaries

Linux

Mac

Windows

Command

Command line

Get it

go get gopkg.in/neurosnap/sentences.v1

Use it

import (
    "fmt"

    "gopkg.in/neurosnap/sentences.v1"
    "gopkg.in/neurosnap/sentences.v1/data"
)

func main() {
    text := `A perennial also-ran, Stallings won his seat when longtime lawmaker David Holmes
    died 11 days after the filing deadline. Suddenly, Stallings was a shoo-in, not
    the long shot. In short order, the Legislature attempted to pass a law allowing
    former U.S. Rep. Carolyn Cheeks Kilpatrick to file; Stallings challenged the
    law in court and won. Kilpatrick mounted a write-in campaign, but Stallings won.`

    // Compiling language specific data into a binary file can be accomplished
    // by using `make <lang>` and then loading the `json` data:
    b, _ := data.Asset("data/english.json");

    // load the training data
    training, _ := sentences.LoadTraining(b)

    // create the default sentence tokenizer
    tokenizer := sentences.NewSentenceTokenizer(training)
    sentences := tokenizer.Tokenize(text)

    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

English

This package attempts to fix some problems I noticed for english.

import (
    "fmt"

    "gopkg.in/neurosnap/sentences.v1/english"
)

func main() {
    text := "Hi there. Does this really work?"

    tokenizer, err := english.NewSentenceTokenizer(nil)
    if err != nil {
        panic(err)
    }

    sentences := tokenizer.Tokenize(text)
    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

Contributing

I need help maintaining this library. If you are interested in contributing to this library then please start by looking at the golder-rules branch which tests the Golden Rules for english sentence tokenization created by the Pragmatic Segmenter library.

Create an issue for a particular failing test and submit an issue/PR.

I'm happy to help anyone willing to contribute.

Customizable

Sentences was built around composability, most major components of this package can be extended.

Eager to make adhoc changes but don't know how to start? Have a look at github.com/neurosnap/sentences/english for a solid example.

Notice

I have not tested this tokenizer in any other language besides English. By default the command line utility loads english. I welcome anyone willing to test the other languages to submit updates as needed.

A primary goal for this package is to be multilingual so I'm willing to help in any way possible.

This library is a port of the nltk's punkt tokenizer.

A Punkt Tokenizer

An unsupervised multilingual sentence boundary detection library for golang. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelyhoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbrevation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Unsupervised multilingual sentence boundary detection

Performance

Using Brown Corpus which is annotated American English text, we compare this package with other libraries across multiple programming languages.

Library Avg Speed (s, 10 runs) Accuracy (%)
Sentences 1.96 98.95
NLTK 5.22 99.21
Owner
Eric Bower
We're all in the gutter, but some of us are looking at the stars.
Eric Bower
Comments
  • Improve performance on Golden Rules

    Improve performance on Golden Rules

    Hi,

    I did a pass over the Golden Rules and I was able to get the failing cases (excluding the list-related tests) down to 4:

    --- FAIL: TestGoldenRules (0.00s)
            golden_rules_test.go:11: 14. Multi-period abbreviations at the end of a sentence
            golden_rules_test.go:12: Actual: [<Sentence [0:33]>]
            golden_rules_test.go:13: Actual: 1, Expected: 2
            golden_rules_test.go:14: ===
            golden_rules_test.go:11: 15. U.S. as sentence boundary
            golden_rules_test.go:12: Actual: [<Sentence [0:33]>]
            golden_rules_test.go:13: Actual: 1, Expected: 2
            golden_rules_test.go:14: ===
            golden_rules_test.go:11: 18. A.M. / P.M. as non sentence boundary and sentence boundary
            golden_rules_test.go:12: Actual: [<Sentence [0:37]> <Sentence [37:98]>]
            golden_rules_test.go:13: Actual: 2, Expected: 3
            golden_rules_test.go:14: ===
            golden_rules_test.go:20: 43. Geo Coordinates
            golden_rules_test.go:21: Actual: [You can find it at N°.] Expected: [You can find it at N°. 1026.253.553.]
            golden_rules_test.go:22: ===
    

    My changes are all local to the english package and all other tests are still passing. Here's the performance impact on the English test files:

    name              old time/op  new time/op  delta
    EnglishPackage-4  8.70ms ± 1%  9.42ms ± 1%  +8.31%  (p=0.008 n=5+5)
    

    I'd like to hear your thoughts on my progress so far.

    Thanks for your work on this useful package!

  • Fix for word tokenizer dropping last word in input.

    Fix for word tokenizer dropping last word in input.

    Currently, the word tokenizer fails to return the last token in a given string. For example, "This is a test sentence" is output as {This is a test}. The problem occurs in line 81, where the loop continues if it is not a space. I added a condition there to only continue if the current character is not the last in the input.

    Lines 90-95 just handle the case appropriately where i is the last character; if it is the last character then we want text[lastSpace:], otherwise text[len(text)-1] never has the opportunity to be included in a token, and so if we don't do that "This is a test sentence" comes out as {This is a test sentenc}.

    The TravisCI build is currently failing but this is due to the sentence tokenizer which I did not alter.

    I'm hoping to write some tests and benchmarks for the word tokenizer, but I thought I'd get the bugfix out ASAP in case others are affected.

    Cheers!

  • optimization suggestion

    optimization suggestion

    staring at a profile at the moment where it appears that regex compilation happens at each tokenization. Seems like caching compiled regexs would make this (awesome) library twice as fast for use on large corpus?

    out

  • Data file structure/creation

    Data file structure/creation

    I'd like to use train a punkt model on a custom corprus- in this case, a large set of tweets collected from the Twitter API. While it is technically english, I'm not having great results with any off the shelf tokenizer available in go. Twitter obviously has some idiosyncrisies- unique abbreviations, the misspellings inherent to web text, urls, emoji...and so on. I wanted to take an unsupervised approach first which led me to punkt and this package.

    I've taken a quick look at the data files provided in the repo, but it isn't completely clear what the structure is, or how they were created. I'm happy to make a pull request with my work if I see some results, but if you could point me in the right direction as to the generation of the data file for a custom corprus I'd really appreciate it. It's entirely possible that I'm missing some existing documentation. If not, I'd be happy to clean up any explanation you can give and make a pull request to include it in docs. Thanks!

  • Loadtraining fails with

    Loadtraining fails with

    Instead of compiling the assets with go-bindata I load the json files like

    f, err := os.Open(initalisationfilenames[lang].segmentationfilename)
    if err != nil {
    	return nil, err
    }
    b, err := ioutil.ReadAll(f)
    if err != nil {
    	return nil, err
    }
    

    The results seem to be correct, can you confirm?

  • spf13/cobra for command line leads to many recursive deps

    spf13/cobra for command line leads to many recursive deps

    When you use sentences as a module and don't need the command line utility, you're stuck with a massive number of recursive dependencies to vendor stemming form spf13/cobra.

    gvt fetch gopkg.in/neurosnap/sentences.v1
    2017/03/05 20:23:42 Fetching: gopkg.in/neurosnap/sentences.v1
    2017/03/05 20:23:47 · Fetching recursive dependency: github.com/spf13/cobra
    2017/03/05 20:23:50 ·· Fetching recursive dependency: github.com/spf13/viper
    2017/03/05 20:23:52 ··· Fetching recursive dependency: github.com/fsnotify/fsnotify
    2017/03/05 20:23:54 ···· Skipping (existing): golang.org/x/sys/unix
    2017/03/05 20:23:54 ··· Fetching recursive dependency: github.com/mitchellh/mapstructure
    2017/03/05 20:23:56 ··· Fetching recursive dependency: github.com/xordataexchange/crypt/config
    2017/03/05 20:23:58 ···· Fetching recursive dependency: github.com/xordataexchange/crypt/backend
    2017/03/05 20:23:59 ····· Fetching recursive dependency: github.com/armon/consul-api
    2017/03/05 20:24:01 ····· Fetching recursive dependency: github.com/coreos/go-etcd/etcd
    2017/03/05 20:24:05 ······ Fetching recursive dependency: github.com/ugorji/go/codec
    2017/03/05 20:24:08 ···· Fetching recursive dependency: github.com/xordataexchange/crypt/encoding/secconf
    2017/03/05 20:24:08 ····· Fetching recursive dependency: golang.org/x/crypto/openpgp
    2017/03/05 20:24:10 ······ Fetching recursive dependency: golang.org/x/crypto/cast5
    2017/03/05 20:24:10 ··· Fetching recursive dependency: github.com/spf13/jwalterweatherman
    2017/03/05 20:24:12 ··· Fetching recursive dependency: github.com/spf13/afero
    2017/03/05 20:24:14 ···· Fetching recursive dependency: github.com/pkg/sftp
    2017/03/05 20:24:17 ····· Fetching recursive dependency: github.com/pkg/errors
    2017/03/05 20:24:19 ····· Fetching recursive dependency: github.com/kr/fs
    2017/03/05 20:24:21 ····· Fetching recursive dependency: golang.org/x/crypto/ssh
    2017/03/05 20:24:21 ····· Deleting existing subpackage to prevent overlap: golang.org/x/crypto/ssh/terminal
    2017/03/05 20:24:21 ······ Fetching recursive dependency: golang.org/x/crypto/ed25519
    2017/03/05 20:24:21 ······ Skipping (existing): golang.org/x/sys/unix
    2017/03/05 20:24:21 ······ Fetching recursive dependency: golang.org/x/crypto/curve25519
    2017/03/05 20:24:21 ···· Fetching recursive dependency: golang.org/x/text/unicode/norm
    2017/03/05 20:24:23 ····· Fetching recursive dependency: golang.org/x/text/transform
    2017/03/05 20:24:23 ····· Fetching recursive dependency: golang.org/x/text/internal/triegen
    2017/03/05 20:24:23 ····· Fetching recursive dependency: golang.org/x/text/internal/ucd
    2017/03/05 20:24:23 ····· Skipping (existing): golang.org/x/text/internal/gen
    2017/03/05 20:24:23 ··· Fetching recursive dependency: github.com/spf13/pflag
    2017/03/05 20:24:26 ··· Fetching recursive dependency: github.com/pelletier/go-toml
    2017/03/05 20:24:28 ···· Fetching recursive dependency: github.com/pelletier/go-buffruneio
    2017/03/05 20:24:30 ··· Fetching recursive dependency: github.com/spf13/cast
    2017/03/05 20:24:32 ··· Fetching recursive dependency: github.com/magiconair/properties
    2017/03/05 20:24:34 ··· Fetching recursive dependency: github.com/hashicorp/hcl
    2017/03/05 20:24:36 ··· Fetching recursive dependency: gopkg.in/yaml.v2
    2017/03/05 20:24:39 ·· Fetching recursive dependency: github.com/inconshreveable/mousetrap
    2017/03/05 20:24:41 ·· Fetching recursive dependency: github.com/cpuguy83/go-md2man/md2man
    2017/03/05 20:24:43 ··· Fetching recursive dependency: github.com/cpuguy83/go-md2man/vendor/github.com/russross/blackfriday
    2017/03/05 20:24:43 ···· Fetching recursive dependency: github.com/cpuguy83/go-md2man/vendor/github.com/shurcooL/sanitized_anchor_name
    2017/03/05 20:24:43 · Fetching recursive dependency: github.com/neurosnap/sentences/english
    
  • Ellipses are split off into sentences

    Ellipses are split off into sentences

    The out put of this selection of text:

    “Can’t, Tom, I’m on Hogwarts business,” said Hagrid, clapping his great hand on Harry’s shoulder and making Harry’s knees buckle. “Good Lord,” said the bartender, peering at Harry, “is this — can this be — ?” The Leaky Cauldron had suddenly gone completely still and silent. “Bless my soul,” whispered the old bartender, “Harry Potter . . . what an honor.” He hurried out from behind the bar, rushed toward Harry and seized his hand, tears in his eyes. “Welcome back, Mr. Potter, welcome back.”

    is:

    “Can’t, Tom, I’m on Hogwarts business,” said Hagrid, clapping his great hand on Harry’s shoulder and making Harry’s knees buckle.
    “Good Lord,” said the bartender, peering at Harry, “is this — can this be — ?”
    The Leaky Cauldron had suddenly gone completely still and silent.
    “Bless my soul,” whispered the old bartender, “Harry Potter . . . what an honor.”
    He hurried out from behind the bar, rushed toward Harry and seized his hand, tears in his eyes.
    “Welcome back, Mr. Potter, welcome back.”
    1 characters remaining
    “Can’t, Tom, I’m on Hogwarts business,” said Hagrid, clapping his great hand on Harry’s shoulder and making Harry’s knees buckle.
    “Good Lord,” said the bartender, peering at Harry, “is this — can this be — ?”
    The Leaky Cauldron had suddenly gone completely still and silent.
    “Bless my soul,” whispered the old bartender, “Harry Potter .
    .
    .
    what an honor.”
    He hurried out from behind the bar, rushed toward Harry and seized his hand, tears in his eyes.
    “Welcome back, Mr. Potter, welcome back.”
    
  • Installation instructionss broken, binary links dead

    Installation instructionss broken, binary links dead

    When trying to download the binaries, I'm getting something like this for all of them.

    <Error> <Code>AllAccessDisabled</Code> <Message>All access to this object has been disabled</Message><RequestId>7268EB2B3DC8532F</RequestId> <HostId>i2U6tSOMH/7Kyq29rzKr/A7HubUHQRQI/01b8nsYxBshadyeuc1jwRBDtHjaGA26ivrIH9tTEHU=</HostId> </Error>

    Also, the commands in the readme are incorrect- sentences/cmd/sentences no longer exists as cmd was renamed to _cmd.

  • The demo doesn't seem to work with these two paragraphs.

    The demo doesn't seem to work with these two paragraphs.

    An excerpt from Adventures with mmap

    This week I started at the Recurse Center, a self directed program where everyone is working at becoming a better programmer. If you’ve been considering it, you should definitely do it! It’s even more awesome than you’ve heard! The first project I’m working on is a distributed in-memory datastore. But it’s primarily an excuse to play around with stuff I’ve been reading about and haven’t gotten around to! This is the story of my adventure with mmap.

  • Sentences get cut off if semi-colon used

    Sentences get cut off if semi-colon used

    The following sentence gets cut off after the semi-colon.

    I am here; you are over there.
    

    I tested that sentence at sentences.erock.io.

    All I see is

    I am here
    
  • precompile regular expressions for a ~4x speedup

    precompile regular expressions for a ~4x speedup

    a hack to precompile regular expressions at application startup time. For unit tests:

    before:

    ok      github.com/neurosnap/sentences  0.163s
    

    after:

    ok      github.com/neurosnap/sentences  0.040s
    

    about a 4x speedup.

    Then in a more real world test I integrated this into a program that processes text in 150k documents of approximately 4kb a piece. Can't share the corpus unfortunately cuz it's not public. but before:

    real    1m35.864s
    user    16m13.703s
    sys 0m8.947s
    

    after:

    real    0m15.041s
    user    2m0.930s
    sys 0m1.777s
    

    8x less cpu used!

  • double-newlines should always start new sentence?

    double-newlines should always start new sentence?

    I noticed this in the context of cited quotations like

    I think there's a bug here.  — me
    
    And then another paragraph.
    

    I think that should be 3 "sentences". The double-newline might be a reliable clue: continuing a sentence from one paragraph to the next is at least uncommon if not disallowed, right? (depending whether you want to keep them together if one paragraph ends with an ellipsis and the next starts with an ellipsis, perhaps) Another way would be to recognize this cited-quotation form, but I guess that could be risky.

    diff --git a/sentences_test.go b/sentences_test.go
    index e506188..d178f09 100644
    --- a/sentences_test.go
    +++ b/sentences_test.go
    @@ -174,6 +174,19 @@ func TestSpacedPeriod(t *testing.T) {
            compareSentence(t, actualText, expected)
     }
     
    +func TestQuotationSourceAndDoubleNewlines(t *testing.T) {
    +       t.Log("Tokenizer should treat double-newline as end of sentence regardless of ending punctuation")
    +
    +       actualText := "'A witty saying proves nothing.' — Voltaire\n\nAnd yet it commands attention."
    +       expected := []string{
    +               "'A witty saying proves nothing.'",
    +               " — Voltaire",
    +               "And yet it commands attention.",
    +       }
    +
    +       compareSentence(t, actualText, expected)
    +}
    +
    

    I was poking around; I see you have token.ParaStart being set sometimes when a double-newline is detected, but treating ParaStart the same as SentBreak in Tokenize() didn't fix it.

  • How to have all supported languages available at runtime?

    How to have all supported languages available at runtime?

    I'm trying to use this library in a multilingual environment. I have function that receives the raw text and a language name as parameters, then loads the right language package and return sentences.

    As only english is loaded by default, my test fails for all other languages, which I expected. But then I tried to run "make spanish" in the project folder and had two different errors:

    • First one, a permission error since data/spanish.json is readonly (installed with go get ...)
    • Then I ran with sudo, which worked fine. But my test fails with this error:

    gopkg.in/neurosnap/sentences.v1/data

    /Users/***/go/pkg/mod/gopkg.in/neurosnap/[email protected]/data/spanish.go:18:6: bindataRead redeclared in this block

    Could you give me some indication on how to compile all supported language packages so they are available to choose at runtime?

    Thanks!

  • More sentence examples

    More sentence examples

    The python lib pragmatic_segmenter has a list of 50+ sentence split examples that this lib fails to parse. You can use their list to test this lib.

    For example:

    He left the bank at 6 P.M. Mr. Smith then went to the store.
    

    Which neurosnap/sentences assumes is one sentence.

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

Dec 25, 2022
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Sep 27, 2022
Golang port of Petrovich - an inflector for Russian anthroponyms.
Golang port of Petrovich - an inflector for Russian anthroponyms.

Petrovich is the library which inflects Russian names to given grammatical case. This is the Go port of https://github.com/petrovich. Installation go

Dec 25, 2022
A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

Jan 4, 2023
Cross platform locale detection for Golang

go-locale go-locale is a Golang lib for cross platform locale detection. OS Support Support all OS that Golang supported, except android: aix: IBM AIX

Aug 20, 2022
Golang RESTful Client for HanLP.中文分词 词性标注 命名实体识别 依存句法分析 语义依存分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

gohanlp 中文分词 词性标注 命名实体识别 依存句法分析 语义依存分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理 HanLP 的golang 接口 在线轻量级RESTful API 仅数KB,适合敏捷开发、移动APP等场景。服务器算力有限,匿名用户配额较少

Dec 16, 2022
📖 Tutorial: An easy way to translate your Golang application
📖 Tutorial: An easy way to translate your Golang application

?? Tutorial: An easy way to translate your Golang application ?? The full article is published on April 13, 2021, on Dev.to: https://dev.to/koddr/an-e

Feb 9, 2022
i18n of golang

i18n i18n of golang 使用方法 下载i18n go get https://github.com/itmisx/i18n 定义 code 语言包 var langPack1 = map[string]map[interface{}]interface{}{ "zh-cn": {

Dec 11, 2021
Licence-server - Building a golang Swagger API with Echo

Building a golang Swagger API with Echo Known Issues References [1] https://deve

Jan 9, 2022
Go-i18n - i18n for Golang

I18n for Go Installation go get -u github.com/fitv/go-i18n Usage YAML files ├──

Oct 18, 2022
A command line interface for trying out Repustate's multilingual semantic search
A command line interface for trying out Repustate's multilingual semantic search

rcli A command line interface for trying out Repustate's multilingual semantic search. Install & Usage Download the binary for your OS. Make sure it's

Nov 26, 2020
Speaker command reads aloud the text message. It supports multilingual voice reading

speaker - Read the text aloud speaker command reads aloud the text message. It supports multilingual voice reading. If you want the time signal, the s

Aug 22, 2022
Tools to help with Japanese sentence mining

Tools to help with Japanese sentence mining

Oct 25, 2022
go nmea sentence analyser

nmea0183 Externally configurable nmea0183 sentence analyser in a go package Status: Essentially functional but undergoing development and testing. Som

Mar 15, 2022
An open-source GitLab command line tool bringing GitLab's cool features to your command line
An open-source GitLab command line tool bringing GitLab's cool features to your command line

GLab is an open source GitLab CLI tool bringing GitLab to your terminal next to where you are already working with git and your code without switching

Dec 30, 2022
A command line tool that builds and (re)starts your web application everytime you save a Go or template fileA command line tool that builds and (re)starts your web application everytime you save a Go or template file

# Fresh Fresh is a command line tool that builds and (re)starts your web application everytime you save a Go or template file. If the web framework yo

Nov 22, 2021
A command line tool to prompt for a value to be included in another command line.

readval is a command line tool which is designed for one specific purpose—to prompt for a value to be included in another command line. readval prints

Dec 22, 2021
Go package to make lightweight ASCII line graph ╭┈╯ in command line apps with no other dependencies.
Go package to make lightweight ASCII line graph ╭┈╯ in command line apps with no other dependencies.

asciigraph Go package to make lightweight ASCII line graphs ╭┈╯. Installation go get github.com/guptarohit/asciigraph Usage Basic graph package main

Jan 1, 2023
Go package to make lightweight ASCII line graph ╭┈╯ in command line apps with no other dependencies.
Go package to make lightweight ASCII line graph ╭┈╯ in command line apps with no other dependencies.

asciigraph Go package to make lightweight ASCII line graphs ╭┈╯. Installation go get github.com/guptarohit/asciigraph Usage Basic graph package main

Jan 8, 2023
Command-line tool and library for Windows remote command execution in Go

WinRM for Go Note: if you're looking for the winrm command-line tool, this has been splitted from this project and is available at winrm-cli This is a

Nov 29, 2022