Go implementation of the Snowball stemmers

Snowball

A Go (golang) implementation of the Snowball stemmer for natural language processing.

Status
Latest release v0.7.0 (2020-11-30)
Latest build status Build
Latest Go versions tested go1.14, go1.15 MacOS, Windows & Ubuntu Linux
Languages available English, Spanish (español), French (le français), Russian (ру́сский язы́к), Swedish (svenska), Norwegian (norsk)
License MIT

Usage

Here is a minimal Go program that uses this package in order to stem a single word.

package main
import (
	"fmt"
	"github.com/kljensen/snowball"
)
func main(){
	stemmed, err := snowball.Stem("Accumulations", "english", true)
	if err == nil{
		fmt.Println(stemmed) // Prints "accumul"
	}
}

Organization & Implementation

The code is organized as follows:

  • The top-level snowball package has a single exported function snowball.Stem, which is defined in snowball/snowball.go.
  • The stemmer for each language is defined in a "sub-package", e.g snowball/spanish.
  • Each language exports a Stem function: e.g. spanish.Stem, which is defined in snowball/spanish/stem.go.
  • Code that is common to multiple languages may go in a separate package, e.g. the small romance package.

Some notes about the implementation:

  • In order to ensure the code is easily extended to non-English languages, I avoided using bytes and byte arrays, and instead perform all operations on runes. See snowball/snowballword/snowballword.go and the SnowballWord struct.
  • In order to avoid casting strings into slices of runes numerous times, this implementation uses a single slice of runes stored in the SnowballWord struct for each word that needs to be stemmed.
  • In spite of the foregoing, readability requires that some strings be kept around and repeatedly cast into slices of runes. For example, in the Spanish stemmer, one step requires removing suffixes with accute accents such as "ución", "logía", and "logías". If I were to hard-code those suffices as slices of runes, the code would be substantially less readable.
  • Instead of carrying around the word regions R1, R2, & RV as separate strings (or slices or runes, or whatever), we carry around the index where each of these regions begins. These are stored as R1start, R2start, & RVstart on the SnowballWord struct. I believe this is a relatively efficient way of storing R1 and R2.
  • The code does not use any maps or regular expressions 1) for kicks, and 2) because I thought they'd negatively impact the performance. (But, mostly for #1; I realize #2 is silly.)
  • I end up refactoring the snowballword package a bit every time I implement a new language.
  • Clearly, the Go implentation of these stemmers is verbose relative to the Snowball language. However, it is much better than the Java version and others.

Testing

To run the tests, do go test ./... in the top-level directory.

Future work

I'd like to implement the Snowball stemmer in more languages. If you can help, I would greatly appreciate it: please fork the project and send a pull request!

(Also, if you are interested in creating a larger NLP project for Go, please get in touch.)

Related work

I know of a few other stemmers availble in Go:

Contributors

License (MIT)

Copyright (c) 2013-2020 the Contributors (see above)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Owner
Kyle L. Jensen
I teach entrepreneurship at Yale. Lately 😍 @elm, @golang, and [learning] @rust-lang. Formerly 😍 @python and @nodejs. Containerize all the things ✊ !
Kyle L. Jensen
Comments
  • Specs of the Swedish stemmer are badly formed

    Specs of the Swedish stemmer are badly formed

    Hello!

    I implemented the Swedish stemmer according to the "specs", and it seems to work fine according to the tests on that very web page. The issue I'm having is that those specs performs badly in real situations.

    For example

    gevär -> gevär
    (gun) -> (gun)
    geväret   -> geväret
    (the gun) -> (the gun)
    

    I you were to ask me what the word stem of the determined form of the word "geväret" is, I would answer the undetermined form "gevär". This is not a one-off type of deal, but the rules do not seem to handle determined forms of nouns in general, which I would consider to be one of the first things to implement in a stemmer in Swedish.

    I could extend the rules and break from the "specs" or keep to the specs, and accept that the results are not satisfactory.

    Which way do you think I should go?

  • Happy -> Happi?

    Happy -> Happi?

    Hi,

    I noticed that the library converts the word "happy" to "happi". Is it intentional ?

    I've tried this text:

    Are Hunter-Gatherers The Happiest Humans To Inhabit Earth? There's an idea percolating up from the anthropology world that may make you rethink what makes you happy.
    

    Got this result:

    hunter-gatherers happiest humans inhabit earth there's idea percolating anthropology world make rethink makes happi
    
  • Norwegian stemming

    Norwegian stemming

  • Moved test code into separate package

    Moved test code into separate package

    This will get rid of test command line flags when someone uses flags package

     -test.bench string
            regular expression to select benchmarks to run
      -test.benchmem
            print memory allocations for benchmarks
      -test.benchtime duration
    
  • Swedish Stemmer

    Swedish Stemmer

    Hello there! I have added a stemmer for Swedish in accordance with the Snowball manual, for Swedish.

    I've added the vocab of ~30k words from snowball.tartarus.org, and run tests for them in the swedish_vocab. It passes, and thus I presume it works.

    I've tried to write the package in the same style that you have, with some slight differences. For example I have used the literal runes 'a', 'b', 'c' instead of the unicode number for them.

    I would love for this to get merged, so please let me know what I need to add or change in order for you to accept the PR.

    Best regards Anton Södergren

  • Support for additional languages

    Support for additional languages

    Hello,

    Are the authors of this package interested in supporting additional languages? I am interested in German, Mandarin and Czech myself. (I realize that Mandarin doesn't have inflection in the IE sense, but it does have the equivalent of stop words and particles that attach themselves as pseudo-affixes.)

    The structure of this library looks easily extensible - would you accept PRs that add support for the languages listed above?

    Better yet, do you maybe already have plans to support some of them?

    Best, Adam

  • Badly named file

    Badly named file

    Hi

    You named romance/common_testing.go instead of romance/common_test.go

    This files is not seen as a testing package and is compiled with the whole package.

    So, when I use your package using flag package, using "-h" flag, I see the entire test.XXX flags.

    I forked your package to rename the file and it works.

    BTW: Thanks for your work, it works great

  • Rename of file in #8 causes tests not to run

    Rename of file in #8 causes tests not to run

    Exceptions look like the following

    # github.com/kljensen/snowball/russian
    russian/russian_test.go:12: undefined: romance.WordBoolTestCase
    russian/russian_test.go:18: undefined: romance.RunRunewiseBoolTest
    russian/russian_test.go:25: undefined: romance.WordBoolTestCase
    russian/russian_test.go:32: undefined: romance.RunWordBoolTest
    russian/russian_test.go:36: undefined: romance.FindRegionsTestCase
    russian/russian_test.go:139: undefined: romance.RunFindRegionsTest
    russian/russian_test.go:145: undefined: romance.StepTestCase
    russian/russian_test.go:298: undefined: romance.RunStepTest
    russian/russian_test.go:305: undefined: romance.StepTestCase
    russian/russian_test.go:408: undefined: romance.RunStepTest
    
Go bindings for the snowball libstemmer library including porter 2

Go (golang) bindings for libstemmer This simple library provides Go (golang) bindings for the snowball libstemmer library including the popular porter

Sep 27, 2022
Cgo binding for Snowball C library

Description Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality. For more detailed info see http://snowball.tartar

Nov 28, 2022
Go language implementation of a blockchain based on the BDLS BFT protocol. The implementation was adapted from Ethereum and Sperax implementation

BDLS protocol based PoS Blockchain Most functionalities of this client is similar to the Ethereum golang implementation. If you do not find your quest

Oct 14, 2022
CVE-2021-4034 - A Golang implementation of clubby789's implementation of CVE-2021-4034

CVE-2021-4034 January 25, 2022 | An00bRektn This is a golang implementation of C

Feb 3, 2022
An implementation of JOSE standards (JWE, JWS, JWT) in Go

Go JOSE Package jose aims to provide an implementation of the Javascript Object Signing and Encryption set of standards. This includes support for JSO

Dec 18, 2022
goRBAC provides a lightweight role-based access control (RBAC) implementation in Golang.

goRBAC goRBAC provides a lightweight role-based access control implementation in Golang. For the purposes of this package: * an identity has one or mo

Dec 29, 2022
This is an implementation of JWT in golang!

jwt This is a minimal implementation of JWT designed with simplicity in mind. What is JWT? Jwt is a signed JSON object used for claims based authentic

Oct 25, 2022
Golang implementation of JSON Web Tokens (JWT)

jwt-go A go (or 'golang' for search engine friendliness) implementation of JSON Web Tokens NEW VERSION COMING: There have been a lot of improvements s

Jan 6, 2023
Platform-Agnostic Security Tokens implementation in GO (Golang)

Golang implementation of PASETO: Platform-Agnostic Security Tokens This is a 100% compatible pure Go (Golang) implementation of PASETO tokens. PASETO

Jan 2, 2023
s3fs provides a S3 implementation for Go1.16 filesystem interface.

S3 FileSystem (fs.FS) implementation.Since S3 is a flat structure, s3fs simulates directories by using prefixes and "/" delim. ModTime on directories is always zero value.

Nov 9, 2022
[NO LONGER MAINTAINED} oauth 2 server implementation in Go

hero hero is a feature rich oauth 2 server implementation in Go. Features User account management Client management oauth 2 rfc 6749 compliant Configu

Nov 18, 2022
OAuth 1.0a implementation in Go

Package oauth1a Summary An implementation of OAuth 1.0a in Go1. API reference Installing Run: go get github.com/kurrik/oauth1a Include in your source

Aug 23, 2022
OAuth 1.0 implementation in go (golang).

OAuth 1.0 Library for Go (If you need an OAuth 2.0 library, check out: https://godoc.org/golang.org/x/oauth2) Developing your own apps, with this libr

Nov 22, 2022
A golang implementation of a console-based trading bot for cryptocurrency exchanges
A golang implementation of a console-based trading bot for cryptocurrency exchanges

Golang Crypto Trading Bot A golang implementation of a console-based trading bot for cryptocurrency exchanges. Usage Download a release or directly bu

Dec 30, 2022
Pure Go termbox implementation

IMPORTANT This library is somewhat not maintained anymore. But I'm glad that it did what I wanted the most. It moved people away from "ncurses" mindse

Dec 28, 2022
go implementation of lightbend's HOCON configuration library https://github.com/lightbend/config

HOCON (Human-Optimized Config Object Notation) Configuration library for working with the Lightbend's HOCON format. HOCON is a human-friendly JSON sup

Dec 3, 2022
Go implementation of the XDG Base Directory Specification and XDG user directories

xdg Provides an implementation of the XDG Base Directory Specification. The specification defines a set of standard paths for storing application file

Jan 5, 2023
Native LZO implementation in Go

go-lzo Native LZO1X implementation in Golang This code has been written using the original LZO1X source code as a reference, to study and understand t

Oct 21, 2022
Go implementation of BLAKE2 (b) cryptographic hash function (optimized for 64-bit platforms).

Go implementation of BLAKE2b collision-resistant cryptographic hash function created by Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O'Hearn, an

Jul 11, 2022
An implementation of JOSE standards (JWE, JWS, JWT) in Go

Go JOSE Package jose aims to provide an implementation of the Javascript Object Signing and Encryption set of standards. This includes support for JSO

Jan 8, 2023