A native Go clean room implementation of the Porter Stemming algorithm.

Go Porter Stemmer

A native Go clean room implementation of the Porter Stemming Algorithm.

This algorithm is of interest to people doing Machine Learning or Natural Language Processing (NLP).

This is NOT a port. This is a native Go implementation from the human-readable description of the algorithm.

I've tried to make it (more) efficient by NOT internally using string's, but instead internally using []rune's and using the same (array) buffer used by the []rune slice (and sub-slices) at all steps of the algorithm.

For Porter Stemmer algorithm, see:

http://tartarus.org/martin/PorterStemmer/def.txt (URL #1)

http://tartarus.org/martin/PorterStemmer/ (URL #2)

Departures

Also, since when I initially implemented it, it failed the tests at...

http://tartarus.org/martin/PorterStemmer/voc.txt (URL #3)

http://tartarus.org/martin/PorterStemmer/output.txt (URL #4)

... after reading the human-readble text over and over again to try to figure out what the error I made was (and doing all sorts of things to debug it) I came to the conclusion that the some of these tests were wrong according to the human-readable description of the algorithm.

This led me to wonder if maybe other people's code that was passing these tests had rules that were not in the human-readable description. Which led me to look at the source code here...

http://tartarus.org/martin/PorterStemmer/c.txt (URL #5)

... When I looked there I noticed that there are some items marked as a "DEPARTURE", which differ from the original algorithm. (There are 2 of these.)

I implemented these departures, and the tests at URL #3 and URL #4 all passed.

Usage

To use this Golang library, use with something like:

package main

import (
  "fmt"
  "github.com/reiver/go-porterstemmer"
)

func main() {
  
  word := "Waxes"
  
  stem := porterstemmer.StemString(word)
  
  fmt.Printf("The word [%s] has the stem [%s].\n", word, stem)
}

Alternatively, if you want to be a bit more efficient, use []rune slices instead, with code like:

package main

import (
  "fmt"
  "github.com/reiver/go-porterstemmer"
)

func main() {
  
  word := []rune("Waxes")
  
  stem := porterstemmer.Stem(word)
  
  fmt.Printf("The word [%s] has the stem [%s].\n", string(word), string(stem))
}

Although NOTE that the above code may modify original slice (named "word" in the example) as a side effect, for efficiency reasons. And that the slice named "stem" in the example above may be a sub-slice of the slice named "word".

Also alternatively, if you already know that your word is already lowercase (and you don't need this library to lowercase your word for you) you can instead use code like:

package main

import (
  "fmt"
  "github.com/reiver/go-porterstemmer"
)

func main() {
  
  word := []rune("waxes")
  
  stem := porterstemmer.StemWithoutLowerCasing(word)
  
  fmt.Printf("The word [%s] has the stem [%s].\n", string(word), string(stem))
}

Again NOTE (like with the previous example) that the above code may modify original slice (named "word" in the example) as a side effect, for efficiency reasons. And that the slice named "stem" in the example above may be a sub-slice of the slice named "word".

Comments
  • Renamed test file to add _test

    Renamed test file to add _test

    Without this, any library that imports pkg porterstemmer will also import testing, meaning that any application that imports pkg porterstemmer will have all the testing flags, which certainly appears to not be desired behaviour.

  • Fix panics

    Fix panics

    This pull-request is related to issue #4

    Added a fuzz unit test to identify which inputs (less than 6 chars long) result in a panic. Moved check for measure = 1 before creating slice to avoid panic in rule 5a.

  • removed unused import, fixes test compilation

    removed unused import, fixes test compilation

    prior to this I get:

    $ go test -v
    # github.com/reiver/go-porterstemmer
    ./porterstemmer_fixes_test.go:6: imported and not used: "fmt"
    FAIL    github.com/reiver/go-porterstemmer [build failed]
    
  • StemString(

    StemString("ion") causes runtime exception

    Test Program

    package main
    
    import "fmt"
    import "github.com/reiver/go-porterstemmer"
    
    func main() {
        word := porterstemmer.StemString("ion")
        fmt.Println(word)
    }
    

    Output

    panic: runtime error: index out of range
    
    goroutine 1 [running]:
    github.com/reiver/go-porterstemmer.step4(0xc20003f2c0, 0x3, 0x3, 0xc20003f2c0, 0x3, ...)
        /home/oliver/go/src/github.com/reiver/go-porterstemmer/porterstemmer.go:691 +0x1550
    github.com/reiver/go-porterstemmer.StemWithoutLowerCasing(0xc20003f2c0, 0x3, 0x3, 0x4e0143, 0x4e0142, ...)
        /home/oliver/go/src/github.com/reiver/go-porterstemmer/porterstemmer.go:895 +0xfd
    github.com/reiver/go-porterstemmer.Stem(0xc20003f2c0, 0x3, 0x3, 0x3, 0x3, ...)
        /home/oliver/go/src/github.com/reiver/go-porterstemmer/porterstemmer.go:868 +0xc2
    github.com/reiver/go-porterstemmer.StemString(0x4e0140, 0x3, 0x45ebb5, 0x4e8070)
        /home/oliver/go/src/github.com/reiver/go-porterstemmer/porterstemmer.go:839 +0x51
    main.main()
        /home/oliver/go/src/thedarkesttimeline/test.go:7 +0x32
    
    goroutine 2 [runnable]:
    exit status 2
    
  • Bug crashes on certain inputs

    Bug crashes on certain inputs

    Ex: "eeb"

    Though .Stem checks for words shorter than 3 letters before steps 1-4 execute. By the time it reaches step5a eeb becomes e and there is an index out of bounds error.

Fast (linear time) implementation of the Gaussian Blur algorithm in Go.
Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Song2 Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Oct 25, 2022
k-means clustering algorithm implementation written in Go
k-means clustering algorithm implementation written in Go

kmeans k-means clustering algorithm implementation written in Go What It Does k-means clustering partitions a multi-dimensional data set into k cluste

Dec 6, 2022
Menggunakan clean architecture dengan domain driven design

Golang project Menggunakan clean architecture dengan domain driven design Domain - Entity Sebuah model untuk entitas Domain - Model Hampir sama dengan

Jan 4, 2022
Clean Architecture With Golang

Clean Architecture With Golang When init a new project go mod init github.com/samuelterra22/clean-architecture-go Run testes go test ./... Generate a

Aug 2, 2022
Genetic Algorithm and Particle Swarm Optimization

evoli Genetic Algorithm and Particle Swarm Optimization written in Go Example Problem Given f(x,y) = cos(x^2 * y^2) * 1/(x^2 * y^2 + 1) Find (x,y) suc

Dec 22, 2022
Golang Genetic Algorithm
Golang Genetic Algorithm

goga Golang implementation of a genetic algorithm. See ./examples for info on how to use the library. Overview Goga is a genetic algorithm solution wr

Dec 19, 2022
An iterative algorithm to generate high-quality triangulated images.
An iterative algorithm to generate high-quality triangulated images.

An iterative algorithm to generate high quality triangulated images. Introduction Triangula uses a modified genetic algorithm to triangulate images. I

Dec 29, 2022
a* pathfinding algorithm written in go

astar a* (a-star) pathfinding algorithm written in go Wikipedia: EN: A* search algorithm DE: A*-Algorithmus Install go get github.com/jpierer/astar@ma

Mar 21, 2022
A Kubernetes Native Batch System (Project under CNCF)
A Kubernetes Native Batch System (Project under CNCF)

Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workloa

Jan 9, 2023
Katib is a Kubernetes-native project for automated machine learning (AutoML).
Katib is a Kubernetes-native project for automated machine learning (AutoML).

Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architec

Jan 2, 2023
k-modes and k-prototypes clustering algorithms implementation in Go

go-cluster GO implementation of clustering algorithms: k-modes and k-prototypes. K-modes algorithm is very similar to well-known clustering algorithm

Nov 29, 2022
An implementation of Neural Turing Machines
An implementation of Neural Turing Machines

Neural Turing Machines Package ntm implements the Neural Turing Machine architecture as described in A.Graves, G. Wayne, and I. Danihelka. arXiv prepr

Sep 13, 2022
Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch
Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch

EGNN - Pytorch Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch. May be eventually used for Alphafold2 replication.

Dec 23, 2022
A high performance go implementation of Wappalyzer Technology Detection Library

wappalyzergo A high performance port of the Wappalyzer Technology Detection Library to Go. Inspired by https://github.com/rverton/webanalyze. Features

Jan 8, 2023
Go implementation of the yolo v3 object detection system
Go implementation of the yolo v3 object detection system

Go YOLO V3 This repository provides a plug and play implementation of the Yolo V3 object detection system in Go, leveraging gocv. Prerequisites Since

Dec 14, 2022
Golang k-d tree implementation with duplicate coordinate support

Golang k-d tree implementation with duplicate coordinate support

Nov 9, 2022
A discord bot that mumbles the audio from another room into your room (WIP)

DiscordGo Voice Receive Example This example experiments with receiving voice data from Discord. It joins a specified voice channel, listens for 10 se

Dec 11, 2021
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Sep 27, 2022
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Sep 27, 2022
Eunomia is a distributed application framework that support Gossip protocol, QuorumNWR algorithm, PBFT algorithm, PoW algorithm, and ZAB protocol and so on.

Introduction Eunomia is a distributed application framework that facilitates developers to quickly develop distributed applications and supports distr

Sep 28, 2021