Gota: DataFrames and data wrangling in Go (Golang)

Gota: DataFrames, Series and Data Wrangling for Go

This is an implementation of DataFrames, Series and data wrangling methods for the Go programming language. The API is still in flux so use at your own risk.

DataFrame

The term DataFrame typically refers to a tabular dataset that can be viewed as a two dimensional table. Often the columns of this dataset refers to a list of features, while the rows represent a number of measurements. As the data on the real world is not perfect, DataFrame supports non measurements or NaN elements.

Common examples of DataFrames can be found on Excel sheets, CSV files or SQL database tables, but this data can come on a variety of other formats, like a collection of JSON objects or XML files.

The utility of DataFrames resides on the ability to subset them, merge them, summarize the data for individual features or apply functions to entire rows or columns, all while keeping column type integrity.

Usage

Loading data

DataFrames can be constructed passing Series to the dataframe.New constructor function:

df := dataframe.New(
	series.New([]string{"b", "a"}, series.String, "COL.1"),
	series.New([]int{1, 2}, series.Int, "COL.2"),
	series.New([]float64{3.0, 4.0}, series.Float, "COL.3"),
)

You can also load the data directly from other formats. The base loading function takes some records in the form [][]string and returns a new DataFrame from there:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
)

Now you can also create DataFrames by loading an slice of arbitrary structs:

type User struct {
	Name     string
	Age      int
	Accuracy float64
    ignored  bool // ignored since unexported
}
users := []User{
	{"Aram", 17, 0.2, true},
	{"Juan", 18, 0.8, true},
	{"Ana", 22, 0.5, true},
}
df := dataframe.LoadStructs(users)

By default, the column types will be auto detected but this can be configured. For example, if we wish the default type to be Float but columns A and D are String and Bool respectively:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
    dataframe.DetectTypes(false),
    dataframe.DefaultType(series.Float),
    dataframe.WithTypes(map[string]series.Type{
        "A": series.String,
        "D": series.Bool,
    }),
)

Similarly, you can load the data stored on a []map[string]interface{}:

df := dataframe.LoadMaps(
    []map[string]interface{}{
        map[string]interface{}{
            "A": "a",
            "B": 1,
            "C": true,
            "D": 0,
        },
        map[string]interface{}{
            "A": "b",
            "B": 2,
            "C": true,
            "D": 0.5,
        },
    },
)

You can also pass an io.Reader to the functions ReadCSV/ReadJSON and it will work as expected given that the data is correct:

csvStr := `
Country,Date,Age,Amount,Id
"United States",2012-02-01,50,112.1,01234
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-02-01,17,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-02-01,NA,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United States",2012-02-01,32,321.31,54320
Spain,2012-02-01,66,555.42,00241
`
df := dataframe.ReadCSV(strings.NewReader(csvStr))
jsonStr := `[{"COL.2":1,"COL.3":3},{"COL.1":5,"COL.2":2,"COL.3":2},{"COL.1":6,"COL.2":3,"COL.3":1}]`
df := dataframe.ReadJSON(strings.NewReader(jsonStr))

Subsetting

We can subset our DataFrames with the Subset method. For example if we want the first and third rows we can do the following:

sub := df.Subset([]int{0, 2})

Column selection

If instead of subsetting the rows we want to select specific columns, by an index or column name:

sel1 := df.Select([]int{0, 2})
sel2 := df.Select([]string{"A", "C"})

Updating values

In order to update the values of a DataFrame we can use the Set method:

df2 := df.Set(
    []int{0, 2},
    dataframe.LoadRecords(
        [][]string{
            []string{"A", "B", "C", "D"},
            []string{"b", "4", "6.0", "true"},
            []string{"c", "3", "6.0", "false"},
        },
    ),
)

Filtering

For more complex row subsetting we can use the Filter method. For example, if we want the rows where the column "A" is equal to "a" or column "B" is greater than 4:

fil := df.Filter(
    dataframe.F{"A", series.Eq, "a"},
    dataframe.F{"B", series.Greater, 4},
) 
fil2 := fil.Filter(
    dataframe.F{"D", series.Eq, true},
)

Filters inside Filter are combined as OR operations whereas if we chain Filter methods, they will behave as AND.

Arrange

With Arrange a DataFrame can be sorted by the given column names:

sorted := df.Arrange(
    dataframe.Sort("A"),    // Sort in ascending order
    dataframe.RevSort("B"), // Sort in descending order
)

Mutate

If we want to modify a column or add one based on a given Series at the end we can use the Mutate method:

// Change column C with a new one
mut := df.Mutate(
    series.New([]string{"a", "b", "c", "d"}, series.String, "C"),
)
// Add a new column E
mut2 := df.Mutate(
    series.New([]string{"a", "b", "c", "d"}, series.String, "E"),
)

Joins

Different Join operations are supported (InnerJoin, LeftJoin, RightJoin, CrossJoin). In order to use these methods you have to specify which are the keys to be used for joining the DataFrames:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
)
df2 := dataframe.LoadRecords(
    [][]string{
        []string{"A", "F", "D"},
        []string{"1", "1", "true"},
        []string{"4", "2", "false"},
        []string{"2", "8", "false"},
        []string{"5", "9", "false"},
    },
)
join := df.InnerJoin(df2, "D")

Function application

Functions can be applied to the rows or columns of a DataFrame, casting the types as necessary:

mean := func(s series.Series) series.Series {
    floats := s.Float()
    sum := 0.0
    for _, f := range floats {
        sum += f
    }
    return series.Floats(sum / float64(len(floats)))
}
df.Capply(mean)
df.Rapply(mean)

Chaining operations

DataFrames support a number of methods for wrangling the data, filtering, subsetting, selecting columns, adding new columns or modifying existing ones. All these methods can be chained one after another and at the end of the procedure check if there has been any errors by the DataFrame Err field. If any of the methods in the chain returns an error, the remaining operations on the chain will become a no-op.

a = a.Rename("Origin", "Country").
    Filter(dataframe.F{"Age", "<", 50}).
    Filter(dataframe.F{"Origin", "==", "United States"}).
    Select("Id", "Origin", "Date").
    Subset([]int{1, 3})
if a.Err != nil {
    log.Fatal("Oh noes!")
}

Print to console

fmt.Println(flights)

> [336776x20] DataFrame
> 
>     X0    year  month day   dep_time sched_dep_time dep_delay arr_time ...
>  0: 1     2013  1     1     517      515            2         830      ...
>  1: 2     2013  1     1     533      529            4         850      ...
>  2: 3     2013  1     1     542      540            2         923      ...
>  3: 4     2013  1     1     544      545            -1        1004     ...
>  4: 5     2013  1     1     554      600            -6        812      ...
>  5: 6     2013  1     1     554      558            -4        740      ...
>  6: 7     2013  1     1     555      600            -5        913      ...
>  7: 8     2013  1     1     557      600            -3        709      ...
>  8: 9     2013  1     1     557      600            -3        838      ...
>  9: 10    2013  1     1     558      600            -2        753      ...
>     ...   ...   ...   ...   ...      ...            ...       ...      ...
>     <int> <int> <int> <int> <int>    <int>          <int>     <int>    ...
> 
> Not Showing: sched_arr_time <int>, arr_delay <int>, carrier <string>, flight <int>,
> tailnum <string>, origin <string>, dest <string>, air_time <int>, distance <int>, hour <int>,
> minute <int>, time_hour <string>

Interfacing with gonum

A gonum/mat.Matrix or any object that implements the dataframe.Matrix interface can be loaded as a DataFrame by using the LoadMatrix() method. If one wants to convert a DataFrame to a mat.Matrix it is necessary to create the necessary structs and method implementations. Since a DataFrame already implements the Dims() (r, c int) method, only implementations for the At and T methods are necessary:

type matrix struct {
	dataframe.DataFrame
}

func (m matrix) At(i, j int) float64 {
	return m.Elem(i, j).Float()
}

func (m matrix) T() mat.Matrix {
	return mat.Transpose{m}
}

Series

Series are essentially vectors of elements of the same type with support for missing values. Series are the building blocks for DataFrame columns.

Four types are currently supported:

Int
Float
String
Bool

For more information about the API, make sure to check:

License

Copyright 2016 Alejandro Sanchez Brotons

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments
  • panic: runtime error: invalid memory address or nil pointer dereference

    panic: runtime error: invalid memory address or nil pointer dereference

    when I apply aggregation for a dataframe ,it happen a panic error like this:

    panic: runtime error: invalid memory address or nil pointer dereference

    the full error stack as follow:

    panic: runtime error: invalid memory address or nil pointer dereference
    [signal 0xc0000005 code=0x0 addr=0x20 pc=0x38c064]
    
    goroutine 1 [running]:
    github.com/go-gota/gota/series.Series.Len(...)
            D:/ProgramFiles/goplus/pkg/mod/github.com/go-gota/[email protected]/series/series.go:560
    github.com/go-gota/gota/dataframe.Groups.Aggregation(0xc005d41b30, 0xc000270f00, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
            D:/ProgramFiles/goplus/pkg/mod/github.com/go-gota/[email protected]/dataframe/dataframe.go:504 +0x904
    

    and my code like this :

    agg := gdf.Aggregation([]dataframe.AggregationType{dataframe.Aggregation_COUNT}, []string{"countn"})
    

    how can i solve it ? thank you.

  • Combining filters with AND and user-defined filters

    Combining filters with AND and user-defined filters

    This PR refines filtering of DataFrames: (1) support for combining filters with AND without having to chain the filters, which should be more performant as rows only need to be traversed once (df.FilterAggregation) (2) support for user-defined filters (series.CompFunc and func(el series.Element) bool) (3) test cases for both new features (4) updated README reflecting the additions

    It should be backwards compatible as df.Filter retains OR semantics.

    Cheers Christoph

  • Add DataFrame.Describe for reporting summary statistics

    Add DataFrame.Describe for reporting summary statistics

    This PR adds some summary statistics helper functions to the Series struct type and the DataFrame.Describe function to obtain summary statistics for the given DataFrame. This is intended to replicate the behavior of the Pandas' DataFrame.describe() function in python.

  • Filter with In on Quoted String returns False

    Filter with In on Quoted String returns False

    Im attempting to Filter on a String list that is quoted. My Filter is as follows:

    df = df.Filter(dataframe.F{ Colname: "XXX", Comparator: series.In, Comparando: "LA", })

    Here are some sample rows form the column I am filtering on has Strings that can look like: "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "CC,DC,FR,FW,KH,MG,WD,WB" "IS,KH,MG,WD" "CC,FC,FS,SC" "IS,KH,MG,WD" "FC,LA,LC,UQ" "CC,CF,CS,FC,FS,KH,LA,LC,MG,WD,WB" "CC,FC,FS,SC" "DC,FR" "DC,FR" UNK UNK "DC,FR" FW

    This should return 7 rows: "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "FC,LA,LC,UQ" "CC,CF,CS,FC,FS,KH,LA,LC,MG,WD,WB"

    But when I run this I get 0 rows back. I think this could be due to the quoted strings which I cant control since they come from a csv file. Or is there a way to pass Regex or a wildcard in as my comparando?

  • How I can apply multiple values to a Dataframe Filter function.

    How I can apply multiple values to a Dataframe Filter function.

    I have following: df = df.Rename("Account", "AccountID"). Filter(dataframe.F{"Event", "==", "bounced"}). Filter(dataframe.F{"Event", "==", "Sent"})

    I want to do it like as:

    df = df.Rename("Account", "AccountID"). Filter(dataframe.F{"Event", "==", ["bounced","Sent"]})

    Is that possible?

  • DataFrame ToMatrix function

    DataFrame ToMatrix function

    Could you, please, implement the dataframe toMatrix (mat.Matrix) function which is hinted in the readme?

    I am new to golang and am trying to replicate a python pipeline(as part of transitioning to golang) which uses StandardScaler but I get an exception that the mat.Matrix as indicated in the error here.

    Note that I am using golang "github.com/pa-m/sklearn/preprocessing" package

    # command-line-arguments ./go_csv.go:18:10: m.columns undefined (type matrix has no field or method columns) ./go_csv.go:21:21: undefined: mat64 ./go_csv.go:22:9: undefined: mat64 ./go_csv.go:123:12: cannot use selDf1 (type dataframe.DataFrame) as type mat.Matrix in argument to scaler.Fit: dataframe.DataFrame does not implement mat.Matrix (missing At method) ./go_csv.go:125:27: cannot use selDf1 (type dataframe.DataFrame) as type mat.Matrix in argument to scaler.Transform: dataframe.DataFrame does not implement mat.Matrix (missing At method)

  • Support Append data (new row) to a Series and/or DataFrame

    Support Append data (new row) to a Series and/or DataFrame

    I can see that func (s *Series) Append(values interface{}) can be used to append data to the end of the Series.

    How can I insert a new row at an arbitrary position?

    It would be great to be able to do it for a DataFrame too. Not just for a Series.

  • Allowing type specification through a map rather than a variadic string argument would be more flexible

    Allowing type specification through a map rather than a variadic string argument would be more flexible

    Right now I have to either specify all types or no types at all. Specifying the types in a map[string]string (column name -> type name) would add the possibility to specify types only for the columns you want to and fallback to auto typing for the other columns.

    It could possibly also shorten the code in ReadRecords by simply checking if the column name is in the map and if not fallback to findType.

    What do you think?

  • Problem installing on windows 10 x64 go 16

    Problem installing on windows 10 x64 go 16

    I'm trying to install dataframes with go 16.6 on windows 10 x64 and got the following message. After that the go.mod file is not updated:

    go get: github.com/kniren/[email protected] updating to
            github.com/kniren/[email protected]: parsing go.mod:     
            module declares its path as: github.com/go-gota/gota
                    but was required as: github.com/kniren/gota
    
  • feature: groupby and Aggregation

    feature: groupby and Aggregation

    I am a fan of Gota, I hope Gota will become better

    ADD Featureļ¼š

    1. Groupby
    2. Aggregation
    groups := df.GroupBy("key1", "key2") // Group by column "key1", and column "key2" 
    aggre := groups.Aggregation([]AggregationType{Aggregation_MAX, Aggregation_MIN}, []string{"values", "values2"}) // Maximum value in column "values",  Minimum value in column "values2"
    

    image

    image

  • New release tag

    New release tag

    v0.9.0 does not include fix of broken import.

    https://github.com/go-gota/gota/commit/7d8acfb8259f6135fb372d0966d9b6a23161c430#diff-80cec47501f12ea2f50aa0ff5c6bca95

    Could you please tag newer?

  • Feat: add PROD operation in series and aggregation

    Feat: add PROD operation in series and aggregation

    Should be convenient if add PROD operation in series and dateframe aggregation

    Furthermore, aggregation can be more optimized with user defined aggregation, how about this?

    type aggragation interface {
        colname() string
        aggFunc(s series.Series) float64
    }
    
    func (gps Groups) Aggregation(aggs []aggragation) DataFrame
    
  • Project Status

    Project Status

    I am wonder what the status of this project is. I see issues haven't been updated, and there hasn't been a commit in over a year.

    Is there any further updates planned for this project or no?

    Likewise, I see a gota2 in the GitHub group.

  • Get Values of a Specific Column as Array (without iteration?)

    Get Values of a Specific Column as Array (without iteration?)

    I have a dataframe that looks like this:

    image

    ...which I would like to start adding some new technical indicator columns to it, and looking for the easiest way to do it as I'm new to Go.

    I would like to make use of the EMA function from this repo which requires the data being passed into the function to be of []float64.

    Is there any easier way to convert a column of dataframe.Dataframe type values to an array without just creating an empty array and iterating over the dataframe (as my dataframe is quite large), or is there an even better way to calculate the EMA column for the dataframe? I'm coming from Python where I think I may have been spoiled with this as it was super easy to do there in Pandas, where Go just doesn't seem to have the same tools to do this.

  • Speed read csv

    Speed read csv

    Hi all. Am I right in thinking that gota is slower than python pandas anyway? Right now I'm trying to load data and calculate it in the shortest time possible. it takes 48kk lines with tab separator: pandas: 32s gota: 1m36s

GoPlus - The Go+ language for engineering, STEM education, and data science

The Go+ language for engineering, STEM education, and data science Summary about Go+ What are mainly impressions about Go+? A static typed language. F

Jan 8, 2023
Store and properly handle data.

Description: Dockerized golang API with MySQL DB. On API start MySQL DB is initialized, with proper vehicle table. OID is used as a unique identifier.

Jan 4, 2022
Collect gtfs vehicle movement data for ML model training.

Transitcast Real time transit monitor This project uses a static gtfs static schedules and frequent queries against a gtfs-rt vehicle position feed ge

Dec 19, 2022
TFKG - A Tensorflow and Keras Golang port

TFKG - A Tensorflow and Keras Golang port This is experimental and quite nasty under the hood* Support macOS: running docker container, no GPU acceler

Oct 18, 2022
The open source, end-to-end computer vision platform. Label, build, train, tune, deploy and automate in a unified platform that runs on any cloud and on-premises.
The open source, end-to-end computer vision platform. Label, build, train, tune, deploy and automate in a unified platform that runs on any cloud and on-premises.

End-to-end computer vision platform Label, build, train, tune, deploy and automate in a unified platform that runs on any cloud and on-premises. onepa

Dec 12, 2022
Go types, funcs, and utilities for working with cards, decks, and evaluating poker hands (Holdem, Omaha, Stud, more)

cardrank.io/cardrank Package cardrank.io/cardrank provides a library of types, funcs, and utilities for working with playing cards, decks, and evaluat

Dec 25, 2022
Naive Bayesian Classification for Golang.

Naive Bayesian Classification Perform naive Bayesian classification into an arbitrary number of classes on sets of strings. bayesian also supports ter

Dec 30, 2022
Ensembles of decision trees in go/golang.
Ensembles of decision trees in go/golang.

CloudForest Google Group Fast, flexible, multi-threaded ensembles of decision trees for machine learning in pure Go (golang). CloudForest allows for a

Dec 1, 2022
Genetic Algorithms library written in Go / golang

Description Genetic Algorithms for Go/Golang Install $ go install git://github.com/thoj/go-galib.git Compiling examples: $ git clone git://github.com

Sep 27, 2022
Golang Genetic Algorithm
Golang Genetic Algorithm

goga Golang implementation of a genetic algorithm. See ./examples for info on how to use the library. Overview Goga is a genetic algorithm solution wr

Dec 19, 2022
Golang Neural Network
Golang Neural Network

Varis Neural Networks with GO About Package Some time ago I decided to learn Go language and neural networks. So it's my variation of Neural Networks

Sep 27, 2022
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Sep 27, 2022
Golang HTML to PDF Converter
Golang HTML to PDF Converter

Golang HTML to PDF Converter For reading any document, one prefers PDF format over any other formats as it is considered as a standard format for any

Dec 15, 2022
A high-performance timeline tracing library for Golang, used by TiDB

Minitrace-Go A high-performance, ergonomic timeline tracing library for Golang. Basic Usage package main import ( "context" "fmt" "strcon

Nov 28, 2022
Golang k-d tree implementation with duplicate coordinate support

Golang k-d tree implementation with duplicate coordinate support

Nov 9, 2022
Another AOC repo (this time in golang!)

advent-of-code Now with 100% more golang! (It's going to be a long advent of code...) To run: Get your data for a given year/day and copy paste it to

Dec 14, 2021
Go (Golang) encrypted deep learning library; Fully homomorphic encryption over neural network graphs

DC DarkLantern A lantern is a portable case that protects light, A dark lantern is one who's light can be hidden at will. DC DarkLantern is a golang i

Oct 31, 2022
Clean Architecture With Golang

Clean Architecture With Golang When init a new project go mod init github.com/samuelterra22/clean-architecture-go Run testes go test ./... Generate a

Aug 2, 2022
face detction/recognization golang lib using tensorflow facenet
face detction/recognization golang lib using tensorflow facenet

Golang lib for detect/recognize by tensorflow facenet Prerequest libtensorfow 1.x Follow the instruction Install TensorFlow for C facenet tenorflow sa

Sep 23, 2022