Gota: DataFrames and data wrangling in Go (Golang)

Last update: Jan 5, 2023

Comments: 15

Gota: DataFrames, Series and Data Wrangling for Go

This is an implementation of DataFrames, Series and data wrangling methods for the Go programming language. The API is still in flux so use at your own risk.

DataFrame

The term DataFrame typically refers to a tabular dataset that can be viewed as a two dimensional table. Often the columns of this dataset refers to a list of features, while the rows represent a number of measurements. As the data on the real world is not perfect, DataFrame supports non measurements or NaN elements.

Common examples of DataFrames can be found on Excel sheets, CSV files or SQL database tables, but this data can come on a variety of other formats, like a collection of JSON objects or XML files.

The utility of DataFrames resides on the ability to subset them, merge them, summarize the data for individual features or apply functions to entire rows or columns, all while keeping column type integrity.

Usage

Loading data

DataFrames can be constructed passing Series to the dataframe.New constructor function:

df := dataframe.New(
	series.New([]string{"b", "a"}, series.String, "COL.1"),
	series.New([]int{1, 2}, series.Int, "COL.2"),
	series.New([]float64{3.0, 4.0}, series.Float, "COL.3"),
)

You can also load the data directly from other formats. The base loading function takes some records in the form [][]string and returns a new DataFrame from there:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
)

Now you can also create DataFrames by loading an slice of arbitrary structs:

type User struct {
	Name     string
	Age      int
	Accuracy float64
    ignored  bool // ignored since unexported
}
users := []User{
	{"Aram", 17, 0.2, true},
	{"Juan", 18, 0.8, true},
	{"Ana", 22, 0.5, true},
}
df := dataframe.LoadStructs(users)

By default, the column types will be auto detected but this can be configured. For example, if we wish the default type to be Float but columns A and D are String and Bool respectively:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
    dataframe.DetectTypes(false),
    dataframe.DefaultType(series.Float),
    dataframe.WithTypes(map[string]series.Type{
        "A": series.String,
        "D": series.Bool,
    }),
)

Similarly, you can load the data stored on a []map[string]interface{}:

df := dataframe.LoadMaps(
    []map[string]interface{}{
        map[string]interface{}{
            "A": "a",
            "B": 1,
            "C": true,
            "D": 0,
        },
        map[string]interface{}{
            "A": "b",
            "B": 2,
            "C": true,
            "D": 0.5,
        },
    },
)

You can also pass an io.Reader to the functions ReadCSV/ReadJSON and it will work as expected given that the data is correct:

csvStr := `
Country,Date,Age,Amount,Id
"United States",2012-02-01,50,112.1,01234
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-02-01,17,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-02-01,NA,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United States",2012-02-01,32,321.31,54320
Spain,2012-02-01,66,555.42,00241
`
df := dataframe.ReadCSV(strings.NewReader(csvStr))

jsonStr := `[{"COL.2":1,"COL.3":3},{"COL.1":5,"COL.2":2,"COL.3":2},{"COL.1":6,"COL.2":3,"COL.3":1}]`
df := dataframe.ReadJSON(strings.NewReader(jsonStr))

Subsetting

We can subset our DataFrames with the Subset method. For example if we want the first and third rows we can do the following:

sub := df.Subset([]int{0, 2})

Column selection

If instead of subsetting the rows we want to select specific columns, by an index or column name:

sel1 := df.Select([]int{0, 2})
sel2 := df.Select([]string{"A", "C"})

Updating values

In order to update the values of a DataFrame we can use the Set method:

df2 := df.Set(
    []int{0, 2},
    dataframe.LoadRecords(
        [][]string{
            []string{"A", "B", "C", "D"},
            []string{"b", "4", "6.0", "true"},
            []string{"c", "3", "6.0", "false"},
        },
    ),
)

Filtering

For more complex row subsetting we can use the Filter method. For example, if we want the rows where the column "A" is equal to "a" or column "B" is greater than 4:

fil := df.Filter(
    dataframe.F{"A", series.Eq, "a"},
    dataframe.F{"B", series.Greater, 4},
) 
fil2 := fil.Filter(
    dataframe.F{"D", series.Eq, true},
)

Filters inside Filter are combined as OR operations whereas if we chain Filter methods, they will behave as AND.

Arrange

With Arrange a DataFrame can be sorted by the given column names:

sorted := df.Arrange(
    dataframe.Sort("A"),    // Sort in ascending order
    dataframe.RevSort("B"), // Sort in descending order
)

Mutate

If we want to modify a column or add one based on a given Series at the end we can use the Mutate method:

// Change column C with a new one
mut := df.Mutate(
    series.New([]string{"a", "b", "c", "d"}, series.String, "C"),
)
// Add a new column E
mut2 := df.Mutate(
    series.New([]string{"a", "b", "c", "d"}, series.String, "E"),
)

Joins

Different Join operations are supported (InnerJoin, LeftJoin, RightJoin, CrossJoin). In order to use these methods you have to specify which are the keys to be used for joining the DataFrames:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
)
df2 := dataframe.LoadRecords(
    [][]string{
        []string{"A", "F", "D"},
        []string{"1", "1", "true"},
        []string{"4", "2", "false"},
        []string{"2", "8", "false"},
        []string{"5", "9", "false"},
    },
)
join := df.InnerJoin(df2, "D")

Function application

Functions can be applied to the rows or columns of a DataFrame, casting the types as necessary:

mean := func(s series.Series) series.Series {
    floats := s.Float()
    sum := 0.0
    for _, f := range floats {
        sum += f
    }
    return series.Floats(sum / float64(len(floats)))
}
df.Capply(mean)
df.Rapply(mean)

Chaining operations

DataFrames support a number of methods for wrangling the data, filtering, subsetting, selecting columns, adding new columns or modifying existing ones. All these methods can be chained one after another and at the end of the procedure check if there has been any errors by the DataFrame Err field. If any of the methods in the chain returns an error, the remaining operations on the chain will become a no-op.

a = a.Rename("Origin", "Country").
    Filter(dataframe.F{"Age", "<", 50}).
    Filter(dataframe.F{"Origin", "==", "United States"}).
    Select("Id", "Origin", "Date").
    Subset([]int{1, 3})
if a.Err != nil {
    log.Fatal("Oh noes!")
}

Print to console

fmt.Println(flights)

> [336776x20] DataFrame
> 
>     X0    year  month day   dep_time sched_dep_time dep_delay arr_time ...
>  0: 1     2013  1     1     517      515            2         830      ...
>  1: 2     2013  1     1     533      529            4         850      ...
>  2: 3     2013  1     1     542      540            2         923      ...
>  3: 4     2013  1     1     544      545            -1        1004     ...
>  4: 5     2013  1     1     554      600            -6        812      ...
>  5: 6     2013  1     1     554      558            -4        740      ...
>  6: 7     2013  1     1     555      600            -5        913      ...
>  7: 8     2013  1     1     557      600            -3        709      ...
>  8: 9     2013  1     1     557      600            -3        838      ...
>  9: 10    2013  1     1     558      600            -2        753      ...
>     ...   ...   ...   ...   ...      ...            ...       ...      ...
>     <int> <int> <int> <int> <int>    <int>          <int>     <int>    ...
> 
> Not Showing: sched_arr_time <int>, arr_delay <int>, carrier <string>, flight <int>,
> tailnum <string>, origin <string>, dest <string>, air_time <int>, distance <int>, hour <int>,
> minute <int>, time_hour <string>

Interfacing with gonum

A gonum/mat.Matrix or any object that implements the dataframe.Matrix interface can be loaded as a DataFrame by using the LoadMatrix() method. If one wants to convert a DataFrame to a mat.Matrix it is necessary to create the necessary structs and method implementations. Since a DataFrame already implements the Dims() (r, c int) method, only implementations for the At and T methods are necessary:

type matrix struct {
	dataframe.DataFrame
}

func (m matrix) At(i, j int) float64 {
	return m.Elem(i, j).Float()
}

func (m matrix) T() mat.Matrix {
	return mat.Transpose{m}
}

Series

Series are essentially vectors of elements of the same type with support for missing values. Series are the building blocks for DataFrame columns.

Four types are currently supported:

Int
Float
String
Bool

For more information about the API, make sure to check:

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner

https://github.com/go-gota/gota

Comments

panic: runtime error: invalid memory address or nil pointer dereference

when I apply aggregation for a dataframe ,it happen a panic error like this:

panic: runtime error: invalid memory address or nil pointer dereference

the full error stack as follow:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x20 pc=0x38c064]

goroutine 1 [running]:
github.com/go-gota/gota/series.Series.Len(...)
        D:/ProgramFiles/goplus/pkg/mod/github.com/go-gota/[email protected]/series/series.go:560
github.com/go-gota/gota/dataframe.Groups.Aggregation(0xc005d41b30, 0xc000270f00, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        D:/ProgramFiles/goplus/pkg/mod/github.com/go-gota/[email protected]/dataframe/dataframe.go:504 +0x904

and my code like this :

agg := gdf.Aggregation([]dataframe.AggregationType{dataframe.Aggregation_COUNT}, []string{"countn"})

how can i solve it ? thank you.

Combining filters with AND and user-defined filters

This PR refines filtering of DataFrames: (1) support for combining filters with AND without having to chain the filters, which should be more performant as rows only need to be traversed once (df.FilterAggregation) (2) support for user-defined filters (series.CompFunc and func(el series.Element) bool) (3) test cases for both new features (4) updated README reflecting the additions

It should be backwards compatible as df.Filter retains OR semantics.

Cheers Christoph
Add DataFrame.Describe for reporting summary statistics

This PR adds some summary statistics helper functions to the Series struct type and the DataFrame.Describe function to obtain summary statistics for the given DataFrame. This is intended to replicate the behavior of the Pandas' DataFrame.describe() function in python.
Filter with In on Quoted String returns False

Im attempting to Filter on a String list that is quoted. My Filter is as follows:

df = df.Filter(dataframe.F{ Colname: "XXX", Comparator: series.In, Comparando: "LA", })

Here are some sample rows form the column I am filtering on has Strings that can look like: "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "CC,DC,FR,FW,KH,MG,WD,WB" "IS,KH,MG,WD" "CC,FC,FS,SC" "IS,KH,MG,WD" "FC,LA,LC,UQ" "CC,CF,CS,FC,FS,KH,LA,LC,MG,WD,WB" "CC,FC,FS,SC" "DC,FR" "DC,FR" UNK UNK "DC,FR" FW

This should return 7 rows: "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "FC,LA,LC,UQ" "CC,CF,CS,FC,FS,KH,LA,LC,MG,WD,WB"

But when I run this I get 0 rows back. I think this could be due to the quoted strings which I cant control since they come from a csv file. Or is there a way to pass Regex or a wildcard in as my comparando?
How I can apply multiple values to a Dataframe Filter function.

I have following: df = df.Rename("Account", "AccountID"). Filter(dataframe.F{"Event", "==", "bounced"}). Filter(dataframe.F{"Event", "==", "Sent"})

I want to do it like as:

df = df.Rename("Account", "AccountID"). Filter(dataframe.F{"Event", "==", ["bounced","Sent"]})

Is that possible?
DataFrame ToMatrix function

Could you, please, implement the dataframe toMatrix (mat.Matrix) function which is hinted in the readme?

I am new to golang and am trying to replicate a python pipeline(as part of transitioning to golang) which uses StandardScaler but I get an exception that the mat.Matrix as indicated in the error here.

Note that I am using golang "github.com/pa-m/sklearn/preprocessing" package

# command-line-arguments ./go_csv.go:18:10: m.columns undefined (type matrix has no field or method columns) ./go_csv.go:21:21: undefined: mat64 ./go_csv.go:22:9: undefined: mat64 ./go_csv.go:123:12: cannot use selDf1 (type dataframe.DataFrame) as type mat.Matrix in argument to scaler.Fit: dataframe.DataFrame does not implement mat.Matrix (missing At method) ./go_csv.go:125:27: cannot use selDf1 (type dataframe.DataFrame) as type mat.Matrix in argument to scaler.Transform: dataframe.DataFrame does not implement mat.Matrix (missing At method)
Support Append data (new row) to a Series and/or DataFrame

I can see that func (s *Series) Append(values interface{}) can be used to append data to the end of the Series.

How can I insert a new row at an arbitrary position?

It would be great to be able to do it for a DataFrame too. Not just for a Series.
Allowing type specification through a map rather than a variadic string argument would be more flexible

Right now I have to either specify all types or no types at all. Specifying the types in a map[string]string (column name -> type name) would add the possibility to specify types only for the columns you want to and fallback to auto typing for the other columns.

It could possibly also shorten the code in ReadRecords by simply checking if the column name is in the map and if not fallback to findType.

What do you think?

Problem installing on windows 10 x64 go 16

I'm trying to install dataframes with go 16.6 on windows 10 x64 and got the following message. After that the go.mod file is not updated:

go get: github.com/kniren/[email protected] updating to
        github.com/kniren/[email protected]: parsing go.mod:     
        module declares its path as: github.com/go-gota/gota
                but was required as: github.com/kniren/gota

feature: groupby and Aggregation

I am a fan of Gota, I hope Gota will become better

ADD Feature：

Groupby
Aggregation

groups := df.GroupBy("key1", "key2") // Group by column "key1", and column "key2" 
aggre := groups.Aggregation([]AggregationType{Aggregation_MAX, Aggregation_MIN}, []string{"values", "values2"}) // Maximum value in column "values",  Minimum value in column "values2"

New release tag

v0.9.0 does not include fix of broken import.

https://github.com/go-gota/gota/commit/7d8acfb8259f6135fb372d0966d9b6a23161c430#diff-80cec47501f12ea2f50aa0ff5c6bca95

Could you please tag newer?
Feat: add PROD operation in series and aggregation
Should be convenient if add PROD operation in series and dateframe aggregation

Furthermore, aggregation can be more optimized with user defined aggregation, how about this?

type aggragation interface { colname() string aggFunc(s series.Series) float64 } func (gps Groups) Aggregation(aggs []aggragation) DataFrame
Project Status

I am wonder what the status of this project is. I see issues haven't been updated, and there hasn't been a commit in over a year.

Is there any further updates planned for this project or no?

Likewise, I see a gota2 in the GitHub group.
Get Values of a Specific Column as Array (without iteration?)

I have a dataframe that looks like this:

...which I would like to start adding some new technical indicator columns to it, and looking for the easiest way to do it as I'm new to Go.

I would like to make use of the EMA function from this repo which requires the data being passed into the function to be of []float64.

Is there any easier way to convert a column of dataframe.Dataframe type values to an array without just creating an empty array and iterating over the dataframe (as my dataframe is quite large), or is there an even better way to calculate the EMA column for the dataframe? I'm coming from Python where I think I may have been spoiled with this as it was super easy to do there in Pandas, where Go just doesn't seem to have the same tools to do this.
Speed read csv

Hi all. Am I right in thinking that gota is slower than python pandas anyway? Right now I'm trying to load data and calculate it in the shortest time possible. it takes 48kk lines with tab separator: pandas: 32s gota: 1m36s

Related tags

Machine Learning gota

GoPlus - The Go+ language for engineering, STEM education, and data science

The Go+ language for engineering, STEM education, and data science Summary about Go+ What are mainly impressions about Go+? A static typed language. F

Jan 8, 2023

Store and properly handle data.

Description: Dockerized golang API with MySQL DB. On API start MySQL DB is initialized, with proper vehicle table. OID is used as a unique identifier.

Jan 4, 2022

Collect gtfs vehicle movement data for ML model training.

Transitcast Real time transit monitor This project uses a static gtfs static schedules and frequent queries against a gtfs-rt vehicle position feed ge

Dec 19, 2022

TFKG - A Tensorflow and Keras Golang port

TFKG - A Tensorflow and Keras Golang port This is experimental and quite nasty under the hood* Support macOS: running docker container, no GPU acceler

Oct 18, 2022

The open source, end-to-end computer vision platform. Label, build, train, tune, deploy and automate in a unified platform that runs on any cloud and on-premises.

End-to-end computer vision platform Label, build, train, tune, deploy and automate in a unified platform that runs on any cloud and on-premises. onepa

Dec 12, 2022

Gota: DataFrames and data wrangling in Go (Golang)

Gota: DataFrames, Series and Data Wrangling for Go

DataFrame

Usage

Loading data

Subsetting

Column selection

Updating values

Filtering

Arrange

Mutate

Joins

Function application

Chaining operations

Print to console

Interfacing with gonum

Series

License

Owner

Comments

panic: runtime error: invalid memory address or nil pointer dereference

Combining filters with AND and user-defined filters

Add DataFrame.Describe for reporting summary statistics

Filter with In on Quoted String returns False

How I can apply multiple values to a Dataframe Filter function.

DataFrame ToMatrix function

Support Append data (new row) to a Series and/or DataFrame

Allowing type specification through a map rather than a variadic string argument would be more flexible

Problem installing on windows 10 x64 go 16

feature: groupby and Aggregation

New release tag

Feat: add PROD operation in series and aggregation

Project Status

Get Values of a Specific Column as Array (without iteration?)

Speed read csv

Related tags

GoPlus - The Go+ language for engineering, STEM education, and data science

Store and properly handle data.

Collect gtfs vehicle movement data for ML model training.

TFKG - A Tensorflow and Keras Golang port

The open source, end-to-end computer vision platform. Label, build, train, tune, deploy and automate in a unified platform that runs on any cloud and on-premises.

Go types, funcs, and utilities for working with cards, decks, and evaluating poker hands (Holdem, Omaha, Stud, more)

Naive Bayesian Classification for Golang.

Ensembles of decision trees in go/golang.

Genetic Algorithms library written in Go / golang

Golang Genetic Algorithm

Golang Neural Network

Golang implementation of the Paice/Husk Stemming Algorithm

Golang HTML to PDF Converter

A high-performance timeline tracing library for Golang, used by TiDB

Golang k-d tree implementation with duplicate coordinate support

Another AOC repo (this time in golang!)

Go (Golang) encrypted deep learning library; Fully homomorphic encryption over neural network graphs

Clean Architecture With Golang

face detction/recognization golang lib using tensorflow facenet