DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration

Last update: Dec 31, 2022

Comments: 15

Dataframes are used for statistics, machine-learning, and data manipulation/exploration. You can think of a Dataframe as an excel spreadsheet. This package is designed to be light-weight and intuitive.

⚠️ The package is production ready but the API is not stable yet. Once stability is reached, version 1.0.0 will be tagged. It is recommended your package manager locks to a commit id instead of the master branch directly. ⚠️

⭐ the project to show your appreciation.

Features

Importing from CSV, JSONL, MySQL & PostgreSQL
Exporting to CSV, JSONL, Excel, Parquet, MySQL & PostgreSQL
Developer Friendly
Flexible - Create custom Series (custom data types)
Performant
Interoperability with gonum package.
pandas sub-package
Fake data generation
Interpolation (ForwardFill, BackwardFill, Linear, Spline, Lagrange)
Time-series Forecasting (SES, Holt-Winters)
Math functions
Plotting (cross-platform)

See Tutorial here.

Installation

go get -u github.com/rocketlaunchr/dataframe-go

import dataframe "github.com/rocketlaunchr/dataframe-go"

DataFrames

Creating a DataFrame

s1 := dataframe.NewSeriesInt64("day", nil, 1, 2, 3, 4, 5, 6, 7, 8)
s2 := dataframe.NewSeriesFloat64("sales", nil, 50.3, 23.4, 56.2, nil, nil, 84.2, 72, 89)
df := dataframe.NewDataFrame(s1, s2)

fmt.Print(df.Table())
  
OUTPUT:
+-----+-------+---------+
|     |  DAY  |  SALES  |
+-----+-------+---------+
| 0:  |   1   |  50.3   |
| 1:  |   2   |  23.4   |
| 2:  |   3   |  56.2   |
| 3:  |   4   |   NaN   |
| 4:  |   5   |   NaN   |
| 5:  |   6   |  84.2   |
| 6:  |   7   |   72    |
| 7:  |   8   |   89    |
+-----+-------+---------+
| 8X2 | INT64 | FLOAT64 |
+-----+-------+---------+

Insert and Remove Row

df.Append(nil, 9, 123.6)

df.Append(nil, map[string]interface{}{
	"day":   10,
	"sales": nil,
})

df.Remove(0)

OUTPUT:
+-----+-------+---------+
|     |  DAY  |  SALES  |
+-----+-------+---------+
| 0:  |   2   |  23.4   |
| 1:  |   3   |  56.2   |
| 2:  |   4   |   NaN   |
| 3:  |   5   |   NaN   |
| 4:  |   6   |  84.2   |
| 5:  |   7   |   72    |
| 6:  |   8   |   89    |
| 7:  |   9   |  123.6  |
| 8:  |  10   |   NaN   |
+-----+-------+---------+
| 9X2 | INT64 | FLOAT64 |
+-----+-------+---------+

Update Row

df.UpdateRow(0, nil, map[string]interface{}{
	"day":   3,
	"sales": 45,
})

Sorting

sks := []dataframe.SortKey{
	{Key: "sales", Desc: true},
	{Key: "day", Desc: true},
}

df.Sort(ctx, sks)

OUTPUT:
+-----+-------+---------+
|     |  DAY  |  SALES  |
+-----+-------+---------+
| 0:  |   9   |  123.6  |
| 1:  |   8   |   89    |
| 2:  |   6   |  84.2   |
| 3:  |   7   |   72    |
| 4:  |   3   |  56.2   |
| 5:  |   2   |  23.4   |
| 6:  |  10   |   NaN   |
| 7:  |   5   |   NaN   |
| 8:  |   4   |   NaN   |
+-----+-------+---------+
| 9X2 | INT64 | FLOAT64 |
+-----+-------+---------+

Iterating

You can change the step and starting row. It may be wise to lock the DataFrame before iterating.

The returned value is a map containing the name of the series (string) and the index of the series (int) as keys.

iterator := df.ValuesIterator(dataframe.ValuesOptions{0, 1, true}) // Don't apply read lock because we are write locking from outside.

df.Lock()
for {
	row, vals, _ := iterator()
	if row == nil {
		break
	}
	fmt.Println(*row, vals)
}
df.Unlock()

OUTPUT:
0 map[day:1 0:1 sales:50.3 1:50.3]
1 map[sales:23.4 1:23.4 day:2 0:2]
2 map[day:3 0:3 sales:56.2 1:56.2]
3 map[1:<nil> day:4 0:4 sales:<nil>]
4 map[day:5 0:5 sales:<nil> 1:<nil>]
5 map[sales:84.2 1:84.2 day:6 0:6]
6 map[day:7 0:7 sales:72 1:72]
7 map[day:8 0:8 sales:89 1:89]

Statistics

You can easily calculate statistics for a Series using the gonum or montanaflynn/stats package.

SeriesFloat64 and SeriesTime provide access to the exported Values field to seamlessly interoperate with external math-based packages.

Example

Some series provide easy conversion using the ToSeriesFloat64 method.

import "gonum.org/v1/gonum/stat"

s := dataframe.NewSeriesInt64("random", nil, 1, 2, 3, 4, 5, 6, 7, 8)
sf, _ := s.ToSeriesFloat64(ctx)

Mean

mean := stat.Mean(sf.Values, nil)

Median

import "github.com/montanaflynn/stats"
median, _ := stats.Median(sf.Values)

Standard Deviation

std := stat.StdDev(sf.Values, nil)

Plotting (cross-platform)

import (
	chart "github.com/wcharczuk/go-chart"
	"github.com/rocketlaunchr/dataframe-go/plot"
	wc "github.com/rocketlaunchr/dataframe-go/plot/wcharczuk/go-chart"
)

sales := dataframe.NewSeriesFloat64("sales", nil, 50.3, nil, 23.4, 56.2, 89, 32, 84.2, 72, 89)
cs, _ := wc.S(ctx, sales, nil, nil)

graph := chart.Chart{Series: []chart.Series{cs}}

plt, _ := plot.Open("Monthly sales", 450, 300)
graph.Render(chart.SVG, plt)
plt.Display(plot.None)
<-plt.Closed

Output:

Math Functions

import "github.com/rocketlaunchr/dataframe-go/math/funcs"

res := 24
sx := dataframe.NewSeriesFloat64("x", nil, utils.Float64Seq(1, float64(res), 1))
sy := dataframe.NewSeriesFloat64("y", &dataframe.SeriesInit{Size: res})
df := dataframe.NewDataFrame(sx, sy)

fn := funcs.RegFunc("sin(2*𝜋*x/24)")
funcs.Evaluate(ctx, df, fn, 1)

Output:

Importing Data

The imports sub-package has support for importing csv, jsonl and directly from a SQL database. The DictateDataType option can be set to specify the true underlying data type. Alternatively, InferDataTypes option can be set.

CSV

csvStr := `
Country,Date,Age,Amount,Id
"United States",2012-02-01,50,112.1,01234
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-02-01,17,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-05-07,NA,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United States",2012-02-01,32,321.31,54320
Spain,2012-02-01,66,555.42,00241
`
df, err := imports.LoadFromCSV(ctx, strings.NewReader(csvStr))

OUTPUT:
+-----+----------------+------------+-------+---------+-------+
|     |    COUNTRY     |    DATE    |  AGE  | AMOUNT  |  ID   |
+-----+----------------+------------+-------+---------+-------+
| 0:  | United States  | 2012-02-01 |  50   |  112.1  | 1234  |
| 1:  | United States  | 2012-02-01 |  32   | 321.31  | 54320 |
| 2:  | United Kingdom | 2012-02-01 |  17   |  18.2   | 12345 |
| 3:  | United States  | 2012-02-01 |  32   | 321.31  | 54320 |
| 4:  | United Kingdom | 2015-05-07 |  NaN  |  18.2   | 12345 |
| 5:  | United States  | 2012-02-01 |  32   | 321.31  | 54320 |
| 6:  | United States  | 2012-02-01 |  32   | 321.31  | 54320 |
| 7:  |     Spain      | 2012-02-01 |  66   | 555.42  |  241  |
+-----+----------------+------------+-------+---------+-------+
| 8X5 |     STRING     |    TIME    | INT64 | FLOAT64 | INT64 |
+-----+----------------+------------+-------+---------+-------+

Exporting Data

The exports sub-package has support for exporting to csv, jsonl, parquet, Excel and directly to a SQL database.

Optimizations

If you know the number of rows in advance, you can set the capacity of the underlying slice of a series using SeriesInit{}. This will preallocate memory and provide speed improvements.

Generic Series

Out of the box, there is support for string, time.Time, float64 and int64. Automatic support exists for float32 and all types of integers. There is a convenience function provided for dealing with bool. There is also support for complex128 inside the xseries subpackage.

There may be times that you want to use your own custom data types. You can either implement your own Series type (more performant) or use the Generic Series (more convenient).

civil.Date

import "time"
import "cloud.google.com/go/civil"

sg := dataframe.NewSeriesGeneric("date", civil.Date{}, nil, civil.Date{2018, time.May, 01}, civil.Date{2018, time.May, 02}, civil.Date{2018, time.May, 03})
s2 := dataframe.NewSeriesFloat64("sales", nil, 50.3, 23.4, 56.2)

df := dataframe.NewDataFrame(sg, s2)

OUTPUT:
+-----+------------+---------+
|     |    DATE    |  SALES  |
+-----+------------+---------+
| 0:  | 2018-05-01 |  50.3   |
| 1:  | 2018-05-02 |  23.4   |
| 2:  | 2018-05-03 |  56.2   |
+-----+------------+---------+
| 3X2 | CIVIL DATE | FLOAT64 |
+-----+------------+---------+

Tutorial

Create some fake data

Let's create a list of 8 "fake" employees with a name, title and base hourly wage rate.

import "golang.org/x/exp/rand"
import "rocketlaunchr/dataframe-go/utils/faker"

src := rand.NewSource(uint64(time.Now().UTC().UnixNano()))
df := faker.NewDataFrame(8, src, faker.S("name", 0, "Name"), faker.S("title", 0.5, "JobTitle"), faker.S("base rate", 0, "Number", 15, 50))

+-----+----------------+----------------+-----------+
|     |      NAME      |     TITLE      | BASE RATE |
+-----+----------------+----------------+-----------+
| 0:  | Cordia Jacobi  |   Consultant   |    42     |
| 1:  | Nickolas Emard |      NaN       |    22     |
| 2:  | Hollis Dickens | Representative |    22     |
| 3:  | Stacy Dietrich |      NaN       |    43     |
| 4:  |  Aleen Legros  |    Officer     |    21     |
| 5:  |  Adelia Metz   |   Architect    |    18     |
| 6:  | Sunny Gerlach  |      NaN       |    28     |
| 7:  | Austin Hackett |      NaN       |    39     |
+-----+----------------+----------------+-----------+
| 8X3 |     STRING     |     STRING     |   INT64   |
+-----+----------------+----------------+-----------+

Apply Function

Let's give a promotion to everyone by doubling their salary.

s := df.Series[2]

applyFn := dataframe.ApplySeriesFn(func(val interface{}, row, nRows int) interface{} {
	return 2 * val.(int64)
})

dataframe.Apply(ctx, s, applyFn, dataframe.FilterOptions{InPlace: true})

+-----+----------------+----------------+-----------+
|     |      NAME      |     TITLE      | BASE RATE |
+-----+----------------+----------------+-----------+
| 0:  | Cordia Jacobi  |   Consultant   |    84     |
| 1:  | Nickolas Emard |      NaN       |    44     |
| 2:  | Hollis Dickens | Representative |    44     |
| 3:  | Stacy Dietrich |      NaN       |    86     |
| 4:  |  Aleen Legros  |    Officer     |    42     |
| 5:  |  Adelia Metz   |   Architect    |    36     |
| 6:  | Sunny Gerlach  |      NaN       |    56     |
| 7:  | Austin Hackett |      NaN       |    78     |
+-----+----------------+----------------+-----------+
| 8X3 |     STRING     |     STRING     |   INT64   |
+-----+----------------+----------------+-----------+

Create a Time series

Let's inform all employees separately on sequential days.

import "rocketlaunchr/dataframe-go/utils/utime"

mts, _ := utime.NewSeriesTime(ctx, "meeting time", "1D", time.Now().UTC(), false, utime.NewSeriesTimeOptions{Size: &[]int{8}[0]})
df.AddSeries(mts, nil)

+-----+----------------+----------------+-----------+--------------------------------+
|     |      NAME      |     TITLE      | BASE RATE |          MEETING TIME          |
+-----+----------------+----------------+-----------+--------------------------------+
| 0:  | Cordia Jacobi  |   Consultant   |    84     |   2020-02-02 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 1:  | Nickolas Emard |      NaN       |    44     |   2020-02-03 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 2:  | Hollis Dickens | Representative |    44     |   2020-02-04 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 3:  | Stacy Dietrich |      NaN       |    86     |   2020-02-05 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 4:  |  Aleen Legros  |    Officer     |    42     |   2020-02-06 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 5:  |  Adelia Metz   |   Architect    |    36     |   2020-02-07 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 6:  | Sunny Gerlach  |      NaN       |    56     |   2020-02-08 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 7:  | Austin Hackett |      NaN       |    78     |   2020-02-09 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
+-----+----------------+----------------+-----------+--------------------------------+
| 8X4 |     STRING     |     STRING     |   INT64   |              TIME              |
+-----+----------------+----------------+-----------+--------------------------------+

Filtering

Let's filter out our senior employees (they have titles) for no reason.

filterFn := dataframe.FilterDataFrameFn(func(vals map[interface{}]interface{}, row, nRows int) (dataframe.FilterAction, error) {
	if vals["title"] == nil {
		return dataframe.DROP, nil
	}
	return dataframe.KEEP, nil
})

seniors, _ := dataframe.Filter(ctx, df, filterFn)

+-----+----------------+----------------+-----------+--------------------------------+
|     |      NAME      |     TITLE      | BASE RATE |          MEETING TIME          |
+-----+----------------+----------------+-----------+--------------------------------+
| 0:  | Cordia Jacobi  |   Consultant   |    84     |   2020-02-02 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 1:  | Hollis Dickens | Representative |    44     |   2020-02-04 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 2:  |  Aleen Legros  |    Officer     |    42     |   2020-02-06 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 3:  |  Adelia Metz   |   Architect    |    36     |   2020-02-07 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
+-----+----------------+----------------+-----------+--------------------------------+
| 4X4 |     STRING     |     STRING     |   INT64   |              TIME              |
+-----+----------------+----------------+-----------+--------------------------------+

Other useful packages

awesome-svelte - Resources for killing react
dbq - Zero boilerplate database operations for Go
electron-alert - SweetAlert2 for Electron Applications
google-search - Scrape google search results
igo - A Go transpiler with cool new syntax such as fordefer (defer for for-loops)
mysql-go - Properly cancel slow MySQL queries
react - Build front end applications using Go
remember-go - Cache slow database queries
testing-go - Testing framework for unit testing

Legal Information

The license is a modified MIT license. Refer to LICENSE file for more details.

Owner

https://github.com/rocketlaunchr/dataframe-go

Comments

HOw to use CSVLoadOptions ?
hi i have one csv fiile ,has four fields [USERID ,MOVIEID,RATING, TIMESTAMP) ,LoadFromCSV default load all fields data type are string ,I want to change it with float64 when load init ,so I create CSVLoadOptions var csvOp imports.CSVLoadOptions csvOp.DictateDataType =make(map[string]interface{}) csvOp.DictateDataType["USERID"]= float64(0) csvOp.DictateDataType["MOVIEID"]=float64(0) csvOp.DictateDataType["RATING"]=float64(0) csvOp.DictateDataType["TIMESTAMP"]=float64(0)

ratingDf, err := imports.LoadFromCSV(ctx, file,csvOp)

but has load error ，I dont know why ，is use the CSVLoadOptions is not correct ?

Getting dataframe.ApplySeriesFn undefined error

Thanks for creating this library!

I can get this code to work:

ctx := context.TODO()

// step 1: open the csv
csvfile, err := os.Open("data/example.csv")
if err != nil {
	log.Fatal(err)
}

dataframe, err := imports.LoadFromCSV(ctx, csvfile)

Here's the data that's printed:

fmt.Print(dataframe.Table())

+-----+------------+-----------------+
|     | FIRST NAME | FAVORITE NUMBER |
+-----+------------+-----------------+
| 0:  |  matthew   |       23        |
| 1:  |   daniel   |        8        |
| 2:  |  allison   |       42        |
| 3:  |   david    |       18        |
+-----+------------+-----------------+
| 4X2 |   STRING   |     STRING      |
+-----+------------+-----------------+

I cannot get this code working:

s := dataframe.Series[2]

applyFn := dataframe.ApplySeriesFn(func(val interface{}, row, nRows int) interface{} {
	return 2 * val.(int64)
})

dataframe.Apply(ctx, s, applyFn, dataframe.FilterOptions{InPlace: true})

fmt.Print(dataframe.Table())

Here's the error message:

./dataframe_go.go:36:22: dataframe.ApplySeriesFn undefined (type *dataframe.DataFrame has no field or method ApplySeriesFn)
./dataframe_go.go:40:11: dataframe.Apply undefined (type *dataframe.DataFrame has no field or method Apply)
./dataframe_go.go:40:44: dataframe.FilterOptions undefined (type *dataframe.DataFrame has no field or method FilterOptions)

Here's the code: https://github.com/MrPowers/go-dataframe-examples/blob/master/dataframe_go.go

Sorry if this is a basic question. I am a Go newbie!

Thanks again for making this library!

Reading from Parquet

Hello,

Are there any plans to support reading a Parquet file into a dataframe? I have a need for this and am evaluating this library to use in an application.

Thanks!
Expand docs to include other common dataframe operations, etc.
Greetings!

Just a minor suggestion, but if you have the time, it could be useful to expand the docs a bit more to cover some additional common operations applied to dataframe-like structures, where supported.

For example:

retrieving a single row

retrieving a single column

selecting row/column subsets by indices or ranges

selecting a single value by <row, column> indices

Further, one other thing I noticed when employing the package for the first time, is that many of the dataframe.xx() function calls include a nil as the first argument.

From looking at the code for dataframe.go, these appear to be relating to an optional Options struct, so it makes sense that this would be set to nil in many instances. It may just be worth mentioning this explicitly in the examples for .Append() in the docs.

Finally two other things that could be useful to consider including in the docs:

Limitations compared with R/pandas

Cheatsheet of commands comparing dataframe-go with R/pandas (more effort, and probably better suited for a separate wiki page, etc., but would be really useful for people coming from these worlds..)

Thanks for taking the time to put together and share this really useful package!
how to get dataframe all data convert to gonum dense matrix ?

I want to use it ,but I found some problem ,You make the property about SeriesInt64 values private !!!,why ?

would you like tell how to convert dataframe to gonum dense matrix ?

and how to use LoadFromCSV(ctx,strings.NewReader(csvStr)),which ctx ,how to define the context.Context

Draw graphs from columns of dataframe

Hi! At the moment I have managed to plot a separate dataframe column by this strange method:

func main() {
        // all values of df are strings representing floating point numbers
        df := df, err := imports.LoadFromCSV(ctx, r, imports.CSVLoadOptions{Comma: ';'})
	s := df.Series[2] // trying to plot column 2
	series := dataframe.NewSeriesFloat64("test_name", nil, nil)

	i := s.ValuesIterator(dataframe.ValuesOptions{InitialRow: 0, Step: 1, DontReadLock: false})
	for {
		row, vals, _ := i()
		if row == nil {
			break
		}
		val, err := strconv.ParseFloat(vals.(string), 64)
		if err != nil {
			continue
		}
		series.Append(val)
	}
	Plot(series)
}

func Plot(ser *dataframe.SeriesFloat64) {
	ctx := context.TODO()
	cs, _ := wcharczuk_chart.S(ctx, ser, nil, nil)
	graph := chart.Chart{
		Title:  "test_graph",
		Width:  640,
		Height: 480,
		Series: []chart.Series{cs},
	}
	f, err := os.Create("graph.svg")
	if err != nil {
		panic(err)
	}
	defer f.Close()

	plt := bufio.NewWriter(f)
	_ = graph.Render(chart.SVG, plt)
}

Is there any simplier or more elegant method to do this job? And another question is if I can plot several columns on one plot? And if it is possible, how can I do this? Thanks in advance.

LoadFromJSON Not Working

files, err := ioutil.ReadFile("device.json") if err != nil { fmt.Println(err) }

var ctx = context.Background()
df2, _ := imports.LoadFromJSON(ctx, strings.NewReader(string(files)))

fmt.Println(df2.Table())

Add support for CSV without headers row

This simply adds the support to import CSV files without a headers row.

In case the ColumnNames options is specified, it uses it to set the series names, instead of reading the first row.

It moves the if row == 0 { outside the for loop to avoid to do the check for each row read.

Inconsistent behavior for Apply when using with ApplyDataFrameFn

I'm trying to concatenate two columns in a dataframe and put it into a new column. The behavior is very inconsistent. Sometimes the strings are concatenated into the new column. Sometimes the value is just set to NaN.

In this run, the value for concat_contact_number in the resulting dataframe was correctly set to 97312345678. The map value for concat_contact_number also reflects the concatenated value.

Expected output:

$ go run main.go 
INFO[0000] In applyConcatDf: vals[contact_number_country_code]: 973 
INFO[0000] In applyConcatDf: vals[concat_contact_number]: 973 
INFO[0000] In applyConcatDf: vals[contact_number]: 12345678 
INFO[0000] In applyConcatDf: vals[concat_contact_number]: 97312345678 
INFO[0000] In applyConcatDf: vals: map[0:973 1:12345678 2:<nil> concat_contact_number:97312345678 contact_number:12345678 contact_number_country_code:973] 
INFO[0000] In prepareDataframe:                         
INFO[0000] +-----+-----------------------------+----------------+-----------------------+
|     | CONTACT NUMBER COUNTRY CODE | CONTACT NUMBER | CONCAT CONTACT NUMBER |
+-----+-----------------------------+----------------+-----------------------+
| 0:  |             973             |    12345678    |      97312345678      |
+-----+-----------------------------+----------------+-----------------------+
| 1X3 |           STRING            |     STRING     |        STRING         |
+-----+-----------------------------+----------------+-----------------------+ 
INFO[0000] In main:                                     
INFO[0000] +-----+-----------------------------+----------------+-----------------------+
|     | CONTACT NUMBER COUNTRY CODE | CONTACT NUMBER | CONCAT CONTACT NUMBER |
+-----+-----------------------------+----------------+-----------------------+
| 0:  |             973             |    12345678    |      97312345678      |
+-----+-----------------------------+----------------+-----------------------+
| 1X3 |           STRING            |     STRING     |        STRING         |
+-----+-----------------------------+----------------+-----------------------+

In this run, the value for concat_contact_number in the resulting dataframe was incorrectly set to NaN. Same as with the correct run, the map value for concat_contact_number is also set to the expected concatenated value.

Erroneous output:

$ go run main.go 
INFO[0000] In applyConcatDf: vals[contact_number_country_code]: 973 
INFO[0000] In applyConcatDf: vals[concat_contact_number]: 973 
INFO[0000] In applyConcatDf: vals[contact_number]: 12345678 
INFO[0000] In applyConcatDf: vals[concat_contact_number]: 97312345678 
INFO[0000] In applyConcatDf: vals: map[0:973 1:12345678 2:<nil> concat_contact_number:97312345678 contact_number:12345678 contact_number_country_code:973] 
INFO[0000] In prepareDataframe:                         
INFO[0000] +-----+-----------------------------+----------------+-----------------------+
|     | CONTACT NUMBER COUNTRY CODE | CONTACT NUMBER | CONCAT CONTACT NUMBER |
+-----+-----------------------------+----------------+-----------------------+
| 0:  |             973             |    12345678    |          NaN          |
+-----+-----------------------------+----------------+-----------------------+
| 1X3 |           STRING            |     STRING     |        STRING         |
+-----+-----------------------------+----------------+-----------------------+ 
INFO[0000] In main:                                     
INFO[0000] +-----+-----------------------------+----------------+-----------------------+
|     | CONTACT NUMBER COUNTRY CODE | CONTACT NUMBER | CONCAT CONTACT NUMBER |
+-----+-----------------------------+----------------+-----------------------+
| 0:  |             973             |    12345678    |          NaN          |
+-----+-----------------------------+----------------+-----------------------+
| 1X3 |           STRING            |     STRING     |        STRING         |
+-----+-----------------------------+----------------+-----------------------+

It can be observed that in both cases the map value for 2 is always <nil>. Is this expected?

Run this code several times to see deviances in the output. The issue may not show up immediately. Sometimes it takes 10x runs, sometimes only 2x run. Again the behavior is inconsistent.

Working code:

package main

import (
	"context"
	"fmt"
	"strings"

	dataframe "github.com/rocketlaunchr/dataframe-go"
	"github.com/rocketlaunchr/dataframe-go/imports"
	log "github.com/sirupsen/logrus"
)

// applyConcatDf returns an ApplyDataFrameFn that concatenates the given column names into another column
func applyConcatDf(dest_column string, columns []string) dataframe.ApplyDataFrameFn {
	return func(vals map[interface{}]interface{}, row, nRows int) map[interface{}]interface{} {
		vals[dest_column] = ""
		for _, key := range columns {
			log.Infof("vals[%s]: %s", key, vals[key].(string))
			vals[dest_column] = vals[dest_column].(string) + vals[key].(string)
			log.Infof("vals[%s]: %s", dest_column, vals[dest_column].(string))
		}

		log.Infof("vals: %v", vals)
		return vals
	}
}

// applySetupDataframe initializes the dataframe from a CSV string
func setupDataframe() *dataframe.DataFrame {
	ctx := context.Background()

	csvStr := `contact_number_country_code,contact_number
"973","12345678"`

	df, _ := imports.LoadFromCSV(ctx, strings.NewReader(csvStr), imports.CSVLoadOptions{
		DictateDataType: map[string]interface{}{
			"contact_number_country_code": "",
			"contact_number":              "",
		},
	})

	return df
}

// prepareDataframe applies the concatenation on the loaded dataframe
func prepareDataframe(df *dataframe.DataFrame) {
	ctx := context.Background()

	sConcatContactNumber := dataframe.NewSeriesString("concat_contact_number", &dataframe.SeriesInit{Size: df.NRows()})
	df.AddSeries(sConcatContactNumber, nil)

	_, err := dataframe.Apply(ctx, df, applyConcatDf("concat_contact_number", []string{"contact_number_country_code", "contact_number"}), dataframe.FilterOptions{InPlace: true})

	if err != nil {
		log.WithError(err).Error("concatenation cannot be applied")
	}

	fmt.Println(df)
}

func main() {
	df := setupDataframe()
	prepareDataframe(df)
	fmt.Println(df)
}

Getting back Float64/Int64/Mixed series from dataframe

I wanted to know if there is a way to convert a series interface to get the original type of series (Float64/Int64/Mixed) underneath it. I will describe mu use case.

After creating a dataframe, I am trying to use gonum to do some analysis. For eg. linear regression of two series from dataframe. But for this I have to iterate over the whole series(using ValuesIterator) to get back each element into a []float64, which is required by gonum. ToSeriesFloat64 does not help since it is not implemented by Series.

Is there an easier way to access the whole underlying series into into corresponding concrete slice?
Potential collision and risk from indirect dependence "github.com/gotestyourself/gotestyourself"
Background

Repo rocketlaunchr/dataframe-go used the old path to import gotestyourself indirectly. This caused that github.com/gotestyourself/gotestyourself and gotest.tools coexist in this repo： https://github.com/rocketlaunchr/dataframe-go/blob/master/go.mod （Line 20 & 40）

github.com/gotestyourself/gotestyourself v2.2.0+incompatible // indirect gotest.tools v2.2.0+incompatible // indirect

That’s because the gotestyourself has already renamed it’s import path from "github.com/gotestyourself/gotestyourself" to "gotest.tools". When you use the old path "github.com/gotestyourself/gotestyourself" to import the gotestyourself, will reintroduces gotestyourself through the import statements "import gotest.tools" in the go source file of gotestyourself.

https://github.com/gotestyourself/gotest.tools/blob/v2.2.0/fs/example_test.go#L8

package fs_test import ( … "gotest.tools/assert" "gotest.tools/assert/cmp" "gotest.tools/fs" "gotest.tools/golden" )

"github.com/gotestyourself/gotestyourself" and "gotest.tools" are the same repos. This will work in isolation, bring about potential risks and problems.

Solution

Add replace statement in the go.mod file:

replace github.com/gotestyourself/gotestyourself => gotest.tools v2.2.0

Then clean the go.mod.
Progress for re-write of dataframe-go?

It's written in the README file that "Once Go 1.18 (Generics) is introduced, the ENTIRE package will be rewritten.", As Go 1.18 has been released for a while, I'm wondering if work has started on re-writing of the entire package. If so, how's the progress?
Indirect dependency `github.com/blend/go-sdk v1.1.1` does not exist

I suspect that the library maintainers prepended "legacy-" to versions before changing the versioning scheme. At the least, this dependency should be updated to legacy-v1.1.1.

Error to read parquet with latest parquet-go

Create a file with python pandas

dataframe = pandas.DataFrame({
        "A": ["a", "b", "c", "d"],
        "B": [2, 3, 4, 1],
        "C": [10, 20, None, None]
    })

dataframe.to_parquet("1.parquet")

This file looks like:

Read this file

func main() {
    ctx := context.Background()
    fr, _ := local.NewLocalFileReader("1.parquet")
    df, err := imports.LoadFromParquet(ctx, fr)
    if err != nil {
        panic(err)
    }
    fmt.Println(df)
}

Got a unique name error

panic: names of series must be unique: 

goroutine 1 [running]:
github.com/rocketlaunchr/dataframe-go.NewDataFrame({0xc0001f8000, 0x3, 0xc000149a10?})
        .../rocketlaunchr/[email protected]/dataframe.go:41 +0x33c
github.com/rocketlaunchr/dataframe-go/imports.LoadFromParquet({0x1497868, 0xc000020080}, {0x1498150?, 0xc00000e798?}, {0xc0000021a0?, 0xc000149f70?, 0x1007599?})
        .../go/pkg/mod/github.com/rocketlaunchr/[email protected]/imports/parquet.go:110 +0x8ae
main.main()
        .../main.go:13 +0x78

Following the stack, I found some useful informations

All series in method imports.LoadFromParquet with empty names

goFieldNameToActual each keys in this map with prefix "Scheme", but goName didn't, may be it's the reason why can't not find a name from this map

This's the first time I use golang to read parquet files. It is an error cause by parquet-go breaking changes or something else ?

Bad import, was an upstream dependency deleted?

go: github.com/sjwhitworth/[email protected] requires
        github.com/rocketlaunchr/[email protected] requires
        github.com/blend/[email protected]: reading github.com/blend/go-sdk/go.mod at revision v1.1.1: unknown revision v1.1.1

It looks like v1.1.1 of github.com/blend/go-sdk is missing. Are you seeing the same or am I taking crazy pills today?

Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more

Gonum Installation The core packages of the Gonum suite are written in pure Go with some assembly. Installation is done using go get. go get -u gonum.

Jan 8, 2023

A well tested and comprehensive Golang statistics library package with no dependencies.

Stats - Golang Statistics Package A well tested and comprehensive Golang statistics library / package / module with no dependencies. If you have any s

Dec 26, 2022

GoStats is a go library for math statistics mostly used in ML domains, it covers most of the statistical measures functions.

GoStats GoStats is an Open Source Go library for math statistics mostly used in Machine Learning domains, it covers most of the Statistical measures f

Nov 10, 2022

Package goraph implements graph data structure and algorithms.

goraph Package goraph implements graph data structure and algorithms. go get -v gopkg.in/gyuho/goraph.v2; I have tutorials and visualizations of grap

Dec 20, 2022

tools for working with streams of data

streamtools 4/1/2015 Development for streamtools has waned as our attention has turned towards developing a language paradigm that embraces blocking,

Nov 18, 2022

Types and utilities for working with 2d geometry in Golang

orb Package orb defines a set of types for working with 2d geo and planar/projected geometric data in Golang. There are a set of sub-packages that use

Dec 28, 2022

:wink: :cyclone: :strawberry: TextRank implementation in Golang with extendable features (summarization, phrase extraction) and multithreading (goroutine) support (Go 1.8, 1.9, 1.10)

TextRank on Go This source code is an implementation of textrank algorithm, under MIT licence. The minimum requred Go version is 1.8. MOTIVATION If th

Dec 18, 2022

2D triangulation library. Allows translating lines and polygons (both based on points) to the language of GPUs.

triangolatte 2D triangulation library. Allows translating lines and polygons (both based on points) to the language of GPUs. Features normal and miter

Dec 23, 2022

Polygol - Boolean polygon clipping/overlay operations (union, intersection, difference, xor) on Polygons and MultiPolygons

polygol Boolean polygon clipping/overlay operations (union, intersection, differ

Jan 8, 2023

DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration

Dataframes are used for statistics, machine-learning, and data manipulation/exploration. You can think of a Dataframe as an excel spreadsheet. This pa

Jan 3, 2023

Gota: DataFrames and data wrangling in Go (Golang)

Gota: DataFrames, Series and Data Wrangling for Go This is an implementation of DataFrames, Series and data wrangling methods for the Go programming l

Jan 6, 2023

Gota: DataFrames and data wrangling in Go (Golang)

Gota: DataFrames, Series and Data Wrangling for Go This is an implementation of DataFrames, Series and data wrangling methods for the Go programming l

Jan 5, 2023

A modern tool for the Windows kernel exploration and tracing

Fibratus A modern tool for the Windows kernel exploration and observability Get Started » Docs • Filaments • Download • Discussions What is Fibratus?

Dec 30, 2022

Bet - An exploration in writing structured Go tests using type parameters

Behavior Tests This is an exploration in writing structured Go tests using type

Feb 25, 2022

CLI to run your dataframes against SLU service and generated labeled dataframe.

trail CLI to run your dataframes against different services (currently, SLU service). Setup Get the latest binaries from the releases here. Choose the

Nov 12, 2021

Common functional data manipulation and abstraction patterns in Golang.

Functional Patterns in Golang GOMAD (Early stage) This package is still in an early stage of development. Feel free to open a PR and contribute or jus

Jan 8, 2023

This repository is where I'm learning to write a CLI using Go, while learning Go, and experimenting with Docker containers and APIs.

CLI Project This repository contains a CLI project that I've been working on for a while. It's a simple project that I've been utilizing to learn Go,

Dec 12, 2021

DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration

Features

Installation

DataFrames

Creating a DataFrame

Insert and Remove Row

Update Row

Sorting

Iterating

Statistics

Example

Mean

Median

Standard Deviation

Plotting (cross-platform)

Math Functions

Importing Data

CSV

Exporting Data

Optimizations

Generic Series

civil.Date

Tutorial

Create some fake data

Apply Function

Create a Time series

Filtering

Other useful packages

Legal Information

Owner

Comments

HOw to use CSVLoadOptions ?

Getting dataframe.ApplySeriesFn undefined error

Reading from Parquet

Expand docs to include other common dataframe operations, etc.

how to get dataframe all data convert to gonum dense matrix ?

Draw graphs from columns of dataframe

LoadFromJSON Not Working

Add support for CSV without headers row

Inconsistent behavior for Apply when using with ApplyDataFrameFn

Getting back Float64/Int64/Mixed series from dataframe

Potential collision and risk from indirect dependence "github.com/gotestyourself/gotestyourself"

Background

Solution

Progress for re-write of dataframe-go?

Indirect dependency `github.com/blend/go-sdk v1.1.1` does not exist

Error to read parquet with latest parquet-go

Bad import, was an upstream dependency deleted?

Related tags

Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more

A well tested and comprehensive Golang statistics library package with no dependencies.

GoStats is a go library for math statistics mostly used in ML domains, it covers most of the statistical measures functions.

Package goraph implements graph data structure and algorithms.

tools for working with streams of data

Types and utilities for working with 2d geometry in Golang

:wink: :cyclone: :strawberry: TextRank implementation in Golang with extendable features (summarization, phrase extraction) and multithreading (goroutine) support (Go 1.8, 1.9, 1.10)

2D triangulation library. Allows translating lines and polygons (both based on points) to the language of GPUs.

Polygol - Boolean polygon clipping/overlay operations (union, intersection, difference, xor) on Polygons and MultiPolygons

DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration

Gota: DataFrames and data wrangling in Go (Golang)

Gota: DataFrames and data wrangling in Go (Golang)

A modern tool for the Windows kernel exploration and tracing

Bet - An exploration in writing structured Go tests using type parameters

CLI to run your dataframes against SLU service and generated labeled dataframe.

Common functional data manipulation and abstraction patterns in Golang.

This repository is where I'm learning to write a CLI using Go, while learning Go, and experimenting with Docker containers and APIs.

Scraper to download school attendance data from the DfE's statistics website

Sig - Statistics in Go - CLI tool for quick statistical analysis of data streams

Swiss Army knife Proxy tool for HTTP/HTTPS traffic capture, manipulation, and replay on the go.