A library for reading and writing parquet files.

Last update: Jan 3, 2023

Comments: 6

Parquet

Parquet generates a parquet reader and writer based on a struct. The struct can be defined by you or it can be generated by reading an existing parquet file.

We (Parsyl) will respond to pull requests and issues to the best of our abilities. However, sometimes we will have higher priorities and the response might not be immediate.

NOTE: If you generate the code based on a parquet file there are quite a few limitations. The PageType of each PageHeader must be DATA_PAGE and the Codec (defined in ColumnMetaData) must be PLAIN or SNAPPY. Also, the parquet file's schema must consist of the currently supported types. But wait, there's more! Some of the encodings, like DELTA_BINARY_PACKED, BIT_PACKED, PLAIN_DICTIONARY, and DELTA_BYTE_ARRAY are also not supported. I would guess there are other parquet options that will cause problems since there are so many possibilities.

Installation

go get -u github.com/parsyl/parquet/...

This will also install parquet's only two dependencies: thift and snappy

Usage

First define a struct for the data to be written to parquet:

type Person struct {
  	ID  int32  `parquet:"id"`
	Age *int32 `parquet:"age"`
}

Next, add a go:generate comment somewhere (in this example all code lives in main.go):

// go:generate parquetgen -input main.go -type Person -package main

Generate the code for the reader and writer:

$ go generate

A new file (parquet.go) has now been written that defines ParquetWriter and ParquetReader. Next, make use of the writer and reader:

package main

import (
    "bytes"
    "encoding/json"
)

func main() {
    var buf bytes.Buffer
    w, err := NewParquetWriter(&buf)
    if err != nil {
        log.Fatal(err)
    }

    w.Add(Person{ID: 1, Age: getAge(30)})
    w.Add(Person{ID: 2})

    // Each call to write creates a new parquet row group.
    if err := w.Write(); err != nil {
        log.Fatal(err)
    }

    // Close must be called when you are done.  It writes
    // the parquet metadata at the end of the file.
    if err := w.Close(); err != nil {
        log.Fatal(err)
    }

    r, err := NewParquetReader(bytes.NewReader(buf.Bytes()))
    if err != nil {
        log.Fatal(err)
    }

    enc := json.NewEncoder(os.Stdout)
    for r.Next() {
        var p Person
        r.Scan(&p)
        enc.Encode(p)
    }

    if err := r.Error(); err != nil {
        log.Fatal(err)
    }
}

func getAge(a int32) *int32 { return &a }

NewParquetWriter has a couple of optional arguments available: MaxPageSize, Uncompressed, and Snappy. For example, the following sets the page size (number of rows in a page before a new one is created) and sets the page data compression to snappy:

w, err := NewParquetWriter(&buf, MaxPageSize(10000), Snappy)

See this for a complete example of how to generate the code based on an existing struct.

See this for a complete example of how to generate the code based on an existing parquet file.

Supported Types

The struct used to define the parquet data can have the following types:

int32
uint32
int64
uint64
float32
float64
string
bool

Each of these types may be a pointer to indicate that the data is optional. The struct can also embed another struct:

type Being struct {
	ID  int32  `parquet:"id"`
	Age *int32 `parquet:"age"`
}

type Person struct {
	Being
	Username string `parquet:"username"`
}

Nested and repeated structs are supported too:

type Being struct {
	ID  int32  `parquet:"id"`
	Age *int32 `parquet:"age"`
}

type Person struct {
	Being    Being
	Username string `parquet:"username"`
	Friends  []Being
}

If you want a field to be excluded from parquet you can tag it with a dash or make it unexported like so:

type Being struct {
  	ID  int32  `parquet:"id"`
	Password string`parquet:"-"` //will not be written to parquet
	age int32                    //will not be written to parquet
}

Parquetgen

Parquetgen is the command that go generate should call in order to generate the code for your custom type. It also can print the page headers and file metadata from a parquet file:

$ parquetgen --help
Usage of parquetgen:
  -ignore
        ignore unsupported fields in -type, otherwise log.Fatal is called when an unsupported type is encountered (default true)
  -import string
        import statement of -type if it doesn't live in -package
  -input string
        path to the go file that defines -type
  -metadata
        print the metadata of a parquet file (-parquet) and exit
  -output string
        name of the file that is produced, defaults to parquet.go (default "parquet.go")
  -package string
        package of the generated code
  -pageheaders
        print the page headers of a parquet file (-parquet) and exit (also prints the metadata)
  -parquet string
        path to a parquet file (if you are generating code based on an existing parquet file or printing the file metadata or page headers)
  -struct-output string
        name of the file that is produced, defaults to parquet.go (default "generated_struct.go")
  -type string
        name of the struct that will used for writing and reading

Owner

Parsyl Inc.

https://github.com/parsyl/parquet

Comments

Reduce allocations and the footprint
Hello there, here is an additional PR with memory optimisations to reduce the footprint on write.

Both optimisations use a slice trick from standard library strconv.Append...

Reduce allocations in optional fields, changing the signature of the functions.

Fixing bitpack and it's gen, so now it has an allocation-free interface like strconv.AppendInt. RLE is also can reuse a buffer and allocate it on the stack.

@cswank What do you think?

benchmark iter time/iter bytes alloc allocs --------- ---- --------- ----------- ------ // master BenchmarkWrite/opt-8 22 486.70 ms/op 89216541 B/op 4308470 allocs/op // PR BenchmarkWrite/opt-8 28 440.07 ms/op 69019114 B/op 133440 allocs/op
Performance optimizations
This PR contains a few self contained improvements, focused on reducing the memory usage of the parquet encoding. each single optimization is done in a dedicated commit, with a commit message that starts with OPT:, so it might make more sense to review those commit individually

to be able to compare with the baseline performance, a parquet.go file was generated in performance/base/, before applying any optimizations. the optimized code was generated in performance/parquet.go, and during the benchmarks, both were used to compare the performance, and to make sure that the binary representation of the data wasn't changed by mistake.

all of the optimizations (except the last one), were done in the generated code, so it didn't effect the base version. here is the comparison (without the last commit):

benchmark iter time/iter bytes alloc allocs --------- ---- --------- ----------- ------ BenchmarkWrite/base-8 13 865.17 ms/op 672684318 B/op 8938970 allocs/op BenchmarkWrite/opt-8 22 518.54 ms/op 256610628 B/op 4318254 allocs/op

here are the results after applying the final commit as well (so there is some improvement for the base version as well):

benchmark iter time/iter bytes alloc allocs --------- ---- --------- ----------- ------ BenchmarkWrite/base-8 13 845.20 ms/op 619640619 B/op 8929125 allocs/op BenchmarkWrite/opt-8 22 505.66 ms/op 202235762 B/op 4310258 allocs/op
Concurrent usage

Hi, I have a use-case where I'm writing billions of data entries into a file. I couldn't find any information regarding this in the README, so I wanted to ask whether it's safe to use *ParquetWriter across multiple goroutines to speed up the whole process and if you have any recommendations for doing that. I'm not quite sure but maybe Add could be called concurrently and every once in a while Write should be called depending on the size of RowGroup.
Update go.mod to include dependency versions

This library does not support the use of the latest apache thrift library as they have broken their API by introducing new arguments into their exported functions. That event makes it impossible to install this library with go get because the older version is not defaulted to. By adding the specific required version (or by updating this library to be compatible while also still adding these versions to the go.mod) this library will become immediately usable again.
Nested (not repeated) generated code is invalid

For now only flat parquet schemas are working correctly.

The code that was used to generate the code for writing values to the input struct needs to be brought back for cases when a field is not repeated.
Import bug

PR #14 has introduced a bug. after that PR, both int and float types are using the package math, but the condition to import math was that int is present in the types

in case of a struct that contains a float but not an int, the generated file wont import math and the file wont compile

to simplify things, I just always import math, and make sure that it is used to avoid an unused import error

Scraping medium blogs to make them loadable with shitty internet and have a pleasant reading experience

Unmedium This project is still WIP We all know medium right? A bunch of JS, wast

Mar 20, 2022

Logger - Go language is interface-oriented to implement an asynchronous log writing program

logger日志库 1、安装 go get github.com/staryjie/logger@latest 2、使用示例： package main import ( "github.com/staryjie/logger" "time" ) func initLogger(name,

Jan 4, 2022

List files and their creation, modification and access time on android

andfind List files and their access, modification and creation date on a Android

Jan 5, 2022

A simple daemon which will watch files on your filesystem, mirror them to MFS, automatically update related pins, and update related IPNS keys.

ipfs-sync is a simple daemon which will watch files on your filesystem, mirror them to MFS, automatically update related pins, and update related IPNS keys, so you can always access your directories from the same address. You can use it to sync your documents, photos, videos, or even a website!

Dec 30, 2022

BRUS - Parses your web server (e.g. nginx) log files and checks with GreyNoise how much noise your website is exposed to.

BRUS bbbbbb rrrrrr u u sssss b b r r u u s bbbbbb rrrrrr u u sssss b b r r u u s bbbbbb r r

May 29, 2022

A simple web service for storing text log files

logpaste A minimalist web service for uploading and sharing log files. Run locally go run main.go Run in local Docker container The Docker container a

Dec 30, 2022

Peimports - based on golang's debug/pe this package gives quick access to the ordered imports of pe files with ordinal support

This code is almost entirely derived from the Go standard library's debug/pe package. It didn't provide access to ordinal based entries in the IAT and

Jan 5, 2022

A helper tool to work with profile.proto (pprof) files

qpprof qpprof complements the pprof tool. Commands Use qpprof command --help to get more information. Flat aggregation Alternative flat aggregations a

Sep 15, 2022

A version control system to manage large files.

ArtiVC ArtiVC (Artifacts Version Control) is a handy command-line tool for data versioning on cloud storage. With only one command, it helps you neatl

Jan 4, 2023

a lightweight, high-performance, out-of-the-box logging library that relies solely on the Go standard library

English | 中文 olog olog is a lightweight, high-performance, out-of-the-box logging library that relies solely on the Go standard library. Support outpu

Apr 12, 2023

Library and program to parse and forward HAProxy logs

haminer Library and program to parse and forward HAProxy logs. Supported forwarder, Influxdb Requirements Go for building from source code git for dow

Aug 17, 2022

Simple and blazing fast lockfree logging library for golang

glg is simple golang logging library Requirement Go 1.11 Installation go get github.com/kpango/glg Example package main import ( "net/http" "time"

Nov 28, 2022

The Simplest and worst logging library ever written

gologger A Simple Easy to use go logger library. Displays Colored log into console in any unix or windows platform. You can even store your logs in fi

Sep 26, 2022

Seelog is a native Go logging library that provides flexible asynchronous dispatching, filtering, and formatting.

Seelog Seelog is a powerful and easy-to-learn logging framework that provides functionality for flexible dispatching, filtering, and formatting log me

Jan 3, 2023

A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go

Package monkit is a flexible code instrumenting and data collection library. See documentation at https://godoc.org/gopkg.in/spacemonkeygo/monkit.v3 S

Dec 14, 2022

Hierarchical, leveled, and structured logging library for Go

Apr 27, 2021

Simple and extensible monitoring agent / library for Kubernetes: https://gravitational.com/blog/monitoring_kubernetes_satellite/

Satellite Satellite is an agent written in Go for collecting health information in a kubernetes cluster. It is both a library and an application. As a

Nov 10, 2022

Litter is a pretty printer library for Go data structures to aid in debugging and testing.

Litter Litter is a pretty printer library for Go data structures to aid in debugging and testing. Litter is provided by Sanity: The Headless CMS Const

Dec 28, 2022

Parametrized JSON logging library in Golang which lets you obfuscate sensitive data and marshal any kind of content.

Noodlog Summary Noodlog is a Golang JSON parametrized and highly configurable logging library. It allows you to: print go structs as JSON messages; pr

Oct 27, 2022