Stream data into Google BigQuery concurrently using InsertAll() or BQ Storage.

Last update: Dec 16, 2022

Comments: 11

bqwriter

A Go package to write data into Google BigQuery concurrently with a high throughput. By default the InsertAll() API is used (REST API under the hood), but you can configure to use the Storage Write API (GRPC under the hood) as well.

The InsertAll API is easier to configure and can work pretty much out of the box without any configuration. It is recommended to use the Storage API as it is faster and comes with a lower cost. The latter does however require a bit more configuration on your side, including a Proto schema file as well. See the Storage example below on how to do (TODO).

import "github.com/OTA-Insight/bqwriter"

To install the packages on your system, do not clone the repo. Instead:

Change to your project directory:

cd /path/to/my/project

Get the package using the official Go tooling, which will also add it to your Go.mod file for you:

go get github.com/OTA-Insight/bqwriter

NOTE: This package is under development, and may occasionally make backwards-incompatible changes.

Go Versions Supported

We currently support Go versions 1.13 and newer.

Authorization

The streamer client will use Google Application Default Credentials for authorization credentials used in calling the API endpoints. This will allow your application to run in many environments without requiring explicit configuration.

Please open an issue should you require more advanced forms of authorization. The issue should come with an example, a clear statement of intention and motivation on why this is a useful contribution to this package. Even if you wish to contribute to this project by implementing this patch yourself, it is none the less best to create an issue prior to it, such that we can all be aligned on the specifics. Good communication is key here.

It was a choice to not support these advanced authorization methods for now. The reasons being that the package authors didn't have a need for it and it allowed to keep the API as simple and small as possible. There however some advanced authorizations still possible:

Authorize using a custom Json key file path;
Authorize with more control by using the https://pkg.go.dev/golang.org/x/oauth2 package to create an oauth2.TokenSource;

To conclude. We currently do not support advanced ways for Authorization, but we're open to include support for these, if there is sufficient interest for it.

Contributing

Contributions are welcome. Please, see the CONTRIBUTING document for details.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms. See Contributor Code of Conduct for more information.

Owner

https://github.com/OTA-Insight/bqwriter

Comments

[Proposal]: Connect to local BigQuery emulator
Contact Details

[email protected]

Summary of your proposal

To change the default connection url you can use the optionsFn parameters to the bigquery.NewClient.

Making the impact as minimum as possible and not changing the behavior used today. The option.ClientOption can be part of the method calls where the bigquery.NewClient is called. "google.golang.org/api/option"

e.g: func NewStreamer(ctx context.Context, projectID, dataSetID, tableID string, cfg *StreamerConfig, opts ...option.ClientOption) (*Streamer, error)

where then they can be passed on to (storage, batch and insertall) e.g:

client, err := storage.NewClient( projectID, dataSetID, tableID, encoder, protobufDescriptor, logger, opts...,

and the pass it on to the google bigquery client: writer, err := managedwriter.NewClient(ctx, projectID, opts...)

To use in test code:

bqWriter, err := bqwriter.NewStreamer( ctx, projectId, datasetId, tableId, &bqwriter.StreamerConfig{ ... }, option.WithEndpoint("localhost:9050"), option.WithoutAuthentication(), ) if err != nil { panic(err) }

Motivation for your proposal

The motivation for this change: To use the 'https://github.com/goccy/bigquery-emulator' as part of the integration tests during development and deploy pipeline.

bigquery-emulator is a locally running biguery emulator, that easily can run during testing. Fort this to work though, local connection needs to be supported.

Alternatives for your proposal

The work around use today is a fork of the bqwriter master. There the OptionFn parameter is past to the used methods.

Alternatively think about if the bqwriter should have it's own OptionFn pattern to handle optional parameters sent in on create.

Version

0.4.1 (Latest)

What platform are you mostly using or planning to use our software on?

Linux

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[Proposal]: use upstream bigquery/storage/managedwriter package instead of our forked version
Contact Details

No response

Summary of your proposal

Use the upstream bigquery/storage/managedwriter package instead of our forked version. One of the issues that ideally is resolved prior to doing so is https://github.com/googleapis/google-cloud-go/issues/5094.

Motivation for your proposal

The main motivation is that we have less code to maintain ourselves.

Alternatives for your proposal

The alternative is to either not use a managedWriter in which case we're going to reinvent the wheel most likely. The only other obvious option is to continue using our fork but that means we'll also need to maintain it ourselves, and do so mostly alone.

Version

0.3.1 (Latest)

What platform are you mostly using or planning to use our software on?

MacOS

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[Proposal]: support batch loading of data
Contact Details

No response

Summary of your proposal

We currently support the insertAll API and soon also the storage API. What we do not yet support is batch loading as documented in https://cloud.google.com/bigquery/docs/batch-loading-data. They give an example of a single file, but we could do it for any number of files as well as any reader in general.

Need to investigate the specifics, but it does look like it is still in scope.

Motivation for your proposal

It's a different kind of use case of BQWriter, still writing data into BQ but for an entirely different purpose. For those purposes batch loading is better suited as the timing isn't as critical, and with that a reduction of cost is a given.

Alternatives for your proposal

Do not support it and explicitly document so instead.

Version

0.3.1 (Latest)

What platform are you mostly using or planning to use our software on?

MacOS

Code of Conduct

[X] I agree to follow this project's Code of Conduct
Support connecting to local BigQuery emulator
Related issues

A link to each issue (be it a proposal or bug) which this PR aims to resolve. Prior to starting a PR an issue is desired in order to ensure we're all aligned prior to you putting any of your valuable time into this project.

Closes https://github.com/OTA-Insight/bqwriter/issues/10

Description

A few sentences describing the overall goals of the pull request's commits.

...

Import remarks

A summary/extract of the most important remarks perhaps already discussed in your description above.

...

Todos

[ ] Tests

[ ] Documentation

[ ] Self-Review

[ ] ...

Impacted Areas in this Golang package:

List general components of the OTA-Insight/bqwriter Golang package that this PR will affect:

...

Code of Conduct

By submitting this pull request (PR), I Agree agree to follow this project's Code of Conduct.
update-v0.6.21
Upgraded Dependencies:

golang.org/x/net: v0.0.0-20220526153639-5463443f8c37 => v0.0.0-20220607020251-c690dde0001d

golang.org/x/sync: v0.0.0-20220513210516-0976fa681c29 => v0.0.0-20220601150217-0de741cfad7f

google.golang.org/api: v0.81.0 => v0.82.0

google.golang.org/genproto: v0.0.0-20220527130721-00d5c0f3be58 => v0.0.0-20220607140733-d738665f6195

google.golang.org/grpc: v1.46.2 => v1.47.0
Update version v0.6.17
cmd: go get -u && go mod tidy

Updated Dependencies:

update google.golang.org/api to v0.77.0 (was v0.75.0);

Updated Indirect Dependencies:

update golang.org/x/sys, golang.org/x/net, and google.golang.org/genproto to latest (no semver);

Added Indirect Dependencies:

add github.com/google/go-cmp v0.5.8
Fix typo
Related issues

/

Description

There was a typo in the error message

Import remarks

/

Todos

[ ] Tests

[ ] Documentation

[ ] Self-Review

[ ] ...

Impacted Areas in this Golang package:

List general components of the OTA-Insight/bqwriter Golang package that this PR will affect:

...

Code of Conduct

By submitting this pull request (PR), I Agree agree to follow this project's Code of Conduct.
add initial benchmark code (insertAll works, storage fails)
Related issues

N/A

Description

A few sentences describing the overall goals of the pull request's commits.

Be able to have benchmarks which function as end-to-end tests with real production infrastructure, and which also give some insights, be it basic, in some of the different clients and setups possible.

Import remarks

N/A

Todos

[ ] Tests

[ ] Documentation

[ ] Self-Review

Impacted Areas in this Golang package:

List general components of the OTA-Insight/bqwriter Golang package that this PR will affect:

new benchmark package;

fix bugs here and there (std logger, storage API)

Code of Conduct

By submitting this pull request (PR), I Agree agree to follow this project's Code of Conduct.
Add batch client
Related issues

A link to each issue (be it a proposal or bug) which this PR aims to resolve. Prior to starting a PR an issue is desired in order to ensure we're all aligned prior to you putting any of your valuable time into this project.

https://github.com/OTA-Insight/bqwriter/issues/2

Description

A few sentences describing the overall goals of the pull request's commits.

Create a new client that supports batch uploading described here: https://cloud.google.com/bigquery/docs/batch-loading-data

Currently we support these formats

CSV

JSON

Avro

Parquet

ORC

Import remarks

A summary/extract of the most important remarks perhaps already discussed in your description above.

Adds support for batch uploading of data.

Todos

[x] Tests

[x] Documentation

[x] Self-Review

Impacted Areas in this Golang package:

List general components of the OTA-Insight/bqwriter Golang package that this PR will affect:

streamer.go

bigquery/batch

Code of Conduct

By submitting this pull request (PR), I Agree agree to follow this project's Code of Conduct.
initial storage API support (alpha)

Related issues

N/A

Description

Adds storage API support. Very basic for now, and only for the DefaultStream.

For production use it is also not yet recommended, until it is further tested and streamlined internally.

Import remarks

best to use the InsertAll API for production use until the Storage API has been tested and streamlined further

Code of Conduct

By submitting this pull request (PR), I Agree agree to follow this project's Code of Conduct.
align special files according to GH's name conventions
Related issues

A link to each issue (be it a proposal or bug) which this PR aims to resolve. Prior to starting a PR an issue is desired in order to ensure we're all aligned prior to you putting any of your valuable time into this project.

N/A

Description

A few sentences describing the overall goals of the pull request's commits.

Align the special files used by GitHub or as defined by its implicit conventions in order to better align with its ecosystem as a whole.

Import remarks

A summary/extract of the most important remarks perhaps already discussed in your description above.

https://github.com/joelparkerhenderson/github-special-files-and-paths is used as the guideline together with the official GitHub docs

Todos

[ ] Documentation

[ ] Self-Review

Impacted Areas in this Golang package:

List general components of the OTA-Insight/bqwriter Golang package that this PR will affect:

N/A

Related tags

Data Processing bqwriter

Prometheus Common Data Exporter can parse JSON, XML, yaml or other format data from various sources (such as HTTP response message, local file, TCP response message and UDP response message) into Prometheus metric data.

Prometheus Common Data Exporter Prometheus Common Data Exporter 用于将多种来源(如http响应报文、本地文件、TCP响应报文、UDP响应报文)的Json、xml、yaml或其它格式的数据，解析为Prometheus metric数据。

May 18, 2022

A stream processing API for Go (alpha)

A data stream processing API for Go (alpha) Automi is an API for processing streams of data using idiomatic Go. Using Automi, programs can process str

Dec 28, 2022

CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

Jan 1, 2023

xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL.

xyr [WIP] xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL. Supported Drivers

Dec 2, 2022

Baker is a high performance, composable and extendable data-processing pipeline for the big data era

Baker is a high performance, composable and extendable data-processing pipeline for the big data era. It shines at converting, processing, extracting or storing records (structured data), applying whatever transformation between input and output through easy-to-write filters.

Dec 14, 2022

Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Dud Website | Install | Getting Started | Source Code Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Jan 1, 2023

Feed pipe input into a Discord server via webhook.

Oct 28, 2022

DEPRECATED: Data collection and processing made easy.

This project is deprecated. Please see this email for more details. Heka Data Acquisition and Processing Made Easy Heka is a tool for collecting and c

Nov 30, 2022

Open source framework for processing, monitoring, and alerting on time series data

Kapacitor Open source framework for processing, monitoring, and alerting on time series data Installation Kapacitor has two binaries: kapacitor – a CL

Dec 24, 2022

A library for performing data pipeline / ETL tasks in Go.

Ratchet A library for performing data pipeline / ETL tasks in Go. The Go programming language's simplicity, execution speed, and concurrency support m

Jan 19, 2022

A distributed, fault-tolerant pipeline for observability data

Table of Contents What Is Veneur? Use Case See Also Status Features Vendor And Backend Agnostic Modern Metrics Format (Or Others!) Global Aggregation

Dec 25, 2022

Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go.

kanzi Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go. modern: state-of-the-art algorithms are impleme

Dec 22, 2022

Data syncing in golang for ClickHouse.

ClickHouse Data Synchromesh Data syncing in golang for ClickHouse. based on go-zero ARCH A typical data warehouse architecture design of data sync Aut

Jan 1, 2023

sq is a command line tool that provides jq-style access to structured data sources such as SQL databases, or document formats like CSV or Excel.

sq: swiss-army knife for data sq is a command line tool that provides jq-style access to structured data sources such as SQL databases, or document fo

Jan 1, 2023

Machine is a library for creating data workflows.

Machine is a library for creating data workflows. These workflows can be either very concise or quite complex, even allowing for cycles for flows that need retry or self healing mechanisms.

Dec 26, 2022

churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline applications.

Churro - ETL for Kubernetes churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline appli

Mar 10, 2022

Stream data into Google BigQuery concurrently using InsertAll() or BQ Storage.

bqwriter

Go Versions Supported

Authorization

Contributing

Owner

Comments

[Proposal]: Connect to local BigQuery emulator

Contact Details

Summary of your proposal

Motivation for your proposal

Alternatives for your proposal

Version

What platform are you mostly using or planning to use our software on?

Code of Conduct

[Proposal]: use upstream bigquery/storage/managedwriter package instead of our forked version

Contact Details

Summary of your proposal

Motivation for your proposal

Alternatives for your proposal

Version

What platform are you mostly using or planning to use our software on?

Code of Conduct

[Proposal]: support batch loading of data

Contact Details

Summary of your proposal

Motivation for your proposal

Alternatives for your proposal

Version

What platform are you mostly using or planning to use our software on?

Code of Conduct

Support connecting to local BigQuery emulator

Related issues

Description

Import remarks

Todos

Impacted Areas in this Golang package:

Code of Conduct

update-v0.6.21

Update version v0.6.17

Fix typo

Related issues

Description

Import remarks

Todos

Impacted Areas in this Golang package:

Code of Conduct

add initial benchmark code (insertAll works, storage fails)

Related issues

Description

Import remarks

Todos

Impacted Areas in this Golang package:

Code of Conduct

Add batch client

Related issues

Description

Import remarks

Todos

Impacted Areas in this Golang package:

Code of Conduct

initial storage API support (alpha)

Related issues

Description

Import remarks

Code of Conduct

align special files according to GH's name conventions

Related issues

Description

Import remarks

Todos

Impacted Areas in this Golang package:

Related tags

Prometheus Common Data Exporter can parse JSON, XML, yaml or other format data from various sources (such as HTTP response message, local file, TCP response message and UDP response message) into Prometheus metric data.

A stream processing API for Go (alpha)

CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL.

Baker is a high performance, composable and extendable data-processing pipeline for the big data era

Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Feed pipe input into a Discord server via webhook.