Run SQL queries against JSON, CSV, Excel, Parquet, and more.

Last update: Dec 31, 2022

Comments: 11

Run SQL queries against JSON, CSV, Excel, Parquet, and more

This is a CLI companion to DataStation (a GUI) for running SQL queries against data files. So if you want the GUI version of this, check out DataStation.

Install

Get Go 1.17+ and then run:

$ go install github.com/multiprocessio/dsq@latest

Usage

You can either pipe data to dsq or you can pass a file name to it.

When piping data to dsq you need to specify the file extension or MIME type.

For example:

$ cat testdata.csv | dsq csv "SELECT * FROM {} LIMIT 1"

Or:

$ cat testdata.parquet | dsq parquet "SELECT COUNT(1) FROM {}"

If you are passing a file, it must have the usual extension for its content type.

For example:

$ dsq testdata.json "SELECT * FROM {} WHERE x > 10"

Or:

$ dsq testdata.ndjson "SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC"

Transforming data to JSON without querying

As a shorthand for dsq testdata.csv "SELECT * FROM {}" to convert supported file types to JSON you can skip the query and the converted JSON will be dumped to stdout.

For example:

$ dsq testdata.csv
[{...some csv data...},{...some csv data...},...]

Supported Data Types

Name	File Extension(s)	Notes
CSV	`csv`
TSV	`tsv`, `tab`
JSON	`json`	Must be an array of objects. Nested object fields are ignored.
Newline-delimited JSON	`ndjson`, `jsonl`
Parquet	`parquet`
Excel	`xlsx`, `xls`	Currently only works if there is only one sheet.
ODS	`ods`	Currently only works if there is only one sheet.
Apache Error Logs	`text/apache2error`	Currently only works if being piped in.
Apache Access Logs	`text/apache2access`	Currently only works if being piped in.
Nginx Access Logs	`text/nginxaccess`	Currently only works if being piped in.

Engine

Under the hood dsq uses DataStation as a library and under that hood DataStation uses SQLite to power these kinds of SQL queries on arbitrary (structured) data.

Comparisons

The speed column is based on rough benchmarks based on q's benchmarks. Eventually I'll do a more thorough and public benchmark.

| Name | Link | Speed | Supported File Types | Engine | |----|-|-|-|-|------------------------------------------------------------------------| | q | http://harelba.github.io/q/ | Fast | CSV, TSV | Uses SQLite | | textql | https://github.com/dinedal/textql | Ok | CSV, TSV | Uses SQLite | | octoql | https://github.com/cube2222/octosql | Slow | JSON, CSV, Excel, Parquet | Custom engine missing many features from SQLite | | dsq | Here | Ok | CSV, TSV, JSON, Newline-delimited JSON, Parquet, Excel, ODS (OpenOffice Calc), Logs | Uses SQLite |

Community

Join us on Discord.

How can I help?

Download the app and use it! Report bugs on Discord.

Before starting on any new feature though, check in on Discord!

If you want to hear about new features and how DataStation works under the hood, sign up here.

License

This software is licensed under an Apache 2.0 license.

Owner

Multiprocess Labs LLC

https://github.com/multiprocessio/dsq

Comments

Support caching
Add "--cache"" flag

Create sqlite db file on disk with name as the hash of its content.

Add test entry.

This will fail tests until the changes in datastation is committed.

Closes #29
Add support for repl

Grab https://github.com/chzyer/readline and wrap main.go's query parts (likely line 469-503) in a loop. Allow the loop to keep going if the user passes an -i flag or --interactive flag.

When this flag is in, caching must also be turned on so that the sqlite database is not ephemeral per run (I think).
sqlite3 one-liner(ish) for benchmarking
Follows #69 - the README at https://github.com/multiprocessio/dsq/tree/b7af2679038f8b09802c88f2f088bcc78fa8872c#benchmark says sqlite3 can't do one-liners, but it sort-of-can - this works for running the taxi query:

sqlite3 :memory: <<EOS .mode csv .import taxi.csv taxi SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count EOS
Parquet is missing rows

I have a Parquet file that should have 30,000+ rows, but SELECT COUNT(*) FROM {} returns 7000. Another one with more than 40,000 rows returns exactly 8000. Converting the same data to JSON works fine.
"go install @latest" not working
I tried installing latest (c6daddfd43b61c1a932404d88dd6e68035b0dd01) with go install github.com/multiprocessio/dsq@latest but it failed with

go: downloading github.com/multiprocessio/dsq v0.0.0-20220515195645-c6daddfd43b6 go: github.com/multiprocessio/dsq@latest (in github.com/multiprocessio/[email protected]): The go.mod file for the module providing named packages contains one or more replace directives. It must not contain directives that would cause it to be interpreted differently than if it were the main module.

however, this older version's install command worked go install github.com/multiprocessio/dsq@4ca4a314a0380c55ab9fc12bb03bc2c998d5580c

I think the issue is related to this line https://github.com/multiprocessio/dsq/commit/4140b05c4e9fd6efbcedf23e58d6025385dbc42b#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6R5
Repl support
Issue #53

I implemented it with https://github.com/chzyer/readline.

What i did :

I refactored line 469 to 503 into a function called runQuery()

I added a new field in args called isInteractive

I added a new function repl()

If args.isInteractive is true repl() is called otherwise runQuery() is run directly

if args.isInteractive { return repl(project, &ec, args, files) } return runQuery(lastNonFlagArg, project, &ec, args, files)
Can't run SQL on log files.
Describe the bug and expected behavior

Getting error while running SQL query on log files like auth.log. Also tried with the file testdata/logfmt/log.logfmt (https://github.com/multiprocessio/dsq/blob/main/testdata/logfmt/log.logfmt). Is it possible to run SQL quries on the log files?

Reproduction steps

# cat log.logfmt | dsq -s text/logfmt 'SELECT level FROM {}' Input is not an array of objects: /tmp/dsq-stdin3752862510.

# cat /var/log/auth.log | dsq -s text/logfmt 'SELECT * FROM {}' Input is not an array of objects: /tmp/dsq-stdin1533956686.

Versions

OS: Ubuntu 22.04 LTS (Jammy Jellyfish)

Shell: bash

dsq version: dsq 0.20.2
suggestion to use v prefix for tags

Hi! Based on this approach https://github.com/golang/go/wiki/Modules#how-can-i-track-tool-dependencies-for-a-module to track tool dependencies for a module, as an example, I found a slight inconvenience when working with dsq, because dsq uses a different style to name tags. For go mod, it's recommended by this https://go.dev/ref/mod#versions

Each version starts with the letter v, followed by a semantic version

As a result, I have a pseudo-version like this instead of v0.21.0. It's challenging to understand which version is under the hood, actually. Is it possible to change the naming convention for tags?
How to select a column name containing spaces?
In the attached xlsx sheet, the header contains a column with spaces. How can I select this column? Writing dsq space.xlsx "SELECT \"Header with Space\" as header" yields:

[{"header":"Header with Space"}]

which is not what I meant.
Support of an input SQL file

Like jq or some other CLIs, there is a flag for an external file -f.

For example, with jq, we can use jq -f script.jq

That will be really useful in some cases, for example, when you want to define a lot of queries.

dsq -f query.sql file1.json file2.json file3.json

Thank you

Building with -buildmode=pie exposes crash in parquet test

See https://github.com/multiprocessio/dsq/issues/15 for the original report.

This crash shows up when you go build -buildmode=pie && ./scripts/test.py. This crash does not happen without -buildmode=pie.

panic: runtime error: index out of range [576457833716731764] with length 117670

goroutine 1 [running]:
github.com/goccy/go-json/internal/encoder.CompileToGetCodeSet(0xc000f70f90?, 0x55b1294306cc?)
      github.com/goccy/[email protected]/internal/encoder/compiler_norace.go:11 +0x1df
github.com/goccy/go-json.encode(0xc001161ba0, {0xc0009540c0, 0xc00112a750})
      github.com/goccy/[email protected]/encode.go:224 +0xd0
github.com/goccy/go-json.marshal({0xc0009540c0, 0xc00112a750}, {0x0, 0x0, 0x1?})
      github.com/goccy/[email protected]/encode.go:148 +0xba
github.com/goccy/go-json.MarshalWithOption(...)
      github.com/goccy/[email protected]/json.go:186
github.com/goccy/go-json.Marshal({0xc0009540c0?, 0xc00112a750?})
      github.com/goccy/[email protected]/json.go:171 +0x2a
github.com/multiprocessio/go-json.(*StreamEncoder).EncodeRow(0xc000958060, {0xc0009540c0?, 0xc00112a750})
      github.com/multiprocessio/[email protected]/encoder.go:57 +0x1dd
github.com/multiprocessio/datastation/runner.transformParquet.func1(0x0?)
      github.com/multiprocessio/datastation/[email protected]/file.go:121 +0xc6
github.com/multiprocessio/datastation/runner.withJSONArrayOutWriter({0x55b12b25b338?, 0xc000011218}, 0xc000f71288)
      github.com/multiprocessio/datastation/[email protected]/json.go:36 +0xf6
github.com/multiprocessio/datastation/runner.withJSONArrayOutWriterFile(...)
      github.com/multiprocessio/datastation/[email protected]/json.go:51
github.com/multiprocessio/datastation/runner.transformParquet({0x55b12b26a2c0?, 0xc000c35788?}, {0x55b12b25b338, 0xc000011218})
      github.com/multiprocessio/datastation/[email protected]/file.go:106 +0xd8
github.com/multiprocessio/datastation/runner.transformParquetFile({0x7ffddb498a31?, 0x1b?}, {0x55b12b25b338, 0xc000011218})
      github.com/multiprocessio/datastation/[email protected]/file.go:143 +0xec
github.com/multiprocessio/datastation/runner.TransformFile({0x7ffddb498a31, 0x1b}, {{0x0?, 0x1ff?}, {0x0?, 0xc000aff440?}}, {0x55b12b25b338, 0xc000011218})
      github.com/multiprocessio/datastation/[email protected]/file.go:594 +0x1e5
main.evalFileInto({0x7ffddb498a31, 0x1b}, 0x0?)
      github.com/multiprocessio/dsq/main.go:47 +0xc5
main._main()
      github.com/multiprocessio/dsq/main.go:241 +0xaec
main.main()
      github.com/multiprocessio/dsq/main.go:381 +0x19

Single binary CLI for generating structured JSON, CSV, Excel, etc.

fakegen: Single binary CLI for generating a random schema of M columns to populate N rows of JSON, CSV, Excel, etc. This program generates a random sc

Dec 26, 2022

Get data from .csv files use SQL-like queries.

csvql Get data from .csv files use SQL-like queries. Задание Необходимо написать консольную программу, которая по заданному клиентом запросу осуществл

Dec 7, 2021

A tool to run queries in defined frequency and expose the count as prometheus metrics. Supports MongoDB and SQL

query2metric A tool to run db queries in defined frequency and expose the count as prometheus metrics. Why ? Product metrics play an important role in

Jul 1, 2022

sqlbench measures and compares the execution time of one or more SQL queries.

sqlbench sqlbench measures and compares the execution time of one or more SQL queries. The main use case is benchmarking simple CPU-bound query varian

Dec 6, 2022

Parses a file and associate SQL queries to a map. Useful for separating SQL from code logic

goyesql This package is based on nleof/goyesql but is not compatible with it any more. This package introduces support for arbitrary tag types and cha

Oct 20, 2021

SQL API is designed to be able to run queries on databases without any configuration by simple HTTP call.

SQL API SQL API is designed to be able to run queries on databases without any configuration by simple HTTP call. The request contains the DB credenti

Dec 2, 2022

write APIs using direct SQL queries with no hassle, let's rethink about SQL

SQLer SQL-er is a tiny portable server enables you to write APIs using SQL query to be executed when anyone hits it, also it enables you to define val

Jan 7, 2023

Lightweight SQL database written in Go for prototyping and playing with text (CSV, JSON) data

gopicosql Lightweight SQL database written in Go for prototyping and playing wit

Jul 27, 2022

xlsx2mysql: An tool of helping your fastly generate SQL from Excel.

xlsx2mysql An tool of helping your fastly generate SQL from Excel 中文文档 Origin In order to convert Excel to MySQL and I made a tool to implement.But Wh

Nov 13, 2021

📊 parallel black box PostgreSQL unit tests run against a real database.

?? psql-docker-tests-example Parallel black box PostgreSQL unit tests run against a real database. Consider reading the Medium Story first. This packa

Sep 15, 2022

Querycrate - A simple library that loads and keeps SQL queries from files

QueryCrate A simple library for loading & getting string queries from files. How

Feb 15, 2022

Mergestat - a command-line tool for running SQL queries on git repositories and related data sources

Query git repositories with SQL. Generate reports, perform status checks, analyze codebases. ?? ??

Dec 30, 2022

pggen - generate type safe Go methods from Postgres SQL queries

pggen - generate type safe Go methods from Postgres SQL queries pggen is a tool that generates Go code to provide a typesafe wrapper around Postgres q

Jan 3, 2023

Opionated sql formatter for use with .go files containing backticked queries

fumpt-the-sql Opionated sql formatter for use with .go files containing backticked queries. Uses https://sqlformat.darold.net/ for the actual sql form

Dec 10, 2021

Dumpling is a fast, easy-to-use tool written by Go for dumping data from the database(MySQL, TiDB...) to local/cloud(S3, GCP...) in multifarious formats(SQL, CSV...).

?? Dumpling Dumpling is a tool and a Go library for creating SQL dump from a MySQL-compatible database. It is intended to replace mysqldump and mydump

Nov 9, 2022

Run SQL queries against JSON, CSV, Excel, Parquet, and more.

Run SQL queries against JSON, CSV, Excel, Parquet, and more

Install

Usage

Transforming data to JSON without querying

Supported Data Types

Engine

Comparisons

Community

How can I help?

Subscribe

License

Owner

Multiprocess Labs LLC

Comments

Support caching

Add support for repl

sqlite3 one-liner(ish) for benchmarking

Parquet is missing rows

"go install @latest" not working

Repl support

Can't run SQL on log files.

suggestion to use v prefix for tags

How to select a column name containing spaces?

Support of an input SQL file

Building with -buildmode=pie exposes crash in parquet test

Related tags

Single binary CLI for generating structured JSON, CSV, Excel, etc.

Get data from .csv files use SQL-like queries.

A tool to run queries in defined frequency and expose the count as prometheus metrics. Supports MongoDB and SQL

sqlbench measures and compares the execution time of one or more SQL queries.

Parses a file and associate SQL queries to a map. Useful for separating SQL from code logic

SQL API is designed to be able to run queries on databases without any configuration by simple HTTP call.

write APIs using direct SQL queries with no hassle, let's rethink about SQL

Lightweight SQL database written in Go for prototyping and playing with text (CSV, JSON) data

xlsx2mysql: An tool of helping your fastly generate SQL from Excel.

📊 parallel black box PostgreSQL unit tests run against a real database.

Querycrate - A simple library that loads and keeps SQL queries from files

Mergestat - a command-line tool for running SQL queries on git repositories and related data sources

pggen - generate type safe Go methods from Postgres SQL queries

Opionated sql formatter for use with .go files containing backticked queries

Dumpling is a fast, easy-to-use tool written by Go for dumping data from the database(MySQL, TiDB...) to local/cloud(S3, GCP...) in multifarious formats(SQL, CSV...).

Use SQL to data from CSV files. Open source CLI. No DB required.

Use SQL to instantly query users, groups, applications and more from Okta. Open source CLI. No DB required.

Use SQL to instantly query instances, networks, databases, and more from Scaleway. Open source CLI. No DB required.

Use SQL to query information including Users, Groups, Clients, Roles and more from Keycloak.