Run SQL queries against JSON, CSV, Excel, Parquet, and more.

Run SQL queries against JSON, CSV, Excel, Parquet, and more

This is a CLI companion to DataStation (a GUI) for running SQL queries against data files. So if you want the GUI version of this, check out DataStation.

Install

Get Go 1.17+ and then run:

$ go install github.com/multiprocessio/dsq@latest

Usage

You can either pipe data to dsq or you can pass a file name to it.

When piping data to dsq you need to specify the file extension or MIME type.

For example:

$ cat testdata.csv | dsq csv "SELECT * FROM {} LIMIT 1"

Or:

$ cat testdata.parquet | dsq parquet "SELECT COUNT(1) FROM {}"

If you are passing a file, it must have the usual extension for its content type.

For example:

$ dsq testdata.json "SELECT * FROM {} WHERE x > 10"

Or:

$ dsq testdata.ndjson "SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC"

Transforming data to JSON without querying

As a shorthand for dsq testdata.csv "SELECT * FROM {}" to convert supported file types to JSON you can skip the query and the converted JSON will be dumped to stdout.

For example:

$ dsq testdata.csv
[{...some csv data...},{...some csv data...},...]

Supported Data Types

Name File Extension(s) Notes
CSV csv
TSV tsv, tab
JSON json Must be an array of objects. Nested object fields are ignored.
Newline-delimited JSON ndjson, jsonl
Parquet parquet
Excel xlsx, xls Currently only works if there is only one sheet.
ODS ods Currently only works if there is only one sheet.
Apache Error Logs text/apache2error Currently only works if being piped in.
Apache Access Logs text/apache2access Currently only works if being piped in.
Nginx Access Logs text/nginxaccess Currently only works if being piped in.

Engine

Under the hood dsq uses DataStation as a library and under that hood DataStation uses SQLite to power these kinds of SQL queries on arbitrary (structured) data.

Comparisons

The speed column is based on rough benchmarks based on q's benchmarks. Eventually I'll do a more thorough and public benchmark.

| Name | Link | Speed | Supported File Types | Engine | |----|-|-|-|-|------------------------------------------------------------------------| | q | http://harelba.github.io/q/ | Fast | CSV, TSV | Uses SQLite | | textql | https://github.com/dinedal/textql | Ok | CSV, TSV | Uses SQLite | | octoql | https://github.com/cube2222/octosql | Slow | JSON, CSV, Excel, Parquet | Custom engine missing many features from SQLite | | dsq | Here | Ok | CSV, TSV, JSON, Newline-delimited JSON, Parquet, Excel, ODS (OpenOffice Calc), Logs | Uses SQLite |

Community

Join us on Discord.

How can I help?

Download the app and use it! Report bugs on Discord.

Before starting on any new feature though, check in on Discord!

Subscribe

If you want to hear about new features and how DataStation works under the hood, sign up here.

License

This software is licensed under an Apache 2.0 license.

Owner
Comments
  • Support caching

    Support caching

    • Add "--cache"" flag
    • Create sqlite db file on disk with name as the hash of its content.
    • Add test entry.
    • This will fail tests until the changes in datastation is committed.

    Closes #29

  • Add support for repl

    Add support for repl

    Grab https://github.com/chzyer/readline and wrap main.go's query parts (likely line 469-503) in a loop. Allow the loop to keep going if the user passes an -i flag or --interactive flag.

    When this flag is in, caching must also be turned on so that the sqlite database is not ephemeral per run (I think).

  • sqlite3 one-liner(ish) for benchmarking

    sqlite3 one-liner(ish) for benchmarking

    Follows #69 - the README at https://github.com/multiprocessio/dsq/tree/b7af2679038f8b09802c88f2f088bcc78fa8872c#benchmark says sqlite3 can't do one-liners, but it sort-of-can - this works for running the taxi query:

    sqlite3 :memory: <<EOS
    .mode csv
    .import taxi.csv taxi
    SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count
    EOS
    
  • Parquet is missing rows

    Parquet is missing rows

    I have a Parquet file that should have 30,000+ rows, but SELECT COUNT(*) FROM {} returns 7000. Another one with more than 40,000 rows returns exactly 8000. Converting the same data to JSON works fine.

  • "go install @latest" not working

    I tried installing latest (c6daddfd43b61c1a932404d88dd6e68035b0dd01) with go install github.com/multiprocessio/dsq@latest but it failed with

    go: downloading github.com/multiprocessio/dsq v0.0.0-20220515195645-c6daddfd43b6
    go: github.com/multiprocessio/dsq@latest (in github.com/multiprocessio/[email protected]):
    	The go.mod file for the module providing named packages contains one or
    	more replace directives. It must not contain directives that would cause
    	it to be interpreted differently than if it were the main module.
    

    however, this older version's install command worked go install github.com/multiprocessio/dsq@4ca4a314a0380c55ab9fc12bb03bc2c998d5580c

    I think the issue is related to this line https://github.com/multiprocessio/dsq/commit/4140b05c4e9fd6efbcedf23e58d6025385dbc42b#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6R5

  • Repl support

    Repl support

    Issue #53

    I implemented it with https://github.com/chzyer/readline.

    What i did :

    • I refactored line 469 to 503 into a function called runQuery()
    • I added a new field in args called isInteractive
    • I added a new function repl()

    If args.isInteractive is true repl() is called otherwise runQuery() is run directly

    if args.isInteractive {
    	return repl(project, &ec, args, files)
    }
    return runQuery(lastNonFlagArg, project, &ec, args, files)
    
  • Can't run SQL on log files.

    Can't run SQL on log files.

    Describe the bug and expected behavior

    Getting error while running SQL query on log files like auth.log. Also tried with the file testdata/logfmt/log.logfmt (https://github.com/multiprocessio/dsq/blob/main/testdata/logfmt/log.logfmt). Is it possible to run SQL quries on the log files?

    Reproduction steps

    # cat log.logfmt | dsq -s text/logfmt 'SELECT level FROM {}'
    Input is not an array of objects: /tmp/dsq-stdin3752862510.
    
    # cat /var/log/auth.log | dsq -s text/logfmt 'SELECT * FROM {}'
    Input is not an array of objects: /tmp/dsq-stdin1533956686.
    

    Versions

    • OS: Ubuntu 22.04 LTS (Jammy Jellyfish)
    • Shell: bash
    • dsq version: dsq 0.20.2
  • suggestion to use v prefix for tags

    suggestion to use v prefix for tags

    Hi! Based on this approach https://github.com/golang/go/wiki/Modules#how-can-i-track-tool-dependencies-for-a-module to track tool dependencies for a module, as an example, I found a slight inconvenience when working with dsq, because dsq uses a different style to name tags. For go mod, it's recommended by this https://go.dev/ref/mod#versions

    Each version starts with the letter v, followed by a semantic version

    As a result, I have a pseudo-version like this instead of v0.21.0. It's challenging to understand which version is under the hood, actually. Is it possible to change the naming convention for tags?

  • How to select a column name containing spaces?

    How to select a column name containing spaces?

    In the attached xlsx sheet, the header contains a column with spaces. How can I select this column? Writing dsq space.xlsx "SELECT \"Header with Space\" as header" yields:

    [{"header":"Header with Space"}]
    

    which is not what I meant.

  • Support of an input SQL file

    Support of an input SQL file

    Like jq or some other CLIs, there is a flag for an external file -f.

    For example, with jq, we can use jq -f script.jq

    That will be really useful in some cases, for example, when you want to define a lot of queries.

    dsq -f query.sql file1.json file2.json file3.json

    Thank you

  • Building with -buildmode=pie exposes crash in parquet test

    Building with -buildmode=pie exposes crash in parquet test

    See https://github.com/multiprocessio/dsq/issues/15 for the original report.

    This crash shows up when you go build -buildmode=pie && ./scripts/test.py. This crash does not happen without -buildmode=pie.

    panic: runtime error: index out of range [576457833716731764] with length 117670
    
    goroutine 1 [running]:
    github.com/goccy/go-json/internal/encoder.CompileToGetCodeSet(0xc000f70f90?, 0x55b1294306cc?)
          github.com/goccy/[email protected]/internal/encoder/compiler_norace.go:11 +0x1df
    github.com/goccy/go-json.encode(0xc001161ba0, {0xc0009540c0, 0xc00112a750})
          github.com/goccy/[email protected]/encode.go:224 +0xd0
    github.com/goccy/go-json.marshal({0xc0009540c0, 0xc00112a750}, {0x0, 0x0, 0x1?})
          github.com/goccy/[email protected]/encode.go:148 +0xba
    github.com/goccy/go-json.MarshalWithOption(...)
          github.com/goccy/[email protected]/json.go:186
    github.com/goccy/go-json.Marshal({0xc0009540c0?, 0xc00112a750?})
          github.com/goccy/[email protected]/json.go:171 +0x2a
    github.com/multiprocessio/go-json.(*StreamEncoder).EncodeRow(0xc000958060, {0xc0009540c0?, 0xc00112a750})
          github.com/multiprocessio/[email protected]/encoder.go:57 +0x1dd
    github.com/multiprocessio/datastation/runner.transformParquet.func1(0x0?)
          github.com/multiprocessio/datastation/[email protected]/file.go:121 +0xc6
    github.com/multiprocessio/datastation/runner.withJSONArrayOutWriter({0x55b12b25b338?, 0xc000011218}, 0xc000f71288)
          github.com/multiprocessio/datastation/[email protected]/json.go:36 +0xf6
    github.com/multiprocessio/datastation/runner.withJSONArrayOutWriterFile(...)
          github.com/multiprocessio/datastation/[email protected]/json.go:51
    github.com/multiprocessio/datastation/runner.transformParquet({0x55b12b26a2c0?, 0xc000c35788?}, {0x55b12b25b338, 0xc000011218})
          github.com/multiprocessio/datastation/[email protected]/file.go:106 +0xd8
    github.com/multiprocessio/datastation/runner.transformParquetFile({0x7ffddb498a31?, 0x1b?}, {0x55b12b25b338, 0xc000011218})
          github.com/multiprocessio/datastation/[email protected]/file.go:143 +0xec
    github.com/multiprocessio/datastation/runner.TransformFile({0x7ffddb498a31, 0x1b}, {{0x0?, 0x1ff?}, {0x0?, 0xc000aff440?}}, {0x55b12b25b338, 0xc000011218})
          github.com/multiprocessio/datastation/[email protected]/file.go:594 +0x1e5
    main.evalFileInto({0x7ffddb498a31, 0x1b}, 0x0?)
          github.com/multiprocessio/dsq/main.go:47 +0xc5
    main._main()
          github.com/multiprocessio/dsq/main.go:241 +0xaec
    main.main()
          github.com/multiprocessio/dsq/main.go:381 +0x19
    
Single binary CLI for generating structured JSON, CSV, Excel, etc.

fakegen: Single binary CLI for generating a random schema of M columns to populate N rows of JSON, CSV, Excel, etc. This program generates a random sc

Dec 26, 2022
Get data from .csv files use SQL-like queries.

csvql Get data from .csv files use SQL-like queries. Задание Необходимо написать консольную программу, которая по заданному клиентом запросу осуществл

Dec 7, 2021
A tool to run queries in defined frequency and expose the count as prometheus metrics. Supports MongoDB and SQL
A tool to run queries in defined frequency and expose the count as prometheus metrics. Supports MongoDB and SQL

query2metric A tool to run db queries in defined frequency and expose the count as prometheus metrics. Why ? Product metrics play an important role in

Jul 1, 2022
sqlbench measures and compares the execution time of one or more SQL queries.

sqlbench sqlbench measures and compares the execution time of one or more SQL queries. The main use case is benchmarking simple CPU-bound query varian

Dec 6, 2022
Parses a file and associate SQL queries to a map. Useful for separating SQL from code logic

goyesql This package is based on nleof/goyesql but is not compatible with it any more. This package introduces support for arbitrary tag types and cha

Oct 20, 2021
SQL API is designed to be able to run queries on databases without any configuration by simple HTTP call.

SQL API SQL API is designed to be able to run queries on databases without any configuration by simple HTTP call. The request contains the DB credenti

Dec 2, 2022
write APIs using direct SQL queries with no hassle, let's rethink about SQL

SQLer SQL-er is a tiny portable server enables you to write APIs using SQL query to be executed when anyone hits it, also it enables you to define val

Jan 7, 2023
Lightweight SQL database written in Go for prototyping and playing with text (CSV, JSON) data

gopicosql Lightweight SQL database written in Go for prototyping and playing wit

Jul 27, 2022
xlsx2mysql: An tool of helping your fastly generate SQL from Excel.

xlsx2mysql An tool of helping your fastly generate SQL from Excel 中文文档 Origin In order to convert Excel to MySQL and I made a tool to implement.But Wh

Nov 13, 2021
📊 parallel black box PostgreSQL unit tests run against a real database.
📊   parallel black box PostgreSQL unit tests run against a real database.

?? psql-docker-tests-example Parallel black box PostgreSQL unit tests run against a real database. Consider reading the Medium Story first. This packa

Sep 15, 2022
Querycrate - A simple library that loads and keeps SQL queries from files

QueryCrate A simple library for loading & getting string queries from files. How

Feb 15, 2022
Mergestat - a command-line tool for running SQL queries on git repositories and related data sources
Mergestat - a command-line tool for running SQL queries on git repositories and related data sources

Query git repositories with SQL. Generate reports, perform status checks, analyze codebases. ?? ??

Dec 30, 2022
pggen - generate type safe Go methods from Postgres SQL queries

pggen - generate type safe Go methods from Postgres SQL queries pggen is a tool that generates Go code to provide a typesafe wrapper around Postgres q

Jan 3, 2023
Opionated sql formatter for use with .go files containing backticked queries

fumpt-the-sql Opionated sql formatter for use with .go files containing backticked queries. Uses https://sqlformat.darold.net/ for the actual sql form

Dec 10, 2021
Dumpling is a fast, easy-to-use tool written by Go for dumping data from the database(MySQL, TiDB...) to local/cloud(S3, GCP...) in multifarious formats(SQL, CSV...).

?? Dumpling Dumpling is a tool and a Go library for creating SQL dump from a MySQL-compatible database. It is intended to replace mysqldump and mydump

Nov 9, 2022
Use SQL to data from CSV files. Open source CLI. No DB required.
Use SQL to data from CSV files. Open source CLI. No DB required.

CSV Plugin for Steampipe Use SQL to query data from CSV files. Get started → Documentation: Table definitions & examples Community: Slack Channel Get

Nov 10, 2022
Use SQL to instantly query users, groups, applications and more from Okta. Open source CLI. No DB required.
Use SQL to instantly query users, groups, applications and more from Okta. Open source CLI. No DB required.

Okta Plugin for Steampipe Use SQL to query infrastructure including users, groups, applications and more from Okta. Get started → Documentation: Table

Nov 10, 2022
Use SQL to instantly query instances, networks, databases, and more from Scaleway. Open source CLI. No DB required.
Use SQL to instantly query instances, networks, databases, and more from Scaleway. Open source CLI. No DB required.

Scaleway Plugin for Steampipe Use SQL to query infrastructure servers, networks, databases and more from your Scaleway project. Get started → Document

Nov 16, 2022
Use SQL to query information including Users, Groups, Clients, Roles and more from Keycloak.

Keycloak Plugin for Steampipe [WIP] THIS IS NOT ACTIVE NOR WORKING YET - DO NOT USE Use SQL to query information including Users, Groups, Clients, Rol

Jan 6, 2023