OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.

OctoSQLOctoSQL

OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases, streaming sources and file formats using SQL.

CircleCI GoDoc Gitter

Problems OctoSQL Solves

  • You need to join / analyze data from multiple datasources.
    • Think of enriching an Excel file by joining it with a PostgreSQL database.
  • You need stream aggregates over time, with live output updates.
    • Think of a live-updated leaderboard with cat images based on a "like" event stream.
  • You need aggregate streams per time window, with live output updates.
    • Think of a unique user count per hour, per country live summary.

Demo

Table of Contents

What is OctoSQL?

OctoSQL is a SQL query engine which allows you to write standard SQL queries on data stored in multiple SQL databases, NoSQL databases, streaming sources and files in various formats trying to push down as much of the work as possible to the source databases, not transferring unnecessary data.

OctoSQL does that by creating an internal representation of your query and later translating parts of it into the query languages or APIs of the source databases. Whenever a datasource doesn't support a given operation, OctoSQL will execute it in memory, so you don't have to worry about the specifics of the underlying datasources.

OctoSQL also includes temporal SQL extensions, to operate ergonomically on streams and respect their event-time (not the current system-time when the records are being processed).

With OctoSQL you don't need O(n) client tools or a large data analysis system deployment. Everything's contained in a single binary.

Why the name?

OctoSQL stems from Octopus SQL.

Octopus, because octopi have many arms, so they can grasp and manipulate multiple objects, like OctoSQL is able to handle multiple datasources simultaneously.

Installation

Either download the binary for your operating system (Linux, OS X and Windows are supported) from the Releases page, or install using the go command line tool:

GO111MODULE=on go get -u github.com/cube2222/octosql/cmd/octosql

Quickstart

Let's say we have a csv file with cats, and a redis database with people (potential cat owners). Now we want to get a list of cities with the number of distinct cat names in them and the cumulative number of cat lives (as each cat has up to 9 lives left).

First, create a configuration file (Configuration Syntax) For example:

dataSources:
  - name: cats
    type: csv
    config:
      path: "~/Documents/cats.csv"
  - name: people
    type: redis
    config:
      address: "localhost:6379"
      password: ""
      databaseIndex: 0
      databaseKeyName: "id"

Then, set the OCTOSQL_CONFIG environment variable to point to the configuration file.

export OCTOSQL_CONFIG=~/octosql.yaml

You can also use the --config command line argument.

Finally, query to your hearts desire:

octosql "SELECT p.city, FIRST(c.name), COUNT(DISTINCT c.name) cats, SUM(c.livesleft) catlives
FROM cats c JOIN people p ON c.ownerid = p.id
GROUP BY p.city
ORDER BY catlives DESC
LIMIT 9"

Example output:

+---------+--------------+------+----------+
| p.city  | c.name_first | cats | catlives |
+---------+--------------+------+----------+
| Warren  | Zoey         |   68 |      570 |
| Gadsden | Snickers     |   52 |      388 |
| Staples | Harley       |   54 |      383 |
| Buxton  | Lucky        |   45 |      373 |
| Bethany | Princess     |   46 |      366 |
| Noxen   | Sheba        |   49 |      361 |
| Yorklyn | Scooter      |   45 |      359 |
| Tuttle  | Toby         |   57 |      356 |
| Ada     | Jasmine      |   49 |      351 |
+---------+--------------+------+----------+

You can choose between live-table batch-table live-csv batch-csv stream-json output formats. (The live-* types will update the terminal view repeatedly every second, the batch-* ones will write the output once before exiting, the stream-* ones will print records whenever they are available)

Temporal SQL Features

OctoSQL features temporal SQL extensions inspired by the paper One SQL to Rule Them All.

Introduction

Often when you're working with streams of events, you'd like to use the time dimension somehow:

  • Calculate average values for a day sliced by hours.
  • Get unique user counts per day.
  • and others

All those examples have one thing in common: The time value of an event is crucial for correctness.

A naive system could just use the current clock time whenever it receives an event. The correctness of this approach however, degrades quickly in the face of network problems, delivery delays, clock skew.

This can be solved by using a value from the event as its time value. A new problem arises though: how do I know that I've received all events up to time X and can publish results for a given hour. You never know if there isn't somewhere a delayed event which should be factored in.

This is where watermarks come into play.

Watermarks

Watermarks are a heuristic which try to approximate the "current time" when processing events. Said differently: When I receive a watermark for 12:00 I can be sure enough I've received all events of interest up to 12:00.

To achieve this, they are generated at streaming sources and propagate downstream through the whole processing pipeline.

The generation of watermarks usually relies on heuristics which provide satisfactory results for our given use case. OctoSQL currently contains the following watermark generators:

  • Maximum difference watermark generator (with an offset argument)

    With an offset of 10 seconds, this generator says: When I've received an event for 12:00:00, then I'm sure I won't receive any event older than 11:59:50.

  • Percentile watermark generator (with a percentile argument)

    With a percentile of 99.5, it will look at a specified number of recent events, and generate a watermark so that 99.5% of those events are after the watermark (not yet triggered), and the remaining 0.5% are before it. This way we set the watermark so that only a fraction of the recently seen events is potentially ignored as being late.

Watermark generators are specified using table valued functions and are documented in the wiki.

Triggers

Another matter is triggering of keys in aggregations. Sometimes you'd like to only see the value for a given key (hour) when you know it's done, but othertimes you'd like to see partial results (how's the unique user count going this hour).

That's where you can use triggers. Triggers allow you to specify when a given aggregate (or join window for that matter) is emitted or updated. OctoSQL contains multiple triggers:

  • Watermark Trigger

    This is the most straightforward trigger. It emits a value whenever the watermark for a given key (or the end of the stream) is reached. So basically the "show me when it's done".

  • Counting Trigger (with a count argument)

    This trigger will emit a value for a key every time it receives count records with this key. The count is reset whenever the key is triggered.

  • Delay Trigger (with a delay argument)

    This trigger will emit a value for a key whenever the key has been inactive for the delay period.

You can use multiple triggers simultaneously. (Show me the current sum every 10 received events, but also the final value after having received the watermark.)

Retractions

A key can be triggered multiple times with partial results. How do we know a given record is a retriggering of some key, and not a new unrelated record?

OctoSQL solves this problem using a dataflow-like architecture. This means whenever a new value is sent for a key, a retraction is send for the old value. In practice this means every update is accompanied by the old record with an undo flag set.

This can be visible when using a stream-* output format with partial results.

Example

Now we can see how it all fits together. In this example we have an events file, which contains records about points being scored in a game by multiple teams.

WITH
      with_watermark AS (SELECT *
                         FROM max_diff_watermark(source=>TABLE(events),
                                                 offset=>INTERVAL 5 SECONDS,
                                                 time_field=>DESCRIPTOR(time)) e),
      with_tumble AS (SELECT *
                      FROM tumble(source=>TABLE(with_watermark),
                                  time_field=>DESCRIPTOR(e.time),
                                  window_length=> INTERVAL 1 MINUTE,
                                  offset => INTERVAL 0 SECONDS) e),
      counts_per_team AS (SELECT e.window_end, e.team, COUNT(*) as goals
                          FROM with_tumble e
                          GROUP BY e.window_end, e.team TRIGGER COUNTING 100, ON WATERMARK)
SELECT *
FROM counts_per_team cpt
ORDER BY cpt.window_end DESC, cpt.goals ASC, cpt.team DESC

We use common table expressions to break the query up into multiple stages.

First we create the with_watermark intermediate table/stream. Here we use the table valued function max_diff_watermark to add watermarks to the events table - with an offset of 5 seconds based on the time record field.

Then we use this intermediate table to create the with_tumble table, where we use the tumble table valued function to add a window_start and window_end field to each record, based on the record's time field. This assigns the records to 1 minute long windows.

Next we create the counts_per_team table, which groups the records by their window end and team.

Finally, we order those results by window end, goal count and team.

Durability

OctoSQL in its current design is based on on-disk transactional storage.

All state is saved this way. All interactions with datasources are designed so that no records get duplicated in the face of errors or application restarts.

You can also kill the OctoSQL process and start it again with the same query and storage-directory (command line argument), it will start where it left off.

By default, OctoSQL will create a temporary directory for the state and delete it after termination.

Configuration

The configuration file has the following form

dataSources:
  - name: <table_name_in_octosql>
    type: <datasource_type>
    config:
      <datasource_specific_key>: <datasource_specific_value>
      <datasource_specific_key>: <datasource_specific_value>
      ...
  - name: <table_name_in_octosql>
    type: <datasource_type>
    config:
      <datasource_specific_key>: <datasource_specific_value>
      <datasource_specific_key>: <datasource_specific_value>
      ...
    ...
physical:
  physical_plan_option: <value>

Available OctoSQL-wide configuration options are:

  • physical
    • groupByParallelism: The parallelism of group by's and distinct queries. Will default to the CPU core count of your machine.
    • streamJoinParallelism: The parallelism of streaming joins. Will default to the CPU core count of your machine.
  • execution
    • lookupJoinPrefetchCount: The count of simultaneously processed records in a lookup join.

Supported Datasources

JSON

JSON file in one of the following forms:

  • one record per line, no commas
  • JSON list of records
options:
  • path - path to file containing the data, required
  • arrayFormat - if the JSON list of records format should be used, optional: defaults to false
  • batchSize - number of records extracted from json file in one storage transaction, optional: defaults to 1000

CSV

CSV file separated using commas.
The file may or may not have column names as it's first row.

options:
  • path - path to file containing the data, required
  • headerRow - whether the first row of the CSV file contains column names or not, optional: defaults to true
  • separator - columns separator, optional: defaults to ","
  • batchSize - number of records extracted from csv file in one storage transaction, optional: defaults to 1000

Excel

A single table in an Excel spreadsheet.
The table may or may not have column names as it's first row.
The table can be in any sheet, and start at any point, but it cannot contain spaces between columns nor spaces between rows.

options:
  • path - path to file, required
  • headerRow - does the first row contain column names, optional: defaults to true
  • sheet - name of the sheet in which data is stored, optional: defaults to "Sheet1"
  • rootCell - name of cell (i.e "A3", "BA14") which is the leftmost cell of the first, optional: defaults to "A1"
  • timeColumns - a list of columns to parse as datetime values with second precision row, optional: defaults to []
  • batchSize - number of records extracted from excel file in one storage transaction, optional: defaults to 1000

Parquet

A single Parquet file.
Nested repeated elements are not supported. Otherwise repeated xor nested elements are supported.
Currently unsupported logical types, they will get parsed as the underlying primitive type:
     - ENUM
     - TIME with NANOS precision
     - TIMESTAMP with NANOS precision (both UTC and non-UTC)
     - INTERVAL
     - MAP

options
  • path - path to file, required
  • batchSize - number of records extracted from parquet file in one storage transaction, optional: defaults to 1000

PostgreSQL

Single PostgreSQL database table.

options:
  • address - address including port number, optional: defaults to localhost:5432
  • user - required
  • password - required
  • databaseName - required
  • tableName - required
  • batchSize - number of records extracted from PostgreSQL database in one storage transaction, optional: defaults to 1000

MySQL

Single MySQL database table.

options:
  • address - address including port number, optional: defaults to localhost:3306
  • user - required
  • password - required
  • databaseName - required
  • tableName - required
  • batchSize - number of records extracted from MySQL database in one storage transaction, optional: defaults to 1000

Redis

Redis database with the given index. Currently only hashes are supported.

options:
  • address - address including port number, optional: defaults to localhost:6379
  • password - optional: defaults to ""
  • databaseIndex - index number of Redis database, optional: defaults to 0
  • databaseKeyName - column name of Redis key in OctoSQL records, optional: defaults to "key"
  • batchSize - number of records extracted from Redis database in one storage transaction, optional: defaults to 1000

Kafka

Multi-partition kafka topic.

optional
  • brokers - list of broker addresses (separately hosts and ports) used to connect to the kafka cluster, optional: defaults to ["localhost:9092"]
  • topic - name of topic to read messages from, required
  • partitions - topic partition count, optional: defaults to 1
  • startOffset - offset from which the first batch of messages will be read, optional: defaults to -1
  • batchSize - number of records extracted from Kafka in one storage transaction, optional: defaults to 1
  • json - should the messages be decoded as JSON, optional: defaults to false

Documentation

Documentation for the available functions: https://github.com/cube2222/octosql/wiki/Function-Documentation

Documentation for the available aggregates: https://github.com/cube2222/octosql/wiki/Aggregate-Documentation

Documentation for the available triggers: https://github.com/cube2222/octosql/wiki/Trigger-Documentation

Documentation for the available table valued functions: https://github.com/cube2222/octosql/wiki/Table-Valued-Functions-Documentation

The SQL dialect documentation: TODO ;) in short though:

Available SQL constructs: Select, Where, Order By, Group By, Offset, Limit, Left Join, Right Join, Inner Join, Distinct, Union, Union All, Subqueries, Operators, Table Valued Functions, Trigger, Common Table Expressions.

Available SQL types: Int, Float, String, Bool, Time, Duration, Tuple (array), Object (e.g. JSON)

Describe

You can describe the current plan in graphviz format using the -describe flag, like this:

octosql "..." --describe | dot -Tpng > output.png

Architecture

An OctoSQL invocation gets processed in multiple phases.

SQL AST

First, the SQL query gets parsed into an abstract syntax tree. This phase only rules out syntax errors.

Logical Plan

The SQL AST gets converted into a logical query plan. This plan is still mostly a syntactic validation. It's the most naive possible translation of the SQL query. However, this plan already has more of a map-filter-reduce form.

If you wanted to add a new query language to OctoSQL, the only problem you'd have to solve is translating it to this logical plan.

Physical Plan

The logical plan gets converted into a physical plan. This conversion finds any semantic errors in the query. If this phase is reached, then the input is correct and OctoSQL will be able execute it.

This phase already understands the specifics of the underlying datasources. So it's here where the optimizer will iteratively transform the plan, pushing computation nodes down to the datasources, and deduplicating unnecessary parts.

The optimizer uses a pattern matching approach, where it has rules for matching parts of the physical plan tree and how those patterns can be restructured into a more efficient version. The rules are meant to be as simple as possible and make the smallest possible changes. For example, pushing filters under maps, if they don't use any mapped variables. This way, the optimizer just keeps on iterating on the whole tree, until it can't change anything anymore. (each iteration tries to apply each rule in each possible place in the tree) This ensures that the plan reaches a local performance minimum, and the rules should be structured so that this local minimum is equal - or close to - the global minimum. (i.e. one optimization, shouldn't make another - much more useful one - impossible)

Here is an example diagram of an optimized physical plan: Physical Plan

Execution Plan

The physical plan gets materialized into an execution plan. This phase has to be able to connect to the actual datasources. It may initialize connections, open files, etc.

Stream

Starting the execution plan creates a stream, which underneath may hold more streams, or parts of the execution plan to create streams in the future. This stream works in a pull based model.

Datasource Pushdown Operations

Datasource Equality In > < <= >=
MySQL supported supported supported
PostgreSQL supported supported supported
Redis supported supported scan
Kafka scan scan scan
Parquet scan scan scan
JSON scan scan scan
CSV scan scan scan

Where scan means that the whole table needs to be scanned for each access.

Telemetry

OctoSQL sends application telemetry on each run to help us gauge user interest and feature use. This way we know somebody uses our software, feel our work is actually useful and can prioritize features based on actual usefulness.

You can turn it off (though please don't) by setting the OCTOSQL_TELEMETRY environment variable to 0. Telemetry is also fully printed in the output log of OctoSQL, if you want to see what precisely is being sent.

Roadmap

  • Additional Datasources.
  • SQL Constructs:
    • JSON Query
    • HAVING, ALL, ANY
  • Push down functions, aggregates to databases that support them.
  • An in-memory index to save values of subqueries and save on rescanning tables which don't support a given operation, so as not to recalculate them each time.
  • Runtime statistics
  • Server mode
  • Querying a json or csv table from standard input.
  • Integration test suite
  • Tuple splitter, returning the row for each tuple element, with the given element instead of the tuple.
Owner
Jacob Martin
LinkedIn: www.linkedin.com/in/jakubmartin
Jacob Martin
Comments
  • mysql query error

    mysql query error

    CREATE TABLE test_config ( id bigint(20) NOT NULL AUTO_INCREMENT COMMENT '自增id', test_config_id bigint(20) NOT NULL COMMENT '', test_id varchar(64) NOT NULL COMMENT '实验id ', test_name varchar(64) NOT NULL COMMENT ' ', flow_id bigint(20) NOT NULL COMMENT ' ', start_time datetime NOT NULL COMMENT '开始时间', end_time datetime NOT NULL COMMENT '结束时间', test_status bigint(20) NOT NULL DEFAULT '0' ', creater varchar(16) NOT NULL COMMENT '创建人', group_info varchar(1024) NOT NULL COMMENT '', sys_ctime datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间', sys_utime datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '修改时间', is_del tinyint(4) DEFAULT '0' COMMENT '删除标志位 0:未删除 1:删除', PRIMARY KEY (id), UNIQUE KEY uniq_config_id (test_config_id), UNIQUE KEY uniq_test_id (test_status,flow_id) ) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=utf8 COMMENT='实验配置表';

    sql: select * from mysql_test t where t.creater='will' sql2: select * from mysql_test t where t.test_name='test'

    when my query condition is a string , also get error: couldn't get record stream from execution plan: couldn't query statement: sql: converting argument $1 type: unsupported type octosql.String, a string

    but sql: select * from mysql_test t where t.id=7 select * from mysql_test t where t.flow_id=1111

    when my query condition is a int can show successful result : {"t.creater":"williehu","t.end_time":{},"t.flow_id":10003,"t.group_info":"test","t.id":7,"t.is_del":0,"t.start_time":{},"t.sys_ctime":{},"t.sys_utime":{},"t.test_config_id":121321,"t.test_id":"mongo","t.test_name":"test","t.test_status":0}

    Have any config must to set ?

  • Cannot use subquery

    Cannot use subquery

    Hi, I try to query using subquery like this SELECT pta2.*, tam1.created_at as ticket_agent_mapping_create FROM premier_ticket_aml pta2 LEFT JOIN ( SELECT * FROM ticket_agent_mapping tam WHERE tam.agent_role = 'aml_agent' ) as tam1 ON pta2.ticket_id = tam1.ticket_id ORDER BY pta2.id ASC; But its doesnt work. Did I do something wrong?

  • "Error: couldn't materialize physical plan" with mysql (mariadb)

    :; octosql 'select * from magento.eav_attribute limit 10'
    Usage:
      octosql <query> [flags]
      octosql [command]
    
    Examples:
    octosql "SELECT * FROM myfile.json"
    octosql "SELECT * FROM mydir/myfile.csv"
    octosql "SELECT * FROM plugins.plugins"
    
    Available Commands:
      completion  Generate the autocompletion script for the specified shell
      help        Help about any command
      plugin      
    
    Flags:
          --describe         Describe query output schema.
          --explain int      Describe query output schema.
      -h, --help             help for octosql
          --optimize         Whether OctoSQL should optimize the query. (default true)
          --output string    Output format to use. Available options are live_table, batch_table, csv, json and stream_native. (default "live_table")
          --profile string   Enable profiling of the given type: cpu, memory, trace.
      -v, --version          version for octosql
    
    Use "octosql [command] --help" for more information about a command.
    
    Error: couldn't materialize physical plan: couldn't materialize plugin datasource: rpc error: code = Unknown desc = couldn't materialize datasource: couldn't prepare statement 'SELECT , attribute_code, , attribute_model, , backend_model, , backend_type, , backend_table, , frontend_model, , frontend_input, , frontend_label, , frontend_class, , source_model, , default_value, , note FROM eav_attribute WHERE (TRUE)': Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ' attribute_code, , attribute_model, , backend_model, , backend_type, , backen...' at line 1
    

    Any thoughts? Looks like it's generating invalid SQL there

  • Cannot install from source due to replace directive

    Cannot install from source due to replace directive

    Installing a module-aware binary without modules has been blocked since Go 1.13 if I remember correctly.

    ❯ go install -u github.com/cube2222/octosql
    flag provided but not defined: -u
    usage: go install [build flags] [packages]
    Run 'go help install' for details.
    ❯ go install github.com/cube2222/octosql    
    no required module provides package github.com/cube2222/octosql; to add it:
            go get github.com/cube2222/octosql
    

    The suggested command downloads the source but doesn't build it.

    Replacing that with the new command using module installation (with a version) also does not work because replace directives are not supported with the install command.

    ❯ go install github.com/cube2222/octosql@latest
    go: github.com/cube2222/octosql@latest (in github.com/cube2222/[email protected]):
            The go.mod file for the module providing named packages contains one or
            more replace directives. It must not contain directives that would cause
            it to be interpreted differently than if it were the main module.
    
  • Zero-config CSV queries

    Zero-config CSV queries

    I'm not sure how feasible this may be, but the old logparser program from Microsoft supported queries like

    Select * from test.csv
    

    And it would read from test.csv, without needing any config files.

  • Parquet datasource crashes on amd64 Ubuntu system without support for AVX2 instructions.

    Parquet datasource crashes on amd64 Ubuntu system without support for AVX2 instructions.

    On my Ubuntu 20 system with 16 GB of RAM and 2 TB of disk capacity, I installed GoLang:

    $ sudo apt update
    $ sudo apt install software-properties-common
    
    $ sudo add-apt-repository ppa:longsleep/golang-backports
    $ sudo apt update
    $ sudo apt install golang-go
    

    I then cloned the master branch of OctoSQL from today.

    $ git clone https://github.com/cube2222/octosql
    $ cd octosql
    

    I then dumped a 1.1B-row, 116 GB in Snappy-compressed Parquet file from the latest version of ClickHouse.

    $ clickhouse client \
        -q "SELECT *
            FROM trips
            FORMAT Parquet" \
        > trips.parquet
    

    When I ran the SQL below I got the following error.

    $ OCTOSQL_NO_TELEMETRY=1 go run main.go "SELECT * FROM trips.parquet LIMIT 10"
    
    SIGILL: illegal instruction
    PC=0x8f0db5 m=7 sigcode=2
    instruction bytes: 0xc4 0xe2 0x7d 0x78 0x44 0x24 0x20 0xc5 0xfe 0x6f 0x8 0xc5 0xfe 0x6f 0x50 0x20
    
    goroutine 1 [running]:
    github.com/segmentio/parquet-go/internal/bits.countByte({0xc002a4a000, 0xffe1, 0xffe1}, 0x1)
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/internal/bits/count_amd64.s:68 +0xf5 fp=0xc00020e0b0 sp=0xc00020e0a8 pc=0x8f0db5
    github.com/segmentio/parquet-go/internal/bits.CountByte(...)
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/internal/bits/count.go:3
    github.com/segmentio/parquet-go.countLevelsEqual({0xc002a4a000?, 0x14315c0?, 0x54c286?}, 0x0?)
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/page.go:244 +0x4f fp=0xc00020e100 sp=0xc00020e0b0 pc=0x9f2f2f
    github.com/segmentio/parquet-go.countLevelsNotEqual(...)
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/page.go:248
    github.com/segmentio/parquet-go.(*Column).decodeDataPageV1(0xc000126dd0, {0xc000274840?}, 0xc0000c62a0)
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/column.go:610 +0x414 fp=0xc00020e198 sp=0xc00020e100 pc=0x9ddf74
    github.com/segmentio/parquet-go.(*filePages).ReadPage(0xc0000c6230)
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/file.go:517 +0x212 fp=0xc00020e2e0 sp=0xc00020e198 pc=0x9efa32
    github.com/segmentio/parquet-go.(*multiPages).ReadPage(0xc000032880)
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/multi_row_group.go:223 +0x3f fp=0xc00020e328 sp=0xc00020e2e0 pc=0x9f133f
    github.com/segmentio/parquet-go.(*columnChunkReader).readPage(0xc00016cb20)
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/column_chunk.go:102 +0xa2 fp=0xc00020e350 sp=0xc00020e328 pc=0x9e1542
    github.com/segmentio/parquet-go.(*columnChunkReader).readValues(0xdbfba0?)
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/column_chunk.go:119 +0x71 fp=0xc00020e390 sp=0xc00020e350 pc=0x9e1651
    github.com/segmentio/parquet-go.columnReadRowFuncOfLeaf.func1({0xc0002747e0?, 0x3, 0x4}, 0xff?, {0xc00016ca00, 0x7fd1b1757108?, 0x40?})
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/column_chunk.go:326 +0xc5 fp=0xc00020e428 sp=0xc00020e390 pc=0x9e2205
    github.com/segmentio/parquet-go.makeColumnReadRowFunc.func1({0x0?, 0xf32600?, 0xc00016e300?}, 0xc8?, {0xc00016ca00, 0x35, 0x35})
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/schema.go:163 +0xa3 fp=0xc00020e490 sp=0xc00020e428 pc=0x9fb943
    github.com/segmentio/parquet-go.(*rowGroupRowReader).ReadRow(0xc00016e340?, {0x0?, 0x0, 0x0?})
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/row_group.go:306 +0xb7 fp=0xc00020e4d8 sp=0xc00020e490 pc=0x9fa4b7
    github.com/segmentio/parquet-go.(*reader).ReadRow(0xc0000c00a0, {0x0?, 0x0, 0x0?})
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/reader.go:276 +0xb1 fp=0xc00020e508 sp=0xc00020e4d8 pc=0x9f9331
    github.com/segmentio/parquet-go.(*Reader).ReadRow(0xc0000c0090, {0x0, 0x0, 0x0})
            /home/mark/go/pkg/mod/github.com/cube2222/[email protected]/reader.go:221 +0x65 fp=0xc00020e540 sp=0xc00020e508 pc=0x9f9205
    github.com/cube2222/octosql/datasources/parquet.(*DatasourceExecuting).Run(0xc00021c9c0, {{0xf30588?, 0xc00016edc0?}, 0x0?}, 0xc00021cb10, 0x0?)
            /home/mark/octosql/datasources/parquet/execution.go:47 +0x512 fp=0xc00020e6e0 sp=0xc00020e540 pc=0xa3bfd2
    github.com/cube2222/octosql/execution/nodes.(*Limit).Run(0xc000033780, {{0xf30588?, 0xc00016edc0?}, 0x0?}, 0xc00021c9f0, 0xc0000a6700?)
            /home/mark/octosql/execution/nodes/limit.go:34 +0x3a6 fp=0xc00020e910 sp=0xc00020e6e0 pc=0x5a6546
    github.com/cube2222/octosql/outputs/batch.(*OutputPrinter).Run(0xc0000a6700, {{0xf30588?, 0xc00016edc0?}, 0x0?})
            /home/mark/octosql/outputs/batch/live_output.go:81 +0x396 fp=0xc00020ea20 sp=0xc00020e910 pc=0xa5f476
    github.com/cube2222/octosql/cmd.glob..func4(0x1537ce0, {0xc00022c6d0, 0x1, 0x1?})
            /home/mark/octosql/cmd/root.go:463 +0x3653 fp=0xc00020fd70 sp=0xc00020ea20 pc=0xc16633
    github.com/spf13/cobra.(*Command).execute(0x1537ce0, {0xc000032050, 0x1, 0x1})
            /home/mark/go/pkg/mod/github.com/spf13/[email protected]/command.go:856 +0x67c fp=0xc00020fe48 sp=0xc00020fd70 pc=0x638e5c
    github.com/spf13/cobra.(*Command).ExecuteC(0x1537ce0)
            /home/mark/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x3b4 fp=0xc00020ff00 sp=0xc00020fe48 pc=0x6394d4
    github.com/spf13/cobra.(*Command).Execute(...)
            /home/mark/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
    github.com/spf13/cobra.(*Command).ExecuteContext(...)
            /home/mark/go/pkg/mod/github.com/spf13/[email protected]/command.go:895
    github.com/cube2222/octosql/cmd.Execute({0xf30588?, 0xc00016edc0?})
            /home/mark/octosql/cmd/root.go:476 +0x53 fp=0xc00020ff20 sp=0xc00020ff00 pc=0xc18233
    main.main()
            /home/mark/octosql/main.go:24 +0xe8 fp=0xc00020ff80 sp=0xc00020ff20 pc=0xc18f48
    runtime.main()
            /usr/lib/go-1.18/src/runtime/proc.go:250 +0x212 fp=0xc00020ffe0 sp=0xc00020ff80 pc=0x43a172
    runtime.goexit()
            /usr/lib/go-1.18/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00020ffe8 sp=0xc00020ffe0 pc=0x46a981
    
    goroutine 7 [syscall]:
    os/signal.signal_recv()
            /usr/lib/go-1.18/src/runtime/sigqueue.go:151 +0x2f
    os/signal.loop()
            /usr/lib/go-1.18/src/os/signal/signal_unix.go:23 +0x19
    created by os/signal.Notify.func1.1
            /usr/lib/go-1.18/src/os/signal/signal.go:151 +0x2a
    
    goroutine 8 [chan receive]:
    main.main.func1()
            /home/mark/octosql/main.go:17 +0x31
    created by main.main
            /home/mark/octosql/main.go:16 +0xd9
    
    goroutine 9 [select]:
    github.com/dgraph-io/ristretto.(*defaultPolicy).processItems(0xc00021d380)
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/policy.go:96 +0x91
    created by github.com/dgraph-io/ristretto.newDefaultPolicy
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/policy.go:80 +0x156
    
    goroutine 10 [select]:
    github.com/dgraph-io/ristretto.(*Cache).processItems(0xc0000775c0)
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/cache.go:314 +0xa8
    created by github.com/dgraph-io/ristretto.NewCache
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/cache.go:162 +0x56a
    
    goroutine 11 [select]:
    github.com/dgraph-io/ristretto.(*defaultPolicy).processItems(0xc0002594a0)
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/policy.go:96 +0x91
    created by github.com/dgraph-io/ristretto.newDefaultPolicy
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/policy.go:80 +0x156
    
    goroutine 12 [select]:
    github.com/dgraph-io/ristretto.(*Cache).processItems(0xc000077740)
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/cache.go:314 +0xa8
    created by github.com/dgraph-io/ristretto.NewCache
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/cache.go:162 +0x56a
    
    goroutine 13 [select]:
    github.com/dgraph-io/ristretto.(*defaultPolicy).processItems(0xc0002635c0)
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/policy.go:96 +0x91
    created by github.com/dgraph-io/ristretto.newDefaultPolicy
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/policy.go:80 +0x156
    
    goroutine 14 [select]:
    github.com/dgraph-io/ristretto.(*Cache).processItems(0xc0000778c0)
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/cache.go:314 +0xa8
    created by github.com/dgraph-io/ristretto.NewCache
            /home/mark/go/pkg/mod/github.com/dgraph-io/[email protected]/cache.go:162 +0x56a
    
    rax    0xc002a4a000
    rbx    0xc00295c001
    rcx    0xc002a59fe1
    rdx    0xc002a59fc0
    rdi    0x1
    rsi    0x0
    rbp    0xc00020e0f0
    rsp    0xc00020e0a8
    r8     0x83ff8
    r9     0x0
    r10    0x0
    r11    0x1
    r12    0x0
    r13    0x0
    r14    0x0
    r15    0x0
    rip    0x8f0db5
    rflags 0x10206
    cs     0x33
    fs     0x0
    gs     0x0
    exit status 2
    
  • `extraneous or missing

    `extraneous or missing " in quoted-field` on CSV file

    ❯ octosql 'select * from ./free_company_dataset.csv limit 10'      
    +------------------+---------+----------------------------------------+--------------------------+-------------------------------------------------------------+------------------+--------------------------+--------------+----------+-----------------------------------+
    |     country      | founded |                   id                   |         industry         |                        linkedin_url                         |     locality     |           name           |    region    |   size   |              website              |
    +------------------+---------+----------------------------------------+--------------------------+-------------------------------------------------------------+------------------+--------------------------+--------------+----------+-----------------------------------+
    | <null>           | <null>  | 'arnold-busck'                         | 'retail'                 | 'linkedin.com/company/arnold-busck'                         | <null>           | 'arnold busck'           | <null>       | '51-200' | 'arnoldbusck.dk'                  |
    | <null>           | <null>  | 'pegasus-rental-cars-auckland-airport' | 'automotive'             | 'linkedin.com/company/pegasus-rental-cars-auckland-airport' | <null>           | 'pegasus rental cars     | <null>       | '11-50'  | 'carrentalsaucklandairport.co.nz' |
    
    
    ❯ octosql 'select count(*) from ./free_company_dataset.csv'         
    ...
    Error: couldn't run query: couldn't run source: couldn't run source: couldn't decode message: parse error on line 248, column 130: extraneous or missing " in quoted-field
    
    ❯ octosql -v                      
    octosql version 0.9.2
    
  • nats driver ?

    nats driver ?

    I just stumbled on this repo. It looks alot like Differential dataflow pattern ?

    I was wondering if NATS is considered ? Kafka i saw is there.

    Nats Jetstream can do deduplication i read btw now.

  • Crash while processing a file

    Crash while processing a file

    Hi, when processing the following, octosql fails on a "couldn't get record stream for execution" error.

    Here's a reproduction of the issue. The reason for the cut operation is because I've tried bysecting in order to find in which cases the issue happens. The command below fails. When changing the 13 to 12, it works. Any value above 12 would crash the program.

    cat baseline-data.csv | cut -d ',' -f 1-13 > actual-data.csv ; octosql -c example-config.octosql "select count(*) from bmdata x"
    

    The matching config file (./example-config.octosql) is this:

    dataSources:
      - name: bmdata
        type: csv
        config:
          path: "actual-data.csv"
    

    The reason i've suspected the column count is kind of incidental. I'm the writer of a command-line sql querying tool called q (http://harelba.github.io/q/) and I'm doing some benchmarks on it and similar tools. The benchmark runs on files with different column counts. All the octosql runs above 12 columns fail the same way, so I've suspected some limit on the column count. Obviously, this might not be the root cause for the failures.

    I'd be glad to provide more information or answer more questions if needed.

    For some reason github doesn't allow me to do an upload. I'll try to send it in the next comment.

    Harel

  • queries using filenames that start with non-alpha characters fail

    queries using filenames that start with non-alpha characters fail

    $ sha256sum a.json
    4518d5db3dedeaf5b8cf078d8090b5d3d8b2da6befe9cddbf0544d4f3e61cb45 a.json

    $ sha256sum 1.json
    4518d5db3dedeaf5b8cf078d8090b5d3d8b2da6befe9cddbf0544d4f3e61cb45 1.json

    $ ~/go/bin/octosql "select 'works' from a.json limit 1" +---------+ | col_0 | +---------+ | 'works' | +---------+

    $ ~/go/bin/octosql "select 'sad face' from 1.json limit 1" Usage: octosql [flags] octosql [command]

    Examples: octosql "SELECT * FROM myfile.json" octosql "SELECT * FROM mydir/myfile.csv" octosql "SELECT * FROM plugins.plugins"

  • Support Excel (.xlsx) files as a Datasource

    Support Excel (.xlsx) files as a Datasource

    Seeing how useful is to be able to query with SQL in Excel and similar spreadsheets it would be useful to bring this functionality into a CLI tooling.

Use SQL to instantly query instances, networks, databases, and more from Scaleway. Open source CLI. No DB required.
Use SQL to instantly query instances, networks, databases, and more from Scaleway. Open source CLI. No DB required.

Scaleway Plugin for Steampipe Use SQL to query infrastructure servers, networks, databases and more from your Scaleway project. Get started → Document

Nov 16, 2022
Use SQL to query databases, logs and more from PlanetScale

Use SQL to instantly query PlanetScale databases, branches and more. Open source CLI. No DB required.

Sep 30, 2022
A go Library for scan database/sql rows to struct、slice、other types. And it support multiple databases connection management

ploto A go Library for scan database/sql rows to struct、slice、other types. And it support multiple databases connection management It's not an ORM. wo

Nov 3, 2022
Dumpling is a fast, easy-to-use tool written by Go for dumping data from the database(MySQL, TiDB...) to local/cloud(S3, GCP...) in multifarious formats(SQL, CSV...).

?? Dumpling Dumpling is a tool and a Go library for creating SQL dump from a MySQL-compatible database. It is intended to replace mysqldump and mydump

Nov 9, 2022
CLI tool that can execute SQL queries on CSV, LTSV, JSON and TBLN. Can output to various formats.
CLI tool that can execute SQL queries on CSV, LTSV, JSON and TBLN. Can output to various formats.

trdsql CLI tool that can execute SQL queries on CSV, LTSV, JSON and TBLN. It is a tool like q, textql and others. The difference from these tools is t

Jan 1, 2023
Manage SQL databases, users and grant using kubernetes manifests

SqlOperator Operate sql databases, users and grants. This is a WIP project and should not at all be used in production at this time. Feel free to vali

Nov 28, 2021
Go sqlite3 http vfs: query sqlite databases over http with range headers

sqlite3vfshttp: a Go sqlite VFS for querying databases over http(s) sqlite3vfshttp is a sqlite3 VFS for querying remote databases over http(s). This a

Dec 27, 2022
Use SQL to instantly query file, domain, URL and IP scanning results from VirusTotal.
Use SQL to instantly query file, domain, URL and IP scanning results from VirusTotal.

VirusTotal Plugin for Steampipe Use SQL to query file, domain, URL and IP scanning results from VirusTotal. Get started → Documentation: Table definit

Nov 10, 2022
Use SQL to instantly query resources, data sources and more from Terraform code. Open source CLI. No DB required.
Use SQL to instantly query resources, data sources and more from Terraform code. Open source CLI. No DB required.

Terraform Plugin for Steampipe Use SQL to query data from Terraform configuration files. Get started → Documentation: Table definitions & examples Com

Dec 22, 2022
A go package to add support for data at rest encryption if you are using the database/sql.

go-lockset A go package to add support for data at rest encryption if you are using the database/sql to access your database. Installation In your Gol

Jan 30, 2022
Go package for sharding databases ( Supports every ORM or raw SQL )
Go package for sharding databases ( Supports every ORM or raw SQL )

Octillery Octillery is a Go package for sharding databases. It can use with every OR Mapping library ( xorm , gorp , gorm , dbr ...) implementing data

Dec 16, 2022
Universal command-line interface for SQL databases

usql A universal command-line interface for PostgreSQL, MySQL, Oracle Database, SQLite3, Microsoft SQL Server, and many other databases including NoSQ

Jan 9, 2023
SQL API is designed to be able to run queries on databases without any configuration by simple HTTP call.

SQL API SQL API is designed to be able to run queries on databases without any configuration by simple HTTP call. The request contains the DB credenti

Dec 2, 2022
SQLite extension for accessing other SQL databases

dblite SQLite extension for accessing other SQL databases, in SQLite. Similar to how Postgres Foreign Data Wrappers enable access to other databases i

Dec 23, 2022
A simple auditor of SQL databases.

DBAuditor SQL数据库审计系统,目前支持SQL注入攻击审计 环境配置 sudo apt install golang 运行方式 将待审计语句填入test.txt中,然后运行主程序: 直接运行: go run main.go 编译运行: go build main.go ./main 主要目

Nov 9, 2022
Parses a file and associate SQL queries to a map. Useful for separating SQL from code logic

goyesql This package is based on nleof/goyesql but is not compatible with it any more. This package introduces support for arbitrary tag types and cha

Oct 20, 2021
Use SQL to query host, DNS and exploit information using Shodan. Open source CLI. No DB required.

Shodan Plugin for Steampipe Query Shodan with SQL Use SQL to query host, DNS and exploit information using Shodan. For example: select * from shod

Nov 10, 2022
Query and Provision Cloud Infrastructure using an extensible SQL based grammar
Query and Provision Cloud Infrastructure using an extensible SQL based grammar

Deploy, Manage and Query Cloud Infrastructure using SQL [Documentation] [Developer Guide] Cloud infrastructure coding using SQL InfraQL allows you to

Oct 25, 2022
Redis-shake is a tool for synchronizing data between two redis databases. Redis-shake是一个用于在两个redis之间同步数据的工具,满足用户非常灵活的同步、迁移需求。
Redis-shake is a tool for synchronizing data between two redis databases. Redis-shake是一个用于在两个redis之间同步数据的工具,满足用户非常灵活的同步、迁移需求。

RedisShake is mainly used to synchronize data from one redis to another. Thanks to the Douyu's WSD team for the support. 中文文档 English tutorial 中文使用文档

Dec 29, 2022