Create key value sqlite3 database from tabular data, fast.

SLIKV

Turn tabular data into a lookup table using sqlite3. This is a working PROTOTYPE with limitations, e.g. no customizations, the table definition is fixed, etc.

CREATE TABLE IF NOT EXISTS map (k TEXT, v TEXT)

As a performance data point, an example dataset with 1B+ rows can be inserted and indexed in less than two hours (on a recent CPU and an nvme drive; database file size: 400G).

How it works

Data is chopped up into smaller chunks (defaults to about 64MB) and imported with the .import command. Indexes are created only after all data has been imported.

Example

$ cat fixtures/sample-xs.tsv | column -t
10.1001/10-v4n2-hsf10003                    10.1177/003335490912400218
10.1001/10-v4n2-hsf10003                    10.1097/01.bcr.0000155527.76205.a2
10.1001/amaguidesnewsletters.1996.novdec01  10.1056/nejm199312303292707
10.1001/amaguidesnewsletters.1996.novdec01  10.1016/s0363-5023(05)80265-5
10.1001/amaguidesnewsletters.1996.novdec01  10.1001/jama.1994.03510440069036
10.1001/amaguidesnewsletters.1997.julaug01  10.1097/00007632-199612150-00003
10.1001/amaguidesnewsletters.1997.mayjun01  10.1164/ajrccm/147.4.1056
10.1001/amaguidesnewsletters.1997.mayjun01  10.1136/thx.38.10.760
10.1001/amaguidesnewsletters.1997.mayjun01  10.1056/nejm199507133330207
10.1001/amaguidesnewsletters.1997.mayjun01  10.1378/chest.88.3.376

$ slikv -o xs.db < fixtures/sample-xs.tsv
2021/10/04 16:13:06 [ok] initialized database · xs.db
2021/10/04 16:13:06 [io] written 679B · 361.3K/s
2021/10/04 16:13:06 [ok] 1/2 created index · xs.db
2021/10/04 16:13:06 [ok] 2/2 created index · xs.db

$ sqlite3 xs.db 'select * from map'
10.1001/10-v4n2-hsf10003|10.1177/003335490912400218
10.1001/10-v4n2-hsf10003|10.1097/01.bcr.0000155527.76205.a2
10.1001/amaguidesnewsletters.1996.novdec01|10.1056/nejm199312303292707
10.1001/amaguidesnewsletters.1996.novdec01|10.1016/s0363-5023(05)80265-5
10.1001/amaguidesnewsletters.1996.novdec01|10.1001/jama.1994.03510440069036
10.1001/amaguidesnewsletters.1997.julaug01|10.1097/00007632-199612150-00003
10.1001/amaguidesnewsletters.1997.mayjun01|10.1164/ajrccm/147.4.1056
10.1001/amaguidesnewsletters.1997.mayjun01|10.1136/thx.38.10.760
10.1001/amaguidesnewsletters.1997.mayjun01|10.1056/nejm199507133330207
10.1001/amaguidesnewsletters.1997.mayjun01|10.1378/chest.88.3.376

$ sqlite3 xs.db 'select * from map where k = "10.1001/amaguidesnewsletters.1997.mayjun01" '
10.1001/amaguidesnewsletters.1997.mayjun01|10.1164/ajrccm/147.4.1056
10.1001/amaguidesnewsletters.1997.mayjun01|10.1136/thx.38.10.760
10.1001/amaguidesnewsletters.1997.mayjun01|10.1056/nejm199507133330207
10.1001/amaguidesnewsletters.1997.mayjun01|10.1378/chest.88.3.376

Motivation

SQLite is likely used more than all other database engines combined. Billions and billions of copies of SQLite exist in the wild. -- https://www.sqlite.org/mostdeployed.html

Sometimes, programs need lookup tables to map values between two domains. A dictionary is a perfect data structure as long as the data fits in memory. For larger sets (hundreds of millions of entries), a dictionary will not work.

The slikv tool takes a two-column tabular file and turns it into an sqlite3 database, which you can query in your program. Depending on the size of the data, you can expect 1K-50K queries per second.

Usage

$ slikv -h
Usage of slikv:
  -B int
        buffer size (default 67108864)
  -C int
        sqlite3 cache size, needs memory = C x page size (default 1000000)
  -I int
        index mode: 0=none, 1=k, 2=v, 3=kv (default 3)
  -o string
        output filename (default "data.db")
  -version
        show version and exit

Performance

$ wc -l fixtures/sample-10m.tsv
10000000 fixtures/sample-10m.tsv

$ stat --format "%s" fixtures/sample-10m.tsv
548384897

$ time slikv < fixtures/sample-10m.tsv
2021/09/30 16:58:07 [ok] initialized database -- data.db
2021/09/30 16:58:17 [io] written 523M · 56.6M/s
2021/09/30 16:58:21 [ok] 1/2 created index -- data.db
2021/09/30 16:58:34 [ok] 2/2 created index -- data.db

real    0m26.267s
user    0m24.122s
sys     0m3.224s
  • 10M rows stored, with indexed keys and values in 27s, 370370 rows/s

TODO

  • allow tab-importing to be done programmatically, for any number of columns
  • a better name: mktabdb, mktabs, dbize
  • could write a tool for burst queries, e.g. split data into N shard, create N databases and distribute queries across files - e.g. dbize db.json with the same repl, etc. -- if we've seen 300K inserts per db, we may see 0.X x CPU x 300K, maybe millions/s.
Owner
Similar Resources

An observability database aims to ingest, analyze and store Metrics, Tracing and Logging data.

An observability database aims to ingest, analyze and store Metrics, Tracing and Logging data.

BanyanDB BanyanDB, as an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data. It's designed to handle observab

Dec 31, 2022

A database connection tool for sensitive data

A database connection tool for sensitive data

go-sql 用于快速统计数据库行数、敏感字段匹配、数据库连接情况。 usage ./go-sql_darwin_amd64 -h ./go-sql_darwin_amd64 -f db.yaml -k name,user ./go-sql_darwin_amd64 -f db.yaml --min

Apr 4, 2022

Lightweight SQL database written in Go for prototyping and playing with text (CSV, JSON) data

gopicosql Lightweight SQL database written in Go for prototyping and playing wit

Jul 27, 2022

A go package to add support for data at rest encryption if you are using the database/sql.

go-lockset A go package to add support for data at rest encryption if you are using the database/sql to access your database. Installation In your Gol

Jan 30, 2022

A fast data generator that's multi-table aware and supports multi-row DML.

A fast data generator that's multi-table aware and supports multi-row DML.

If you need to generate a lot of random data for your database tables but don't want to spend hours configuring a custom tool for the job, then datage

Dec 26, 2022

Typescript type declaration to PostgreSQL CREATE TABLE converter

ts2psql NOTE: This is WIP. Details in this readme are ideal state. Current usage: go build && ./ts2psql (or go build && ts2psql if on Windows OS). A s

Jan 13, 2022

🏋️ dbbench is a simple database benchmarking tool which supports several databases and own scripts

dbbench Table of Contents Description Example Installation Supported Databases Usage Custom Scripts Troubeshooting Development Acknowledgements Descri

Dec 30, 2022

Database wrapper that manage read write connections

rwdb Database wrapper that manage read write connections Install go get github.com/andizzle/rwdb Create connections package main import "github.com/

Dec 10, 2022

Vitess is a database clustering system for horizontal scaling of MySQL.

Vitess Vitess is a database clustering system for horizontal scaling of MySQL through generalized sharding. By encapsulating shard-routing logic, Vite

Jan 3, 2023
A tool I made to quickly store bug bounty program scopes in a local sqlite3 database

GoScope A tool I made to quickly store bug bounty program scopes in a local sqlite3 database. Download or copy a Burpsuite configuration file from the

Nov 18, 2021
Go sqlite3 http vfs: query sqlite databases over http with range headers

sqlite3vfshttp: a Go sqlite VFS for querying databases over http(s) sqlite3vfshttp is a sqlite3 VFS for querying remote databases over http(s). This a

Dec 27, 2022
Simple key-value store on top of SQLite or MySQL

KV Work in progress, not ready for prime time. A simple key/value store on top of SQLite or MySQL (Go port of GitHub's KV). Aims to be 100% compatible

Dec 3, 2022
Dumpling is a fast, easy-to-use tool written by Go for dumping data from the database(MySQL, TiDB...) to local/cloud(S3, GCP...) in multifarious formats(SQL, CSV...).

?? Dumpling Dumpling is a tool and a Go library for creating SQL dump from a MySQL-compatible database. It is intended to replace mysqldump and mydump

Nov 9, 2022
[mirror] the database client and tools for the Go vulnerability database

The Go Vulnerability Database golang.org/x/vulndb This repository is a prototype of the Go Vulnerability Database. Read the Draft Design. Neither the

Dec 29, 2022
Database - Example project of database realization using drivers and models

database Golang based database realization Description Example project of databa

Feb 10, 2022
Additions to Go's database/sql for super fast performance and convenience.

gocraft/dbr (database records) gocraft/dbr provides additions to Go's database/sql for super fast performance and convenience. $ go get -u github.com/

Jan 1, 2023
Convert data exports from various services to a single SQLite database
Convert data exports from various services to a single SQLite database

Bionic Bionic is a tool to convert data exports from web apps to a single SQLite database. Bionic currently supports data exports from Google, Apple H

Dec 9, 2022
Library for scanning data from a database into Go structs and more

scany Overview Go favors simplicity, and it's pretty common to work with a database via driver directly without any ORM. It provides great control and

Jan 9, 2023