tstorage is a lightweight local on-disk storage engine for time-series data

tstorage Go Reference

tstorage is a lightweight local on-disk storage engine for time-series data with a straightforward API. Especially ingestion is massively optimized as it provides goroutine safe capabilities of write into and read from TSDB that partitions data points by time.

Motivation

I'm working on a couple of tools that handle a tremendous amount of time-series data, such as Ali and Gosivy. Especially Ali, I had been facing a problem of increasing heap consumption over time as it's a load testing tool that aims to perform real-time analysis. I little poked around a fast TSDB library that offers simple APIs but eventually nothing works as well as I'd like, that's why I settled on writing this package myself.

To see how much tstorage has helped improve Ali's performance, see the release notes here.

Usage

Currently, tstorage requires Go version 1.16 or greater

By default, tstorage.Storage works as an in-memory database. The below example illustrates how to insert a row into the memory and immediately select it.

timestamp: 1600000000, value: 0.1 } } ">
package main

import (
	"fmt"

	"github.com/nakabonne/tstorage"
)

func main() {
	storage, _ := tstorage.NewStorage(
		tstorage.WithTimestampPrecision(tstorage.Seconds),
	)
	defer storage.Close()

	_ = storage.InsertRows([]tstorage.Row{
		{
			Metric: "metric1",
			DataPoint: tstorage.DataPoint{Timestamp: 1600000000, Value: 0.1},
		},
	})
	points, _ := storage.Select("metric1", nil, 1600000000, 1600000001)
	for _, p := range points {
		fmt.Printf("timestamp: %v, value: %v\n", p.Timestamp, p.Value)
		// => timestamp: 1600000000, value: 0.1
	}
}

Using disk

To make time-series data persistent on disk, specify the path to directory that stores time-series data through WithDataPath option.

storage, _ := tstorage.NewStorage(
	tstorage.WithDataPath("./data"),
)
defer storage.Close()

Labeled metrics

In tstorage, you can identify a metric with combination of metric name and optional labels. Here is an example of insertion a labeled metric to the disk.

metric := "mem_alloc_bytes"
labels := []tstorage.Label{
	{Name: "host", Value: "host-1"},
}

_ = storage.InsertRows([]tstorage.Row{
	{
		Metric:    metric,
		Labels:    labels,
		DataPoint: tstorage.DataPoint{Timestamp: 1600000000, Value: 0.1},
	},
})
points, _ := storage.Select(metric, labels, 1600000000, 1600000001)

For more examples see the documentation.

Benchmarks

Benchmark tests were made using Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz with 16GB of RAM on macOS 10.15.7

$ go version
go version go1.16.2 darwin/amd64

$ go test -benchtime=4s -benchmem -bench=. .
goos: darwin
goarch: amd64
pkg: github.com/nakabonne/tstorage
cpu: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
BenchmarkStorage_InsertRows-8                  	14135685	       305.9 ns/op	     174 B/op	       2 allocs/op
BenchmarkStorage_SelectAmongThousandPoints-8   	20548806	       222.4 ns/op	      56 B/op	       2 allocs/op
BenchmarkStorage_SelectAmongMillionPoints-8    	16185709	       292.2 ns/op	      56 B/op	       1 allocs/op
PASS
ok  	github.com/nakabonne/tstorage	16.501s

Internal

Time-series database has specific characteristics in its workload. In terms of write operations, a time-series database has to ingest a tremendous amount of data points ordered by time. Time-series data is immutable, mostly an append-only workload with delete operations performed in batches on less recent data. In terms of read operations, in most cases, we want to retrieve multiple data points by specifying its time range, also, most recent first: query the recent data in real-time. Besides, time-series data is already indexed in time order.

Based on these characteristics, tstorage adopts a linear data model structure that partitions data points by time, totally different from the B-trees or LSM trees based storage engines. Each partition acts as a fully independent database containing all data points for its time range.

Screenshot

Key benefits:

  • We can easily ignore all data outside of the partition time range when querying data points.
  • Most read operations work fast because recent data get cached in heap.
  • When a partition gets full, we can persist the data from our in-memory database by sequentially writing just a handful of larger files. We avoid any write-amplification and serve SSDs and HDDs equally well.

Memory partition

The memory partition is writable and stores data points in heap. The head partition is always memory partition. Its next one is also memory partition to accept out-of-order data points. It stores data points in an ordered Slice, which offers excellent cache hit ratio compared to linked lists unless it gets updated way too often (like delete, add elements at random locations).

All incoming data is written to a write-ahead log (WAL) right before inserting into a memory partition to prevent data loss.

Disk partition

The old memory partitions get compacted and persisted to the directory prefixed with p-, under the directory specified with the WithDataPath option. Here is the macro layout of disk partitions:

$ tree ./data
./data
├── p-1600000001-1600003600
│   ├── data
│   └── meta.json
├── p-1600003601-1600007200
│   ├── data
│   └── meta.json
└── p-1600007201-1600010800
    ├── data
    └── meta.json

As you can see each partition holds two files: meta.json and data. The data is compressed, read-only and is memory-mapped with mmap(2) that maps a kernel address space to a user address space. Therefore, what it has to store in heap is only partition's metadata. Just looking at meta.json gives us a good picture of what it stores:

$ cat ./data/p-1600000001-1600003600/meta.json
{
  "minTimestamp": 1600000001,
  "maxTimestamp": 1600003600,
  "numDataPoints": 7200,
  "metrics": {
    "metric-1": {
      "name": "metric-1",
      "offset": 0,
      "minTimestamp": 1600000001,
      "maxTimestamp": 1600003600,
      "numDataPoints": 3600
    },
    "metric-2": {
      "name": "metric-2",
      "offset": 36014,
      "minTimestamp": 1600000001,
      "maxTimestamp": 1600003600,
      "numDataPoints": 3600
    }
  }
}

Each metric has its own file offset of the beginning. Data point slice for each metric is compressed separately, so all we have to do when reading is to seek, and read the points off.

Out-of-order data points

What data points get out-of-order in real-world applications is not uncommon because of network latency or clock synchronization issues; tstorage basically doesn't discard them. If out-of-order data points are within the range of the head memory partition, they get temporarily buffered and merged at flush time. Sometimes we should handle data points that cross a partition boundary. That is the reason why tstorage keeps more than one partition writable.

Used by

  • ali - A load testing tool capable of performing real-time analysis
  • gosivy - Real-time visualization tool for Go process metrics

Acknowledgements

This package is implemented based on tons of existing ideas. What I especially got inspired by are:

A big "thank you!" goes out to all of them.

Comments
  • Support for external TSDB storage

    Support for external TSDB storage

    Hey, I'd like to use tstorage but I'd like to use an external persistence (in my case Kafka). Do you think this would make sense to support so that we can use external systems like Databases or even streams like Kafka for the persistence?

    My usecase is to utilize tstorage to collect some metrics for Kowl so that we can show some graphs (e.g. storage usage over time). Ideally we don't need our users to configure a filestorage, but instead we could use something like a Kafka topic that can be shared by all instances.

  • WAL recovery issue

    WAL recovery issue

    Hey @nakabonne , I appreciate your quick work on the WAL, that's awesome! I just tried it and apparently I ran into an issue with the recovery from the WAL file. I let it run for ~2minutes where it added a few metrics into the TSDB and I noticed that the WAL file is also written. During recovery the following error popped up:

    failed to recover WAL: failed to read WAL: encounter an error while reading WAL segment file "0": failed to read the metric name: unexpected EOF

    My settings are:

    	tsdbOpts := []tstorage.Option{
    		tstorage.WithTimestampPrecision(tstorage.Seconds),
    		tstorage.WithPartitionDuration(cfg.CacheRetention / 2),
    		tstorage.WithDataPath(cfg.Persistence.Disk.DataPath),
    		tstorage.WithRetention(cfg.Persistence.Disk.Retention),
    	}
    
            storage, err := tstorage.NewStorage(tsdbOpts...)
    

    I sent you the WAL file via EMail.

  • Support writes outside of head partition

    Support writes outside of head partition

    currently we only write to the head partition, this change supports writing to any of the writable partitions (currently set to 2).

    ref https://github.com/nakabonne/tstorage/issues/20

  • Fix

    Fix "slice bounds out of range" error in memoryMetric.selectPoints()

    This will fix a bug which I have encountered while using nakabonne/ali (awesome! btw):

    panic: runtime error: slice bounds out of range [6:1]
    
    goroutine 53 [running]:in:
    
    github.com/nakabonne/tstorage.(*memoryMetric).selectPoints(0xc0001ba7e0, 0x16a06b537e57044c, 0x16a06b5a7a7ab04c, 0x0, 0x0, 0x0)
            /root/go/pkg/mod/github.com/nakabonne/[email protected]/memory_partition.go:232 +0x27b
    github.com/nakabonne/tstorage.(*memoryPartition).selectDataPoints(0xc000132200, 0x82dbf7, 0x7, 0x0, 0x0, 0x0, 0x16a06b537e57044c, 0x16a06b5a7a7ab04c, 0x449a4c, 0xc000092238, ...)
            /root/go/pkg/mod/github.com/nakabonne/[email protected]/memory_partition.go:124 +0x9e
    github.com/nakabonne/tstorage.(*storage).Select(0xc000132180, 0x82dbf7, 0x7, 0x0, 0x0, 0x0, 0x16a06b537e57044c, 0x16a06b5a7a7ab04c, 0x46a77b, 0x77b3455e6fa8, ...)
            /root/go/pkg/mod/github.com/nakabonne/[email protected]/storage.go:324 +0x262
    github.com/nakabonne/ali/storage.(*storage).Select(0xc00012e260, 0x82dbf7, 0x7, 0xc043adf3db12ae4c, 0x323d8539, 0xb456e0, 0xc043adfb5b12ae4c, 0x72e613139, 0xb456e0, 0x0, ...)
            /opt/ali/storage/storage.go:116 +0x105
    github.com/nakabonne/ali/gui.(*drawer).redrawCharts(0xc0001c6480, 0x8a5ab8, 0xc00007e000)
            /opt/ali/gui/drawer.go:53 +0x262
    created by github.com/nakabonne/ali/gui.attack
            /opt/ali/gui/keybinds.go:55 +0x9d
    

    In the original code almost always the first case (green) is entered - the bug occurs when the second (red) case is executed:

    image

  • tstorage + gRPC = tstorage-server

    tstorage + gRPC = tstorage-server

    Hello @nakabonne,

    I saw your post about tstorage on HackerNews and I spent some time looking through the source code - and I love the project.

    The challenge I have is I want to access tstorage by many applications simultaneously; as a result, I spent some time writing a gRPC service definition overtop of tstorage to turn it into a server application. I've created the repo tstorage-server.

    Have a look and feel free to use any bit of the code if you like. If not, no worries! Just wanted to say thanks again for your great work!

    Take care, -Bart

  • Support WAL to offer durability guarantees

    Support WAL to offer durability guarantees

    A write-ahead log is a required capability to provide durability guarantees.

    https://github.com/nakabonne/tstorage/blob/2ed7de6461aa61704ea8be0eaba2369bba0831b5/file_wal.go#L18-L21

  • Merge out of order data points before flushing to disk

    Merge out of order data points before flushing to disk

    This change sorts the out of order points and then merges the two slices before flushing buffered data to disk. This is a partial implementation of https://github.com/nakabonne/tstorage/issues/7 but doesn't handle points outside of the head partition bounds.

  • Missing meta.json file

    Missing meta.json file

    When I abruptly stop the application it apparently does not necessarily write a meta.json file into the disk directories. On startup a log message like this will pop up then:

    failed to open disk partition for config\\tsdb-data\\xy\\p-1635758430-1635769231: failed to read metadata: open config\\tsdb-data\\xy\\p-1635758430-1635769231\\meta.json: The system cannot find the file specified."
    

    I'm unsure when the meta.json file is supposed to be written, hence I can't tell whether it's dependent on the fact that I abruptly stop the application or not. It's not easily testable for me because it takes some time until it writes these files I think? Let me know if I can provide further details, has happened a couple times already.

  • Add support for Write Ahead Log

    Add support for Write Ahead Log

    Fixes https://github.com/nakabonne/tstorage/issues/5

    This PR adds support for WAL only for insertOperation. The format is:

    	   +--------+---------------------+--------+--------------------+----------------+
    	   | op(1b) | len metric(varints) | metric | timestamp(varints) | value(varints) |
    	   +--------+---------------------+--------+--------------------+----------------+
    
  • Handle data points outside range of the head partition

    Handle data points outside range of the head partition

    Basically, all data points get ingested into the head partition. But sometimes data points get across the partition boundary, which should get into the next partition.

    https://github.com/nakabonne/tstorage/blob/b1bbbafe00c65da63ce4eb66161134da4e15b92f/storage.go#L251-L252

    For points older than the head's next partition, it will get discarded.

  • Fix the way to take offset so that it can decode no matter the number of metrics is

    Fix the way to take offset so that it can decode no matter the number of metrics is

    This PR does:

    • write actually encoded data points for each metric
    • reset encoder params used for computation (like deltas) whenever flushing
    • also stop using gzip for now.
  •  Wildcard in Select method

    Wildcard in Select method

    It seems very convenient to be able selecting more than one metric using the following syntax: (Assume having "metric-1" and "metric-2" data tags)

    storage.Select("metric?", nil, timestart, timeend)
    

    to select all tags with any symbol on ? place or

    storage.Select("*", nil, timestart, timeend)
    

    to select all data.

    Similarly it could be regex.

  • GetRetention of the Tstorage instance added.

    GetRetention of the Tstorage instance added.

    Hi @nakabonne, it is a really great and useful repository, thanks for developing. As a feature I would like to be able to collect configured and default values of the database and I thought that I could create a pull request. If you think that it is suitable for the convention of the repository. I would like to add other get methods of (partition duration, walbuffer) and so on.

    Kind Regards

  • Proposal: gzip partition data

    Proposal: gzip partition data

    Given a naive write of sequential data:

    for i := int64(1); i < 10000; i++ {
    	_ = storage.InsertRows([]tstorage.Row{{
    			Metric:    "metric1",
    			DataPoint: tstorage.DataPoint{Timestamp: 1600000000 + i, Value: float64(i)},
    		}})
    }
    

    The file contains many stretches of repeated values (mostly 0x00's). This is great for gzip, a byte-stream de-duplicatior.

    If I gzip the file:

    $ gzip -k data && ls -alsh data*                                                                                                                                           130 ↵
     84K -rw-rw-r-- 1 wmeitzler wmeitzler  82K Sep 20 09:45 data
    4.0K -rw-rw-r-- 1 wmeitzler wmeitzler 1.3K Sep 20 09:45 data.gz
    

    I achieve file compression of 21x!

    Note that this only really provides value when long stretches of adjacent datapoints are similar. If I populate a file with truly random values, rather than ascending, I can only compress from 82kb to 80kb. And I'm paying CPU time to achieve these small files. I suspect, but have not validated, that a meaningful amount of use cases for a TSDB will generate these adjacent-similarity series that would benefit from gzip compression.

    Given that Go allows access to the streaming nature of gzip operations, I propose exploring the use of these functions in the reading and writing of data files.

  • inclusive end in Select

    inclusive end in Select

    During Select in this code using >= instead of > making end inclusive which contradict with documentation for Select

    if end >= maxTimestamp {
    	endIdx = int(size)
    } else {
    	// Use binary search because points are in-order.
    	endIdx = sort.Search(int(size), func(i int) bool {
    		return m.points[i].Timestamp >= end
    	})
    }
    
storage interface for local disk or AWS S3 (or Minio) platform

storage interface for local disk or AWS S3 (or Minio) platform

Apr 19, 2022
TurtleDex is a decentralized cloud storage platform that radically alters the landscape of cloud storage

TurtleDex is a decentralized cloud storage platform that radically alters the landscape of cloud storage. By leveraging smart contracts, client-side encryption, and sophisticated redundancy (via Reed-Solomon codes), TurtleDex allows users to safely store their data with hosts that they do not know or trust.

May 29, 2021
"rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Yandex Files

Website | Documentation | Download | Contributing | Changelog | Installation | Forum Rclone Rclone ("rsync for cloud storage") is a command-line progr

Jan 9, 2023
QingStor Object Storage service support for go-storage

go-services-qingstor QingStor Object Storage service support for go-storage. Install go get github.com/minhjh/go-service-qingstor/v3 Usage import ( "

Dec 13, 2021
s3git: git for Cloud Storage. Distributed Version Control for Data.
s3git: git for Cloud Storage. Distributed Version Control for Data.

s3git: git for Cloud Storage. Distributed Version Control for Data. Create decentralized and versioned repos that scale infinitely to 100s of millions of files. Clone huge PB-scale repos on your local SSD to make changes, commit and push back. Oh yeah, it dedupes too and offers directory versioning.

Dec 27, 2022
An encrypted object storage system with unlimited space backed by Telegram.

TGStore An encrypted object storage system with unlimited space backed by Telegram. Please only upload what you really need to upload, don't abuse any

Nov 28, 2022
Storj is building a decentralized cloud storage network
Storj is building a decentralized cloud storage network

Ongoing Storj v3 development. Decentralized cloud object storage that is affordable, easy to use, private, and secure.

Jan 8, 2023
SFTPGo - Fully featured and highly configurable SFTP server with optional FTP/S and WebDAV support - S3, Google Cloud Storage, Azure Blob

SFTPGo - Fully featured and highly configurable SFTP server with optional FTP/S and WebDAV support - S3, Google Cloud Storage, Azure Blob

Jan 4, 2023
Terraform provider for the Minio object storage.

terraform-provider-minio A Terraform provider for Minio, a self-hosted object storage server that is compatible with S3. Check out the documenation on

Dec 1, 2022
Rook is an open source cloud-native storage orchestrator for Kubernetes

Rook is an open source cloud-native storage orchestrator for Kubernetes, providing the platform, framework, and support for a diverse set of storage solutions to natively integrate with cloud-native environments.

Oct 25, 2022
A Redis-compatible server with PostgreSQL storage backend

postgredis A wild idea of having Redis-compatible server with PostgreSQL backend. Getting started As a binary: ./postgredis -addr=:6380 -db=postgres:/

Nov 8, 2021
CSI for S3 compatible SberCloud Object Storage Service

sbercloud-csi-obs CSI for S3 compatible SberCloud Object Storage Service This is a Container Storage Interface (CSI) for S3 (or S3 compatible) storage

Feb 17, 2022
Void is a zero storage cost large file sharing system.

void void is a zero storage cost large file sharing system. License Copyright © 2021 Changkun Ou. All rights reserved. Unauthorized using, copying, mo

Nov 22, 2021
This is a simple file storage server. User can upload file, delete file and list file on the server.
This is a simple file storage server.  User can upload file,  delete file and list file on the server.

Simple File Storage Server This is a simple file storage server. User can upload file, delete file and list file on the server. If you want to build a

Jan 19, 2022
High Performance, Kubernetes Native Object Storage
High Performance, Kubernetes Native Object Storage

MinIO Quickstart Guide MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Ama

Jan 2, 2023
Perkeep (née Camlistore) is your personal storage system for life: a way of storing, syncing, sharing, modelling and backing up content.

Perkeep is your personal storage system. It's a way to store, sync, share, import, model, and back up content. Keep your stuff for life. For more, see

Dec 26, 2022
Storage Orchestration for Kubernetes

What is Rook? Rook is an open source cloud-native storage orchestrator for Kubernetes, providing the platform, framework, and support for a diverse se

Dec 29, 2022
Cloud-Native distributed storage built on and for Kubernetes
Cloud-Native distributed storage built on and for Kubernetes

Longhorn Build Status Engine: Manager: Instance Manager: Share Manager: Backing Image Manager: UI: Test: Release Status Release Version Type 1.1 1.1.2

Jan 1, 2023
A High Performance Object Storage released under Apache License
A High Performance Object Storage released under Apache License

MinIO Quickstart Guide MinIO is a High Performance Object Storage released under Apache License v2.0. It is API compatible with Amazon S3 cloud storag

Sep 30, 2021