A disk-backed key-value store.

Last update: Jan 1, 2023

Comments: 15

What is diskv?

Diskv (disk-vee) is a simple, persistent key-value store written in the Go language. It starts with an incredibly simple API for storing arbitrary data on a filesystem by key, and builds several layers of performance-enhancing abstraction on top. The end result is a conceptually simple, but highly performant, disk-backed storage system.

Installing

Install Go 1, either from source or with a prepackaged binary. Then,

$ go get github.com/peterbourgon/diskv

Usage

package main

import (
	"fmt"
	"github.com/peterbourgon/diskv"
)

func main() {
	// Simplest transform function: put all the data files into the base dir.
	flatTransform := func(s string) []string { return []string{} }

	// Initialize a new diskv store, rooted at "my-data-dir", with a 1MB cache.
	d := diskv.New(diskv.Options{
		BasePath:     "my-data-dir",
		Transform:    flatTransform,
		CacheSizeMax: 1024 * 1024,
	})

	// Write three bytes to the key "alpha".
	key := "alpha"
	d.Write(key, []byte{'1', '2', '3'})

	// Read the value back out of the store.
	value, _ := d.Read(key)
	fmt.Printf("%v\n", value)

	// Erase the key+value from the store (and the disk).
	d.Erase(key)
}

More complex examples can be found in the "examples" subdirectory.

Theory

Basic idea

At its core, diskv is a map of a key (string) to arbitrary data ([]byte). The data is written to a single file on disk, with the same name as the key. The key determines where that file will be stored, via a user-provided TransformFunc, which takes a key and returns a slice ([]string) corresponding to a path list where the key file will be stored. The simplest TransformFunc,

func SimpleTransform (key string) []string {
    return []string{}
}

will place all keys in the same, base directory. The design is inspired by Redis diskstore; a TransformFunc which emulates the default diskstore behavior is available in the content-addressable-storage example.

Note that your TransformFunc should ensure that one valid key doesn't transform to a subset of another valid key. That is, it shouldn't be possible to construct valid keys that resolve to directory names. As a concrete example, if your TransformFunc splits on every 3 characters, then

d.Write("abcabc", val) // OK: written to <base>/abc/abc/abcabc
d.Write("abc", val)    // Error: attempted write to <base>/abc/abc, but it's a directory

This will be addressed in an upcoming version of diskv.

Probably the most important design principle behind diskv is that your data is always flatly available on the disk. diskv will never do anything that would prevent you from accessing, copying, backing up, or otherwise interacting with your data via common UNIX commandline tools.

Advanced path transformation

If you need more control over the file name written to disk or if you want to support slashes in your key name or special characters in the keys, you can use the AdvancedTransform property. You must supply a function that returns a special PathKey structure, which is a breakdown of a path and a file name. Strings returned must be clean of any slashes or special characters:

func AdvancedTransformExample(key string) *diskv.PathKey {
	path := strings.Split(key, "/")
	last := len(path) - 1
	return &diskv.PathKey{
		Path:     path[:last],
		FileName: path[last] + ".txt",
	}
}

// If you provide an AdvancedTransform, you must also provide its
// inverse:

func InverseTransformExample(pathKey *diskv.PathKey) (key string) {
	txt := pathKey.FileName[len(pathKey.FileName)-4:]
	if txt != ".txt" {
		panic("Invalid file found in storage folder!")
	}
	return strings.Join(pathKey.Path, "/") + pathKey.FileName[:len(pathKey.FileName)-4]
}

func main() {
	d := diskv.New(diskv.Options{
		BasePath:          "my-data-dir",
		AdvancedTransform: AdvancedTransformExample,
		InverseTransform:  InverseTransformExample,
		CacheSizeMax:      1024 * 1024,
	})
	// Write some text to the key "alpha/beta/gamma".
	key := "alpha/beta/gamma"
	d.WriteString(key, "¡Hola!") // will be stored in "<basedir>/alpha/beta/gamma.txt"
	fmt.Println(d.ReadString("alpha/beta/gamma"))
}

Adding a cache

An in-memory caching layer is provided by combining the BasicStore functionality with a simple map structure, and keeping it up-to-date as appropriate. Since the map structure in Go is not threadsafe, it's combined with a RWMutex to provide safe concurrent access.

Adding order

diskv is a key-value store and therefore inherently unordered. An ordering system can be injected into the store by passing something which satisfies the diskv.Index interface. (A default implementation, using Google's btree package, is provided.) Basically, diskv keeps an ordered (by a user-provided Less function) index of the keys, which can be queried.

Adding compression

Something which implements the diskv.Compression interface may be passed during store creation, so that all Writes and Reads are filtered through a compression/decompression pipeline. Several default implementations, using stdlib compression algorithms, are provided. Note that data is cached compressed; the cost of decompression is borne with each Read.

Streaming

diskv also now provides ReadStream and WriteStream methods, to allow very large data to be handled efficiently.

Future plans

Needs plenty of robust testing: huge datasets, etc...
More thorough benchmarking
Your suggestions for use-cases I haven't thought of

Credits and contributions

Original idea, design and implementation: Peter Bourgon Other collaborations: Javier Peletier (Epic Labs)

Owner

Peter Bourgon

The official GitHub account of Dwayne 'The Rock' Johnson

https://github.com/peterbourgon/diskv http://godoc.org/github.com/peterbourgon/diskv

Comments

Switch to actively maintained Red-Black tree implementation?

I've noticed that github.com/petar/GoLLRB have no activity since 2013 while github.com/HuKeping/rbtree appears to be actively maintained.

I wonder if it would be worth switching to actively maintained project?

completeFilename causes invalid memory address or nil pointer dereference

We're running a component in Kubernetes that uses diskv under the hood. The problem is that the process occasionally crashes when it attempts to remove the key from the store. Here is the relevant stack trace:

github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).Erase(0xc42041c2d0, 0x0, 0x1b, 0x0, 0x0) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:409 +0xe7
github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).completeFilename(0xc42041c2d0, 0x0, 0x1b, 0x1b, 0x27fdf00) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:525 +0x98
path/filepath.Join(0xc42110b970, 0x2, 0x2, 0xc4208e18c0, 0x11) /usr/lib/go/src/path/filepath/path.go:210
path/filepath.join(0xc42110b970, 0x2, 0x2, 0x0, 0x0) /usr/lib/go/src/path/filepath/path_unix.go:45 +0x96
strings.Join(0xc42110b970, 0x2, 0x2, 0x18d6236, 0x1, 0xc42110b918, 0x2) /usr/lib/go/src/strings/strings.go:424

Data directory is mounted as a regular host path (/opt/spm/agent) and file names are ksuid-compatible identifiers.

diskv is initialized with following configuration:

d := diskv.New(diskv.Options{
		BasePath:     c.Dir,
		Transform:    func(s string) []string { return []string{} },
		CacheSizeMax: 1024 * 1024,
})

Do you have any pointers or ideas why this would happen?

(*File).Chmod on Windows

https://github.com/golang/go/blob/bba88467f86472764a656e61f5f3265ed6853692/src/syscall/syscall_windows.go#L1094

os.Chmod always return syscall.EWINDOWS error on Windows.
Suggestion: Improve performance by maintaining structs

Currently diskv maintains in-memory values in []byte format. It would be excellent from a performance standpoint if arbitrary structs could be maintained instead as this would eliminate the overhead of deserialization. I suspect that the performance improvement would be 1-2 orders of magnitude depending on the application.
ensureCacheSpaceWithLock panics with variable sized keys

If variable sized keys are being used, then ensureCacheSpaceWithLock() can panic during the cache cleanup routine which loops through safe(). This occurs if the key being inserted is larger than the last key removed.

It would be better if either this key were inserted anyway (and thus exceeding the cache threshold) or the routine was modified so that it cleared some percent of the total cache size before proceeding.
Cache expiration

I've used this awesome library in a couple of projects now and I think it is a great tool. One thing I'm wondering about is a way to set a cache expiration policy. The policy would dictate when cache keys become stale and should expire and perhaps be erased. Have you considered any additional features to this library, or perhaps strategies to employ regarding this topic?

Thanks!
CAS feature request/offer?
I was looking for a CAS[1].

Seems like it could be a pretty small change to diskv. You would need a cryptographic hash function, like say skein or sha256. Then something like:

d := diskv.New(diskv.Options{ BasePath: "my-data-dir", Transform: flatTransform, CryptoHash: sha256, CacheSizeMax: 1024 * 1024, }) (key, err) := d.put([]byte{'1', '2', '3'})) ... (value,err) := d.get(key)

I use put/get because they imply (to me anyways) atomic operations, that read/write do not.

Opinions? Alternatives?

If I did something like the above and send a pull request, would you consider it?

[1] http://en.wikipedia.org/wiki/Content-addressable_storage
Implement atomic write to disk

Implement atomic write by writing into a temporary file (within the same directory to guarantee that it won't move accross disks), with the same permission and then rename at the last moment to make it available.
please tag formal releases
Please consider assigning version numbers and tagging releases. Tags/releases are useful for downstream package maintainers (in Debian and other distributions) to export source tarballs, automatically track new releases and to declare dependencies between packages. Read more in the Debian Upstream Guide.

Versioning provides additional benefits to encourage vendoring of a particular (e.g. latest stable) release contrary to random unreleased snapshots.

Versioning provides safety margin for situations like "ops, we've made a mistake but reverted problematic commit in master". Presumably tagged version/release is better tested/reviewed than random snapshot of "master" branch.

Thank you.

See also

http://semver.org/

https://en.wikipedia.org/wiki/Software_versioning
Canceling a Keys() walk

We need a done channel on the Keys and KeysPrefix iterator. That will be a breaking API to the library however.

I am happy to send a PR but want to get agreement on the API breakage problem first. Based on discussion from https://github.com/coreos/rocket/pull/209#issuecomment-66217292
concurrent access

I've a use-case where I have two applications reading/writing to the disk cache. When reading from the disk cache, an application will first read from the in-memory cache, which in some scenario's will be dirty. Is it possible to always read from the physical disk first?
While walking, InverseTransform is invoked for directories - is this a bug?
Hey @peterbourgon - loving this library and thank you for your stewardship!

Would love to have you take a look at the following lines: https://github.com/peterbourgon/diskv/blob/2566386005f64f58f34e1ff32907800a64537e6a/diskv.go#L597-L601

By my reading, it would be preferable not to call InverseTransform for directories. In my case, I'm using the AdvancedTransform and its inverse in the README, and this is tickling the "panic" line because directories do not carry the expected extension.

What about switching the logic to this? In other words, pass on the directory BEFORE calling InverseTransform.

if info.IsDir() { return nil // "pass" } key := d.InverseTransform(pathKey) if !strings.HasPrefix(key, prefix) { return nil // "pass" }
Implement a fix for #66, excessive memory use in siphon.

The siphon will now stop writing to its internal buffer once the size of the buffer exceeds the maximum cache size. Because we write until we exceed the max cache size, we're safe to attempt the cache update even if the buffer only contains partial data, because it's still over the limit & will be rejected.
ReadStream with a very large value results in excessive memory use when cache is enabled

If the cache is enabled, readWithRLock always reads the file using a siphon.

The siphon code copies every byte it reads into a bytes.Buffer. When the full file has been read, that bytes.Buffer is used to update the cache.

However, if the underlying file is e.g. a gigabyte in size, the siphon will end up with a bytes.Buffer containing that entire gigabyte. Unless you've set your cache size to over a gigabyte, this gets thrown away as soon as the ReadStream is done.

The main reason we use ReadStream is so we can deal with very large items without having to stick the entire thing in memory at once. Having discovered this, we'll probably disable the cache, but there are cases where people may wish to have a cache enabled without blowing up their memory!
ReadStream/WriteStream can lead to data races

If someone calls ReadStream, then proceeds to read from it slowly, then someone else calls WriteStream, the reader will start to get the new resource contents part-way through. This is because createKeyFileWithNoLock calls os.OpenFile with O_TRUNC set when updating a file, and the existing reader ends up pointing at the new data.

Related tags

Key-Value Store diskv

Simple Distributed key-value database (in-memory/disk) written with Golang.

Kallbaz DB Simple Distributed key-value store (in-memory/disk) written with Golang. Installation go get github.com/msam1r/kallbaz-db Usage API // Get

Jan 18, 2022

Distributed reliable key-value store for the most critical data of a distributed system

etcd Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in order

Dec 28, 2022

Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service.

Olric Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service. With

Jan 4, 2023

a persistent real-time key-value store, with the same redis protocol with powerful features

a fast NoSQL DB, that uses the same RESP protocol and capable to store terabytes of data, also it integrates with your mobile/web apps to add real-time features, soon you can use it as a document store cause it should become a multi-model db. Redix is used in production, you can use it in your apps with no worries.

Dec 25, 2022

GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

GhostDB is designed to speed up dynamic database or API driven websites by storing data in RAM in order to reduce the number of times an external data source such as a database or API must be read. GhostDB provides a very large hash table that is distributed across multiple machines and stores large numbers of key-value pairs within the hash table.

Jan 6, 2023

Pogreb is an embedded key-value store for read-heavy workloads written in Go.

Embedded key-value store for read-heavy workloads written in Go

Dec 29, 2022

CrankDB is an ultra fast and very lightweight Key Value based Document Store.

CrankDB is a ultra fast, extreme lightweight Key Value based Document Store.

Apr 12, 2022

yakv is a simple, in-memory, concurrency-safe key-value store for hobbyists.

yakv (yak-v. (originally intended to be "yet-another-key-value store")) is a simple, in-memory, concurrency-safe key-value store for hobbyists. yakv provides persistence by appending transactions to a transaction log and restoring data from the transaction log on startup.

Feb 24, 2022

A disk-backed key-value store.

What is diskv?

Installing

Usage

Theory

Basic idea

Advanced path transformation

Adding a cache

Adding order

Adding compression

Streaming

Future plans

Credits and contributions

Owner

Peter Bourgon

Comments

Switch to actively maintained Red-Black tree implementation?

completeFilename causes invalid memory address or nil pointer dereference

(*File).Chmod on Windows

Suggestion: Improve performance by maintaining structs

ensureCacheSpaceWithLock panics with variable sized keys

Cache expiration

CAS feature request/offer?

Implement atomic write to disk

please tag formal releases

Canceling a Keys() walk

concurrent access

While walking, InverseTransform is invoked for directories - is this a bug?

Implement a fix for #66, excessive memory use in siphon.

ReadStream with a very large value results in excessive memory use when cache is enabled

ReadStream/WriteStream can lead to data races

Related tags

Simple Distributed key-value database (in-memory/disk) written with Golang.

Distributed reliable key-value store for the most critical data of a distributed system

Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service.

a persistent real-time key-value store, with the same redis protocol with powerful features

GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

Pogreb is an embedded key-value store for read-heavy workloads written in Go.

CrankDB is an ultra fast and very lightweight Key Value based Document Store.

yakv is a simple, in-memory, concurrency-safe key-value store for hobbyists.

Multithreaded key value pair store using thread safe locking mechanism allowing concurrent reads

ShockV is a simple key-value store with RESTful API

A rest-api that works with golang as an in-memory key value store

Distributed key-value store

Simple in memory key-value store.

A simple in-memory key-value store application

Biscuit is a multi-region HA key-value store for your AWS infrastructure secrets.

An in-memory key:value store/cache (similar to Memcached) library for Go, suitable for single-machine applications.

NutsDB a simple, fast, embeddable and persistent key/value store written in pure Go.

KV - a toy in-memory key value store built primarily in an effort to write more go and check out grpc

PrimeKV is a Secure, REST API driven Key/Value store.