High-performance, columnar, in-memory store with bitmap indexing in Go

kelindar/column
Go Version PkgGoDev Go Report Card License Coverage

Columnar In-Memory Store with Bitmap Indexing

This package contains a high-performance, columnar, in-memory storage engine that supports fast querying, update and iteration with zero-allocations and bitmap indexing.

Features

  • Optimized, cache-friendly columnar data layout that minimizes cache-misses.
  • Optimized for zero heap allocation during querying (see benchmarks below).
  • Optimized batch updates/deletes, an update during a transaction takes around 12ns.
  • Support for SIMD-enabled filtering (i.e. "where" clause) by leveraging bitmap indexing.
  • Support for columnar projection (i.e. "select" clause) for fast retrieval.
  • Support for computed indexes that are dynamically calculated based on provided predicate.
  • Support for concurrent updates per-column (e.g. 2 goroutines can update 2 columns concurrently).
  • Support for transaction isolation, allowing you to create transactions and commit/rollback.
  • Support for expiration of rows based on time-to-live or expiration column.
  • Support for atomic increment/decrement of numerical values, transactionally.
  • Support for change data stream that streams all commits consistently.

Documentation

The general idea is to leverage cache-friendly ways of organizing data in structures of arrays (SoA) otherwise known "columnar" storage in database design. This, in turn allows us to iterate and filter over columns very efficiently. On top of that, this package also adds bitmap indexing to the columnar storage, allowing to build filter queries using binary and, and not, or and xor (see kelindar/bitmap with SIMD support).

Collection and Columns

In order to get data into the store, you'll need to first create a Collection by calling NewCollection() method. Each collection requires a schema, which can be either specified manually by calling CreateColumn() multiple times or automatically inferred from an object by calling CreateColumnsOf() function.

In the example below we're loading some JSON data by using json.Unmarshal() and auto-creating colums based on the first element on the loaded slice. After this is done, we can then load our data by inserting the objects one by one into the collection. This is accomplished by calling Insert() method on the collection itself repeatedly.

data := loadFromJson("players.json")

// Create a new columnar collection
players := column.NewCollection()
players.CreateColumnsOf(data[0])

// Insert every item from our loaded data
for _, v := range data {
	players.Insert(v)
}

Now, let's say we only want specific columns to be added. We can do this by calling CreateColumn() method on the collection manually to create the required columns.

// Create a new columnar collection with pre-defined columns
players := column.NewCollection()
players.CreateColumn("name", column.ForString())
players.CreateColumn("class", column.ForString())
players.CreateColumn("balance", column.ForFloat64())
players.CreateColumn("age", column.ForInt16())

// Insert every item from our loaded data
for _, v := range loadFromJson("players.json") {
	players.Insert(v)
}

While the previous example demonstrated how to insert many objects, it was doing it one by one and is rather inefficient. This is due to the fact that each Insert() call directly on the collection initiates a separate transacion and there's a small performance cost associated with it. If you want to do a bulk insert and insert many values, faster, that can be done by calling Insert() on a transaction, as demonstrated in the example below. Note that the only difference is instantiating a transaction by calling the Query() method and calling the txn.Insert() method on the transaction instead the one on the collection.

players.Query(func(txn *Txn) error {
	for _, v := range loadFromJson("players.json") {
		txn.Insert(v)
	}
	return nil // Commit
})

Querying and Indexing

The store allows you to query the data based on a presence of certain attributes or their values. In the example below we are querying our collection and applying a filtering operation bu using WithValue() method on the transaction. This method scans the values and checks whether a certain predicate evaluates to true. In this case, we're scanning through all of the players and looking up their class, if their class is equal to "rogue", we'll take it. At the end, we're calling Count() method that simply counts the result set.

// This query performs a full scan of "class" column
players.Query(func(txn *column.Txn) error {
	count := txn.WithValue("class", func(v interface{}) bool {
		return v == "rogue"
	}).Count()
	return nil
})

Now, what if we'll need to do this query very often? It is possible to simply create an index with the same predicate and have this computation being applied every time (a) an object is inserted into the collection and (b) an value of the dependent column is updated. Let's look at the example below, we're fist creating a rogue index which depends on "class" column. This index applies the same predicate which only returns true if a class is "rogue". We then can query this by simply calling With() method and providing the index name.

An index is essentially akin to a boolean column, so you could technically also select it's value when querying it. Now, in this example the query would be around 10-100x faster to execute as behind the scenes it uses bitmap indexing for the "rogue" index and performs a simple logical AND operation on two bitmaps when querying. This avoid the entire scanning and applying of a predicate during the Query.

// Create the index "rogue" in advance
out.CreateIndex("rogue", "class", func(v interface{}) bool {
	return v == "rogue"
})

// This returns the same result as the query before, but much faster
players.Query(func(txn *column.Txn) error {
	count := txn.With("rogue").Count()
	return nil
})

The query can be further expanded as it allows indexed intersection, difference and union operations. This allows you to ask more complex questions of a collection. In the examples below let's assume we have a bunch of indexes on the class column and we want to ask different questions.

First, let's try to merge two queries by applying a Union() operation with the method named the same. Here, we first select only rogues but then merge them together with mages, resulting in selection containing both rogues and mages.

// How many rogues and mages?
players.Query(func(txn *Txn) error {
	txn.With("rogue").Union("mage").Count()
	return nil
})

Next, let's count everyone who isn't a rogue, for that we can use a Without() method which performs a difference (i.e. binary AND NOT operation) on the collection. This will result in a count of all players in the collection except the rogues.

// How many rogues and mages?
players.Query(func(txn *Txn) error {
	txn.Without("rogue").Count()
	return nil
})

Now, you can combine all of the methods and keep building more complex queries. When querying indexed and non-indexed fields together it is important to know that as every scan will apply to only the selection, speeding up the query. So if you have a filter on a specific index that selects 50% of players and then you perform a scan on that (e.g. WithValue()), it will only scan 50% of users and hence will be 2x faster.

= 30 }).Count() return nil }) ">
// How many rogues that are over 30 years old?
players.Query(func(txn *Txn) error {
	txn.With("rogue").WithFloat("age", func(v float64) bool {
		return v >= 30
	}).Count()
	return nil
})

Iterating over Results

In all of the previous examples, we've only been doing Count() operation which counts the number of elements in the result set. In this section we'll look how we can iterate over the result set. In short, there's 2 main methods that allow us to do it:

  1. Range() method which takes in a column name as an argument and allows faster get/set of the values for that column.
  2. Select() method which doesn't pre-select any specific column, so it's usually a bit slower and it also does not allow any updates.

Let's first examine the Range() method. In the example below we select all of the rogues from our collection and print out their name by using the Range() method and providing "name" column to it. The callback containing the Cursor allows us to quickly get the value of the column by calling String() method to retrieve a string value. It also contains methods such as Int(), Uint(), Float() or more generic Value() to pull data of different types.

players.Query(func(txn *Txn) error {
	txn.With("rogue").Range("name", func(v column.Cursor) bool {
		println("rogue name ", v.String()) // Prints the name
		return true
	})
	return nil
})

Now, what if you need two columns? The range only allows you to quickly select a single column, but you can still retrieve other columns by their name during the iteration. This can be accomplished by corresponding StringAt(), FloatAt(), IntAt(), UintAt() or ValueAt() methods as shown below.

players.Query(func(txn *Txn) error {
	txn.With("rogue").Range("name", func(v column.Cursor) bool {
		println("rogue name ", v.String())    // Prints the name
		println("rogue age ", v.IntAt("age")) // Prints the age
		return true
	})
	return nil
})

On the other hand, Select() allows you to do a read-only selection which provides a Selector cursor. This cursor does not allow any updates, deletes or inserts and is also not pre-select any particular column. In the example below we print out names of all of the rogues using a selector.

players.Query(func(txn *Txn) error {
	txn.With("rogue").Select(func(v column.Selector) bool {
		println("rogue name ", v.StringAt("name")) // Prints the name
		return true
	})
	return nil
})

Now, what if you need to quickly delete all some of the data in the collection? In this case DeleteAll() or DeleteIf() methods come in handy. These methods are very fast (especially DeleteAll()) and allow you to quickly delete the appropriate results, transactionally. In the example below we delete all of the rogues from the collection by simply selecting them in the transaction and calling the DeleteAll() method.

players.Query(func(txn *Txn) error {
	txn.With("rogue").DeleteAll()
	return nil
})

Updating Values

In order to update certain items in the collection, you can simply call Range() method and the corresponding Cursor's Update() or UpdateAt() methods that allow to update a value of a certain column atomically. The updates won't be directly reflected given that the store supports transactions and only when transaction is commited, then the update will be applied to the collection. This allows for isolation and rollbacks.

In the example below we're selecting all of the rogues and updating both their balance and age to certain values. The transaction returns nil, hence it will be automatically committed when Query() method returns.

players.Query(func(txn *Txn) error {
	txn.With("rogue").Range("balance", func(v column.Cursor) bool {
		v.Update(10.0)        // Update the "balance" to 10.0
		v.UpdateAt("age", 50) // Update the "age" to 50
		return true
	}) // Select the balance
	return nil
})

In certain cases, you might want to atomically increment or decrement numerical values. In order to accomplish this you can use the provided Add() or AddAt() operations of the Cursor or Selector. Note that the indexes will also be updated accordingly and the predicates re-evaluated with the most up-to-date values. In the below example we're incrementing the balance of all our rogues by 500 atomically.

players.Query(func(txn *Txn) error {
	txn.With("rogue").Range("balance", func(v column.Cursor) bool {
		v.Add(500.0) // Increment the "balance" by 500
		return true
	})
	return nil
})

Expiring Values

Sometimes, it is useful to automatically delete certain rows when you do not need them anymore. In order to do this, the library automatically adds an expire column to each new collection and starts a cleanup goroutine aynchronously that runs periodically and cleans up the expired objects. In order to set this, you can simply use InsertWithTTL() method on the collection that allows to insert an object with a time-to-live duration defined.

In the example below we are inserting an object to the collection and setting the time-to-live to 5 seconds from the current time. After this time, the object will be automatically evicted from the collection and its space can be reclaimed.

players.InsertWithTTL(map[string]interface{}{
	"name": "Merlin",
	"class": "mage",
	"age": 55,
	"balance": 500,
}, 5 * time.Second) // The time-to-live of 5 seconds

On an interestig node, since expire column which is automatically added to each collection is an actual normal column, you can query and even update it. In the example below we query and conditionally update the expiration column. The example loads a time, adds one hour and updates it, but in practice if you want to do it you should use Add() method which can perform this atomically.

players.Query(func(txn *column.Txn) error {
	return txn.Range("expire", func(v column.Cursor) bool {
		oldExpire := time.Unix(0, v.Int()) // Convert expiration to time.Time
		newExpire := expireAt.Add(1 * time.Hour).UnixNano()  // Add some time
		v.Update(newExpire)
		return true
	})
})

Transaction Commit and Rollback

Transactions allow for isolation between two concurrent operations. In fact, all of the batch queries must go through a transaction in this library. The Query method requires a function which takes in a column.Txn pointer which contains various helper methods that support querying. In the example below we're trying to iterate over all of the players and update their balance by setting it to 10.0. The Query method automatically calls txn.Commit() if the function returns without any error. On the flip side, if the provided function returns an error, the query will automatically call txn.Rollback() so none of the changes will be applied.

// Range over all of the players and update (successfully their balance)
players.Query(func(txn *column.Txn) error {
	txn.Range("balance", func(v column.Cursor) bool {
		v.Update(10.0) // Update the "balance" to 10.0
		return true
	})

	// No error, txn.Commit() will be called
	return nil
})

Now, in this example, we try to update balance but a query callback returns an error, in which case none of the updates will be actually reflected in the underlying collection.

// Range over all of the players and update (successfully their balance)
players.Query(func(txn *column.Txn) error {
	txn.Range("balance", func(v column.Cursor) bool {
		v.Update(10.0) // Update the "balance" to 10.0
		return true
	})

	// Returns an error, txn.Rollback() will be called
	return fmt.Errorf("bug") 
})

You can (but probablty won't need to) call Commit() or Rollback() manually, as many times as required. This could be handy to do partial updates but calling them too often will have a performance hit on your application.

// Range over all of the players and update (successfully their balance)
players.Query(func(txn *column.Txn) error {
	txn.Range("balance", func(v column.Cursor) bool {
		v.Update(10.0) // Update the "balance" to 10.0
		return true
	})

	txn.Commit() // Manually commit all of the changes
	return nil   // This will call txn.Commit() again, but will be a no-op
})

Streaming Changes

This library also supports streaming out all transaction commits consistently, as they happen. This allows you to implement your own change data capture (CDC) listeners, stream data into kafka or into a remote database for durability. In order to enable it, you can simply provide an implementation of a commit.Writer interface during the creation of the collection.

In the example below we take advantage of the commit.Channel implementation of a commit.Writer which simply publishes the commits into a go channel. Here we create a buffered channel and keep consuming the commits with a separate goroutine, allowing us to view transactions as they happen in the store.

// Create a new commit writer (simple channel) and a new collection
writer  := make(commit.Channel, 1024)
players := NewCollection(column.Options{
	Writer: writer,
})

// Read the changes from the channel
go func(){
	for commit := writer{
		println("commit", commit.Type.String())
	}
}()

// ... insert, update or delete

On a separate note, this change stream is guaranteed to be consistent and serialized. This means that you can also replicate those changes on another database and synchronize both. In fact, this library also provides Replay() method on the collection that allows to do just that. In the example below we create two collections primary and replica and asychronously replicating all of the commits from the primary to the replica using the Replay() method together with the change stream.

// Create a p rimary collection
writer  := make(commit.Channel, 1024)
primary := column.NewCollection(column.Options{
	Writer: &writer,
})
primary.CreateColumnsOf(object)

// Replica with the same schema
replica := column.NewCollection()
replica.CreateColumnsOf(object)

// Keep 2 collections in sync
go func() {
	for change := range writer {
		replica.Replay(change)
	}
}()

Complete Example

= 30 }) // Load the items into the collection loaded := loadFixture("players.json") players.CreateColumnsOf(loaded[0]) for _, v := range loaded { players.Insert(v) } // This performs a full scan on 3 different columns and compares them given the // specified predicates. This is not indexed, but does a columnar scan which is // cache-friendly. players.Query(func(txn *column.Txn) error { println(txn.WithString("race", func(v string) bool { return v == "human" }).WithString("class", func(v string) bool { return v == "mage" }).WithFloat("age", func(v float64) bool { return v >= 30 }).Count()) // prints the count return nil }) // This performs a cound, but instead of scanning through the entire dataset, it scans // over pre-built indexes and combines them using a logical AND operation. The result // will be the same as the query above but the performance of the query is 10x-100x // faster depending on the size of the underlying data. players.Query(func(txn *column.Txn) error { println(txn.With("human", "mage", "old").Count()) // prints the count return nil }) // Same condition as above, but we also select the actual names of those // players and iterate through them. players.Query(func(txn *column.Txn) error { txn.With("human", "mage", "old").Range("name", func(v column.Cursor) bool { println(v.String()) // prints the name return true }) // The column to select return nil }) } ">
func main(){

	// Create a new columnar collection
	players := column.NewCollection()

	// index on humans
	players.CreateIndex("human", "race", func(v interface{}) bool {
		return v == "human"
	})

	// index for mages
	players.CreateIndex("mage", "class", func(v interface{}) bool {
		return v == "mage"
	})

	// index for old
	players.CreateIndex("old", "age", func(v interface{}) bool {
		return v.(float64) >= 30
	})

	// Load the items into the collection
	loaded := loadFixture("players.json")
	players.CreateColumnsOf(loaded[0])
	for _, v := range loaded {
		players.Insert(v)
	}

	// This performs a full scan on 3 different columns and compares them given the 
	// specified predicates. This is not indexed, but does a columnar scan which is
	// cache-friendly.
	players.Query(func(txn *column.Txn) error {
		println(txn.WithString("race", func(v string) bool {
			return v == "human"
		}).WithString("class", func(v string) bool {
			return v == "mage"
		}).WithFloat("age", func(v float64) bool {
			return v >= 30
		}).Count()) // prints the count
		return nil
	})

	// This performs a cound, but instead of scanning through the entire dataset, it scans
	// over pre-built indexes and combines them using a logical AND operation. The result
	// will be the same as the query above but the performance of the query is 10x-100x
	// faster depending on the size of the underlying data.
	players.Query(func(txn *column.Txn) error {
		println(txn.With("human", "mage", "old").Count()) // prints the count
		return nil
	})

	// Same condition as above, but we also select the actual names of those 
	// players and iterate through them.
	players.Query(func(txn *column.Txn) error {
		txn.With("human", "mage", "old").Range("name", func(v column.Cursor) bool {
			println(v.String()) // prints the name
			return true
		}) // The column to select
		return nil
	})
}

Benchmarks

The benchmarks below were ran on a collection of 500 items containing a dozen columns. Feel free to explore the benchmarks but I strongly recommend testing it on your actual dataset.

cpu: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
BenchmarkCollection/insert-8         5013795      239.9 ns/op    27 B/op     0 allocs/op
BenchmarkCollection/fetch-8         23730796       50.63 ns/op    0 B/op     0 allocs/op
BenchmarkCollection/scan-8            234990     4743 ns/op       0 B/op     0 allocs/op
BenchmarkCollection/count-8          7965873      152.7 ns/op     0 B/op     0 allocs/op
BenchmarkCollection/range-8          1512513      799.9 ns/op     0 B/op     0 allocs/op
BenchmarkCollection/update-at-8      5409420      224.7 ns/op     0 B/op     0 allocs/op
BenchmarkCollection/update-all-8      196626     6099 ns/op       0 B/op     0 allocs/op
BenchmarkCollection/delete-at-8      2006052      594.9 ns/op     0 B/op     0 allocs/op
BenchmarkCollection/delete-all-8     1889685      643.2 ns/op     0 B/op     0 allocs/op

When testing for larger collections, I added a small example (see examples folder) and ran it with 20 million rows inserted, each entry has 12 columns and 4 indexes that need to be calculated, and a few queries and scans around them.

result = 7160000 -> full scan took 196.153362ms running indexed query of human mages... -> result = 1360000 -> indexed query took 581.158µs running indexed query of human female mages... -> result = 640000 -> indexed query took 753.122µs running update of balance of everyone... -> updated 20000000 rows -> update took 301.888912ms running update of age of mages... -> updated 6040000 rows -> update took 93.835876ms ">
running insert of 20000000 rows...
-> insert took 52.8255618s

running full scan of age >= 30...
-> result = 10200000
-> full scan took 176.01008ms

running full scan of class == "rogue"...
-> result = 7160000
-> full scan took 196.153362ms

running indexed query of human mages...
-> result = 1360000
-> indexed query took 581.158µs

running indexed query of human female mages...
-> result = 640000
-> indexed query took 753.122µs

running update of balance of everyone...
-> updated 20000000 rows
-> update took 301.888912ms

running update of age of mages...
-> updated 6040000 rows
-> update took 93.835876ms

Contributing

We are open to contributions, feel free to submit a pull request and we'll review it as quickly as we can. This library is maintained by Roman Atachiants

License

Tile is licensed under the MIT License.

Owner
Roman Atachiants
As a hacker with a Ph.D., I'm scaling backend services for half a billion people in the Middle East.
Roman Atachiants
Comments
  • Sum functionality for Numeric columns in transactions

    Sum functionality for Numeric columns in transactions

    This PR introduces a simple sum aggregation for transactions. It adds 3 functions, txn.SumInt64(), txn.SumUint64(), & txn.SumFloat64(), that sum the given Numeric type column with respect to the txn's current index, via txn_lock's rangeRead().

    Since there are no prior built-in aggregation functions in the package, I understand if the goal is to let the user handler that functionality via a txn.Range() call. However, I believe that more work could be done to further parallelize this operation (read & add all chunks concurrently?). Due to the data access required for this, accounting for simple aggregations in column itself seems like the best place.

  • Support 'AND (... OR ... OR ...)' queries

    Support 'AND (... OR ... OR ...)' queries

    Introduces WithUnion, which operates similarly to Union, but will allocate a separate bitmap (1) for calculating the multi-OR before applying it to the txn's current index via an AND. This could be used to solve #55.

    I'd still like to clean up the bitmap comparison logic and extend the testing to use players

  • SortedIndex baseline implementation

    SortedIndex baseline implementation

    This PR introduces a new Sorted Index that keeps an actively sorted b-tree (github.com/tidwall/btree) for a column of the user's choosing (currently limited to string-type only). The index holds one b-tree that is not copied between transactions (mutexed).

    Future work would consider other type columns being sorted (currently only string columns), PK sorting, and custom Less() functionality for users.

  • index out of range

    index out of range

    Hello. Found a bug. It doesn't appear every execution.

    Stack trace:

    panic: runtime error: index out of range [3030211] with length 3030211
    
    goroutine 70 [running]:
    github.com/kelindar/column.(*columnfloat64).Apply(0xc0004b6570, 0xc0c332ecc0)
            /root/go/golibs/src/github.com/kelindar/column/column_numbers.go:205 +0x1ad
    github.com/kelindar/column.(*column).Apply(0xc0004b6660, 0xc0c332ecc0, 0x0)
            /root/go/golibs/src/github.com/kelindar/column/column.go:148 +0x5b
    github.com/kelindar/column.(*Txn).commitUpdates.func1(0xc0c332ecc0)
            /root/go/golibs/src/github.com/kelindar/column/txn.go:445 +0x7e
    github.com/kelindar/column/commit.(*Reader).Range(0xc0c332ecc0, 0xc000092780, 0xb8, 0xc0009077d8)
            /root/go/golibs/src/github.com/kelindar/column/commit/reader.go:219 +0x11e
    github.com/kelindar/column.(*Txn).commitUpdates(0xc0c30fc6e0, 0xb8, 0xc006cdc000)
            /root/go/golibs/src/github.com/kelindar/column/txn.go:439 +0x14b
    github.com/kelindar/column.(*Txn).commit.func2(0xc0000000b8, 0xc006cdc000, 0x100, 0x4800)
            /root/go/golibs/src/github.com/kelindar/column/txn.go:408 +0x1f0
    github.com/kelindar/column.(*Txn).rangeWrite.func1(0xc0000000b8)
            /root/go/golibs/src/github.com/kelindar/column/txn_lock.go:71 +0x13b
    github.com/kelindar/bitmap.Bitmap.Range(0xc0c3c2b2c0, 0x3, 0x4, 0xc000907a08)
            /root/go/golibs/src/github.com/kelindar/bitmap/range.go:28 +0xbc
    github.com/kelindar/column.(*Txn).rangeWrite(0xc0c30fc6e0, 0xc000907a78)
            /root/go/golibs/src/github.com/kelindar/column/txn_lock.go:62 +0x7f
    github.com/kelindar/column.(*Txn).commit(0xc0c30fc6e0)
            /root/go/golibs/src/github.com/kelindar/column/txn.go:402 +0x1a6
    github.com/kelindar/column.(*Collection).Query(0xc0000b2000, 0xc09ba1fc38, 0x1485b00, 0x28)
            /root/go/golibs/src/github.com/kelindar/column/collection.go:264 +0xff
    

    My code causing the issue:

    e = d.Query(func(txn *column.Txn) (e error) {
    	_ = txn.Range("col2", func(v column.Cursor) {
    		v.SetFloat64(0)
    	})
    	return txn.Range("col1", func(v column.Cursor) {
    		if a, h := someMap[v.String()]; h {
    			v.SetFloat64At("col2", a[1].(float64))
    			v.SetStringAt("col3", a[0].(string))
    		}
    	})
    }
    
  • Migrated to sharded latch & commit buffer

    Migrated to sharded latch & commit buffer

    This PR introduces a couple of major changes in this library.

    1. The in-memory commit buffer is now essentially []byte which can be potentially written to disk in future for disaster recovery, see #5. This commit buffer stores various operations and their offsets and has been optimized.
    2. While keeping the isolation level at read committed, its introducing a shared mutex, as discussed #6. This increases overall concurrency of a single large collection. The collection is essentially divided in "chunks" of 16K elements and we're now using 128 latches (sharded RWMutex) in order to control concurrent access.

    Performance hit of sharded mutex is relatively small, and in case of updates the new commit buffer it actually reduces heap allocations as it no longer requires to allocate a single interface{} on the heap.

    Benchmark go1.16

    I ran a small benchmark with various workloads (90% read / 10% write, etc) on a collection of 1 million elements with different goroutine pools. In this example we're combining two types of transactions:

    • Read transactions that query a random index and iterate over the results over a single column.
    • Write transactions that update a random element (point-write).

    Note that the goal of this benchmark is to validate concurrency, not throughput this represents the current "best" case scenario when the updates are random and do less likely to incur contention. Reads, however quite often would hit the same chunks as only the index itself is randomized.

    As expected, it scales quite well to large number of goroutines unless the workload extremely write-heavy in which case exclusive latches on chunks would lead to contentions across the board and decrease the performance.

    90%-10%       1 procs      143,221,213 read/s         70 write/s
    90%-10%       8 procs    1,081,511,102 read/s        483 write/s
    90%-10%      16 procs    1,068,562,727 read/s        455 write/s
    90%-10%      32 procs    1,042,382,561 read/s        442 write/s
    90%-10%      64 procs    1,039,644,346 read/s        446 write/s
    90%-10%     128 procs    1,049,228,432 read/s        442 write/s
    90%-10%     256 procs    1,027,362,194 read/s        477 write/s
    90%-10%     512 procs    1,023,097,576 read/s        457 write/s
    90%-10%    1024 procs      996,585,722 read/s        436 write/s
    90%-10%    2048 procs      948,455,719 read/s        494 write/s
    90%-10%    4096 procs      930,094,338 read/s        540 write/s
    50%-50%       1 procs      142,015,047 read/s        598 write/s
    50%-50%       8 procs    1,066,028,881 read/s      4,300 write/s
    50%-50%      16 procs    1,039,210,987 read/s      4,191 write/s
    50%-50%      32 procs    1,042,789,993 read/s      4,123 write/s
    50%-50%      64 procs    1,040,410,050 read/s      4,102 write/s
    50%-50%     128 procs    1,006,464,963 read/s      4,008 write/s
    50%-50%     256 procs    1,008,663,071 read/s      4,170 write/s
    50%-50%     512 procs      989,864,228 read/s      4,146 write/s
    50%-50%    1024 procs      998,826,089 read/s      4,258 write/s
    50%-50%    2048 procs      939,110,917 read/s      4,515 write/s
    50%-50%    4096 procs      866,137,428 read/s      5,291 write/s
    10%-90%       1 procs      135,493,165 read/s      4,968 write/s
    10%-90%       8 procs    1,017,928,553 read/s     37,130 write/s
    10%-90%      16 procs    1,040,251,193 read/s     37,521 write/s
    10%-90%      32 procs      982,115,784 read/s     35,689 write/s
    10%-90%      64 procs      975,158,264 read/s     34,041 write/s
    10%-90%     128 procs      940,466,888 read/s     34,827 write/s
    10%-90%     256 procs      930,871,315 read/s     34,399 write/s
    10%-90%     512 procs      892,502,438 read/s     33,955 write/s
    10%-90%    1024 procs      834,594,229 read/s     32,953 write/s
    10%-90%    2048 procs      785,583,770 read/s     32,882 write/s
    10%-90%    4096 procs      688,402,474 read/s     34,646 write/s
    

    Benchmark go1.17beta1

    I also ran the exact same benchmark on go 1.17 beta1 and the improvement is quite impressive. I suspect because of https://github.com/golang/go/issues/40724 major change by the amazing Go team.

    90%-10%       1 procs      237,130,690 read/s        112 write/s
    90%-10%       8 procs    1,651,884,038 read/s        717 write/s
    90%-10%      16 procs    1,604,529,778 read/s        684 write/s
    90%-10%      32 procs    1,568,422,932 read/s        705 write/s
    90%-10%      64 procs    1,368,854,176 read/s        603 write/s
    90%-10%     128 procs    1,376,234,760 read/s        601 write/s
    90%-10%     256 procs    1,444,827,685 read/s        634 write/s
    90%-10%     512 procs    1,382,944,862 read/s        630 write/s
    90%-10%    1024 procs    1,385,708,505 read/s        641 write/s
    90%-10%    2048 procs    1,400,975,478 read/s        678 write/s
    90%-10%    4096 procs    1,272,429,528 read/s        645 write/s
    50%-50%       1 procs      240,843,311 read/s        949 write/s
    50%-50%       8 procs    1,658,665,375 read/s      6,591 write/s
    50%-50%      16 procs    1,653,341,392 read/s      6,674 write/s
    50%-50%      32 procs    1,558,058,949 read/s      6,176 write/s
    50%-50%      64 procs    1,430,884,504 read/s      5,751 write/s
    50%-50%     128 procs    1,451,153,699 read/s      5,661 write/s
    50%-50%     256 procs    1,443,416,127 read/s      5,726 write/s
    50%-50%     512 procs    1,355,457,178 read/s      5,645 write/s
    50%-50%    1024 procs    1,249,493,888 read/s      5,221 write/s
    50%-50%    2048 procs    1,162,011,258 read/s      5,484 write/s
    50%-50%    4096 procs    1,102,741,629 read/s      5,286 write/s
    10%-90%       1 procs      203,623,696 read/s      7,311 write/s
    10%-90%       8 procs    1,062,318,113 read/s     38,339 write/s
    10%-90%      16 procs    1,077,146,140 read/s     37,950 write/s
    10%-90%      32 procs    1,068,210,272 read/s     38,919 write/s
    10%-90%      64 procs    1,098,461,537 read/s     39,370 write/s
    10%-90%     128 procs    1,035,202,595 read/s     37,986 write/s
    10%-90%     256 procs    1,020,512,476 read/s     38,001 write/s
    10%-90%     512 procs    1,147,669,716 read/s     41,260 write/s
    10%-90%    1024 procs    1,103,880,028 read/s     42,898 write/s
    10%-90%    2048 procs      973,620,731 read/s     41,779 write/s
    10%-90%    4096 procs      887,719,199 read/s     41,167 write/s
    
  • Running in the background?

    Running in the background?

    Hi, thanks for this library, it's extremely helpful.

    This might be a more general Go question so I apologise in advance if it is, or a really silly question... how do I go about running this in-memory database on a webserver, so code executed from an inbound HTTP request can access it and run queries & insert data into an already established column database?

    For context, I'm new to GoLang and come from an Elixir background, where we'd spin up a lightweight process to keep the ets (in-memory) cache alive, and then each HTTP request is handled by a separate lightweight process and would be allowed access to that ets memory space or send messages to the process with the in-memory db. Is there something similar in GoLang to achieve this? 🤔

    Thanks so much 🙏🏼

  • Fix case when union is first called indexing function

    Fix case when union is first called indexing function

    When a transaction is first initialized, the internal index is set to the owner collection's fill list. When a Union call is applied to this fresh transaction, the index either remains unchanged or now marks unfilled bits as filled.

  • Union is buggy

    Union is buggy

    package main
    
    import (
    	"fmt"
    
    	"github.com/kelindar/column"
    )
    
    func main() {
    	statistics := column.NewCollection()
    
    	// schema
    	(statistics.CreateColumn("d_a", column.ForString()))
    
    	statistics.CreateIndex("d_a_1", "d_a", func(r column.Reader) bool { return r.String() == "1" })
    	statistics.CreateIndex("d_a_2", "d_a", func(r column.Reader) bool { return r.String() == "2" })
    	statistics.CreateIndex("d_a_3", "d_a", func(r column.Reader) bool { return r.String() == "3" })
    
    	// insert
    	statistics.InsertObject(map[string]interface{}{
    		"d_a": "1",
    	})
    	statistics.InsertObject(map[string]interface{}{
    		"d_a": "2",
    	})
    	statistics.InsertObject(map[string]interface{}{
    		"d_a": "3",
    	})
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.With("d_a_1").Count()) // 1, correct
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.With("d_a_2").Count()) // 1, correct
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.With("d_a_3").Count()) // 1, correct
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.Union("d_a_1", "d_a_2").Count()) // 3, incorrect, should be 2
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.Union("d_a_1", "d_a_3").Count()) // 3, incorrect, should be 2
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.Union("d_a_2", "d_a_3").Count()) // 3, incorrect, should be 2
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.With("d_a_1", "d_a_2").Count()) // 0, correct
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.With("d_a_1", "d_a_3").Count()) // 0, correct
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.With("d_a_2", "d_a_3").Count()) // 0, correct
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.With("d_a_1").Union("d_a_2").Count()) // 2, correct
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.With("d_a_1").Union("d_a_3").Count()) // 2, correct
    		return nil
    	}))
    
    	(statistics.Query(func(tx *column.Txn) error {
    		fmt.Println(tx.With("d_a_2").Union("d_a_3").Count()) // 2, correct
    		return nil
    	}))
    }
    
  • Different results from query

    Different results from query

    When I query with WithValue or WithString I'm able to return different results.

    func TestQuery(t *testing.T) {
    	model := column.NewCollection()
    	model.CreateColumnsOf(map[string]interface{}{
    		"ID": "",
    	})
    
    	model.Query(func(txn *column.Txn) error {
    		for i := 0; i < 20000; i++ {
    
    			txt := fmt.Sprint(i)
    			txn.Insert(map[string]interface{}{
    				"ID": txt,
    			})
    		}
    		return nil
    	})
    
    	model.Query(func(txn *column.Txn) error {
    		fmt.Println(txn.WithValue("ID", func(v interface{}) bool {
    			return v.(string) == "5"
    		}).Count())
    		fmt.Println(txn.WithString("ID", func(v string) bool {
    			return v == "5"
    		}).Count())
    		return nil
    	})
    }
    

    The problem gets more pronounced the higher the loop count. For example, at 10,000 there are no issues, but at 1,000,000 there are 62. The first instance happens at 16,390 loops

    Changing the txt format to txt := fmt.Sprintf("%2d", i) removes the problem ( I tried up to 100,000,000 iterations)

    I noted this problem when generating with uuid's, the example above is just simpler to generate

  • [Feature Request]: Simper collection creation & data insertion & and Query

    [Feature Request]: Simper collection creation & data insertion & and Query

    First thank you very much build such cool project. I have used it in my project for cache. it works well. after some time usage I found it could be simpler for collection creation, data insertion and query.

    for my case alwasy used it to store a struct. every time I need to follow the steps -- save 1: create a collection 2: create column 3: insert row and binding. -- query 1: query the row & binding 2: construct a struct.

    saying I have a such a stuct struct Food { Id stringjson:"id"Categroy stringjson:"category"}

    for the case if I save a Food, I hope it would be create the Collection & Column automatically by using the json tag. as in most cases even for a simple key-value, it's still a struct. if it supplies such a simpler API it will much simpler and reducing a lot of coding.

    for data query, hope it can automatically marshall into a struct, by match the query properties and struct property json tag.

  • Expose race case with ubuntu container

    Expose race case with ubuntu container

    Hello,

    I've managed to reliably replicate the race case found in multiple git workflow runs. This requires using a docker container that mocks the specifications of the Github Actions environment, while increasing the exposure area of the edge case within TestSnapshot.

  • how do I efficiently query for unique values of a field

    how do I efficiently query for unique values of a field

    say I get a stream of data: {machineCode: "", lat: , lon: } And I want to display a count of such datums per machineCode.

    Is there a way to efficiently get all the unique machine codes? or should I just keep track of them while inserting data?

  • [Feature Request] set pk column after columns creation

    [Feature Request] set pk column after columns creation

    when create columns with CreateColumnsOf as below, there is no way to set a PK field.

    ` obj := map[string]any{ "name": "Roman", "age": 35, "wallet": 50.99, "health": 100, "mana": 200, }

    col := NewCollection()
    col.CreateColumnsOf(obj)`
    

    if there is a new API such as collection.setPK(colName string) then we can set an existing column as PK. Of course, this api only available when the collection is emtpy.

  • [Feature Request] more powerful index

    [Feature Request] more powerful index

    right now this project supports bitmap index, it works well in some cases. if it support b-tree index, it will suport more uses scenarios

    thank you very much build such simple and powerful project !

  • Non deterministic filter results

    Non deterministic filter results

    Hey, it's me again

    I've stumbled across a case where my filtering logic sometimes returns the correct result; a count of 5, and sometimes returns a count of 0.

    I've created a gist to show the logic and the test I'm running https://gist.github.com/james-bowers/55722b1cf2cb60d88093a3051c7e0c20

    I'm wondering if there's something I'm doing wrong in the checkFilters function 🤔 as in the readme examples, the "where" clause is achieved by creating various indexes. However I'm confused as to why my approach sometimes works, and sometimes doesn't.

    Any light you can shed on the issue would be greatly appreciated.

    If the recommended approach is to always create an index for each filter, what overhead is there in doing this?

    Thanks so much! 🙏🏼

  • Really cool project!

    Really cool project!

    I stumbled onto this project from your other one, kelindar/bitmap, while evaluating bitmap libraries. I don't know if I'll use this library, but I wanted to let you know I thought it was really cool, and a great exhibition of how powerful thoughtful technology can be. :)

Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service.

Olric Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service. With

Jan 4, 2023
yakv is a simple, in-memory, concurrency-safe key-value store for hobbyists.
yakv is a simple, in-memory, concurrency-safe key-value store for hobbyists.

yakv (yak-v. (originally intended to be "yet-another-key-value store")) is a simple, in-memory, concurrency-safe key-value store for hobbyists. yakv provides persistence by appending transactions to a transaction log and restoring data from the transaction log on startup.

Feb 24, 2022
A rest-api that works with golang as an in-memory key value store

In Store A rest-api that works with golang as an in-memory key value store Usage Fist of all, clone the repo with the command below. You must have gol

Oct 24, 2021
Simple in memory key-value store.

Simple in memory key-value store. Development This project is written in Go. Make sure you have Go installed (download). Version 1.17 or higher is req

Nov 6, 2021
A simple in-memory key-value store application
A simple in-memory key-value store application

vtec vtec, is a simple in-memory key-value store application. vtec provides persistence by appending transactions to a json file and restoring data fr

Jun 22, 2022
An in-memory key:value store/cache (similar to Memcached) library for Go, suitable for single-machine applications.

go-cache go-cache is an in-memory key:value store/cache similar to memcached that is suitable for applications running on a single machine. Its major

Jan 3, 2023
KV - a toy in-memory key value store built primarily in an effort to write more go and check out grpc

KV KV is a toy in-memory key value store built primarily in an effort to write more go and check out grpc. This is still a work in progress. // downlo

Dec 30, 2021
kvStore is a simple key/value in-memory store

kvStore is a simple key/value in-memory store. It is designed for the API. kvStore keeps records at /tmp/kvStore/dbName.db. You can specify server port, dbName and, file save interval in your RunServer(Addr, dbName) call.

Feb 24, 2022
Golang-key-value-store - Key Value Store API Service with Go DDD Architecture

This document specifies the tools used in the Key-Value store and reorganizes how to use them. In this created service, In-Memory Key-Value Service was created and how to use the API is specified in the HTML file in the folder named "doc"

Jul 31, 2022
Turn any key/value index into a high-performance two-dimensional spatial index
Turn any key/value index into a high-performance two-dimensional spatial index

modular-spatial-index For the demo that this animated gif was generated from

Mar 6, 2022
Key-value database stored in memory with option of persistence
Key-value database stored in memory with option of persistence

Easy and intuitive command line tool allows you to spin up a database avaliable from web or locally in a few seconds. Server can be run over a custom TCP protocol or over HTTP.

Aug 1, 2022
ZedisDB - a key-value memory database written in Go

ZedisDB - a key-value memory database written in Go

Sep 4, 2022
Simple Distributed key-value database (in-memory/disk) written with Golang.

Kallbaz DB Simple Distributed key-value store (in-memory/disk) written with Golang. Installation go get github.com/msam1r/kallbaz-db Usage API // Get

Jan 18, 2022
FlashDB is an embeddable, in-memory key/value database in Go
FlashDB is an embeddable, in-memory key/value database in Go

FlashDB is an embeddable, in-memory key/value database in Go (with Redis like commands and super easy to read)

Dec 28, 2022
A disk-backed key-value store.

What is diskv? Diskv (disk-vee) is a simple, persistent key-value store written in the Go language. It starts with an incredibly simple API for storin

Jan 1, 2023
Distributed reliable key-value store for the most critical data of a distributed system

etcd Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in order

Dec 28, 2022
a persistent real-time key-value store, with the same redis protocol with powerful features
a persistent real-time key-value store, with the same redis protocol with powerful features

a fast NoSQL DB, that uses the same RESP protocol and capable to store terabytes of data, also it integrates with your mobile/web apps to add real-time features, soon you can use it as a document store cause it should become a multi-model db. Redix is used in production, you can use it in your apps with no worries.

Dec 25, 2022
Pogreb is an embedded key-value store for read-heavy workloads written in Go.
Pogreb is an embedded key-value store for read-heavy workloads written in Go.

Embedded key-value store for read-heavy workloads written in Go

Dec 29, 2022
CrankDB is an ultra fast and very lightweight Key Value based Document Store.

CrankDB is a ultra fast, extreme lightweight Key Value based Document Store.

Apr 12, 2022