metrics2.0 based, multi-tenant timeseries store for Graphite and friends.

Last update: Dec 16, 2022

Comments: 16

Grafana Metrictank

Introduction

Grafana Metrictank is a multi-tenant timeseries platform that can be used as a backend or replacement for Graphite. It provides long term storage, high availability, efficient storage, retrieval and processing for large scale environments.

Grafana Labs has been running Metrictank in production since December 2015. It currently requires an external datastore like Cassandra or Bigtable, and we highly recommend using Kafka to support clustering, as well as a clustering manager like Kubernetes. This makes it non-trivial to operate, though Grafana Labs has an on-premise product that makes this process much easier.

Features

100% open source
Heavily compressed chunks (inspired by the Facebook gorilla paper) dramatically lower cpu, memory, and storage requirements and get much greater performance out of Cassandra than other solutions.
Writeback RAM buffers and chunk caches, serving most data out of memory.
Multiple rollup functions can be configured per serie (or group of series). E.g. min/max/sum/count/average, which can be selected at query time via consolidateBy(). So we can do consolidation (combined runtime+archived) accurately and correctly, unlike most other graphite backends like whisper
Flexible tenancy: can be used as single tenant or multi tenant. Selected data can be shared across all tenants.
Input options: carbon, metrics2.0, kafka.
Guards against excessively large queries. (per-request series/points restrictions)
Data backfill/import from whisper
Speculative Execution means you can use replicas not only for High Availability but also to reduce query latency.
Write-Ahead buffer based on Kafka facilitates robust clustering and enables other analytics use cases.
Tags and Meta Tags support
Render response metadata: performance statistics, series lineage information and rollup indicator visible through Grafana
Index pruning (hide inactive/stale series)
Timeseries can change resolution (interval) over time, they will be merged seamlessly at read time. No need for any data migrations.

Relation to Graphite

The goal of Metrictank is to provide a more scalable, secure, resource efficient and performant version of Graphite that is backwards compatible, while also adding some novel functionality. (see Features, above)

There's 2 main ways to deploy Metrictank:

as a backend for Graphite-web, by setting the CLUSTER_SERVER configuration value.
as an alternative to a Graphite stack. This enables most of the additional functionality. Note that Metrictank's API is not quite on par yet with Graphite-web: some less commonly used functions are not implemented natively yet, in which case Metrictank relies on a graphite-web process to handle those requests. See our graphite comparison page for more details.

Limitations

No performance/availability isolation between tenants per instance. (only data isolation)
Minimum computation locality: we move the data from storage to processing code, which is both metrictank and graphite.
Can't overwrite old data. We support reordering the most recent time window but that's it. (unless you restart MT)

Interesting design characteristics (feature or limitation... up to you)

Upgrades / process restarts requires running multiple instances (potentially only for the duration of the maintenance) and possibly re-assigning the primary role. Otherwise data loss of current chunks will be incurred. See operations guide
clustering works best with an orchestrator like kubernetes. MT itself does not automate master promotions. See clustering for more.
Only float64 values. Ints and bools currently stored as floats (works quite well due to the gorilla compression),
Only uint32 unix timestamps in second resolution. For higher resolution, consider streaming directly to grafana
We distribute data by hashing keys, like many similar systems. This means no data locality (data that will be often used together may not live together)

Docs

installation, configuration and operation.

features in-depth

Other

Releases and versioning

releases and changelog
we aim to keep master stable and vet code before merging to master
We're pre-1.0 but adopt semver for our 0.MAJOR.MINOR format. The rules are simple:
- MAJOR version for incompatible API or functionality changes
- MINOR version when you add functionality in a backwards-compatible manner, and
We don't do patch level releases since minor releases are frequent enough.

License

This software is distributed under the terms of the GNU Affero General Public License.

Some specific packages have a different license:

schema : apache2
expr: 2-clause BSD
mdata/chunk/tsz: 2-clause BSD

Owner

Grafana Labs

Grafana Labs is behind leading open source projects Grafana and Loki, and the creator of the first open & composable observability platform.

https://github.com/grafana/metrictank

Comments

future of raintank-metric, use something else?
please help me fill this in. we need to agree on what our requirements/desires are before talking about using other tools

current requirements?

safely relay metrics from our queue into storage and ES without losing data in case we can't safely deliver

decode messages from our custom format used in rabbitmq (but i suppose we could also store them differently in rabbit?)

encode messages into our custom format, to be stored in ES

possible future requirements

real time aggregation

real time processing/alerting (I personally don't think we need to be too concerned about this just yet. once we have high performance/scalability requirements we'll probably use a dedicated real time processing framework like spark/storm/heron/...)

questions

can we write our own decode, encode, processor plugins in Go, in heka?

can somebody describe what we do with ES from the raintank-metric/rabbitmq perspective and how dependent this is on the main storage backend? like if kairosdb is down, can we or must we still update ES? if ES is down, can or must we still write to kairos?

does rabbitmq support multiple readers of the same data, and does it maintain what has been acked by which reader?
rollups.
the time has come.

https://github.com/raintank/ops/issues/112 has some details

i know we wrote down some thoughts etc at the summit, do we still have those notes? or perhaps not that important

implementation will probably be an nsq consumer that generates all lower-res streams and stores them (including for current data), as opposed to a design where lower-res only starts where higher-res ends.

we can generate spread data like librato/omniti/hostedgraphite, or store individual min/max/avg/.. series, or use an algo like LTTB (see https://github.com/sveinn-steinarsson/flot-downsample/)

Add asPercent function

Native implementation of asPercent() Graphite function. (http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.asPercent)

Added a new argument type ArgIn that allows multiple other argument types. This was necessary for the total argument. Some of the code borrowed from an abandoned PR: https://github.com/grafana/metrictank/pull/672

In terms of speed improvement:

---------- Native Implementation ----------
Requests      [total, rate]            900, 5.01
Duration      [total, attack, wait]    3m0.13756s, 2m59.799999s, 337.561ms
Latencies     [mean, 50, 95, 99, max]  72.006704ms, 38.065ms, 342.887ms, 472.467ms, 765.657ms
Bytes In      [total, mean]            130300948, 144778.83
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:900
Error Set:
---------- Graphite (Python) Implementation ----------
Requests      [total, rate]            900, 5.01
Duration      [total, attack, wait]    3m6.337648s, 2m59.799999s, 6.537649s
Latencies     [mean, 50, 95, 99, max]  797.282367ms, 167.489ms, 4.789024s, 6.756318s, 8.006429s
Bytes In      [total, mean]            144407224, 160452.47
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:900
Error Set:

On average, the native implementation was 11x faster, median was 4x faster, p95 was 14x faster, p99 was 14x faster and max was over 10x faster

Optimize for a large number of new metrics getting added while still serving queries fast

We're still experiencing serious issues with instances that have a large index and a high metric churn, in the worst case all queries time out when the index gets slammed with too many adds per time. We should first add a benchmark which adds a large number of metrics to a large index, while concurrently querying it. Then we can try to optimize it based on that benchmark so queries still get served fast, while index adds happen eventually but with lower priority.
Use Confluent and move the kafka-consumers into one consumer struct

Replaces the sarama consumers with confluent ones. Also gets rid of the duplication between the kafka notifier and kafka input by moving all kafka consumer related stuff into a new struct that's used by both of them.
meta-tags (previously known as extrinsic tags)

Metrics 2.0 supports adding meta data to metrics, however this is at the cost of network bandwidth. A lot of meta data could be very static (e.g. the data-center a machine is in). It would be very nice to have a means for bulk-loading / updating static meta-data and having it merge in with tags.

For example every metric might have a tag host. Associated with the host is a collection of static data, cluster, data-center, os, os-version, etc. We would like to feed this in. From grafana this would appear as tags to the metric.
deadlock in SyncChunkSaveState

We had Cassandra slowness which caused the write queues to fill in MT, but we noticed that on many of our instances, one worker never drained. Over the course of a couple minutes this caused all ingest in MT to stop.

Attaching a snapshot of the dashboard. It's a bit hard to tell, so I added an arrow showing where queue 5 just hangs as the other queues all start to drain. Also attached a stack from one of our hung instances.

metrictank.20170911.stack.txt
Prune index in Cassandra

We currently keep adding entries to the index in Cassandra and never prune them. At startup MT needs to load all of that data and filter it by the LastUpdated property to ignore the ones that have not been updated for a certain amount of time, but this makes the startup slower and slower because it needs to filter more data. We should delete index entries from Cassandra once they have reached a certain age. That pruning age should probably be higher than when we prune them from the memory index, because we want to keep the ability to just adjust the memory pruning settings and restart MT to restore index entries that have already been pruned from memory. If a user decides to send a metric again, and hence "activates" it again in the cassandra/memory indices, the historic data will still be available just like it is now.

The simplest solution for that would probably be a simple go routine that occasionally loads all the data from the cassandra index and deletes all the entries that haven't been updated for a certain time.
Add support for `summarize`

There's a question to be answered before this is ready to merge. Should the input series have a matching QueryPatt and Target? Is QueryFrom and QueryTo equivalent to series.start and series.end in the python code?

Also if #833 is merged, there'll be a conflict with the docs/graphite and I'll rebase to squash commits
reorderBuffer question

Hi,

Can you ellaborate on how to configure the reorderBuffer? What should be the relation between that number and the raw interval specified in the first defined retention.

From my tests it seems reorder work but the extra data it reorder is gone every "gc-interval".
Add tags to exported Metrictank stats
We need to add a duplicate set of exported stats to Metrictank in order to facilitate the transition to a mostly tag based system.

For a yet to be determined amount of time we should export both the current stats and the new stats with tags to allow alerts and queries to be updated accordingly. This will temporarily increase memory usage.

This will also be a great opportunity to gather data on the memory usage reductions provided by #1212

Examples of proposed tagged stats:

memory.gc.cpu_fraction

Old Stat

metrictank.stats.$environment.$instance.memory.gc.cpu_fraction.gauge32

New Stat

memory.gc.cpu-fraction;application=metrictank;environment=$environment;instance-id=$instance;metric-type=gauge32

api.request.node.latency

Old Stat

metrictank.stats.$environment.$instance.api.request.node.latency.mean.gauge32

New Stat

api.request.latency;application=metrictank;environment=$environment;instance-id=$instance;http-path=node;metric-aggregation=mean;metric-type=gauge32

idx.memory.find-cache.invalidation.drop

Old Stat

metrictank.stats.$environment.$instance.idx.memory.find-cache.invalidation.drop

New Stat

idx.memory.find-cache.invalidation;application=metrictank;environment=$environment;instance-id=$instance;metric=drop;metric-aggregation=mean;metric-type=gauge32

idx.memory.find-cache.invalidation.exec

Old Stat

metrictank.stats.$environment.$instance.idx.memory.find-cache.invalidation.exec

New Stat

idx.memory.find-cache.invalidation;application=metrictank;environment=$environment;instance-id=$instance;metric=exec;metric-aggregation=mean;metric-type=gauge32

You might be wondering why I am proposing to use such long tag keys and values. The length of the tag key and value don't matter much when using #1212. All of the terms will be stored once as byte slices in the object store and many times as uintptrs in Metrictank itself, so being more verbose is not an issue.

Each series will need to be inspected individually to determine the exact series name / tag combinations.
Refactor metrictank repo to be compatible with go modules

This PR refactors the entire metrictank repository to make it compatible with Go modules as well with vscode's static compilation. All of the Makefile commands have been fixed (as far as I know). I believe travis/circleci is also working.

As part of the change, the vendor/ directory has been removed from checked-in source, instead users should run go mod vendor locally as needed.
Control partition size using Cassandra as backend

Hello guys

How can I control the partition size for the tables in Cassandra? I have 3GB of partition size in the table metrictank.metric_idx and it is obviously hitting the performance
Add tool for reporting out of order and duplicate metrics

This PR adds a new tool, cmd/mt-kafka-mdm-report-out-of-order, which consumes metrics from Kafka and discovers those which are out of order or duplicates. It then groups these metrics by name or a specific tag using an index built from Cassandra, and outputs the results.

Panic in function processing (seriesaggregators.go)

In one of our production instances we're seeing a panic occurring regularly:

[Macaron] PANIC: runtime error: index out of range [179] with length 179
/usr/local/go/src/runtime/panic.go:88 (0x434fa4)
/go/src/github.com/grafana/metrictank/expr/seriesaggregators.go:208 (0xc0b96c)
/go/src/github.com/grafana/metrictank/expr/func_aggregate.go:73 (0xbdcb8b)
/go/src/github.com/grafana/metrictank/expr/func_aggregate.go:60 (0xbdc714)
/go/src/github.com/grafana/metrictank/expr/plan.go:327 (0xc0a6a8)
/go/src/github.com/grafana/metrictank/api/graphite.go:1016 (0xc852d6)
/go/src/github.com/grafana/metrictank/api/graphite.go:318 (0xc7d424)
/usr/local/go/src/reflect/value.go:475 (0x4c11a6)
/usr/local/go/src/reflect/value.go:336 (0x4c0698)
/go/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:177 (0xb01439)
/go/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:137 (0xb00e0a)
/go/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xb1bc1c)
/go/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:112 (0xc6a124)
/go/src/github.com/grafana/metrictank/vendor/github.com/raintank/gziper/gzip.go:100 (0xc6a117)
/go/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:79 (0xb1ba92)
/go/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:157 (0xb01154)
/go/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:135 (0xb00ef9)
/go/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xb1bc1c)
/go/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:112 (0xb2cda5)
/go/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/recovery.go:161 (0xb2cd98)
/go/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/logger.go:40 (0xb1f7b7)
/go/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:157 (0xb01154)
/go/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:135 (0xb00ef9)
/go/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xb1bc1c)
/go/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:112 (0xb83d54)
/go/src/github.com/grafana/metrictank/api/middleware/logger.go:45 (0xb83d3d)
/usr/local/go/src/reflect/value.go:475 (0x4c11a6)
/usr/local/go/src/reflect/value.go:336 (0x4c0698)

I think this is an indication for a bigger issue, because actually this "index out of range" shouldn't happen, but as a quick fix we could at least add a len() check on the relevant line to prevent a panic.

"limit exhausted" message due to max-series-per-req limit (for untagged requests)
the new max-series limit for untagged requests (see #1926 , #1929 ), specifically the limit fanned out to cluster peers (which is proportional to how much data the node has), has an issue. because metrictank has the limit enabled by default, this means that deploying master can return in "limit exhausted" messages well before the number of series is actually hit.

Here's why: first of all note that UnpartitionedMemoryIdx.Find takes in the limit parameter which is this proportional limit. (if PartitionedMemoryIdx.Find is used it further divides the limit by the number of partitions before it calls each UnpartitionedMemoryIdx.Find). This Find method relies on UnpartitionedMemoryIdx.findMaybeCached to use the find helper with the find cache in front of it. Note also that Find does "from filtering", meaning e.g. for a query from now-1h to now, metric definitions with lastUpdate < from are known to have no data and are not included in the result set.

But:

find() does a complete find (without 'from' filtering) these results can nicely be cached. If we from-filtered the results we'd need to add them to cache key and probably cache similar data many times over because on repeated queries the from would typically vary constantly. (but on the other hand, if there are typically many definitions that haven't been updated in a while - e.g. heavy churn situations - this approach means our cached results contain a large amount of definitions we don't need)

In Find() is where we do 'from' filtering.

We implemented the "series limiter" in find() because we don't want to first assemble the entire result set only to then check the limit, as that would make the limit moot, effectively. But implementing the limiter prior to from-filtering makes it impossible to apply correctly, as typically most entries will not be included in the final resultset due to the 'from' filtering. This means a benign request that only asks for a reasonable amount of (from filtered) series, may hit the limiter if there's more entries with older lastUpdate timestamps.

There's actually a 2nd kind of bug, which seems more rare, but anyway. Because find() does a breath-first search, as it's progressing down the tree it may collect a lot of branches that match the expression so far. We have to keep the branches while traversing further down the tree (and the pattern) to see whether the branches (and ultimately the leaves) fully match the path. We precisely want to want to avoid only applying the limit until we've assembled the full response, which means we currently trigger a breach condition when the amount of "candidate branches" exceeds the limit, but it may well be that those branches would be dropped.

We want to avoid loading too much data in RAM (apply limit as early possible), while caching the find response body in a way that's agonstic wrt the 'from' filter, and in a correct way.

My proposal:

change the find() algorithm to be depth first rather than breadth first. This means we can apply the limit as early possible (without loading too much data in RAM), in a correct way, although the limit needs to be moved outside of the main find() algorithm, so it can be applied on the from-filtered output. (e.g. in Find)

rather than (assuming the uncached scenario) first finishing the complete find() - which may result in too much data -before doing the from filtering and applying the limit, change the calling convention to be a lockstep approach with an iterator.

find() becomes an iterator which feeds data to its caller (Find) which can do the from filtering and apply the limit while 'find' is running. the new find() can keep a full copy of its unfiltered output, so that - assuming the iteration is not aborted by its caller - it can add the full output to the cache after the iteration completes.

To make the find iteration depth first rather than breadth first. we compute all matchers up front (we wouldn't want to recompute them everytime we revisit a certain level of the tree) this implies:

con: we compute all matchers up front, in particular we may compute matchers needlessly if we would otherwise have found out "early" that a query had no matches

pro: bail out early if a match expression is malformed. no need to traverse a section of the tree before detecting a bad matcher.

metrics2.0 based, multi-tenant timeseries store for Graphite and friends.

Grafana Metrictank

Introduction

Features

Relation to Graphite

Limitations

Interesting design characteristics (feature or limitation... up to you)

Docs

installation, configuration and operation.

features in-depth

Other

Releases and versioning

License

Owner

Grafana Labs

Comments

future of raintank-metric, use something else?

current requirements?

possible future requirements

questions

rollups.

Add asPercent function

Optimize for a large number of new metrics getting added while still serving queries fast

Use Confluent and move the kafka-consumers into one consumer struct

meta-tags (previously known as extrinsic tags)

deadlock in SyncChunkSaveState

Prune index in Cassandra

Add support for `summarize`

reorderBuffer question

Add tags to exported Metrictank stats

memory.gc.cpu_fraction

api.request.node.latency

idx.memory.find-cache.invalidation.drop

idx.memory.find-cache.invalidation.exec

Refactor metrictank repo to be compatible with go modules

Control partition size using Cassandra as backend

Add tool for reporting out of order and duplicate metrics

Panic in function processing (seriesaggregators.go)

"limit exhausted" message due to max-series-per-req limit (for untagged requests)

Related tags

k6 query generator for graphite API

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration

Go WhatsApp Multi-Device Implementation in REST API with Multi-Session/Account Support

Hexa is the open-source, standards-based policy orchestration software for multi-cloud and hybrid businesses.

K8s controller implementing Multi-Cluster Services API based on AWS Cloud Map.

Kilo is a multi-cloud network overlay built on WireGuard and designed for Kubernetes (k8s + wg = kg)

Enterprise-grade container platform tailored for multicloud and multi-cluster management

⎈ Multi pod and container log tailing for Kubernetes

Enable dynamic and seamless Kubernetes multi-cluster topologies

The example shows how to build a simple multi-tier web application using Kubernetes and Docker

Multi cluster kubernetes dashboard with batteries included. Build by developers, for developers.

A multi-service dev environment for teams on Kubernetes

Sample multi docker compose environment setup

CoreDNS plugin implementing K8s multi-cluster services DNS spec.

A simple multi-layered config loader for Go. Made for smaller projects. No external dependencies.

Golang multi lingual web application development

Taller explicativo sobre construcciones multi-stage de Docker

Docker go multi stage builds

An Oracle Cloud (OCI) Pulumi resource package, providing multi-language access to OCI

`memory.gc.cpu_fraction`

`api.request.node.latency`

`idx.memory.find-cache.invalidation.drop`

`idx.memory.find-cache.invalidation.exec`