Service for firewalling graphite metrics

Kambi

Last update: Apr 28, 2022

Comments: 8

hadrianus

Block incoming graphite metrics if they come in too fast for downstream carbon-relay/carbon-cache to handle.

Building

Hadrianus is written in Go so all you need is: go build

As a convenience there's a small Makefile. To build:

make all <- Generates a Linux amd64 executable
make all-mac <- Generates a Darwin amd64 executable

Usage

Basic usage

hadrianus listeningport outport1...

listeningport is the port number for listening to incoming newline delimited graphite protocol messages.
outport1 denotes the first (out of possibly many) output ports for carbon-relay process instances. Metrics will be distributed to the destinations in a "round robin" fashion.

Options

-cleanupmaxage Maximum time in seconds since last message before metric path is removed from memory.
-cleanuptimegranularity Seconds between memory cleanup events (default 601)
-enablenewmetrics Initially enable new metrics and block them later if needed.
-maxdrymessages Maximum allowed consecutive identical values before marking metric as stale.
-minimumtimeinterval Minimum allowed time interval between incoming metrics in seconds. Lower values makes hadrianus more "generous" in how often applications may send a specific metric.
-mirrordestination Secondary destination(s) to mirror traffic to.
-override Filename for per-path override file that allows allowlisting.
-staleresendinterval Time after which stale messages are resent in seconds.
-statstimegranularity Time between statistics messages in seconds.
-tertiarydestination Tertiary destination(s) to mirror traffic to.

What

A newline delimited graphite message works like this: metric_path value timestamp\n. The number of messages can be limited per metric path for a time period. For example, setting a timelimit of 60 would result in a message only being transmitted once per minute, or more seldom.

This can be useful to increase stability and reliability if you have applications producing more messages than the graphite/carbon system can handle.

Example commandline usage

Typical usage

hadrianus -minimumtimeinterval=14 2003 2103 2203 server01.iambk.com:2303

Tell hadrianus to discard unique metric paths that arrive sooner than or equal to every 14 seconds. It will listen for plaintext graphite messages on port 2003. It will attempt to distribute the incoming messages to 127.0.0.1:2103, 127.0.0.1:2203 and server01.iambk.com:2303 in a round-robin fashion.

Allowing a metric path to pass through unmodified

It's possible to allow a metric path to pass through without being touched by the blocking logic by adding a matching pattern in a separate configuration file. In this example the configuration file allowlist.conf is as follows:

[hadrianus]
pattern = ^server\.hadrianus\.
allowunmodified = true

Referring to the "allowlist" file is done on the commandline by using the -override flag as follows: hadrianus -override=allowlist.conf -minimumtimeinterval=14 2003 2103 2203 server01.iambk.com:2303

Owner

Kambi

https://github.com/kambisports/hadrianus

Comments

Improve stability & observability
Add:

buffered channel writes

mechanism to recover from when all channel buffers are full

internal metrics to give additional insight into the stability of the service

documentation on internal metrics

enable "Nagle's algorithm" for output to consumers to improve network performance

This change will improve the service's stability, speed, and observability. However, the changes may increase latency and delays for the metrics consumers due to the added buffering.
Specify internal Hadrianus metrics as Go template

Currently, the graphite metrics path for internal Hadrianus metrics are available on a hard-coded metrics path. However, to suit how different organizations or individuals organize their metrics, it should be possible to customize where you can find the metrics. The most flexible way to accomplish this could be to specify this on the command line (or configuration) using Go templates.
Improve "choppy" graphs
It becomes very choppy in many of our graphs where the value rarely changes:

We can solve the problem like this:

Remember how long it took before Hadrianus removed the value from the "stale" state.

Use the stale time interval to decide how much additional time should pass until the metric is placed in a "stale" state next time.

It may also be a good idea to set a maximum "additional stale time" limit. When the limit is 0, the feature is disabled.
Add Grafana Cloud output support
Since one of the current uses of Hadrianus is to send metrics to Grafana Cloud, it might simplify the setup to add Grafana Cloud support to Hadrianus. That way, we'd be able to spin up Hadrianus as a scratch container in EKS with minimal fuss, without using carbon-relay-ng.

The integration of the code seems super simple: https://github.com/grafana/cloud-graphite-scripts/blob/master/send/main.go

The above example uses "plain json" (content-type "application/json"), which is pretty inefficient. Instead, it should use "binary protocol, snappy compressed" (content-type "rt-metric-binary-snappy"). Some work needs to be spent on understanding how this is done.

We still need to think about how the UI/config should look for it (config maps? env variables?).

One way to handle this is to define named output aliases in a configuration file. It overrides the standard resolve hostname/port mechanism when such a name is encountered. Each output alias needs to have:

name

type (for example "grafanaNet" or "plaintext")

optional configuration details (for example, "address URL", "API key", "schemas file", aggregation file" for "grafanaNet")
Add Prometheus endpoint

Internal Hadrianus metrics should also (optionally?) be exposed as a Prometheus endpoint. This could have the additional benefit of being usable as a service health check in Kubernetes or similar.
Fix clientConnectionsActive data

server.hadrianus.*.clientConnectionsActive will sometimes drift out of sync and give strange results.

Possibly, this could be fixed by replacing the existing naïve counters with "atomic counters" (https://gobyexample.com/atomic-counters).
Implement "cardinality guard" functionality
It can be problematic for a graphite setup if metrics producers create new metrics paths at a high rate.

There should be a way to limit the creation of new metrics paths for a metrics producer to stop the downstream systems from being overwhelmed.

Suggestions:

The production of new metrics should be possible to throttle by producer IP address.

Once throttled, a producer should only be allowed to create a limited number of new metrics paths per measurement period.

A mock example of keeping track of this could look like this. The first level is a 128-bit IPv6 address. The second level contains the 32 bit hash of a graphite metrics path, mapped to a 64 bit UNIX timestamp. The idea is that every N seconds (whatever your measurement time interval is), you count how many paths have a timestamp that is greater than or equal to the current timestamp minus N for every IP address. This approach will give you the number of unique metric paths produced for the time interval N for every IP address:

{ "::ffff:90.16.154.34": { "c65a134b": 1642598151, "c296b298": 1642598151 }, "dc8b:334c:b60c:d107:47d3:ca23:2860:df1a": { "dd7ee3a6": 1642598153, "a3cf9f1b": 1642598153, "d4c8af8d": 1642598153 } }
clientConnectionsActive could be more helpful

The internal Hadrianus metric server.hadrianus.*.clientConnectionsActive isn't beneficial when you have a small number of connections, since it's just sampling the number of connections currently.

It should (perhaps) instead of this, show an average of the number of concurrent active network connections during the sampling interval.

Related tags

Miscellaneous graphite

Service for firewalling graphite metrics

hadrianus

Building

Usage

Basic usage

Options

What

Example commandline usage

Typical usage

Allowing a metric path to pass through unmodified

Owner

Kambi

Comments

Improve stability & observability

Specify internal Hadrianus metrics as Go template

Improve "choppy" graphs

Add Grafana Cloud output support

Add Prometheus endpoint

Fix clientConnectionsActive data

Implement "cardinality guard" functionality

clientConnectionsActive could be more helpful

Related tags

Type-safe Prometheus metrics builder library for golang

Go port of Coda Hale's Metrics library

A tool to run queries in defined frequency and expose the count as prometheus metrics.

Prometheus support for go-metrics

a tool for getting metrics in containers

Collect and visualize metrics from Brigade 2

Count Dracula is a fast metrics server that counts entries while automatically expiring old ones

rsync wrapper (or output parser) that pushes metrics to prometheus

mackerel-agent is an agent program to post your hosts' metrics to mackerel.io.

⛑ Gatus - Automated service health dashboard

HTTP service to generate PDF from Json requests

Kratos Service Layout

An HTTP service for customizing import path of your Go packages.

A service for predicting the order of keys to use for opening doors in Ladder Slasher

Example hello-world service uses go-fx-grpc-starter boilerplate code

Typesafe lazy instantiation to improve service start time

Implement a toy in-memory store information service for a delivery company

An in-memory, key-value store HTTP API service

A Simple Bank Web Service implemented in Go, HTTP & GRPC, PostgreSQL, Docker, Kubernetes, GitHub Actions CI