A tool based on eBPF, prometheus and grafana to monitor network connectivity.

Last update: Dec 8, 2022

Comments: 11

Connectivity Monitor

Tracks the connectivity of a kubernetes cluster to its api server and exposes meaningful connectivity metrics.

Uses ebpf to observe all the TCP connection establishments from the shoot cluster to the kubernetes api server. Derives meaningful connectivity metrics (upper bound for meaningful availability) for the kubernetes api server that is running in the seed cluster.

Can be deployed in two different modes:

Deployed in a shoot cluster (or a normal kubernetes cluster) to track the connectivity to the api server.
Deployed in a seed cluster to track the connectivity of all shoot clusters hosted on the seed.

The network path

The network path from the shoot cluster to the api server.

The shoot cluster's api server is hosted in the seed cluster and the network path involves several hops:

the NAT gateway in the shoot cluster,
the load balancer in the seed cluster,
a k8s service hop and
the envoy reverse proxy.

The reverse proxy terminates the TCP connection, starts the TLS negotiation and chooses the api server of the shoot cluster based on the server name extension in the TLS ClientHello message (SNI). The TLS negotiation is relayed to the chosen api server so that the client actually establishes a TLS session directly with the api server. (See SNI GEP for details.)

Possible failure types

We can distinguish multiple failure types:

There is no network connectivity to the api server.

The focus of this connectivity-monitor component.

New TCP connections to the kubernetes api server are observed to confirm that all the components along the network path to the kubernetes api server, and the kubernetes api server itself, are working as expected. Many things can break along the network path: the DNS resolution of the domain name of the load balancer, packets can be dropped due to misconfiguration of connection tracking tables, or the reverse proxy might be overloaded to accept any new connections. The mundane failure case that there are no running api server processes is also covered by the connectivity monitor.
The api server reports an internal server error.

Detecting this failure type is not feasible for the connectivity-monitor component; it can be achieved by processing the access logs of the api server.

The failure cases when the connection is successfully established, but the api server detects and returns a internal server failure (4xx - user error, 5xx - internal error) are considered as successful connection attempts, hence the connectivity monitor yields an upper bound for meaningful availability. This situations can be detected on the server side, by parsing the access logs, knowing that due to the successful connections we can expect to find matching access logs.
The api server doesn't comply with the specification.

Detecting this failure type requires test cases with a known expected outcome.

The most tricky failure case is when the api server can not itself detect the error and returns an incorrect answer as a success (2xx - ok). This failure case can only be detected by running test cases against the api server, where the result is known ahead of time and it can be asserted that the expected and actual results are equivalent.

Observe all the connections from the shoot cluster to the api server

To capture all connection attempts by:

system components managed by Gardener: kubelet, kube-proxy, calico, ... and
any user workload that is talking to the api server

the connectivity-exporter must be deployed as a daemonset in the host network of the node, in the shoot cluster.

Deploying the connectivity-exporter directly in the shoot cluster is motivated by:

the connectivity-exporter is closer to the clients that initiate the connection and hence it can even capture failed attempts that don't reach the seed cluster at all (e.g. due to DNS misconfiguration),
by deploying the connectivity-exporter in the shoot cluster, the load is considerably smaller: it is tracking all the connections from a single shoot cluster (1-1k/s), and not all the connections from all the shoot clusters of a single seed cluster (300x).

Later, we plan to deploy the connectivity exporter in the seed cluster as well to monitor all the connections from all the shoot clusters centrally, that could at least reach the reverse proxy (envoy).

Annotate time based on state of connections

The connectivity-exporter assesses each connection attempt based on the packet sequence it observes in a certain time window:

unacknowledged connection: SYN (packet sent to the api server), no acknowledgment received
rejected connection: SYN packet sent, SYN+ACK packet received, but e.g. during the TLS negotiation the server responds with an RST+ACK packet to abort the connection
successful connection: SYN (packet sent to the api server), SYN+ACK (packet received from the api server)

The connectivity exporter annotates 1s long time buckets after a certain offset, to tolerate late arrivals and avoid issues at second boundaries:

active (/inactive) second: active if there were some new connection attempts, inactive if there were no new connection attempts,
failed (/successful) second: failed if there was at least one failed connection attempt (unacknowledged or rejected), or if there were no connection attempts and the preceding bucket was assessed as failed; successful otherwise.

If packets arrive too late (beyond a certain time window) or simply out of sequence (e.g. a SYN+ACK packet without a preceding SYN packet on the same connection), they are counted as an orphan packet.

Prometheus metrics

The state of the connectivity exporter is exposed with prometheus counter metrics, which can be comfortably scraped without losing the 1s granularity.

# HELP connectivity_exporter_connections_total Total number of new connections.
# TYPE connectivity_exporter_connections_total counter
connectivity_exporter_connections_total{kind="rejected"} 0
connectivity_exporter_connections_total{kind="successful"} 544
connectivity_exporter_connections_total{kind="unacknowledged"} 0

# HELP connectivity_exporter_packets_total Total number of new packets.
# TYPE connectivity_exporter_packets_total counter
connectivity_exporter_packets_total{kind="orphan"} 0

# HELP connectivity_exporter_seconds_total Total number of seconds.
# TYPE connectivity_exporter_seconds_total counter
connectivity_exporter_seconds_total{kind="active"} 337
connectivity_exporter_seconds_total{kind="active_failed"} 0
connectivity_exporter_seconds_total{kind="clock"} 2354
connectivity_exporter_seconds_total{kind="failed"} 0

When the connectivity exporter is deployed in the seed, an SNI label is added to the metrics above to differentiate the connections to the different api servers.

Inspiration

This work is motivated by the meaningful availability paper and the SRE books by Google.

The failed seconds counter metric is meaningful according to the definition of the paper: it captures what users experience. In every counted failed second, there was at least one failed connection attempt by a user or there weren't any successful connection attempts since the last failure. During the uptime of the monitoring stack itself, any failed connection attempt by a user (running in the shoot cluster) will be reported as a failed second.

Overview

The following sketch shows where are the TCP connections captured and how is time annotated based on the assessed connection states.

The big picture of meaningful availability also includes application level access logs on the server side. Connectivity monitoring is a first step on the path to meaningful availability that yields an upper bound: availability requires connectivity.

Note that this is a low level and hence very generic approach with potential for widespread adoption. As long as the service is delivered via TCP/IP (i.e. all the services of our concern), service instances can be differentiated by the SNI TLS extension, we can measure the connectivity with 1s resolution with this approach. The connectivity exporter can be deployed anywhere along the path between the clients and the servers. This choice is a tradeoff: if deployed close to the clients, it can cover more failure cases and needs to handle less load; if it is deployed closer to the server, it might cover all the clients but miss certain failure cases.

In the Gardener architecture, we have the unique situation that all the relevant clients of the api server are running in the shoot cluster and we can deploy the connectivity exporter next to some other Gardener managed system components in the shoot cluster as well.

Comments

build: container images and helm charts

This adds Makefile targets:

docker/build
docker/push
helm/generate
helm/install
helm/uninstall

The following environment variables can be redefined before running 'make':

REGISTRY
IMAGE_NAME
IMAGE_TAG

For example, I run

export REGISTRY=xxxxx.azurecr.io
export IMAGE_NAME=connectivity-monitor
export IMAGE_TAG=albantest

With this, I can test:

$ time make docker/build docker/push helm/install

And then, the pod is deployed:

$ kubectl logs -n connectivity-monitor connectivity-exporter-65rz4
2021/11/08 14:45:01 maxprocs: Updating GOMAXPROCS=2: determined from CPU quota
I1108 14:45:02.076150       9 metrics.go:24] Starting connectivity-exporter
I1108 14:45:24.076205       9 packet.go:245] sni: dc.services.visualstudio.com, connections: 1
...

There are some errors but that could be debugged later:

packet.go:159] Empty SNI
sni: , connections: 2

TODO:

[ ] Add missing helm charts
[ ]

CI: initial GitHub Action
What this PR does / why we need it:

This builds, runs the unit tests, creates a docker image and pushes it to the GitHub Container Registry.

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Release note:
Fix Prometheus counters for each SNI
What this PR does / why we need it:

Rename {succeeded,failed}_seconds to {succeeded,failed}_connections in the BPF map sni_stats

Separate stats for each SNI

On inactive seconds, carry over failed second state

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Release note:
ebpf: fix integer overflow with offsets

Offsets should not be stored in __u8 because they might be bigger than 256. Typically, when the client wget supports a large amount of cypher suites, the offset for the SNI becomes bigger than 256.
connectivity-exporter: add CLI flag -metrics-addr

connectivity-exporter was previously listening on port 19100 on all network interfaces and this was not configurable.

This patch adds a CLI flag -metrics-addr to make this configurable. The default is still ":19100" to keep the previous behaviour unchanged.

This is useful when the Kubernetes cluster already has something listening on port 19100.
connectivity-exporter: monitoring all interfaces

The network interface name could be specified with the "-i" CLI flag but it was not possible to monitor all network interfaces.

With this patch, connectivity-exporter will monitor all network interfaces when the "-i" flag is empty or missing.

It works by setting sll_ifindex to zero: see "man 7 packet":

sll_ifindex is the interface index of the interface (see netdevice(7)); 0 matches any interface (only permitted for binding).
Simplify metric expiration

What this PR does / why we need it:

SNIs should be "expired" after 15 minutes of inactivity. Previously each SNI received its own goroutine that would start a timer. This timer would be reset whenever there was activity. However, this added unnecessary complexity and made it more difficult to test the code. With this PR the SNIs are now expired in a single goroutine where we check the last time that it was updated. This should simplify the code and make it more readable/ testable.
Reset the weekly metrics on Sundays at midnight

What this PR does / why we need it:

Previously, they were reset on Thursdays at midnight, because the start of the unix epoch time, January 1, 1970 was a Thursday.

Special notes for your reviewer:

May 9, 2022 is a Monday.
Add some panels to show the cluster downtimes in seconds

What this PR does / why we need it:

Adds panels to show downtime in seconds for specific SNIs. This can be useful if you want to see how many seconds a downtime was versus a percentage.
Rename instances of connectivity-monitor to connectivity-exporter

Cleanup any renaming instances of connectivity-monitor and replace them with connectivity-exporter. Done after renaming the repository to gardener/connectivity-exporter.
Fix BPF verifier issue

What this PR does / why we need it:

On Kernel 5.15, the current version fails after 354 iterations of the unrolled for loop. TLS_MAX_SERVER_NAME_LEN is less than that (128) and if the for loop is rewritten in this (equivalent) way, the BPF verifier accepts the program (both on 5.13 and on 5.15).

A tool based on eBPF, prometheus and grafana to monitor network connectivity.

Connectivity Monitor

The network path

Possible failure types

Observe all the connections from the shoot cluster to the api server

Annotate time based on state of connections

Prometheus metrics

Inspiration

Overview

Owner

Gardener

Comments

build: container images and helm charts

CI: initial GitHub Action

Fix Prometheus counters for each SNI

ebpf: fix integer overflow with offsets

connectivity-exporter: add CLI flag -metrics-addr

connectivity-exporter: monitoring all interfaces

Simplify metric expiration

Reset the weekly metrics on Sundays at midnight

Add some panels to show the cluster downtimes in seconds

Rename instances of connectivity-monitor to connectivity-exporter

Fix BPF verifier issue

Related tags

Internet connectivity for your VPC-attached Lambda functions without a NAT Gateway

eBPF based TCP observability.

eBPF library for Go based on Linux libbpf

eBPF-based EDR for Linux

An ebpf's tool to watch traffic

Trace Go program execution with uprobes and eBPF

SailFirewall - Linux firewall powered by eBPF and XDP

Library to work with eBPF programs from Go

eBPF Library for Go

A distributed Layer 2 Direct Server Return (L2DSR) load balancer for Linux using XDP/eBPF

Edb - An eBPF program debugger

Prometheus exporter for counting connected devices to a network using nmap

Package socket provides a low-level network connection type which integrates with Go's runtime network poller to provide asynchronous I/O and deadline support. MIT Licensed.

Magma is an open-source software platform that gives network operators an open, flexible and extendable mobile core network solution.

Zero Trust Network Communication Sentinel provides peer-to-peer, multi-protocol, automatic networking, cross-CDN and other features for network communication.

Nat-type-identifier-go - A Go based implementation of Network Address Transalation (NAT) type identifier based on nat-type-identifier

Optimize Windows's network/NIC driver settings for NewTek's NDI(Network-Device-Interface).

A simple network analyzer that capture http network traffic

A client can monitor OceanBase