A tool to dump and restore Prometheus data blocks.

promdump

pipeline

promdump dumps the head and persistent blocks of Prometheus. It supports filtering the persistent blocks by time range.

Why This Tool

When debugging Kubernetes clusters with restrictive access, I often find it helpful to get access to the in-cluster Prometheus metrics. To reduce the amount of back-and-forth with the users (due to missing metrics, incorrect labels etc.) , it makes sense to ask the users to "get me everything around the time of the incident".

The most common way to achieve this is to use commands like kubectl exec and kubectl cp to compress and dump Prometheus' entire data directory. On non-trivial clusters, the resulting compressed file can be very large. To import the data into a local test instance, I will need at least the same amount of disk space.

promdump is a tool that can be used to dump Prometheus data blocks. It is different from the promtool tsdb dump command in such a way that its output can be re-used in another Prometheus instance. See this issue for a discussion on the limitation on the output of promtool tsdb dump. And unlike the Promethues TSDB snapshot API, promdump doesn't require Prometheus to be started with the --web.enable-admin-api option. Instead of dumping the entire TSDB, promdump offers the flexibility to filter persistent blocks by time range.

How It Works

The promdump CLI downloads the promdump-$(VERSION).tar.gz file from a public storage bucket to your local /tmp folder. The download will be skipped if such a file already exists. The -f option can be used to force a re-download.

Then the CLI uploads the decompressed promdump binary to the targeted Prometheus container, via the pod's exec subresource.

Within the Prometheus container, promdump queries the Prometheus TSDB using the tsdb package. It reads and streams the WAL files, head block and persistent blocks to stdout, which can be redirected to a file on your local file system. To regulate the size of the dump, persistent blocks can be filtered by time range.

promdump performs read-only operations on the TSDB.

When the data dump is completed, the promdump binary will be automatically deleted from your Prometheus container.

The restore subcommand can then be used to copy this dump file to another Prometheus container. When this container is restarted, it will reconstruct its in-memory index and chunks using the restored on-disk memory-mapped chunks and WAL.

The --debug option can be used to output more verbose logs for each command.

Getting Started

Install promdump as a kubectl plugin:

kubectl krew update

kubectl krew install promdump

kubectl promdump --version

For demonstration purposes, use kind to create two K8s clusters:

for i in {0..1}; do \
  kind create cluster --name dev-0$i ;\
done

Install Prometheus on both clusters using the community Helm chart:

for i in {0..1}; do \
  helm --kube-context=kind-dev-0$i install prometheus prometheus-community/prometheus ;\
done

Deploy a custom controller to cluster dev-00. This controller is annotated for metrics scraping:

kubectl --context=kind-dev-00 apply -f https://raw.githubusercontent.com/ihcsim/controllers/master/podlister/deployment.yaml

Port-forward to the Prometheus pod to find the custom demo_http_requests_total metric.

📝 Later, we will use promdump to copy the samples of this metric over to the dev-01 cluster.

CONTEXT="kind-dev-00"
POD_NAME=$(kubectl --context "${CONTEXT}" get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
kubectl --context="${CONTEXT}" port-forward "${POD_NAME}" 9090

Demo controller metrics

📝 In subsequent commands, the -c and -d options can be used to change the container name and data directoy.

Dump the data from the first cluster:

# check the tsdb metadata
kubectl promdump meta --context=$CONTEXT -p $POD_NAME
Head Block Metadata
------------------------
Minimum time (UTC): | 2021-04-18 18:00:03
Maximum time (UTC): | 2021-04-18 20:34:48
Number of series    | 18453

Persistent Blocks Metadata
----------------------------
Minimum time (UTC):     | 2021-04-15 03:19:10
Maximum time (UTC):     | 2021-04-18 18:00:00
Total number of blocks  | 9
Total number of samples | 92561234
Total number of series  | 181304
Total size              | 139272005

# capture the data dump
TARFILE="dump-`date +%s`.tar.gz"
kubectl promdump \
  --context "${CONTEXT}" \
  -p "${POD_NAME}" \
  --min-time "2021-04-15 03:19:10" \
  --max-time "2021-04-18 20:34:48"  > "${TARFILE}"

# view the content of the tar file. expect to see the 'chunk_heads', 'wal' and
# persistent blocks directories.
$ tar -tf "${TARFILE}"

Restore the data dump to the Prometheus pod on the dev-01 cluster, where we don't have the custom controller:

CONTEXT="kind-dev-01"
POD_NAME=$(kubectl --context "${CONTEXT}" get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")

# check the tsdb metadata
kubectl promdump meta --context "${CONTEXT}" -p "${POD_NAME}"
Head Block Metadata
------------------------
Minimum time (UTC): | 2021-04-18 20:39:21
Maximum time (UTC): | 2021-04-18 20:47:30
Number of series    | 20390

No persistent blocks found

# restore the data dump found at ${TARFILE}
kubectl promdump restore \
  --context="${CONTEXT}" \
  -p "${POD_NAME}" \
  -t "${TARFILE}"

# check the metadata again. it should match that of the dev-00 cluster
kubectl promdump meta --context "${CONTEXT}" -p "${POD_NAME}"
Head Block Metadata
------------------------
Minimum time (UTC): | 2021-04-18 18:00:03
Maximum time (UTC): | 2021-04-18 20:35:48
Number of series    | 18453

Persistent Blocks Metadata
----------------------------
Minimum time (UTC):     | 2021-04-15 03:19:10
Maximum time (UTC):     | 2021-04-18 18:00:00
Total number of blocks  | 9
Total number of samples | 92561234
Total number of series  | 181304
Total size              | 139272005

# confirm that the WAL, head and persistent blocks are copied to the targeted
# Prometheus server
kubectl --context="${CONTEXT}" exec "${POD_NAME}" -c prometheus-server -- ls -al /data

Restart the Prometheus pod:

kubectl --context="${CONTEXT}" delete po "${POD_NAME}"

Port-forward to the pod to confirm that the samples of the demo_http_requests_total metric have been copied over:

kubectl --context="${CONTEXT}" port-forward "${POD_NAME}" 9091:9090

Make sure that time frame of your query matches that of the restored data.

Restored metrics

FAQ

Q: I am not seeing the restored data

A: There are a few things you can check:

  • When generating the dump, make sure the start and end date times are specified in the UTC time zone.
  • If using the Prometheus console, make sure the time filter falls within the time range of your data dump. You can confirm your restored data time range using the kubectl promdump meta subcommand.
  • Compare the TSDB metadata of the target Prometheus with the source Prometheus to see if their time range match, using the kubectl promdump meta subcommand. The head block metadata may deviate slightly depending on how old your data dump is.
  • Use the kubectl exec command to run commands likes ls -al <data_dir> and cat <data_dir>/<data_block>/meta.json to confirm the data range of a particular data block.
  • Try restarting the Prometheus pod after the restoration to let Prometheus replay the restored WALs. The restored data must be persisted to survive a restart.
  • Check Prometheus logs to see if there are any errors due to corrupted data blocks.
  • Run the kubectl promdump restore subcommand with the --debug flag to see if it provides more hints.

Limitations

promdump is still in its experimental phase. SREs can use it to copy data blocks from one Prometheus instance to another development instance, while debugging cluster issues.

Before restoring the data dump, promdump will erase the content of the data folder in the target Prometheus instance, to avoid corrupting the data blocks due to conflicting segment error such as:

opening storage failed: get segment range: segments are not sequential

It's not suitable for production backup/restore operation.

Like kubectl cp, promdump requires the tar binary to be installed in the Prometheus container.

Development

To run linters and unit test:

make lint test

To produce local builds:

# the kubectl CLI plugin
make cli

# the promdump core
make core

To install Prometheus via Helm:

make hack/prometheus

To do a release:

git tag -a v$version

make dist release

Note that the GitHub Actions pipeline uses the same make release targets.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner
Ivan Sim
Principal Software Engineer (OpenShift)
Ivan Sim
Comments
  • Add CLI option to use local version of promdump

    Add CLI option to use local version of promdump

    Currently, the CLI downloads the promdump.tar.gz file from a remote bucket, if it doesn't exist locally. This makes development cumbersome. Adding a new CLI option so that the CLI can read local version of the .tar.gz file.

  • Improve Development/Release Workflow

    Improve Development/Release Workflow

    Development:

    1. The CLI expects the promdump .tar.gz file to be found in the /tmp directory. This is inconvenient during development.
      • [ ] Update the core target to bundle up the promdump .tar.gz
      • [x] Add a CLI option to specify where to read the dev promdump .tar.gz file from (2b463c3f83ab7ddcfb1e214c6f6ba7661d6b09fa)

    Release:

    1. Auto-update the kubectl plugin manifest during a release. See krew doc here.
  • Update CI To Build Image

    Update CI To Build Image

    This PR:

    1. Added a new Dockerfile
    2. Update CI jobs to run on Ubuntu 22.04
    3. Upgrade golangci-lint to latest version
    4. Update CI to push Docker image to ghcr.io
  • Embed promdump binary in CLI

    Embed promdump binary in CLI

    This PR uses the Go embed package to embed the promdump binary in the CLI, removing the code to download the binary from the remote s3 bucket. The go.mod is updated to use Go 1.18.

  • Promdump OOMs when using Large date range

    Promdump OOMs when using Large date range

    Is there a limit on the number of hours/days we should use? It OOMs when we use 2 days or more, for example:

    ~$ ./promdump -min-time `date +%s%N --date "2022-09-24 00:00:00"` -max-time `date +%s%N --date "2022-09-27 00:00:00"` -data-dir /opt/yugabyte/prometheusv2 > test_prom_dump2.tgz
    Killed
    

    We See its stuck at this stage and then it gets killed:

    <Skipping>
    time=2022-09-28T17:04:48Z caller=level.go:63 level=debug message="checking block" path=01GE2GN42ZPGBTVXGX4BPPFVGK minTime(utc)=2022-09-28T14:00:01.387Z maxTime(utc)=2022-09-28T16:00:00Z
    time=2022-09-28T17:04:48Z caller=level.go:63 level=debug message="skipping block" path=01GE2GN42ZPGBTVXGX4BPPFVGK
    time=2022-09-28T17:04:48Z caller=level.go:63 level=debug message="finish parsing persistent blocks" numBlocksFound=6
    Killed
    
    

    /var/log/messages shows promdump is getting killed due to OOM:

    Sep 28 17:06:20 kachand-ybany kernel: [4826585.690969] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=promdump,pid=16712,uid=1016
    Sep 28 17:06:20 kachand-ybany kernel: [4826585.691001] Out of memory: Killed process 16712 (promdump) total-vm:34456068kB, anon-rss:12744864kB, file-rss:0kB, shmem-rss:0kB, UID:1016 pgtables:25368kB oom_score_adj:0
    Sep 28 17:06:20 kachand-ybany kernel: [4826585.889349] oom_reaper: reaped process 16712 (promdump), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    Sep 28 17:08:19 kachand-ybany systemd[1]: Started Session 1560 of user centos.
    Sep 28 17:09:14 kachand-ybany kernel: [4826760.104395] containerd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-999
    Sep 28 17:09:14 kachand-ybany kernel: [4826760.104400] CPU: 0 PID: 6143 Comm: containerd Not tainted 5.4.0-1083-gcp #91~18.04.1-Ubuntu
    Sep 28 17:09:14 kachand-ybany kernel: [4826760.104401] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/29/2022
    

    Any workarounds we can use? other than increasing the memory?

  • Found unsequential head chunk files

    Found unsequential head chunk files

    I am running promdump in a test environment and occasionally I can not run the meta command because I get the following error message:

    # kubectl promdump meta -n openshift-monitoring -p prometheus-k8s-0 -c prometheus -d /prometheus
    time=2022-05-19T15:49:05Z caller=level.go:63 level=error error="found unsequential head chunk files chunks_head/000010 (index: 10) and chunks_head/000012 (index: 12)"
    failed to exec command: command terminated with exit code 1
    

    Is it possible to make promdump tolerate of the reported condition of unsequential head chunk files?

  • Update Prometheus Go Module Dependency

    Update Prometheus Go Module Dependency

    promdump depends on a very old version (v1.8.2-0.20201015110737-0a7fdd3b7696) of Prometheus Go module. We should update go.mod to use a newer version. More info can be found in this issue. My recent attempt to upgrade this shows that there are some conflicts in the version of the go-openapi depended by kustomize (which is an indirect dependency of cli-runtime), and that depended by Prometheus.

Lxmin - Backup and Restore LXC instances from MinIO

lxmin Backup and restore LXC instances from MinIO Usage NAME: lxmin - backup a

Dec 7, 2022
Export Prometheus metrics from journald events using Prometheus Go client library

journald parser and Prometheus exporter Export Prometheus metrics from journald events using Prometheus Go client library. For demonstration purposes,

Jan 3, 2022
A simple tool who pulls data from Online.net API and parse them to a Prometheus format

Dedibox backup monitoring A simple tool who reads API from Online.net and parse them into a Prometheus-compatible format. Conceived to be lightweight,

Aug 16, 2022
LLS-Exporter exports fuel level sensor data (rs-485 lls protocol) as prometheus metrics

LLS Exporter LLS Exporter reads rs485/rs232 data from serial port, decodes lls protocol and exports fuel level sensor data as prometheus metrics. Lice

Dec 14, 2021
1С-RAS Prometheus data exporter
1С-RAS Prometheus data exporter

1С-RAS Prometheus data exporter 1C-RAS Экспортер метрик в Prometheus Ещё один эк

Dec 13, 2022
Reporting tool for djobi: web server, email, prometheus

Generate report, of Djobi®© pipeline runs. Requirement go docker Usage Env. variables TINTIN_PIPELINES_URLS URL to pipelines definitions (git) TINTIN_

Dec 14, 2021
Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using HPE Smart Storage Administrator tool

hpessa-exporter Overview Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using

Jan 17, 2022
Flux prometheus grafana-example - A tool for keeping Kubernetes clusters in sync with sources ofconfiguration
Flux prometheus grafana-example - A tool for keeping Kubernetes clusters in sync with sources ofconfiguration

Flux is a tool for keeping Kubernetes clusters in sync with sources of configuration (like Git repositories), and automating updates to configuration when there is new code to deploy.

Feb 1, 2022
💧 Visual Data Preparation (VDP) is an open-source tool to seamlessly integrate Vision AI with the modern data stack
💧 Visual Data Preparation (VDP) is an open-source tool to seamlessly integrate Vision AI with the modern data stack

Website | Community | Blog Get Early Access Visual Data Preparation (VDP) is an open-source tool to streamline the end-to-end visual data processing p

Jan 5, 2023
Automating Kubernetes Rollouts with Argo and Prometheus. Checkout the demo URL below
Automating Kubernetes Rollouts with Argo and Prometheus. Checkout the demo URL below

observe-argo-rollout Demo for Automating and Monitoring Kubernetes Rollouts with Argo and Prometheus Performing Demo The demo can be found on Katacoda

Nov 16, 2022
🦥 Easy and simple Prometheus SLO generator
🦥 Easy and simple Prometheus SLO generator

Sloth Introduction Use the easiest way to generate SLOs for Prometheus. Sloth generates understandable, uniform and reliable Prometheus SLOs for any k

Jan 4, 2023
GitHub Rate Limits Prometheus exporter. Works with both App and PAT credentials
GitHub Rate Limits Prometheus exporter. Works with both App and PAT credentials

Github Rate Limit Prometheus Exporter A prometheus exporter which scrapes GitHub API for the rate limits used by PAT/GitHub App. Helm Chart with value

Sep 19, 2022
Netstat exporter - Prometheus exporter for exposing reserved ports and it's mapped process

Netstat exporter Prometheus exporter for exposing reserved ports and it's mapped

Feb 3, 2022
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

kepler Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics Architectur

Dec 26, 2022
Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd data, and intelligent diagnosis.
Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd data, and intelligent diagnosis.

Kstone 中文 Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd

Dec 27, 2022
Translate Prometheus Alerts into Kubernetes pod readiness

prometheus-alert-readiness Translates firing Prometheus alerts into a Kubernetes readiness path. Why? By running this container in a singleton deploym

Oct 31, 2022
A beginner friendly introduction to prometheus 🔥
A beginner friendly introduction to prometheus 🔥

Prometheus-Basics A beginner friendly introduction to prometheus. Table of Contents What is prometheus ? What are metrics and why is it important ? Ba

Dec 29, 2022
Doraemon is a Prometheus based monitor system
Doraemon is a Prometheus based monitor system

English | 中文 Doraemon Doraemon is a Prometheus based monitor system ,which are made up of three components——the Rule Engine,the Alert Gateway and the

Nov 28, 2022
A set of tests to check compliance with the Prometheus Remote Write specification

Prometheus Remote Write Compliance Test This repo contains a set of tests to check compliance with the Prometheus Remote Write specification. The test

Dec 4, 2022