🦥 Easy and simple Prometheus SLO generator

Last update: Jan 4, 2023

Comments: 17

Sloth

Introduction

Use the easiest way to generate SLOs for Prometheus.

Sloth generates understandable, uniform and reliable Prometheus SLOs for any kind of service. Using a simple SLO spec that results in multiple metrics and multi window multi burn alerts.

At this moment Sloth is focused on Prometheus, however depending on the demand and complexity we may support more backeds.

Features

Simple, maintainable and understandable SLO spec.
Reliable SLO metrics and alerts.
Based on Google SLO implementation and multi window multi burn alerts framework.
Autogenerates Prometheus SLI recording rules in different time windows.
Autogenerates Prometheus SLO metadata rules.
Autogenerates Prometheus SLO multi window multi burn alert rules (Page and warning).
SLO spec validation.
Customization of labels, disabling different type of alerts...
A single way (uniform) of creating SLOs across all different services and teams.
Automatic Grafana dashboard to see all your SLOs state.
Single binary and easy to use CLI.
Kubernetes (Prometheus-operator) support.

Get Sloth

Releases
Docker images
git clone [email protected]:slok/sloth.git && cd ./sloth && make build && ls -la ./bin

Getting started

Release the Sloth!

sloth generate -i ./examples/getting-started.yml

version: "prometheus/v1"
service: "myservice"
labels:
  owner: "myteam"
  repo: "myorg/myservice"
  tier: "2"
slos:
  # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%).
  - name: "requests-availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(http_request_duration_seconds_count{job="myservice"}[{{.window}}]))
    alerting:
      name: MyServiceAvailabilitySLO
      labels:
        category: "availability"
      annotations:
        # Overwrite default Sloth SLO alert summmary on ticket and page alerts.
        summary: "High error rate on 'myservice' requests responses"
      page_alert:
        labels:
          severity: pageteam
          routing_key: myteam
      ticket_alert:
        labels:
          severity: "slack"
          slack_channel: "#alerts-myteam"

This would be the result you would obtain from the above spec example.

How does it work

At this moment Sloth uses Prometheus rules to generate SLOs. Based on the generated recording and alert rules it creates a reliable and uniform SLO implementation:

1 Sloth spec -> Sloth -> N Prometheus rules

The Prometheus rules that Sloth generates can be explained in 3 categories:

SLIs: These rules are the base, they use the queries provided by the user to get a value used to show what is the error service level (availability). It creates multiple rules for different time windows, these different results will be used for the alerts.
Metadata: These are used as informative metrics, like the remaining error budget, the SLO objective percent... These are very handy for SLO visualization, e.g Grafana dashboard.
Alerts: These are the multiwindow-multiburn alerts that are based on the SLI rules.

Sloth will take the service level spec and for each SLO in the spec will create 3 rule groups with the above categories.

The generated rules share the same metric name across SLOs, however the labels are the key to identify the different services, SLO... This is how we obtain a uniform way of describing all the SLOs across different teams and services.

To get all the available metric names created by Sloth, use this query:

count({sloth_id!=""}) by (__name__)

Modes

Generator

generate will generate Prometheus rules in different formats based on the specs. This mode only needs the CLI so its very useful on Gitops, CI, scripts or as a CLI on yout toolbox.

Currently there are two types of specs supported for generate command. Sloth will detect the input spec type and generate the output type accordingly:

Raw (Prometheus)

Check spec here: v1

Will generate the prometheus recording and alerting rules in Standard Prometheus YAML format.

Kubernetes CRD (Prometheus-operator)

Check CRD here: v1

Will generate from a Sloth CRD spec into Prometheus-operator CRD rules. This generates the prometheus operator CRDs based on the Sloth CRD template.

The CRD doesn't need to be registered in any K8s cluster because it happens as a CLI (offline). A Kubernetes controller that makes this translation automatically inside the Kubernetes cluster is in the TODO list

Examples

Alerts disabled: Simple example that shows how to disable alerts.
K8s apiserver: Real example of SLOs for a Kubernetes Apiserver.
Home wifi: My home Ubiquti Wifi SLOs.
K8s Home wifi: Same as home-wifi but shows how to generate Prometheus-operator CRD from a Sloth CRD.
Raw Home wifi: Example showing how to use raw SLIs instead of the common events using the home-wifi example.

The resulting generated SLOs are in examples/_gen.

Why Sloth

Creating Prometheus rules for SLI/SLO framework is hard, error prone and is pure toil.

Sloth abstracts this task, and we also gain:

Read friendlyness: Easy to read and declare SLI/SLOs.
Gitops: Easy to integrate with CI flows like validation, checks...
Reliability and testing: Generated prometheus rules are already known that work, no need the creation of tests.
Centralize features and error fixes: An update in Sloth would be applied to all the SLOs managed/generated with it.
Standardize the metrics: Same conventions, automatic dashboards...
Rollout future features for free with the same specs: e.g automatic report creation.

SLI?

Service level indicator. Is a way of quantify how your service should be responding to user.

TL;DR: What is good/bad service for your users. E.g:

Requests >=500 considered errors.
Requests >200ms considered errors.
Process executions with exit code >0 considered errors.

Normally is measured using events: good/bad-events / total-events.

SLO?

Service level objective. A percent that will tell how many SLI errors your service can have in a specific period of time.

Error budget?

An error budget is the ammount of errors (driven by the SLI) you can have in a specific period of time, this is driven by the SLO.

Lets see an example:

SLI Error: Requests status code >= 500
Period: 30 days
SLO: 99.9%
Error budget: 0.0999 (100-99.9)
Total requests in 30 days: 10000
Available error requests: 9.99 (10000 * 0.0999 / 100)

If we have more than 9.99 request response with >=500 status code, we would be burning more error budget than the available, if we have less errors, we would end without spending all the error budget.

Burn rate?

The speed you are consuming your error budget. This is key for SLO based alerting (Sloth will create all these alerts), because depending on the speed you are consuming your error budget, it will trigger your alerts.

Speed/rate examples:

1: You are consuming 100% of the error budget in the expected period (e.g if 30d period, then 30 days).
2: You are consuming 200% of the error budget in the expected period (e.g if 30d period, then 15 days).
60: You are consuming 6000% of the error budget in the expected period (e.g if 30d period, then 12h hour).
1080: You are consuming 108000% of the error budget in the expected period (e.g if 30d period, then 40 minute).

SLO based alerting?

With SLO based alerting you will get better alerting to a regular alerting system, because:

Alerts on symptoms (SLIs), not causes.
Trigger at different levels (warning/ticket and critical/page).
Takes into account time and quantity, this is: speed of errors and number of errors on specific time.

The result of these is:

Correct time to trigger alerts (important == fast, not so important == slow).
Reduce alert fatigue.
Reduce false positives and negatives.

What are ticket and page alerts?

MWMB type alerting is based on two kinds of alerts, ticket and page:

page: Are critical alerts that normally are used to wake up, notify on important channels, trigger oncall...
ticket: The warning alerts that normally open tickets, post messages on non-important Slack channels...

These are triggered in different ways, page alerts are triggered faster but require faster error budget burn rate, on the other side, ticket alerts are triggered slower and require a lower and constant error budget burn rate.

Can I disable alerts?

Yes, use disable: true on page and ticket.

Grafana dashboard?

Check grafana-dashboard, this dashboard will load the SLOs automatically.

Owner

Xabier Larrakoetxea Gallego

Platform tools at @newrelic

https://github.com/slok/sloth

Comments

Ignore sloth_window in prometheus alerts

This prevents alerts resolving and re-firing when different windows fire.

Implemented using the suggestion from @tokheim in #240. Fixes https://github.com/slok/sloth/issues/240
The value of the "Remaining error budget (30d window)" label is not properly shown

As in the image below, the value of the "Remaining error budget (30d window)" label is not properly shown.

As you can see there are multiple values "NaN"

Why this is happening and how can I fix it? Thank you in advance!
feat: add securityContext for pod and container

What: Deployments can have security settings in their manifest on two levels: pod and container. However, there are some capabilities only configurable in one of the respective levels(https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#securitycontext-v1-core). This PR sets a default configuration for container securityContext, which drops all POSIX capabilities and denies privilege escalation and for pod securityContext adds user, group fsGroup and supplementalGroups and also denies root usage. These are and should be standard settings in the context of Kubernetes. It also adds the possibility of running vault-injector in a Kubernetes environment without PSP (to be removed in v1.25 https://kubernetes.io/docs/concepts/security/pod-security-policy/), but with OpenPolicyAgent (possibly the PSP substitute) with the same capabilities as a restricted PSP instead.

This PR sets the respective settings to the values.yaml and is defaulting them as well. With this they can be adopted if it is needed.
Quiet services and NaN
I'm wondering what can be done about the scenario where the service has periods of receiving no (or few) requests so NaN values start to creep in.

This is because, for example

(sum(rate(http_server_requests_seconds_count{deployment="sloexample"}[1h])))

evaluates to 0 so you get NaN when you divide by it to get error ratios.

Normally this is fine - until you come to error budgets. My remaining error budget is not "undefined" or NaN just because my service went quiet for a period of time.

promtool complains about duplicate rules

Hi,

I created a PrometheusServiceLevel with two SLI. Checking the generated PrometheusRule CR with promtool it complains about a duplicate recording rule.

> promtool check rules example.yaml
Checking example.yaml
1 duplicate rule(s) found.
Metric: slo:sli_error:ratio_rate30d
Label(s):
        sloth_window: 30d
Might cause inconsistency while recording expressions.
  SUCCESS: 34 rules found

example of one generated recording rule.

    - expr: |
        sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="example-ingress-error-rate", sloth_service="example", sloth_slo="ingress-error-rate"}[30d])
        / ignoring (sloth_window)
        count_over_time(slo:sli_error:ratio_rate5m{sloth_id="example-ingress-error-rate", sloth_service="example", sloth_slo="ingress-error-rate"}[30d])
      labels:
        sloth_window: 30d
        // sloth_slo: ingress-error-rate  // <- this fixes promlint complains.
      record: slo:sli_error:ratio_rate30d

The problem can be bypassed by adding abel: sloth_slo: ingress-error-rate to the expression explicitly. Would you accept a PR for this chagne?

Support for different SLO time windows

👋 Hi there!

First and foremost thanks for open sourcing this, this is cool stuff that I might end up using at work.

Do you have any plans for adding support to different time windows other than 30 days?

I was taking a look at the code and I see it is hardcoded in https://github.com/slok/sloth/blob/main/internal/prometheus/spec.go#L63

I'm not sure if this is just a matter of adding support for this in the api spec or if there's more to it than just that.
Add alerting windows spec and use these to customize the alerts for advanced users
This PR adds support for customizing SLO period windows.

It has a new spec that the users can use to decide how the SLO period windows should be. An example of the most common used SLO period (30d) that Sloth has by default, would be declared like this:

apiVersion: "sloth.slok.dev/v1" kind: "AlertWindows" spec: sloPeriod: 30d page: quick: errorBudgetPercent: 2 shortWindow: 5m longWindow: 1h slow: errorBudgetPercent: 5 shortWindow: 30m longWindow: 6h ticket: quick: errorBudgetPercent: 10 shortWindow: 2h longWindow: 1d slow: errorBudgetPercent: 10 shortWindow: 6h longWindow: 3d

By default, Sloth continues supporting 28d and 30d SLO periods.

Also added --slo-period-windows-path flag to load custom SLO period windows from a directory.

BREAKING

--window-days has been renamed to --default-slo-period.

Removed -w short flag for --window-days.
Other way to write SLO than ErrorQuery and TotalQuery

Hi, Xabier

I have Prometheus plugin installed in my Jenkins, which is using DropWizard metrics. The plugin exports the metric jenkins_job_total_duration that already includes quantile label. Eg: jenkins_job_total_duration{quantile="0.999"} = 300ms, meaning that 99.9% of Jenkins jobs are having the duration 300ms. However, there's no way to know how many jobs are having that 300ms duration. It's not like the Bucket implementation of Prometheus, where we apply histogram_quantile function and le label to know the duration and how many jobs.

In this case, I can't write the SLO "99.9% Jobs should have duration less than 300ms" in the meta of "ErrorQuery and TotalQuery". Because ms is not jobs to be divided, they're not in the same unit.

So my question is: Is there another way to write SLO in this case ? Does Sloth support another SLO declaration without ErrorQuery and TotalQuery ? Something like RED framework. Would love to know your opinion to define SLO for this case.

I love your work. Thanks, Xabier.
Add extra labels to prometheusRule object

Why?

In our case, PrometheusRule requires labels to be picked up.

Disclaimer

I think I did it right? It works (tested on my clusters), but Golang development is a hobby for me xD I'm a system administrator. Let me know if you want me to change something
Make OptimizedSLIRecordGenerator optional

In https://github.com/slok/sloth/blob/main/internal/prometheus/recording_rules.go#L53 the SLI for the SLO-period is always calculated using a optimized calculation method. When there isn't uniform load on a service the result of the optimized method can differ quite a lot from a calculation using errors divided by total. An example can be seen below where traffic (green) is very uneven, and the optimized calculation (yellow) underestimates the error rate many times over compared to the regular (blue) calculation.

This comes from the optimized calcuation assuming that each 5m slice are equally important for the overall SLO. You can even see that while the blue line stays static in periods with no traffic, the yellow error rate slowly decreases.

Granted there isn't any broad consensus on how SLOs should be calculated, it is a topic with passionate debate. One example discussing the fairness of using ratio based SLOs can be found in https://grafana.com/blog/2019/11/27/kubecon-recap-how-to-include-latency-in-slo-based-alerting/

“ISPs love to tell you they have 99.9% uptime. What does that mean? This implies it’s time-based, but all night I’m asleep, I am not using my internet connection. And even if there’s a downtime, I won’t notice and my ISP will tell me they were up all the time. Not using the service is like free uptime for the ISP. Then I have a 10-minute, super important video conference, and my internet connection goes down. That’s for me like a full outage. And they say, yeah, 10 minutes per month, that’s three nines, it’s fine.”

A better alternative: a request-based SLA that says, “During each month we’ll serve 99% of requests successfully.”

Would there be any interest in making the OptimizedSLIRecordGenerator optional? With some input on how a user could control this, I'd be happy to try to create a pull request.
Helm: fix typo from extra-lables to extra-labels

This is a very simple typo fix in the heml chart, the string extra-lables -> extra-labels.

Updated the Chart version to 0.4.1 manually, but in case this is not needed please let me know and will revert it back

sloth generate directory

I try to create a github action to generate the rules but I get an error for my generate command

job part of my github action config:

  generate-slo-job-1:
    name: Generate the SLOs
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: download and setup generator binary
        run: |       
          wget https://github.com/slok/sloth/releases/download/v0.11.0/sloth-linux-amd64
          chmod +x sloth-linux-amd64
          ./sloth-linux-amd64 generate -i ./configuration/sloth/rules/ -o ./configuration/sloth/rules/_gen/
      - uses: EndBug/add-and-commit@v9

error message: error: "generate" command failed: invalid spec, could not load with any of the supported spec types

what would be the correct command?

Issues when generating SLOs with custom period window
Issue

When specifying a custom slo period window, the sloth cli returns some technical errors and not the expected behaviour. Even when following the guide on the website, it doesn't work properly. Especially without altering enabled, the page & ticket windows are still required in the AlertWindows.

Expected behaviour

It should be easy to generate SLOs with custom period windows (e.g. 7d or 14d)

Steps to reproduce

Scenario 1: Just specifying the default-slo-period

sloth generate --default-slo-period="14d" -i ./examples/getting-started.yml ... error: "generate" command failed: invalid default slo period: window period 336h0m0s missing%

Very technical error

Scenario 2: Specifying the default-slo-period and slo-period-windows-path

sloth generate --default-slo-period="7d" --slo-period-windows-path=./examples/windows -i ./examples/getting-started.yml

Works

Scenario 3: Specifying the default-slo-period and slo-period-windows-path without alerting/paging

Change 7d.yaml to not include alerting/paging

apiVersion: sloth.slok.dev/v1 kind: AlertWindows spec: sloPeriod: 7d

sloth generate --default-slo-period="7d" --slo-period-windows-path=./examples/windows -i ./examples/no-alerts.yml (error: "generate" command failed: could not load SLO period windows repository: could not initialize custom windows: could not discover period windows: could not load "7d.yaml" alert windows: invalid alerting window: invalid page quick: long window is required%

It seems like the page & ticket section in the AlertWindows is needed even though it is not used in the SLO spec.

Ideas

Improve error messages

Throw a proper error message when custom window can't be found

Throw a proper error message when the page/ticket section in the AlertWindows is missing (make it required)

Don't throw an error if altering is not needed

When the alerting is disabled in the SLO spec, it should not be required in the AlertWindows section
Metrics on the Sloth operator itself

Hi folks.

We were wondering if we can monitor and introduce alerts if somethings goes wrong with the sloth operator itself. We deployed it in kubernetes and with the PodMonitor we can fetch the metrics exposed on :8081/metrics. Some seem to be related to the /metrics interface, some on the kooper controller, on Go, etc.

Long story short, we want to be alerted when Sloth could not expand the spec to a prometheus rule. Is there a metric for this and I'm missing it?

Thank you.
build(deps): bump github.com/prometheus/prometheus from 0.40.3 to 0.41.0
Bumps github.com/prometheus/prometheus from 0.40.3 to 0.41.0.

Commits

c0d8a56 Merge pull request #11744 from roidelapluie/finalrelease

d7937d4 Release 2.41.0

1bf03eb Merge pull request #11720 from roidelapluie/release-2-41-0-rc-0

75af653 Release v2.41.0-rc.0

8aae683 Update docker dependency

4f35683 Merge pull request #11727 from prometheus/fix-error-unwrapping

1a2c645 Correctly handle error unwrapping in rules and remote write receiver

88ee72d Merge pull request #11712 from roidelapluie/update-deps-for-41

9e26adf Add myself as release shepherd (#11693)

c396c3e Update go dependencies before 2.41

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Pull latest image from docker hub?

Hi Team, @slok

Is there a way to we can pull the latest image from docker hub registry rather than from ghcr? Last updated tag was a year ago

https://hub.docker.com/r/slok/sloth

Best strategy to manage > 400 SLOs

Hi there! We will have to create a lot of SLOs (Loading time, Errors rate for > 200 endpoints etc...) Looking at what sloth generated as rules - examples:

- record: slo:sli_error:ratio_rate5m
    expr: |
      (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[5m])))
      /
      (sum(rate(http_request_duration_seconds_count{job="myservice"}[5m])))

- record: slo:sli_error:ratio_rate30d
    expr: |
      sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
      / ignoring (sloth_window)
      count_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])

I have a question regarding how to manage hundred of records in case of we need to specify the service name and handler properties in each record. Is it correct and is it worth to create a single record without specifying the job/service or any other property like the following ones:

- record: slo:sli_error:ratio_rate5m
    expr: |
      (sum(rate(http_request_duration_seconds_count{code=~"(5..|429)"}[5m])))
      /
      (sum(rate(http_request_duration_seconds_count{}[5m])))

- record: slo:sli_error:ratio_rate30d
    expr: |
      sum_over_time(slo:sli_error:ratio_rate5m{}[30d])
      / ignoring (sloth_window)
      count_over_time(slo:sli_error:ratio_rate5m{}[30d])

Thanks a lot for your help!

A simple tool who pulls data from Online.net API and parse them to a Prometheus format

Dedibox backup monitoring A simple tool who reads API from Online.net and parse them into a Prometheus-compatible format. Conceived to be lightweight,

Aug 16, 2022

A simple prometheus exporter for the EE895-M16HV2 CO2 sensor

EE895-exporter A simple prometheus exporter for reading sensor data from a EE895-M16HV2 module such as this Raspberry PI Board. Based on the ee895 pyt

Oct 30, 2022

Go package exposing a simple interface for executing commands, enabling easy mocking and wrapping of executed commands.

go-runner Go package exposing a simple interface for executing commands, enabling easy mocking and wrapping of executed commands. The Runner interface

Oct 18, 2022

Automating Kubernetes Rollouts with Argo and Prometheus. Checkout the demo URL below

observe-argo-rollout Demo for Automating and Monitoring Kubernetes Rollouts with Argo and Prometheus Performing Demo The demo can be found on Katacoda

Nov 16, 2022

A tool to dump and restore Prometheus data blocks.

promdump promdump dumps the head and persistent blocks of Prometheus. It supports filtering the persistent blocks by time range. Why This Tool When de

Dec 16, 2022

GitHub Rate Limits Prometheus exporter. Works with both App and PAT credentials

Github Rate Limit Prometheus Exporter A prometheus exporter which scrapes GitHub API for the rate limits used by PAT/GitHub App. Helm Chart with value

Sep 19, 2022

Netstat exporter - Prometheus exporter for exposing reserved ports and it's mapped process

Netstat exporter Prometheus exporter for exposing reserved ports and it's mapped

Feb 3, 2022

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

kepler Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics Architectur

Dec 26, 2022

🦥 Easy and simple Prometheus SLO generator

Sloth

Introduction

Features

Get Sloth

Getting started

How does it work

Modes

Generator

Raw (Prometheus)

Kubernetes CRD (Prometheus-operator)

Examples

F.A.Q

Why Sloth

SLI?

SLO?

Error budget?

Burn rate?

SLO based alerting?

What are ticket and page alerts?

Can I disable alerts?

Grafana dashboard?

Owner

Xabier Larrakoetxea Gallego

Comments

Ignore sloth_window in prometheus alerts

The value of the "Remaining error budget (30d window)" label is not properly shown

feat: add securityContext for pod and container

Quiet services and NaN

promtool complains about duplicate rules

Support for different SLO time windows

Add alerting windows spec and use these to customize the alerts for advanced users

BREAKING

Other way to write SLO than ErrorQuery and TotalQuery

Add extra labels to prometheusRule object

Why?

Disclaimer

Make OptimizedSLIRecordGenerator optional

Helm: fix typo from extra-lables to extra-labels

sloth generate directory

Issues when generating SLOs with custom period window

Issue

Expected behaviour

Steps to reproduce

Scenario 1: Just specifying the default-slo-period

Scenario 2: Specifying the default-slo-period and slo-period-windows-path

Scenario 3: Specifying the default-slo-period and slo-period-windows-path without alerting/paging

Ideas

Metrics on the Sloth operator itself

build(deps): bump github.com/prometheus/prometheus from 0.40.3 to 0.41.0

Pull latest image from docker hub?

Best strategy to manage > 400 SLOs

Related tags

A simple tool who pulls data from Online.net API and parse them to a Prometheus format

A simple prometheus exporter for the EE895-M16HV2 CO2 sensor

Go package exposing a simple interface for executing commands, enabling easy mocking and wrapping of executed commands.

Automating Kubernetes Rollouts with Argo and Prometheus. Checkout the demo URL below

A tool to dump and restore Prometheus data blocks.

GitHub Rate Limits Prometheus exporter. Works with both App and PAT credentials

Netstat exporter - Prometheus exporter for exposing reserved ports and it's mapped process

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

Translate Prometheus Alerts into Kubernetes pod readiness

A beginner friendly introduction to prometheus 🔥

Doraemon is a Prometheus based monitor system

A set of tests to check compliance with the Prometheus Remote Write specification

📡 Prometheus exporter that exposes metrics from SpaceX Starlink Dish

Prometheus rule linter

Prometheus exporter for Chia node metrics

Plays videos using Prometheus, e.g. Bad Apple.

k6 prometheus output extension

Generate Prometheus rules for your SLOs

Nvidia GPU exporter for prometheus using nvidia-smi binary

Scenario 1: Just specifying the `default-slo-period`

Scenario 2: Specifying the `default-slo-period` and `slo-period-windows-path`

Scenario 3: Specifying the `default-slo-period` and `slo-period-windows-path` without alerting/paging