🦥 Easy and simple Prometheus SLO generator

sloth

Sloth

CI Go Report Card Apache 2 licensed GitHub release (latest SemVer)

Introduction

Use the easiest way to generate SLOs for Prometheus.

Sloth generates understandable, uniform and reliable Prometheus SLOs for any kind of service. Using a simple SLO spec that results in multiple metrics and multi window multi burn alerts.

At this moment Sloth is focused on Prometheus, however depending on the demand and complexity we may support more backeds.

Features

  • Simple, maintainable and understandable SLO spec.
  • Reliable SLO metrics and alerts.
  • Based on Google SLO implementation and multi window multi burn alerts framework.
  • Autogenerates Prometheus SLI recording rules in different time windows.
  • Autogenerates Prometheus SLO metadata rules.
  • Autogenerates Prometheus SLO multi window multi burn alert rules (Page and warning).
  • SLO spec validation.
  • Customization of labels, disabling different type of alerts...
  • A single way (uniform) of creating SLOs across all different services and teams.
  • Automatic Grafana dashboard to see all your SLOs state.
  • Single binary and easy to use CLI.
  • Kubernetes (Prometheus-operator) support.

Small Sloth SLO dahsboard

Get Sloth

Getting started

Release the Sloth!

sloth generate -i ./examples/getting-started.yml
version: "prometheus/v1"
service: "myservice"
labels:
  owner: "myteam"
  repo: "myorg/myservice"
  tier: "2"
slos:
  # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%).
  - name: "requests-availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(http_request_duration_seconds_count{job="myservice"}[{{.window}}]))
    alerting:
      name: MyServiceAvailabilitySLO
      labels:
        category: "availability"
      annotations:
        # Overwrite default Sloth SLO alert summmary on ticket and page alerts.
        summary: "High error rate on 'myservice' requests responses"
      page_alert:
        labels:
          severity: pageteam
          routing_key: myteam
      ticket_alert:
        labels:
          severity: "slack"
          slack_channel: "#alerts-myteam"

This would be the result you would obtain from the above spec example.

How does it work

At this moment Sloth uses Prometheus rules to generate SLOs. Based on the generated recording and alert rules it creates a reliable and uniform SLO implementation:

1 Sloth spec -> Sloth -> N Prometheus rules

The Prometheus rules that Sloth generates can be explained in 3 categories:

  • SLIs: These rules are the base, they use the queries provided by the user to get a value used to show what is the error service level (availability). It creates multiple rules for different time windows, these different results will be used for the alerts.
  • Metadata: These are used as informative metrics, like the remaining error budget, the SLO objective percent... These are very handy for SLO visualization, e.g Grafana dashboard.
  • Alerts: These are the multiwindow-multiburn alerts that are based on the SLI rules.

Sloth will take the service level spec and for each SLO in the spec will create 3 rule groups with the above categories.

The generated rules share the same metric name across SLOs, however the labels are the key to identify the different services, SLO... This is how we obtain a uniform way of describing all the SLOs across different teams and services.

To get all the available metric names created by Sloth, use this query:

count({sloth_id!=""}) by (__name__)

Modes

Generator

generate will generate Prometheus rules in different formats based on the specs. This mode only needs the CLI so its very useful on Gitops, CI, scripts or as a CLI on yout toolbox.

Currently there are two types of specs supported for generate command. Sloth will detect the input spec type and generate the output type accordingly:

Raw (Prometheus)

Check spec here: v1

Will generate the prometheus recording and alerting rules in Standard Prometheus YAML format.

Kubernetes CRD (Prometheus-operator)

Check CRD here: v1

Will generate from a Sloth CRD spec into Prometheus-operator CRD rules. This generates the prometheus operator CRDs based on the Sloth CRD template.

The CRD doesn't need to be registered in any K8s cluster because it happens as a CLI (offline). A Kubernetes controller that makes this translation automatically inside the Kubernetes cluster is in the TODO list

Examples

  • Alerts disabled: Simple example that shows how to disable alerts.
  • K8s apiserver: Real example of SLOs for a Kubernetes Apiserver.
  • Home wifi: My home Ubiquti Wifi SLOs.
  • K8s Home wifi: Same as home-wifi but shows how to generate Prometheus-operator CRD from a Sloth CRD.
  • Raw Home wifi: Example showing how to use raw SLIs instead of the common events using the home-wifi example.

The resulting generated SLOs are in examples/_gen.

F.A.Q

Why Sloth

Creating Prometheus rules for SLI/SLO framework is hard, error prone and is pure toil.

Sloth abstracts this task, and we also gain:

  • Read friendlyness: Easy to read and declare SLI/SLOs.
  • Gitops: Easy to integrate with CI flows like validation, checks...
  • Reliability and testing: Generated prometheus rules are already known that work, no need the creation of tests.
  • Centralize features and error fixes: An update in Sloth would be applied to all the SLOs managed/generated with it.
  • Standardize the metrics: Same conventions, automatic dashboards...
  • Rollout future features for free with the same specs: e.g automatic report creation.

SLI?

Service level indicator. Is a way of quantify how your service should be responding to user.

TL;DR: What is good/bad service for your users. E.g:

  • Requests >=500 considered errors.
  • Requests >200ms considered errors.
  • Process executions with exit code >0 considered errors.

Normally is measured using events: good/bad-events / total-events.

SLO?

Service level objective. A percent that will tell how many SLI errors your service can have in a specific period of time.

Error budget?

An error budget is the ammount of errors (driven by the SLI) you can have in a specific period of time, this is driven by the SLO.

Lets see an example:

  • SLI Error: Requests status code >= 500
  • Period: 30 days
  • SLO: 99.9%
  • Error budget: 0.0999 (100-99.9)
  • Total requests in 30 days: 10000
  • Available error requests: 9.99 (10000 * 0.0999 / 100)

If we have more than 9.99 request response with >=500 status code, we would be burning more error budget than the available, if we have less errors, we would end without spending all the error budget.

Burn rate?

The speed you are consuming your error budget. This is key for SLO based alerting (Sloth will create all these alerts), because depending on the speed you are consuming your error budget, it will trigger your alerts.

Speed/rate examples:

  • 1: You are consuming 100% of the error budget in the expected period (e.g if 30d period, then 30 days).
  • 2: You are consuming 200% of the error budget in the expected period (e.g if 30d period, then 15 days).
  • 60: You are consuming 6000% of the error budget in the expected period (e.g if 30d period, then 12h hour).
  • 1080: You are consuming 108000% of the error budget in the expected period (e.g if 30d period, then 40 minute).

SLO based alerting?

With SLO based alerting you will get better alerting to a regular alerting system, because:

  • Alerts on symptoms (SLIs), not causes.
  • Trigger at different levels (warning/ticket and critical/page).
  • Takes into account time and quantity, this is: speed of errors and number of errors on specific time.

The result of these is:

  • Correct time to trigger alerts (important == fast, not so important == slow).
  • Reduce alert fatigue.
  • Reduce false positives and negatives.

What are ticket and page alerts?

MWMB type alerting is based on two kinds of alerts, ticket and page:

  • page: Are critical alerts that normally are used to wake up, notify on important channels, trigger oncall...
  • ticket: The warning alerts that normally open tickets, post messages on non-important Slack channels...

These are triggered in different ways, page alerts are triggered faster but require faster error budget burn rate, on the other side, ticket alerts are triggered slower and require a lower and constant error budget burn rate.

Can I disable alerts?

Yes, use disable: true on page and ticket.

Grafana dashboard?

Check grafana-dashboard, this dashboard will load the SLOs automatically.

Owner
Xabier Larrakoetxea Gallego
Platform tools at @newrelic
Xabier Larrakoetxea Gallego
Comments
  • Ignore sloth_window in prometheus alerts

    Ignore sloth_window in prometheus alerts

    This prevents alerts resolving and re-firing when different windows fire.

    Implemented using the suggestion from @tokheim in #240. Fixes https://github.com/slok/sloth/issues/240

  • The value of the

    The value of the "Remaining error budget (30d window)" label is not properly shown

    As in the image below, the value of the "Remaining error budget (30d window)" label is not properly shown.

    image

    image

    As you can see there are multiple values "NaN"

    Why this is happening and how can I fix it? Thank you in advance!

  • feat: add securityContext for pod and container

    feat: add securityContext for pod and container

    What: Deployments can have security settings in their manifest on two levels: pod and container. However, there are some capabilities only configurable in one of the respective levels(https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#securitycontext-v1-core). This PR sets a default configuration for container securityContext, which drops all POSIX capabilities and denies privilege escalation and for pod securityContext adds user, group fsGroup and supplementalGroups and also denies root usage. These are and should be standard settings in the context of Kubernetes. It also adds the possibility of running vault-injector in a Kubernetes environment without PSP (to be removed in v1.25 https://kubernetes.io/docs/concepts/security/pod-security-policy/), but with OpenPolicyAgent (possibly the PSP substitute) with the same capabilities as a restricted PSP instead.

    This PR sets the respective settings to the values.yaml and is defaulting them as well. With this they can be adopted if it is needed.

  • Quiet services and NaN

    Quiet services and NaN

    I'm wondering what can be done about the scenario where the service has periods of receiving no (or few) requests so NaN values start to creep in.

    This is because, for example

    (sum(rate(http_server_requests_seconds_count{deployment="sloexample"}[1h])))
    

    evaluates to 0 so you get NaN when you divide by it to get error ratios.

    Normally this is fine - until you come to error budgets. My remaining error budget is not "undefined" or NaN just because my service went quiet for a period of time.

  • promtool complains about duplicate rules

    promtool complains about duplicate rules

    Hi,

    I created a PrometheusServiceLevel with two SLI. Checking the generated PrometheusRule CR with promtool it complains about a duplicate recording rule.

    > promtool check rules example.yaml
    Checking example.yaml
    1 duplicate rule(s) found.
    Metric: slo:sli_error:ratio_rate30d
    Label(s):
            sloth_window: 30d
    Might cause inconsistency while recording expressions.
      SUCCESS: 34 rules found
    

    example of one generated recording rule.

        - expr: |
            sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="example-ingress-error-rate", sloth_service="example", sloth_slo="ingress-error-rate"}[30d])
            / ignoring (sloth_window)
            count_over_time(slo:sli_error:ratio_rate5m{sloth_id="example-ingress-error-rate", sloth_service="example", sloth_slo="ingress-error-rate"}[30d])
          labels:
            sloth_window: 30d
            // sloth_slo: ingress-error-rate  // <- this fixes promlint complains.
          record: slo:sli_error:ratio_rate30d
    

    The problem can be bypassed by adding abel: sloth_slo: ingress-error-rate to the expression explicitly. Would you accept a PR for this chagne?

  • Support for different SLO time windows

    Support for different SLO time windows

    👋 Hi there!

    First and foremost thanks for open sourcing this, this is cool stuff that I might end up using at work.

    Do you have any plans for adding support to different time windows other than 30 days?

    I was taking a look at the code and I see it is hardcoded in https://github.com/slok/sloth/blob/main/internal/prometheus/spec.go#L63

    I'm not sure if this is just a matter of adding support for this in the api spec or if there's more to it than just that.

  • Add alerting windows spec and use these to customize the alerts for advanced users

    Add alerting windows spec and use these to customize the alerts for advanced users

    This PR adds support for customizing SLO period windows.

    It has a new spec that the users can use to decide how the SLO period windows should be. An example of the most common used SLO period (30d) that Sloth has by default, would be declared like this:

    apiVersion: "sloth.slok.dev/v1"
    kind: "AlertWindows"
    spec:
      sloPeriod: 30d
      page:
        quick:
          errorBudgetPercent: 2
          shortWindow: 5m
          longWindow: 1h
        slow:
          errorBudgetPercent: 5
          shortWindow: 30m
          longWindow: 6h
      ticket:
        quick:
          errorBudgetPercent: 10
          shortWindow: 2h
          longWindow: 1d
        slow:
          errorBudgetPercent: 10
          shortWindow: 6h
          longWindow: 3d
    
    

    By default, Sloth continues supporting 28d and 30d SLO periods.

    Also added --slo-period-windows-path flag to load custom SLO period windows from a directory.

    BREAKING

    • --window-days has been renamed to --default-slo-period.
    • Removed -w short flag for --window-days.
  • Other way to write SLO than ErrorQuery and TotalQuery

    Other way to write SLO than ErrorQuery and TotalQuery

    Hi, Xabier

    I have Prometheus plugin installed in my Jenkins, which is using DropWizard metrics. The plugin exports the metric jenkins_job_total_duration that already includes quantile label. Eg: jenkins_job_total_duration{quantile="0.999"} = 300ms, meaning that 99.9% of Jenkins jobs are having the duration 300ms. However, there's no way to know how many jobs are having that 300ms duration. It's not like the Bucket implementation of Prometheus, where we apply histogram_quantile function and le label to know the duration and how many jobs.

    In this case, I can't write the SLO "99.9% Jobs should have duration less than 300ms" in the meta of "ErrorQuery and TotalQuery". Because ms is not jobs to be divided, they're not in the same unit.

    So my question is: Is there another way to write SLO in this case ? Does Sloth support another SLO declaration without ErrorQuery and TotalQuery ? Something like RED framework. Would love to know your opinion to define SLO for this case.

    I love your work. Thanks, Xabier.

  • Add extra labels to prometheusRule object

    Add extra labels to prometheusRule object

    Why?

    In our case, PrometheusRule requires labels to be picked up.

    Disclaimer

    I think I did it right? It works (tested on my clusters), but Golang development is a hobby for me xD I'm a system administrator. Let me know if you want me to change something

  • Make OptimizedSLIRecordGenerator optional

    Make OptimizedSLIRecordGenerator optional

    In https://github.com/slok/sloth/blob/main/internal/prometheus/recording_rules.go#L53 the SLI for the SLO-period is always calculated using a optimized calculation method. When there isn't uniform load on a service the result of the optimized method can differ quite a lot from a calculation using errors divided by total. An example can be seen below where traffic (green) is very uneven, and the optimized calculation (yellow) underestimates the error rate many times over compared to the regular (blue) calculation.

    image

    This comes from the optimized calcuation assuming that each 5m slice are equally important for the overall SLO. You can even see that while the blue line stays static in periods with no traffic, the yellow error rate slowly decreases.

    Granted there isn't any broad consensus on how SLOs should be calculated, it is a topic with passionate debate. One example discussing the fairness of using ratio based SLOs can be found in https://grafana.com/blog/2019/11/27/kubecon-recap-how-to-include-latency-in-slo-based-alerting/

    “ISPs love to tell you they have 99.9% uptime. What does that mean? This implies it’s time-based, but all night I’m asleep, I am not using my internet connection. And even if there’s a downtime, I won’t notice and my ISP will tell me they were up all the time. Not using the service is like free uptime for the ISP. Then I have a 10-minute, super important video conference, and my internet connection goes down. That’s for me like a full outage. And they say, yeah, 10 minutes per month, that’s three nines, it’s fine.”

    A better alternative: a request-based SLA that says, “During each month we’ll serve 99% of requests successfully.”

    Would there be any interest in making the OptimizedSLIRecordGenerator optional? With some input on how a user could control this, I'd be happy to try to create a pull request.

  • Helm: fix typo from extra-lables to extra-labels

    Helm: fix typo from extra-lables to extra-labels

    This is a very simple typo fix in the heml chart, the string extra-lables -> extra-labels.

    Updated the Chart version to 0.4.1 manually, but in case this is not needed please let me know and will revert it back

  • sloth generate directory

    sloth generate directory

    I try to create a github action to generate the rules but I get an error for my generate command

    job part of my github action config:

      generate-slo-job-1:
        name: Generate the SLOs
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v3
          - name: download and setup generator binary
            run: |       
              wget https://github.com/slok/sloth/releases/download/v0.11.0/sloth-linux-amd64
              chmod +x sloth-linux-amd64
              ./sloth-linux-amd64 generate -i ./configuration/sloth/rules/ -o ./configuration/sloth/rules/_gen/
          - uses: EndBug/add-and-commit@v9
    

    error message: error: "generate" command failed: invalid spec, could not load with any of the supported spec types

    what would be the correct command?

  • Issues when generating SLOs with custom period window

    Issues when generating SLOs with custom period window

    Issue

    When specifying a custom slo period window, the sloth cli returns some technical errors and not the expected behaviour. Even when following the guide on the website, it doesn't work properly. Especially without altering enabled, the page & ticket windows are still required in the AlertWindows.

    Expected behaviour

    It should be easy to generate SLOs with custom period windows (e.g. 7d or 14d)

    Steps to reproduce

    Scenario 1: Just specifying the default-slo-period

    sloth generate --default-slo-period="14d" -i ./examples/getting-started.yml
    ...
    error: "generate" command failed: invalid default slo period: window period 336h0m0s missing%
    

    Very technical error

    Scenario 2: Specifying the default-slo-period and slo-period-windows-path

    sloth generate --default-slo-period="7d" --slo-period-windows-path=./examples/windows -i ./examples/getting-started.yml
    

    Works

    Scenario 3: Specifying the default-slo-period and slo-period-windows-path without alerting/paging

    Change 7d.yaml to not include alerting/paging

    apiVersion: sloth.slok.dev/v1
    kind: AlertWindows
    spec:
      sloPeriod: 7d
    
    sloth generate --default-slo-period="7d" --slo-period-windows-path=./examples/windows -i ./examples/no-alerts.yml
    (error: "generate" command failed: could not load SLO period windows repository: could not initialize custom windows: could not discover period windows: could not load "7d.yaml" alert windows: invalid alerting window: invalid page quick: long window is required%
    

    It seems like the page & ticket section in the AlertWindows is needed even though it is not used in the SLO spec.

    Ideas

    Improve error messages

    • Throw a proper error message when custom window can't be found
    • Throw a proper error message when the page/ticket section in the AlertWindows is missing (make it required)

    Don't throw an error if altering is not needed

    • When the alerting is disabled in the SLO spec, it should not be required in the AlertWindows section
  • Metrics on the Sloth operator itself

    Metrics on the Sloth operator itself

    Hi folks.

    We were wondering if we can monitor and introduce alerts if somethings goes wrong with the sloth operator itself. We deployed it in kubernetes and with the PodMonitor we can fetch the metrics exposed on :8081/metrics. Some seem to be related to the /metrics interface, some on the kooper controller, on Go, etc.

    Long story short, we want to be alerted when Sloth could not expand the spec to a prometheus rule. Is there a metric for this and I'm missing it?

    Thank you.

  • build(deps): bump github.com/prometheus/prometheus from 0.40.3 to 0.41.0

    build(deps): bump github.com/prometheus/prometheus from 0.40.3 to 0.41.0

    Bumps github.com/prometheus/prometheus from 0.40.3 to 0.41.0.

    Commits
    • c0d8a56 Merge pull request #11744 from roidelapluie/finalrelease
    • d7937d4 Release 2.41.0
    • 1bf03eb Merge pull request #11720 from roidelapluie/release-2-41-0-rc-0
    • 75af653 Release v2.41.0-rc.0
    • 8aae683 Update docker dependency
    • 4f35683 Merge pull request #11727 from prometheus/fix-error-unwrapping
    • 1a2c645 Correctly handle error unwrapping in rules and remote write receiver
    • 88ee72d Merge pull request #11712 from roidelapluie/update-deps-for-41
    • 9e26adf Add myself as release shepherd (#11693)
    • c396c3e Update go dependencies before 2.41
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
  • Pull latest image from docker hub?

    Pull latest image from docker hub?

    Hi Team, @slok

    Is there a way to we can pull the latest image from docker hub registry rather than from ghcr? Last updated tag was a year ago

    https://hub.docker.com/r/slok/sloth

  • Best strategy to manage > 400 SLOs

    Best strategy to manage > 400 SLOs

    Hi there! We will have to create a lot of SLOs (Loading time, Errors rate for > 200 endpoints etc...) Looking at what sloth generated as rules - examples:

    - record: slo:sli_error:ratio_rate5m
        expr: |
          (sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[5m])))
          /
          (sum(rate(http_request_duration_seconds_count{job="myservice"}[5m])))
    
    - record: slo:sli_error:ratio_rate30d
        expr: |
          sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
          / ignoring (sloth_window)
          count_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
    

    I have a question regarding how to manage hundred of records in case of we need to specify the service name and handler properties in each record. Is it correct and is it worth to create a single record without specifying the job/service or any other property like the following ones:

    - record: slo:sli_error:ratio_rate5m
        expr: |
          (sum(rate(http_request_duration_seconds_count{code=~"(5..|429)"}[5m])))
          /
          (sum(rate(http_request_duration_seconds_count{}[5m])))
    
    - record: slo:sli_error:ratio_rate30d
        expr: |
          sum_over_time(slo:sli_error:ratio_rate5m{}[30d])
          / ignoring (sloth_window)
          count_over_time(slo:sli_error:ratio_rate5m{}[30d])
    

    Thanks a lot for your help!

A simple tool who pulls data from Online.net API and parse them to a Prometheus format

Dedibox backup monitoring A simple tool who reads API from Online.net and parse them into a Prometheus-compatible format. Conceived to be lightweight,

Aug 16, 2022
A simple prometheus exporter for the EE895-M16HV2 CO2 sensor

EE895-exporter A simple prometheus exporter for reading sensor data from a EE895-M16HV2 module such as this Raspberry PI Board. Based on the ee895 pyt

Oct 30, 2022
Go package exposing a simple interface for executing commands, enabling easy mocking and wrapping of executed commands.

go-runner Go package exposing a simple interface for executing commands, enabling easy mocking and wrapping of executed commands. The Runner interface

Oct 18, 2022
Automating Kubernetes Rollouts with Argo and Prometheus. Checkout the demo URL below
Automating Kubernetes Rollouts with Argo and Prometheus. Checkout the demo URL below

observe-argo-rollout Demo for Automating and Monitoring Kubernetes Rollouts with Argo and Prometheus Performing Demo The demo can be found on Katacoda

Nov 16, 2022
A tool to dump and restore Prometheus data blocks.
A tool to dump and restore Prometheus data blocks.

promdump promdump dumps the head and persistent blocks of Prometheus. It supports filtering the persistent blocks by time range. Why This Tool When de

Dec 16, 2022
GitHub Rate Limits Prometheus exporter. Works with both App and PAT credentials
GitHub Rate Limits Prometheus exporter. Works with both App and PAT credentials

Github Rate Limit Prometheus Exporter A prometheus exporter which scrapes GitHub API for the rate limits used by PAT/GitHub App. Helm Chart with value

Sep 19, 2022
Netstat exporter - Prometheus exporter for exposing reserved ports and it's mapped process

Netstat exporter Prometheus exporter for exposing reserved ports and it's mapped

Feb 3, 2022
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

kepler Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics Architectur

Dec 26, 2022
Translate Prometheus Alerts into Kubernetes pod readiness

prometheus-alert-readiness Translates firing Prometheus alerts into a Kubernetes readiness path. Why? By running this container in a singleton deploym

Oct 31, 2022
A beginner friendly introduction to prometheus 🔥
A beginner friendly introduction to prometheus 🔥

Prometheus-Basics A beginner friendly introduction to prometheus. Table of Contents What is prometheus ? What are metrics and why is it important ? Ba

Dec 29, 2022
Doraemon is a Prometheus based monitor system
Doraemon is a Prometheus based monitor system

English | 中文 Doraemon Doraemon is a Prometheus based monitor system ,which are made up of three components——the Rule Engine,the Alert Gateway and the

Nov 28, 2022
A set of tests to check compliance with the Prometheus Remote Write specification

Prometheus Remote Write Compliance Test This repo contains a set of tests to check compliance with the Prometheus Remote Write specification. The test

Dec 4, 2022
📡 Prometheus exporter that exposes metrics from SpaceX Starlink Dish
📡  Prometheus exporter that exposes metrics from SpaceX Starlink Dish

Starlink Prometheus Exporter A Starlink exporter for Prometheus. Not affiliated with or acting on behalf of Starlink(™) ?? Starlink Monitoring System

Dec 19, 2022
Prometheus rule linter

pint pint is a Prometheus rule linter. Usage There are two modes it works in: CI PR linting Ad-hoc linting of a selected files or directories Pull Req

Jan 2, 2023
Prometheus exporter for Chia node metrics

chia_exporter Prometheus metric collector for Chia nodes, using the local RPC API Building and Running With the Go compiler tools installed: go build

Sep 19, 2022
Plays videos using Prometheus, e.g. Bad Apple.
Plays videos using Prometheus, e.g. Bad Apple.

prom_bad_apple Plays videos using Prometheus, e.g. Bad Apple. Inspiration A while back I thought this blog post and the corresponding source code were

Nov 30, 2022
k6 prometheus output extension

xk6-prometheus A k6 extension implements Prometheus HTTP exporter as k6 output extension. Using xk6-prometheus output extension you can collect metric

Nov 22, 2022
Generate Prometheus rules for your SLOs

prometheus-slo Generates Prometheus rules for alerting on SLOs. Based on https://developers.soundcloud.com/blog/alerting-on-slos. Usage Build and Run

Nov 27, 2022
Nvidia GPU exporter for prometheus using nvidia-smi binary
Nvidia GPU exporter for prometheus using nvidia-smi binary

nvidia_gpu_exporter Nvidia GPU exporter for prometheus, using nvidia-smi binary to gather metrics. Introduction There are many Nvidia GPU exporters ou

Jan 5, 2023