🚨 Collection of Prometheus alerting rules

πŸ‘‹ Awesome Prometheus Alerts Awesome

Most alerting rules are common to every Prometheus setup. We need a place to find them all. 🀘 🚨 πŸ“Š

Collection available here: https://awesome-prometheus-alerts.grep.to

✨ Contents

🚨 Rules

Basic resource monitoring

Databases and brokers

Reverse proxies and load balancers

Runtimes

Orchestrators

Network, security and storage

Other

🀝 Contributing

Contributions from community (you!) are most welcome!

There are many ways to contribute: writing code, alerting rules, documentation, reporting issues, discussing better error tracking...

Instructions here

πŸ‹οΈ Improvements

  • Create an alert rule builder in Jekyll for custom alerts (severity, thresholds, instances...)
  • Add resolution suggestions to rule descriptions, for faster incident resolution (#85).

πŸ’« Show your support

Give a ⭐️ if this project helped you!

support us

πŸ‘ Thanks

Gratitude for the Gitlab operation team that provided 50+ rules. \o/

πŸ“ License

CC4

Licensed under the Creative Commons 4.0 License, see LICENSE file for more detail.

Comments
  • Add ProveMonitoring rule

    Add ProveMonitoring rule

    Rule to prove to the team that the monitoring is working end to end. The Alert is triggered every Monday morning for five monies which should be enough time for everyone to see the message appear across slack / email / pages / etc.

  • Context switch rate rule always alerts

    Context switch rate rule always alerts

    Hello,

    Thanks for your prometheus alerts which are, indeed, well and truly awesome! They've been very helpful in migrating a project away from icinga2 towards prometheus.

    There's one alert here that I find will always trigger: The node alert named ContextSwitching. If I plot the graph for the query, I find that idle servers will generally have around 2100 context switches, while a moderately busy one will have 50.5k.

    These are generally multi-core processors, does that factor into it at all? Whatever the case, I think there is an issue with the PromQL expression that you might want to be made aware of. Thanks again.

  • Execution Read and Write metric using mongodb-exporter for prometheus

    Execution Read and Write metric using mongodb-exporter for prometheus

    I have below query for execution time for mongo using prometheus but that doesnot seem to give correct value can someone help here ?

    ((avg(rate(mongodb_mongod_op_latencies_latency_total{type="write",mongo_cluster=~"$cluster_name"}[5m])) by (instance)) * on (instance) group_right mongodb_mongod_replset_my_name) * on (instance,name,service) group_right mongodb_mongod_replset_member_health{state=~"$type",set=~"$shardName",name=~"$instance"}

  • Rule KubernetesOutOfCapacity gives error

    Rule KubernetesOutOfCapacity gives error

    Hi All,

    The following rule give the error:

    many-to-many matching not allowed: matching labels must be unique on one side
    

    The code that has been published on the website:

      - alert: KubernetesOutOfCapacity
        expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(pod, namespace) group_left(node) (0 * kube_pod_info)) / sum(kube_node_status_allocatable_pods) by (node) * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes out of capacity (instance {{ $labels.instance }})
          description: "{{ $labels.node }} is out of capacity\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    

    Any idea how to fix this ?

  • KubernetesJobCompletion will always fire while job is running

    KubernetesJobCompletion will always fire while job is running

    I could be misunderstanding the purpose of this, but I'm seeing some weird behavior with this rule. The rule is currently defined as:

      - alert: KubernetesJobCompletion
        expr: kube_job_spec_completions - kube_job_status_succeeded > 0 or kube_job_status_failed > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kubernetes job completion (instance {{ $labels.instance }})"
          description: "Kubernetes Job failed to complete\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    

    kube_job_spec_completions will always be some number 1 or higher. It is the number of completions configured for the job. It isn't the number of completed pods for the job; that should be kube_job_complete. While the job is running the result of the first part of this expression will always be true (1 - 0 > 0). So if the job doesn't finish within 5 minutes, this alert will always fire and then get resolved later when the job does succeed.

    I think this should be changed to kube_job_complete, right? Otherwise the rule should be adjusted in the for: to be as long as a job could take to succeed (not very accurate or flexible, I think).

  • KubernetesPodNotHealthy expr problem

    KubernetesPodNotHealthy expr problem

    - alert: KubernetesPodNotHealthy
        expr: min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h]) > 0
        for: 5m
        labels:
          severity: error
        annotations:
          summary: "Kubernetes Pod not healthy (instance {{ $labels.instance }})"
          description: "Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    

    I want to use this ,but the "expr" Doesn't seem right. I get the error like:

    Error executing query: invalid parameter 'query': 1:107: parse error: ranges only allowed for vector selectors
    

    if I use min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) , the result is OK。

  • MDRaid Alert

    MDRaid Alert

    Would be handy if we could get add the MDRaid alert for md raid array degradation. Here's what i've.

    - alert: MDRaidDegrade
        expr: (node_md_disk - node_md_disk_active) != 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID."
          description: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID {{$labels.device}}. VALUE - {{ $value }}.
    
  • CHANGE syntax Istio alert

    CHANGE syntax Istio alert

    Hi,

    This alert is not correct because is necessary to combine rate with sum syntax.

    Here you can see the correct use of quantiles.

    Screen Shot 2022-07-05 at 12 57 08

    I can create PR and easily fix that if you want.

  • Alert KubernetesPodNotHealthy reporting incorrect alerts

    Alert KubernetesPodNotHealthy reporting incorrect alerts

    The way the following alert works is (from my understanding), that is any Pod that is "Pending|Unknown|Failed" state for longer than the default resolution in the last hour will trigger the alert. At least that's how the alert is firing for me. The Alert description says something else, the pod should be down for longer than an hour.

      - alert: KubernetesPodNotHealthy
        expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
          description: Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}
    

    I'm no expert on PromQL but maybe the range/resolution has to be changed like this: [1h:1h]?

  • mongotop β€” json equivalent metrics in prometheus using mongodb-exporter

    mongotop β€” json equivalent metrics in prometheus using mongodb-exporter

    I am running MongoDB-exporter and pushing data to Prometheus by default mongotop metric is disabled in MongoDB-exporter which I have enabled using --collect.topmetrics flag but enabling this flag is giving fewer metrics and I don't need those metrics when I run mongotop -- json on mongo node I am getting a couple of metrics that are pretty useful to me (read, write, total time and stuff per dB, collection, etc ..) is there any flag in mongo-exporter that lets me get the get mongotop -- json equivalent metrics in Prometheus (https://github.com/percona/mongodb_exporter) this the exporter

  • rabbitmq rules don't work

    rabbitmq rules don't work

    Hello, Do you maybe have updated rabbitmq rules as these are not working (most of them)? The metrics you have specified in the expressions do not exist so i have made similar (kinda). Any feedback appreciated and thanks.

  • Add under-utilized HPA alert

    Add under-utilized HPA alert

    This alert should inform when HPAs are scaled more than half the time at their minReplicas, which is an indication of possible cost savings. In addition, it is assumed that a minimum number of replicas should still be running for redundancy.

  • ContainerVolumeUsage always alarms

    ContainerVolumeUsage always alarms

    Hi, trying to understand the rule for ContainerVolumeUsage:

    This always is zero, which causes the calculation to always be 100 sum by (instance) (container_fs_inodes_free{name!=""}) = 0 sum by (instance) (container_fs_inodes_total) = 95469464 1 - (sum by (instance) (container_fs_inodes_free{name!=""}) / sum by (instance) (container_fs_inodes_total)) = 1 (1 - (sum by (instance) (container_fs_inodes_free{name!=""}) / sum by (instance) (container_fs_inodes_total))) * 100 > 80 = 100 (1 - 0 / 95469464) * 100 = 100

    At least in my environment (docker) it seems not working due to reported container_fs_inodes_free to be 0 for every container. I suspect the alarm is fine but something with cadvisor or working as designed? Looking for feedback, disabling for now

  • Add runbook for MongoDBVirtualMemoryUsage

    Add runbook for MongoDBVirtualMemoryUsage

    Hello,

    first of all, thanks for this collection of alerts!. I've tested the set of percona mongoDB alerts in my current setup and could see, that the "MongoDBVirtualMemoryUsage" is always firing. This means, that the virtual memory is taking up more than three times the space as the resident memory -> high memory usage.

    Do you have any idea how to handle this ? Currently it is super noisy for me.

  • Some postgres alerts are not related to postgres-exporter

    Some postgres alerts are not related to postgres-exporter

    Hello,

    We wanted to add this alert rule:

    Postgresql high rate statement timeout

    https://awesome-prometheus-alerts.grep.to/rules#rule-postgresql-1-12

    So we installed prometheus-community/postgres_exporter (as mentioned in the title of the section).

    But we quickly realized that the metric postgresql_errors_total (used in this rule) was not available with postgres_exporter.

    After some research we found this pdf: https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2166/slides/147/monitoring.pdf

    Which presents prometheus monitoring of postgres with postgres-exporter AND using postgres logs via mtail. We can see that postgresql_errors_total is created with mtail (the mtail config file is somewhere in the middle of the pdf, search for postgresql_errors_total).

    So the following alerts do not require postgres-exporter, but mtail with a custom configuration:

    • Postgresql high rate statement timeout
    • Postgresql high rate deadlock (both relying on postgresql_errors_total)

    This should be mentioned next to these rules, or the rules moved to a subsection.

    I looked up the original commit adding these rules: https://github.com/samber/awesome-prometheus-alerts/commit/0b89a764eed0e65863cad503a47a7a7695563f0c

    It seems like it was added at the same time as the other postgres rules (which rely on postgres-exporter).

    Thanks

  • Add ZFS metrics

    Add ZFS metrics

    First and most useful alert would be node_zfs_zpool_state{state!="online"} > 0 I suppose

    Note that I created upstream issue for adding some more metrics https://github.com/prometheus/node_exporter/issues/2423

AWS Cloudtrail event alerting lambda function. Send alerts to Slack, Email, or SNS.
AWS Cloudtrail event alerting lambda function. Send alerts to Slack, Email, or SNS.

Cloudtrail-Tattletail is a Lambda based Cloudtrail alerting tool. It allows you to write simple rules for interesting Cloudtrail events and forward those events to a number of different systems.

Jan 6, 2023
An Alert notification service is an application which can receive alerts from certain alerting systems like System_X and System_Y and send these alerts to developers in the form of SMS and emails.

Alert-System An Alert notification service is an application which can receive alerts from certain alerting systems like System_X and System_Y and sen

Dec 10, 2021
Export Prometheus metrics from journald events using Prometheus Go client library

journald parser and Prometheus exporter Export Prometheus metrics from journald events using Prometheus Go client library. For demonstration purposes,

Jan 3, 2022
Library/tool to change a yaml given a rules file

golang-yaml-rules/yaml-transform Library/tool to change a yaml given a rules file Using jsonpath ( https://github.com/vmware-labs/yaml-jsonpath ), thi

Feb 11, 2022
Represent your rego rules programmatically.
Represent your rego rules programmatically.

Policy Enforcer Policy enforcer is a open source tool that allows you to easily create complex authorization policy. Supports RBAC, ABAC and resource

Jul 5, 2022
A collection of Go style guides

This is a collection of style guides for Go. Be sure to read about writing engineering guidelines before trying to adopt one of these wholesale. (For

Dec 29, 2022
βš”οΈ Web Hacker's Weapons / A collection of cool tools used by Web hackers. Happy hacking , Happy bug-hunting
βš”οΈ Web Hacker's Weapons / A collection of cool tools used by Web hackers. Happy hacking , Happy bug-hunting

A collection of cool tools used by Web hackers. Happy hacking , Happy bug-hunting Family project Table of Contents WHW-Tools Weapons Awesome Bookmarkl

Jan 5, 2023
A collection of commands for work done on GitHub

gh_sugar A collection of commands for work done on GitHub command pr Create pull request. Usage of pr: -from string from branch -owner str

Oct 19, 2021
Collection of mini-programs demonstrating Kubernetes client-go usage.

Kubernetes client-go examples Collection of mini-programs covering various client-go use cases. The intention (at least so far) is to test (more or le

Jan 3, 2023
Fadvisor(FinOps Advisor) is a collection of exporters which collect cloud resource pricing and billing data guided by FinOps, insight cost allocation for containers and kubernetes resource
Fadvisor(FinOps Advisor) is a collection of exporters which collect cloud resource pricing and billing data guided by FinOps, insight cost allocation for containers and kubernetes resource

[TOC] Fadvisor: FinOps Advisor fadvisor(finops advisor) is used to solve the FinOps Observalibility, it can be integrated with Crane to help users to

Jan 3, 2023
K8s - A Collection of tools, hands-on walkthroughs with source code
K8s - A Collection of tools, hands-on walkthroughs with source code

The Ultimate Engineer Toolbox ?? ?? A Collection of tools, hands-on walkthroughs

Feb 14, 2022
Translate Prometheus Alerts into Kubernetes pod readiness

prometheus-alert-readiness Translates firing Prometheus alerts into a Kubernetes readiness path. Why? By running this container in a singleton deploym

Oct 31, 2022
A beginner friendly introduction to prometheus πŸ”₯
A beginner friendly introduction to prometheus πŸ”₯

Prometheus-Basics A beginner friendly introduction to prometheus. Table of Contents What is prometheus ? What are metrics and why is it important ? Ba

Dec 29, 2022
Doraemon is a Prometheus based monitor system
Doraemon is a Prometheus based monitor system

English | δΈ­ζ–‡ Doraemon Doraemon is a Prometheus based monitor system ,which are made up of three componentsβ€”β€”the Rule Engine,the Alert Gateway and the

Nov 28, 2022
A set of tests to check compliance with the Prometheus Remote Write specification

Prometheus Remote Write Compliance Test This repo contains a set of tests to check compliance with the Prometheus Remote Write specification. The test

Dec 4, 2022
Automating Kubernetes Rollouts with Argo and Prometheus. Checkout the demo URL below
Automating Kubernetes Rollouts with Argo and Prometheus. Checkout the demo URL below

observe-argo-rollout Demo for Automating and Monitoring Kubernetes Rollouts with Argo and Prometheus Performing Demo The demo can be found on Katacoda

Nov 16, 2022
πŸ“‘ Prometheus exporter that exposes metrics from SpaceX Starlink Dish
πŸ“‘  Prometheus exporter that exposes metrics from SpaceX Starlink Dish

Starlink Prometheus Exporter A Starlink exporter for Prometheus. Not affiliated with or acting on behalf of Starlink(β„’) ?? Starlink Monitoring System

Dec 19, 2022
A tool to dump and restore Prometheus data blocks.
A tool to dump and restore Prometheus data blocks.

promdump promdump dumps the head and persistent blocks of Prometheus. It supports filtering the persistent blocks by time range. Why This Tool When de

Dec 16, 2022
πŸ¦₯ Easy and simple Prometheus SLO generator
πŸ¦₯ Easy and simple Prometheus SLO generator

Sloth Introduction Use the easiest way to generate SLOs for Prometheus. Sloth generates understandable, uniform and reliable Prometheus SLOs for any k

Jan 4, 2023