A set of tests to check compliance with the Prometheus Remote Write specification

Prometheus Remote Write Compliance Test

This repo contains a set of tests to check compliance with the Prometheus Remote Write specification.

The test suit works by forking an instance of the sender with some config to scrape the test running itself and send remote write requests to the test suite for a fixed period of time. The test suit than examines the received requests for compliance.

Running the test

The test is a vanilla Golang unit test, and can be run as such. To run all the tests:

$ go test -v ./

To run all the tests for a single target:

$ go test -run "TestRemoteWrite/prometheus/.+" -v ./

To run a single test across all the targets:

$ go test -run "TestRemoteWrite/.+/Counter" -v ./

Remote Write Senders

The repo tests the following remote write senders:

If you want to add another sender, see the examples in the targets director and recreate that pattern in a PR.

Owner
Tom Wilkie
@grafana VP Product, @prometheus & @cortexproject maintainer. Previously @kausalco, @weaveworks, @google, @acunu
Tom Wilkie
Comments
  • promql+alert_generator: Dockerized the toolset, updated docs.

    promql+alert_generator: Dockerized the toolset, updated docs.

    Additionally updated docs.

    The motivation for this is that we can now set automated tests where in future we might want to provide docker images with exact set of test cases in versioned image.

    Signed-off-by: Bartlomiej Plotka [email protected]

  • Epic: Alert Generator Compliance Test Suite

    Epic: Alert Generator Compliance Test Suite

    Based on the specification, here is the list of all the high-level cases that needs to be covered by the test suite. In all the cases, the content of the alerts, APIs, time series, are checked to be correct.

    • [x] Presence of all the template variable and functions as described in the specification (across all the rules, not all in a single rule).
      • Data
        • [x] $labels.something .Labels.something
        • [x] $value .Value
      • Queries
        • [x] query
        • [x] first
        • [x] label
        • [x] value
        • [x] sortByLabel
      • Numbers
        • [x] humanize
        • [x] humanize1024
        • [x] humanizeDuration
        • [x] humanizePercentage
        • [x] humanizeTimestamp
      • Strings
        • [x] title
        • [x] toUpper
        • [x] toLower
        • [x] stripPort
        • [x] match
        • [x] reReplaceAll
        • [x] parseDuration
      • Others
        • [x] args
      • Undocumented and/or not needed:
        • strvalue (undocumented, not needed)
        • pathPrefix (not needed, only in consoles, and also undocumented)
        • .ExternalLabels $externalLabels (not needed)
        • .ExternalURL $externalURL (not needed)
        • graphLink (not needed)
        • tableLink (not needed)
        • tmpl (not needed, only in consoles)
        • safeHtml (not needed, only in consoles)
    • [x] Alert that goes from pending->firing->inactive.
    • [x] Alert that goes from pending->inactive.
    • [x] Rule that never becomes active (i.e. alerts in pending or firing)
    • [x] pending alerts having changing annotation values (checked via API)
    • [x] firing and inactive alerts being sent when they first went into those states.
    • [x] firing alert being re-sent at expected intervals when the alert is active with changing annotation contents.
    • [x] inactive alert being re-sent at expected intervals up to a certain time and not after that.
    • [x] Alert that goes directly to firing state (skipping the pending state) because of zero for duration.
    • [x] Alert that becomes active after having fired already and gone into inactive state for both the cases where for duration is zero and non zero. Here we should test 2 cases: One where inactive alert was still being sent, hence should stop sending that. Two is the inactive alert was not being sent anymore.
    • [x] Rule that produces new alerts that go from pending->firing->inactive while already having active alerts.
    • [x] When the for duration is non-zero and less than the evaluation interval, firing alert must be sent after the second evaluation of the rule and not before.
    • [x] A rule group having rules which are dependant on the ALERTS series from the rules above it in the same group.
    • [x] Expansion of template in annotations only use the labels from the query result as source data even if those labels get overridden by the rules. They do not use the rules' additional labels.
    • [x] Alert goes into inactive when there is no more data. Both when in firing and pending.

    All the time comparison will be done within a certain acceptable delta and need not be exact.

  • [alert generator] question about `alertname` label

    [alert generator] question about `alertname` label

    The spec says the following:

    The alert name from the alerting rule (HighRequestLatency from the example above) MUST be added to the labels of the alert with the label name as alertname. It MUST override any existing alertname label.
    

    The statement above says that rule's name should override any existing alertname label. Does it mean that in templates $labels.alertname and .Labels.alertname values should behave in the same way? One of the testcases expects template value to be equal to the existing alertname label: https://github.com/prometheus/compliance/blob/c7c726de89973d77cb491faa1b32cfddf7dcde8a/alert_generator/cases/case_new_alerts_and_order_check.go#L254 But this looks controversial to what spec says.

  • Import the PromLabs PromQL compliance tester

    Import the PromLabs PromQL compliance tester

    This is a pretty minimal import without too many changes, to get it moved before doing anything more.

    The first commit imports it completely unchanged, the second one just fixes minimal things to adjust to the new repo location.

  • Set offset for the query end time to -12 minutes

    Set offset for the query end time to -12 minutes

    When running the PromQL Compliance test, we can sometimes observe the queries with negative offset failing. The reason for this is that by default query end time is now() - 2m. And we have queries with negative offsets of -5m and -10m. That, generally, means that we are querying into the future for 3m for the first query (with offset -5), and 8m for the second query (with offset -10). When running the compliance test, sometimes a race condition can be observed with data ingestion where one storage still hasn't ingested the same sample that the other system already has ingested. As a result, queries with negative offsets into the future can fail since the resulting time series will have additional samples at the end (the one that the other storage system still hasn't ingested at that point in time). This issue is raised in #94.

    This PR offsets the query end time to -12m (previously it was -2m). This will ensure that even for the query with offset -10 we don't query into the future and thus we avoid the ingestion race condition problem.

  • Allow config file splitting, update test cases

    Allow config file splitting, update test cases

    Sorry, this is in one commit because I didn't do a separate update of test cases in the old, non-split config file.

    You can now pass -config-file multiple times, leading for the mentioned files to be concatenated before YAML parsing happens.

    Signed-off-by: Julius Volz [email protected]

  • Alert Generator Compliance Specification 1.0

    Alert Generator Compliance Specification 1.0

    This PR adds a specification for alert-generator compliance which was open for public review at https://docs.google.com/document/d/1QyGA3c0Eys9rZRMEbSSXcd1C0yuph8wCQ6B3Dc90rk0/edit

  • Add test for retry behaviour: should retries 5xx, should not retry 4xx.

    Add test for retry behaviour: should retries 5xx, should not retry 4xx.

    --- FAIL: TestRemoteWrite (50.30s)
        --- PASS: TestRemoteWrite/grafana (0.00s)
            --- PASS: TestRemoteWrite/grafana/Retries400 (10.07s)
            --- PASS: TestRemoteWrite/grafana/Retries500 (10.08s)
        --- FAIL: TestRemoteWrite/otelcollector (0.01s)
            --- PASS: TestRemoteWrite/otelcollector/Retries400 (10.02s)
            --- FAIL: TestRemoteWrite/otelcollector/Retries500 (10.02s)
        --- PASS: TestRemoteWrite/prometheus (0.01s)
            --- PASS: TestRemoteWrite/prometheus/Retries400 (10.05s)
            --- PASS: TestRemoteWrite/prometheus/Retries500 (10.10s)
        --- FAIL: TestRemoteWrite/telegraf (0.01s)
            --- PASS: TestRemoteWrite/telegraf/Retries400 (10.02s)
            --- FAIL: TestRemoteWrite/telegraf/Retries500 (10.02s)
        --- FAIL: TestRemoteWrite/vector (0.01s)
            --- PASS: TestRemoteWrite/vector/Retries400 (10.03s)
            --- FAIL: TestRemoteWrite/vector/Retries500 (10.03s)
    

    Signed-off-by: Tom Wilkie [email protected]

  • [alert_generator]

    [alert_generator] "mismatch in EndsAt" error question

    alert_generator test suite checks the received alerts for the correctness of their properties. One of those checks is comparing if EndsAt param is within the time range between now (when alert was received by alert_generator) and now+delta, where delta is usually 4*resendDelay - see https://github.com/prometheus/compliance/blob/main/alert_generator/cases/expected_alert.go#L80-L96

    However, the time when alert was received isn't always the time when alert was triggered. Since Prometheus aligns the time slots when alert should be executed, the real time and timestamp of alert execution can differ - see https://github.com/prometheus/prometheus/blob/580e852f1028ecbcaa67836f2da5230ac7c35fd0/rules/manager.go#L411-L419

    Should this mean, that alert_generator should calculate EndsAt param based on alert's ActiveAt param instead of time when alert was actually received?

  • Don't require the up metric for non-up-metric tests.

    Don't require the up metric for non-up-metric tests.

    Its a bit harsh to fail agents on not providing the up metric in unrelated tests. So remove that check, and use other signals. Makes NameLabel test pass for Telegraf and Otel.

    Signed-off-by: Tom Wilkie [email protected]

  • alert_generator: add `vmalert` config

    alert_generator: add `vmalert` config

    Add testing configuration for VictoriaMetrics vmalert component whcih can be used as alerts generator. vmalert can be configured to use Prometheus as remote storage for querying alerts and writing back results.

  • promql-compliance-tester fails on victoriametrics

    promql-compliance-tester fails on victoriametrics

    I've just tried to run the promql compatibility test with victoriametrics 1.83.0 but run into the following error.

    Tried thanos and prometheus as reference config. Both fail.

    Result is also non deterministic as the error alwasy fails at a different step. See:

    / # ./promql-compliance-tester -config-file vm.yaml -config-file ./promql-test-queries.yml -output-passing
    403 / 548 [------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->___________________________________________________________________] 73.54% 307 p/sFATA[0001] Error running comparison: expected reference API query "label_replace(demo_num_cpus, \"instance\", \"\", \"\", \"\")" to fail, but succeeded  source="main.go:137"
    
    / # ./promql-compliance-tester -config-file vm.yaml -config-file ./promql-test-queries.yml -output-passing
    385 / 548 [----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->___________________________________________________________________________] 70.26% 363 p/sFATA[0001] Error running comparison: expected reference API query "label_replace(demo_num_cpus, \"instance\", \"\", \"\", \"\")" to fail, but succeeded  source="main.go:137"
    
    / # ./promql-compliance-tester -config-file vm.yaml -config-file ./promql-test-queries.yml -output-passing
    445 / 548 [--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->_______________________________________________] 81.20% 267 p/sFATA[0001] Error running comparison: expected reference API query "label_replace(demo_num_cpus, \"instance\", \"\", \"\", \"\")" to fail, but succeeded  source="main.go:137"
    
    / # ./promql-compliance-tester -config-file vm.yaml -config-file ./promql-test-queries.yml -output-passing
    431 / 548 [-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->______________________________________________________] 78.65% 348 p/sFATA[0001] Error running comparison: expected reference API query "label_replace(demo_num_cpus, \"instance\", \"\", \"\", \"\")" to fail, but succeeded  source="main.go:137"
    
    
  • Conformance test queries with negative offset fail due to latency

    Conformance test queries with negative offset fail due to latency

    Hi

    PromQL conformance tests with negative offset fail because of latency issues.

    Latencies might be different in storage for native prom and vendor's storage (Due to persistence, Georedundant, etc). Negative offset queries try to fetch data for the future timestamp and the availability of data makes a difference in results of the queries. By this, here the compliance tool is not testing the correctness of the queries but testing the latency of the storages.

    image

    Suppose a query with negative offset -5m made at 10:32:00:000. According to compliance tool, timestamps for queries are:

    end_time: time.Now() - 2 min => 10:32:00:000 - 2 mins => 10:30:00:00 start_time: end_time - 10m => 10:30:00:00- 10 mins => 10:20:00:00

    Query needs values till 10:35:00 (Because of -5m from end time)

    At 10:32:00:000, Due to latency, value may not be available for timestamp 10:31:55:000 in vendor storage whereas it may be available for native prom (This can happen vice-versa as well). Here the value returned by native prom will be 101 whereas vendor implementation returns 100 (Previous available value).

    I think we can overcome this by providing "query_time_parameters": end_time value as a time stamp that is less than 10 mins of current timestamp. But just like to know other thoughts on this issue and how to overcome it?

  • [alert_generator] Add docs on how to use the test suite

    [alert_generator] Add docs on how to use the test suite

    Similar to the PromQL tests, I don't plan on building binaries and docker images in the CI and rather have a bunch of Go CLI commands in the docs that can be run to use the test suite.

  • Prometheus is not fully compatible with OpenMetrics tests

    Prometheus is not fully compatible with OpenMetrics tests

    What did you do?

    We want to ensure OpenMetrics / Prometheus compatibility in the OpenTelemetry Collector. We have been building compatibility tests to verify the OpenMetrics spec is fully supported on both the OpenTelemetry Collector Prometheus receiver and PRW exporter as well as in Prometheus itself.

    We used the OpenMetrics metrics test data available at https://github.com/OpenObservability/OpenMetrics/tree/main/tests/testdata/parsers

    Out of a total of 161 negative tests in OpenMetrics, 94 tests pass (these tests are dropped) with an 'up' value of 0; 67 tests are not dropped and have an 'up' value of 1 and 22 tests have incorrectly ingested metrics.

    In order to test Prometheus itself, we set up a metrics HTTP endpoint that exposes invalid/bad metrics from the OpenMetrics tests. We then configured Prometheus 2.31.0 to scrape the metrics endpoint.

    What did you expect to see?

    Expected result: The scrape should fail since the target has invalid metric and the appropriate error should be reported.

    For e.g with following metric data: bad_counter_values_1 (https://raw.githubusercontent.com/OpenObservability/OpenMetrics/main/tests/testdata/parsers/bad_counter_values_1/metrics)

    # TYPE a counter
    a_total -1
    # EOF
    

    What did you see instead? Under which circumstances?

    Current behavior: Scrape is successful. There are multiple bad test cases that are scraped successfully by Prometheus.

    For example - Using bad_counter_values_1 (#5 listed below) does not show an error even though it is an negative counter value. According to OpenMetrics tests, this metric should not be parsed.

    Screenshot 2021-11-03 at 2 49 52 PM

    You can see no error has been reported and the scrape is successful.

    Screenshot 2021-11-03 at 2 50 20 PM

    Similar to bad_counter_values_1 test case, there are multiple bad test cases where the scrape is successful and metrics are ingested by Prometheus:

    1. bad_missing_or_extra_commas_0
    2. bad_metadata_in_wrong_place_1
    3. bad_counter_values_18
    4. bad_grouping_or_ordering_9
    5. bad_counter_values_1
    6. bad_histograms_2
    7. bad_counter_values_16
    8. bad_value_1
    9. bad_missing_or_extra_commas_2
    10. bad_invalid_labels_6
    11. bad_grouping_or_ordering_8
    12. bad_metadata_in_wrong_place_0
    13. bad_grouping_or_ordering_10
    14. bad_grouping_or_ordering_0
    15. bad_value_2
    16. bad_metadata_in_wrong_place_2
    17. bad_text_after_eof_1
    18. bad_value_3
    19. bad_counter_values_0
    20. bad_grouping_or_ordering_3
    21. bad_histograms_3
    22. bad_blank_line

    Environment

    • System information:

    Darwin 20.6.0 x86_64

    • Prometheus version:

    version=2.31.0

    • Prometheus configuration file:
    global:
      scrape_interval: 5s
    
    scrape_configs:
      - job_name: "open-metrics-scrape"
        static_configs:
          - targets: ["localhost:3000"]
    
    

    cc: @PaurushGarg @mustafain117

Prometheus Remote Write Go client

promwrite Prometheus Remote Write Go client with minimal dependencies. Supports Prometheus, Cortex, VictoriaMetrics etc. Install go get -u github.com/

Dec 13, 2022
Kubernetes compliance validation pack for Probr

Probr Kubernetes Service Pack The Probr Kubernetes Service pack provides a variety of provider-agnostic compliance checks. Get the latest stable versi

Jul 21, 2022
W5-test-go - Write functions to pass the tests with the cases need to pass

Week 5 Assignment In this assignment, we expect to you write functions to pass t

Feb 11, 2022
Write controller-runtime based k8s controllers that read/write to git, not k8s

Git Backed Controller The basic idea is to write a k8s controller that runs against git and not k8s apiserver. So the controller is reading and writin

Dec 10, 2021
A tool to check whether docker images exist in the remote registry.

Check Docker Image A tool to check whether docker images exist in the remote registry. Build project: go build -o check-image . Example usage: REGISTR

Jul 26, 2022
Export Prometheus metrics from journald events using Prometheus Go client library

journald parser and Prometheus exporter Export Prometheus metrics from journald events using Prometheus Go client library. For demonstration purposes,

Jan 3, 2022
Common Expression Language -- specification and binary representation

The Common Expression Language (CEL) implements common semantics for expression evaluation, enabling different applications to more easily interoperate.

Jan 8, 2023
Open Source runtime scanner for Linux containers (LXD), It performs security audit checks based on CIS Linux containers Benchmark specification
Open Source runtime scanner for Linux containers (LXD), It performs security audit checks based on CIS Linux containers  Benchmark specification

lxd-probe Scan your Linux container runtime !! Lxd-Probe is an open source audit scanner who perform audit check on a linux container manager and outp

Dec 26, 2022
Open Source runtime scanner for OpenShift cluster and perform security audit checks based on CIS RedHat OpenShift Benchmark specification
Open Source runtime scanner for OpenShift cluster and perform security audit checks based on CIS RedHat OpenShift Benchmark specification

OpenShift-Ordeal Scan your Openshift cluster !! OpenShift-Ordeal is an open source audit scanner who perform audit check on OpenShift Cluster and outp

Sep 6, 2022
Testcontainers is a Golang library that providing a friendly API to run Docker container. It is designed to create runtime environment to use during your automatic tests.

When I was working on a Zipkin PR I discovered a nice Java library called Testcontainers. It provides an easy and clean API over the go docker sdk to

Jan 7, 2023
Client extension for interacting with Kubernetes clusters from your k6 tests.

⚠️ This is a proof of concept As this is a proof of concept, it won't be supported by the k6 team. It may also break in the future as xk6 evolves. USE

Jan 2, 2023
ginko-volkswagen detects when your tests are being run in a CI server, and reports them as passing

detects when your ginkgo-based tests are being run in a CI server, and reports them as passing

Dec 4, 2021
A Kubernetes operator to manage ThousandEyes tests
A Kubernetes operator to manage ThousandEyes tests

ThousandEyes Kubernetes Operator ThousandEyes Kubernetes Operator is a Kubernetes operator used to manage ThousandEyes Tests deployed via Kubernetes c

Jul 18, 2022
Tests for jsondata-go

These are tests performed on github.com/blues/jsonata-go. De tests are copied from the jsonata test suite in the master branch, commit e6e436d44e2b04a

Nov 11, 2021
Bet - An exploration in writing structured Go tests using type parameters

Behavior Tests This is an exploration in writing structured Go tests using type

Feb 25, 2022
Making it easy to write shell-like scripts in Go
Making it easy to write shell-like scripts in Go

import github.com/bitfield/script What is script? script is a Go library for doing the kind of tasks that shell scripts are good at: reading files, ex

Jan 9, 2023
A simple and flexible health check library for Go.

Health A simple and flexible health check library for Go. Documentation · Report Bug · Request Feature Table of Contents Getting started Synchronous v

Jan 4, 2023
Kubedd – Check migration issues of Kubernetes Objects while K8s upgrade

Kubedd – Check migration issues of Kubernetes Objects while K8s upgrade

Dec 19, 2022
scenario system to check the behavior of kube-scheduler

kube-scheduler-simulator-cli: Kubernetes Scheduler simulator on CLI and scenario system. Hello world. This repository is scenario system for kube-sche

Jan 25, 2022