The Prometheus monitoring system and time series database.

Prometheus

CircleCI Docker Repository on Quay Docker Pulls Go Report Card CII Best Practices Gitpod ready-to-code

Visit prometheus.io for the full documentation, examples and guides.

Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.

The features that distinguish Prometheus from other metrics and monitoring systems are:

  • A multi-dimensional data model (time series defined by metric name and set of key/value dimensions)
  • PromQL, a powerful and flexible query language to leverage this dimensionality
  • No dependency on distributed storage; single server nodes are autonomous
  • An HTTP pull model for time series collection
  • Pushing time series is supported via an intermediary gateway for batch jobs
  • Targets are discovered via service discovery or static configuration
  • Multiple modes of graphing and dashboarding support
  • Support for hierarchical and horizontal federation

Architecture overview

Install

There are various ways of installing Prometheus.

Precompiled binaries

Precompiled binaries for released versions are available in the download section on prometheus.io. Using the latest production release binary is the recommended way of installing Prometheus. See the Installing chapter in the documentation for all the details.

Docker images

Docker images are available on Quay.io or Docker Hub.

You can launch a Prometheus container for trying it out with

$ docker run --name prometheus -d -p 127.0.0.1:9090:9090 prom/prometheus

Prometheus will now be reachable at http://localhost:9090/.

Building from source

To build Prometheus from source code, first ensure that have a working Go environment with version 1.14 or greater installed. You also need Node.js and Yarn installed in order to build the frontend assets.

You can directly use the go tool to download and install the prometheus and promtool binaries into your GOPATH:

$ go get github.com/prometheus/prometheus/cmd/...
$ prometheus --config.file=your_config.yml

However, when using go get to build Prometheus, Prometheus will expect to be able to read its web assets from local filesystem directories under web/ui/static and web/ui/templates. In order for these assets to be found, you will have to run Prometheus from the root of the cloned repository. Note also that these directories do not include the new experimental React UI unless it has been built explicitly using make assets or make build.

An example of the above configuration file can be found here.

You can also clone the repository yourself and build using make build, which will compile in the web assets so that Prometheus can be run from anywhere:

$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/prometheus.git
$ cd prometheus
$ make build
$ ./prometheus --config.file=your_config.yml

The Makefile provides several targets:

  • build: build the prometheus and promtool binaries (includes building and compiling in web assets)
  • test: run the tests
  • test-short: run the short tests
  • format: format the source code
  • vet: check the source code for common errors
  • docker: build a docker container for the current HEAD
  • assets: build the new experimental React UI

React UI Development

For more information on building, running, and developing on the new React-based UI, see the React app's README.md.

More information

Contributing

Refer to CONTRIBUTING.md

License

Apache License 2.0, see LICENSE.

Comments
  • TSDB data import tool for OpenMetrics format.

    TSDB data import tool for OpenMetrics format.

    Created a tool to import data formatted according to the Prometheus exposition format. The tool can be accessed via the TSDB CLI.

    closes prometheus/prometheus#535

    Signed-off-by: Dipack P Panjabi [email protected]

    (Port of https://github.com/prometheus/tsdb/pull/671)

  • Add mechanism to perform bulk imports

    Add mechanism to perform bulk imports

    Currently the only way to bulk-import data is a hacky one involving client-side timestamps and scrapes with multiple samples per time series. We should offer an API for bulk import. This relies on https://github.com/prometheus/prometheus/issues/481.

    EDIT: It probably won't be an web-based API in Prometheus, but a command-line tool.

  • Create a section ANNOTATIONS with user-defined payload and generalize RUNBOOK, DESCRIPTION, SUMMARY into fields therein.

    Create a section ANNOTATIONS with user-defined payload and generalize RUNBOOK, DESCRIPTION, SUMMARY into fields therein.

    RUNBOOK was added in a hurry in #843 for an internal demo of one of our users, which didn't give it enough time to be fully discussed. The demo has been done, so we can reconsider this.

    I think we should revert this change, and remove RUNBOOK:

    • Our general policy is that if it can be done with labels, do it with labels
    • All notification methods in the alertmanager will need extra code to deal with this
    • In future, all alertmanager notification templates will need extra code to deal with this
    • In general, all user code touching the alertmanager will need extra code to deal with this
    • This presumes a certain workflow in that you have something called a "runbook" (and not any other name - playbook is also common) and that you have exactly one of them

    Runbooks are not a fundamental aspect of an alert, are not in use by all of our users and thus I don't believe they meet the bar for first-class support within prometheus. This is especially true considering that they don't add anything that isn't already possible with labels.

  • Implement strategies to limit memory usage.

    Implement strategies to limit memory usage.

    Currently, Prometheus simply limits the chunks in memory to a fixed number.

    However, this number doesn't directly imply the total memory usage as many other things take memory as well.

    Prometheus could measure its own memory consumption and (optionally) evict chunks early if it needs too much memory.

    It's non-trivial to measure "actual" memory consumption in a platform independent way.

  • '@ <timestamp>' modifier

    '@ ' modifier

    This PR implements @ <timestamp> modifier as per this design doc.

    An example query:

    rate(process_cpu_seconds_total[1m]) 
      and
    topk(7, rate(process_cpu_seconds_total[1h] @ 1234))
    

    which ranks based on last 1h rate and w.r.t. unix timestamp 1234 but actually plots the 1m rate.

    Closes #7903

    This PR is to be followed up with an easier way to represent the start, end, range of a query in PromQL so that we could do @ <end>, metric[<range>] easily.

  • Port isolation from old TSDB PR

    Port isolation from old TSDB PR

    The original PR was https://github.com/prometheus/tsdb/pull/306 .

    I tried to carefully adjust to the new world order, but please give this a very careful review, especially around iterator reuse (marked with a TODO).

    On the bright side, I definitely found and fixed a bug in txRing.

  • 2.3.0 significatnt memory usage increase.

    2.3.0 significatnt memory usage increase.

    Bug Report

    What did you do? Upgraded to 2.3.0

    What did you expect to see? General improvements.

    What did you see instead? Under which circumstances? Memory usage, possibly driven by queries, has considerably increased. Upgrade at 09:27, the memory usage drops on the graph after then are from container restarts due to OOM.

    container_memory_usage_bytes

    image

    Environment

    Prometheus in kubernetes 1.9

    • System information: Standard docker containers, on docker kubelet on linux.

    • Prometheus version: 2.3.0 insert output of prometheus --version here

  • Support for environment variable substitution in configuration file

    Support for environment variable substitution in configuration file

    I think that would be a good idea to substitute environment variables in the configuration file.

    That could be done really easily using os.ExpandEnv on configuration string when loading configuration string.

    That would be much easier to substitute environment variables only on configuration values. go -ini provides a valueMapper but yaml.v2 doesn't have such mechanism.

  • React UI: Implement more sophisticated autocomplete

    React UI: Implement more sophisticated autocomplete

    It would be great to have more sophisticated expression field autocompletion in the new React UI.

    Currently it only autocompletes metric names, and only when the expression field doesn't contain any other sub-expressions yet.

    Things that would be nice to autocomplete:

    • metric names anywhere within an expression
    • label names
    • label values
    • function names
    • etc.

    For autocomplete functionality not to annoy users, it needs to be as highly performant, correct, and unobtrusive as possible. Grafana does many things right here already, but they also have a few really annoying bugs, like inserting closing parentheses in incorrect locations of an expression.

    Currently @slrtbtfs has indicated interest in building a language-server-based autocomplete implementation.

  • Benchmark tsdb master

    Benchmark tsdb master

    DO NOT MERGE

    Benchmark 1

    Benchmark the following PRs against 2.11.1

    1. For queries: https://github.com/prometheus/tsdb/pull/642
    2. For compaction: https://github.com/prometheus/tsdb/pull/643 https://github.com/prometheus/tsdb/pull/654 https://github.com/prometheus/tsdb/pull/653
    3. Opening block: https://github.com/prometheus/tsdb/pull/645

    Results

    Did not test for compaction from on-disk blocks. Could not really see the allocation optimizations in compaction, that might be because the savings are mostly in the number of allocations and not the size of allocation (size is what is showed in the dashboards). That would mean CPU to be saved, but couldn't make a huge difference, but a slight increase in gap during compaction.

    The gains looked good in

    1. Allocations
    2. CPU (because of allocations?)
    3. RSS was also lower (upto 10 GiB lower! ~60 vs ~70).
    4. Also a tiny-good improvement in query inner_eval times.
    5. Compaction time (this should help the increase in compaction time that https://github.com/prometheus/tsdb/pull/627 is going to bring).
    6. System load.

    And bad in

    1. result_sort for the queries. Not sure why.

    Benchmark 2

    Benchmark https://github.com/prometheus/tsdb/pull/627 (which includes all the PRs from above Benchmark 1) against 2.11.1

  • M-map full chunks of Head from disk

    M-map full chunks of Head from disk

    tl-dr desc for the PR from @krasi-georgiev


    When appending to the head and a chunk is full it is flushed to the disk and m-mapped (memory mapped) to free up memory

    Prom startup now happens in these stages

    • Iterate the m-maped chunks from disk and keep a map of series reference to its slice of mmapped chunks.
    • Iterate the WAL as usual. Whenever we create a new series, look for it's mmapped chunks in the map created before and add it to that series.

    If a head chunk is corrupted the currpted one and all chunks after that are deleted and the data after the corruption is recovered from the existing WAL which means that a corruption in m-mapped files results in NO data loss.

    Mmaped chunks format - main difference is that the chunk for mmaping now also includes series reference because there is no index for mapping series to chunks. The block chunks are accessed from the index which includes the offsets for the chunks in the chunks file - example - chunks of series ID have offsets 200, 500 etc in the chunk files. In case of mmaped chunks, the offsets are stored in memory and accessed from that. During WAL replay, these offsets are restored by iterating all m-mapped chunks as stated above by matching the series id present in the chunk header and offset of that chunk in that file.

    Prombench results

    WAL Replay

    1h Wal reply time 30% less wal reply time - 4m31 vs 3m36 2h Wal reply time 20% less wal reply time - 8m16 vs 7m

    Memory During WAL Replay

    High Churn 10-15% less RAM - 32gb vs 28gb 20% less RAM after compaction 34gb vs 27gb No Churn 20-30% less RAM - 23gb vs 18gb 40% less RAM after compaction 32.5gb vs 20gb

    Screenshots are in this comment


    Prerequisite: https://github.com/prometheus/prometheus/pull/6830 (Merged)

    Closes https://github.com/prometheus/prometheus/issues/6377. More info in the linked issue and the doc in that issue and the doc inside that doc inside that issue :)

    • [x] Add tests
    • [x] Explore possible ways to get rid of new globals added in head.go
    • [x] Wait for https://github.com/prometheus/prometheus/pull/6830 to be merged
    • [x] Fix windows tests
  • Deleting data points older than TSDB min date with 'delete_series' API endpoint fails

    Deleting data points older than TSDB min date with 'delete_series' API endpoint fails

    What did you do?

    1. Set out_of_order_time_window in storage configuration to allow the storage of data points that are older than the TSDB's minimum timestamp by a certain amount of time.
    2. Generate and store some data points with timestamps older than the TSDB's minimum timestamp.
    3. Use the/api/v1/admin/tsdb/delete_series API endpoint to delete the data points, specifying a match parameter that matches the data points you created.
    4. Observe that the data points are not deleted.

    What did you expect to see?

    When I use the /api/v1/admin/tsdb/delete_series API endpoint to delete the data points I created, I expect Prometheus to delete all data points matching the specified match parameters, regardless of their timestamps.

    What did you see instead? Under which circumstances?

    The delete_series API does not delete data points with timestamps older than the TSDB's minimum timestamp.

    System information

    Linux 6.0.11-300.fc37.x86_64 x86_64

    Prometheus version

    prometheus, version 2.40.7 (branch: HEAD, revision: ab239ac5d43f6c1068f0d05283a0544576aaecf8)
      build user:       root@afba4a8bd7cc
      build date:       20221214-08:49:43
      go version:       go1.19.4
      platform:         linux/amd64
    

    Prometheus configuration file

    storage:
      tsdb:
        out_of_order_time_window: 1w
    

    Alertmanager version

    No response

    Alertmanager configuration file

    No response

    Logs

    ts=2022-12-31T02:45:19.628Z caller=main.go:512 level=info msg="No time or size retention was set so using the default time retention" duration=15d
    ts=2022-12-31T02:45:19.628Z caller=main.go:556 level=info msg="Starting Prometheus Server" mode=server version="(version=2.40.7, branch=HEAD, revision=ab239ac5d43f6c1068f0d05283a0544576aaecf8)"
    ts=2022-12-31T02:45:19.628Z caller=main.go:561 level=info build_context="(go=go1.19.4, user=root@afba4a8bd7cc, date=20221214-08:49:43)"
    ts=2022-12-31T02:45:19.628Z caller=main.go:562 level=info host_details="(Linux 6.0.11-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Dec 2 20:47:45 UTC 2022 x86_64 localhost-live.Home (none))"
    ts=2022-12-31T02:45:19.628Z caller=main.go:563 level=info fd_limits="(soft=524288, hard=524288)"
    ts=2022-12-31T02:45:19.628Z caller=main.go:564 level=info vm_limits="(soft=unlimited, hard=unlimited)"
    ts=2022-12-31T02:45:19.629Z caller=web.go:559 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
    ts=2022-12-31T02:45:19.630Z caller=main.go:993 level=info msg="Starting TSDB ..."
    ts=2022-12-31T02:45:19.630Z caller=tls_config.go:232 level=info component=web msg="Listening on" address=[::]:9090
    ts=2022-12-31T02:45:19.630Z caller=tls_config.go:235 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
    ts=2022-12-31T02:45:19.633Z caller=head.go:562 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
    ts=2022-12-31T02:45:19.633Z caller=head.go:606 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=1.581µs
    ts=2022-12-31T02:45:19.633Z caller=head.go:612 level=info component=tsdb msg="Replaying WAL, this may take a while"
    ts=2022-12-31T02:45:19.633Z caller=head.go:683 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
    ts=2022-12-31T02:45:19.634Z caller=head.go:711 level=info component=tsdb msg="WBL segment loaded" segment=0 maxSegment=0
    ts=2022-12-31T02:45:19.634Z caller=head.go:720 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=25.653µs wal_replay_duration=290.747µs wbl_replay_duration=123.648µs total_replay_duration=461.634µs
    ts=2022-12-31T02:45:19.635Z caller=main.go:1014 level=info fs_type=9123683e
    ts=2022-12-31T02:45:19.635Z caller=main.go:1017 level=info msg="TSDB started"
    ts=2022-12-31T02:45:19.635Z caller=main.go:1197 level=info msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
    ts=2022-12-31T02:45:21.019Z caller=main.go:1234 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=1.384764925s db_storage=746ns remote_storage=1.026µs web_handler=285ns query_engine=536ns scrape=1.384541556s scrape_sd=28.161µs notify=3.212µs notify_sd=9.203µs rules=5.028µs tracing=20.258µs
    ts=2022-12-31T02:45:21.020Z caller=main.go:978 level=info msg="Server is ready to receive web requests."
    ts=2022-12-31T02:45:21.020Z caller=manager.go:944 level=info component="rule manager" msg="Starting rule manager..."
    ts=2022-12-31T02:45:26.561Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=1672436696000 maxt=1672437600000 ulid=01GNK13RWKHAQGK0TCB7FXNG8J duration=13.847104ms
    ts=2022-12-31T02:45:26.562Z caller=head.go:1213 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=663.29µs
    
  • Retain series selection when toggling graph features

    Retain series selection when toggling graph features

    When toggling certain features in the web UI graph page, such as local time or stacked graphs, the selection of series is reset to show all of them. This PR takes the current selection into account when updating the view, preserving it whenever possible. This is only possible when the graphed data set remains unaltered; feature toggles that change the graph data will still reset the selection.

    Fixes #10970

  • Remove Nomad `datacenter` field in configuration docs

    Remove Nomad `datacenter` field in configuration docs

    There is a datacenter field documented in the Nomad SD configuration section even though there is no such field implemented in the actual discovery mechanism. Looking into the Nomad API client, there doesn't seem to be a place to specify the datacenter in either the client config or the query options, so I'm going to assume the error is in documenting the existence of the datacenter field and not the lack of the implementation of it, as it doesn't exist.

    Fixes #11776

    cc @attachmentgenie

  • Remote write stop sending samples

    Remote write stop sending samples

    What did you do?

    I use a prometheus to collect metrics, and write to an OpenTSDB remote storage. Then all the prometheus remote metrics stoped to grow. It looks like #10264 , but I am very sure the patch is merged.

    I got the stack goroutine.txt

    What did you expect to see?

    No response

    What did you see instead? Under which circumstances?

    Logs like "Skipping resharding, last successful send was beyond threshold" repeated for a while, then no more remote write logs.

    Prometheus remote write metrics stoped growing, apparently it's blocked somewhere.

    System information

    Linux 5.4.143.bsk.3-amd64 x86_64

    Prometheus version

    Version	2.33.5
    Revision	abe39ffbb022aed5606e15fd46d9541c822e354f
    Branch	bytedance-dev
    BuildUser	root@leafboat-73428968
    BuildDate	20220609-13:02:08
    GoVersion	go1.17.8
    

    Prometheus configuration file

    global:
      scrape_interval: 30s
      scrape_timeout: 10s
      evaluation_interval: 30s
      external_labels:
        cluster: erica
        collector: ksm-collector-1
        physical_cluster: erica-hl
        vdc: hl
    remote_write:
    - url: http://telecaster.byted.org/api/prometheus/align
      remote_timeout: 30s
      write_relabel_configs:
      - source_labels: [metric_prefix, __name__]
        separator: ;
        regex: (.+);(.+)
        target_label: __name__
        replacement: $1.$2
        action: replace
      - separator: ;
        regex: (metric_prefix|bke_prometheus_stack_scrape)
        replacement: $1
        action: labeldrop
      follow_redirects: true
      queue_config:
        capacity: 2500
        max_shards: 200
        min_shards: 1
        max_samples_per_send: 500
        batch_send_deadline: 5s
        min_backoff: 30ms
        max_backoff: 5s
      metadata_config:
        send: true
        send_interval: 1m
        max_samples_per_send: 500
    

    Alertmanager version

    No response

    Alertmanager configuration file

    No response

    Logs

    ts=2022-12-19T10:20:30.668Z caller=dedupe.go:112 component=remote level=warn remote_name=f9aa52 url=http://telecaster.byted.org/api/prometheus/align msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671414921 minSendTimestamp=1671445220

Nightingale - A Distributed and High-Performance Monitoring System. Prometheus enterprise edition
Nightingale - A Distributed and High-Performance Monitoring System. Prometheus enterprise edition

Introduction ?? A Distributed and High-Performance Monitoring System. Prometheus

Jan 7, 2022
Monitoring-go - A simple monitoring tool to sites of MOVA

Monitoring GO A simple monitoring tool to sites of MOVA How to use Clone Repo gi

Feb 14, 2022
Time Series Alerting Framework

Bosun Bosun is a time series alerting framework developed by Stack Exchange. Scollector is a metric collection agent. Learn more at bosun.org. Buildin

Dec 27, 2022
Distributed simple and robust release management and monitoring system.
Distributed simple and robust release management and monitoring system.

Agente Distributed simple and robust release management and monitoring system. **This project on going work. Road map Core system First worker agent M

Nov 17, 2022
A system and resource monitoring tool written in Golang!
A system and resource monitoring tool written in Golang!

Grofer A clean and modern system and resource monitor written purely in golang using termui and gopsutil! Currently compatible with Linux only. Curren

Jan 8, 2023
An open-source and enterprise-level monitoring system.
 An open-source and enterprise-level monitoring system.

Falcon+ Documentations Usage Open-Falcon API Prerequisite Git >= 1.7.5 Go >= 1.6 Getting Started Docker Please refer to ./docker/README.md. Build from

Jan 1, 2023
checkah is an agentless SSH system monitoring and alerting tool.

CHECKAH checkah is an agentless SSH system monitoring and alerting tool. Features: agentless check over SSH (password, keyfile, agent) config file bas

Oct 14, 2022
Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system.

Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system. Cloudprobe

Dec 30, 2022
mtail - extract internal monitoring data from application logs for collection into a timeseries database
 mtail - extract internal monitoring data from application logs for collection into a timeseries database

mtail - extract internal monitoring data from application logs for collection into a timeseries database mtail is a tool for extracting metrics from a

Dec 29, 2022
rtop is an interactive, remote system monitoring tool based on SSH

rtop rtop is a remote system monitor. It connects over SSH to a remote system and displays vital system metrics (CPU, disk, memory, network). No speci

Dec 30, 2022
distributed monitoring system
distributed monitoring system

OWL OWL 是由国内领先的第三方数据智能服务商 TalkingData 开源的一款企业级分布式监控告警系统,目前由 Tech Operation Team 持续开发更新维护。 OWL 后台组件全部使用 Go 语言开发,Go 语言是 Google 开发的一种静态强类型、编译型、并发型,并具有垃圾回

Dec 24, 2022
An example logging system using Prometheus, Loki, and Grafana.
An example logging system using Prometheus, Loki, and Grafana.

Logging Example Structure Collector Export numerical data for Prometheus and log data for Promtail. Exporter uses port 8080 Log files are saved to ./c

Nov 21, 2022
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Nov 10, 2022
Feb 9, 2022
Cloudinsight Agent is a system tool that monitors system processes and services, and sends information back to your Cloudinsight account.

Cloudinsight Agent 中文版 README Cloudinsight Agent is written in Go for collecting metrics from the system it's running on, or from other services, and

Nov 3, 2022
A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go
A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go

Package monkit is a flexible code instrumenting and data collection library. See documentation at https://godoc.org/gopkg.in/spacemonkeygo/monkit.v3 S

Dec 14, 2022
A GNU/Linux monitoring and profiling tool focused on single processes.
A GNU/Linux monitoring and profiling tool focused on single processes.

Uroboros is a GNU/Linux monitoring tool focused on single processes. While utilities like top, ps and htop provide great overall details, they often l

Dec 26, 2022
Simple and extensible monitoring agent / library for Kubernetes: https://gravitational.com/blog/monitoring_kubernetes_satellite/

Satellite Satellite is an agent written in Go for collecting health information in a kubernetes cluster. It is both a library and an application. As a

Nov 10, 2022
Simple Golang tool for monitoring linux cpu, ram and disk usage.

Simple Golang tool for monitoring linux cpu, ram and disk usage.

Mar 19, 2022