Next Generation Monitoring Server With Golang

Next Generation Monitoring Server

Build

make

Arguments

$ bin/ng-monitoring-server --help
  Usage of bin/ng-monitoring-server:
        --address string             TCP address to listen for http connections
        --advertise-address string   tidb server advertise IP
        --config string              config file path
        --log.path string            Log path of ng monitoring server
        --pd.endpoints strings       Addresses of PD instances within the TiDB cluster. Multiple addresses are separated by commas, e.g. --pd.endpoints 10.0.0.1:2379,10.0.0.2:2379
        --retention-period string    Data with timestamps outside the retentionPeriod is automatically deleted
                                     The following optional suffixes are supported: h (hour), d (day), w (week), y (year). If suffix isn't set, then the duration is counted in months (default "1")
        --storage.path string        Storage path of ng monitoring server
pflag: help requested

Config Example

$ cat config/config.toml.example
  # NG Monitoring Server Configuration.
  
  # Server address.
  address = "0.0.0.0:12020"
  
  advertise-address = "0.0.0.0:12020"
  
  [log]
  # Log path
  path = "log"
  
  # Log level: DEBUG, INFO, WARN, ERROR
  level = "INFO"
  
  [pd]
  # Addresses of PD instances within the TiDB cluster. Multiple addresses are separated by commas, e.g. ["10.0.0.1:2379","10.0.0.2:2379"]
  endpoints = ["0.0.0.0:2379"]
  
  [storage]
  # Storage path of ng monitoring server
  path = "data"
  
  [security]
  ca-path = ""
  cert-path = ""
  key-path = ""

Reload Config

$ bin/ng-monitoring-server --config config/config.toml.example

# Another shell session
$ pkill -SIGHUP ng-monitoring-server
Owner
PingCAP
The team behind TiDB TiKV, an open source MySQL compatible NewSQL HTAP database
PingCAP
Comments
  • fix bug of running state instead of finished state

    fix bug of running state instead of finished state

    Signed-off-by: crazycs520 [email protected]

    close https://github.com/pingcap/ng-monitoring/issues/35

    ▶ curl "http://192.168.1.3:12020/continuous_profiling/group_profiles?begin_time=1639725850&end_time=1644836910"
    [{"ts":1639725960,"profile_duration_secs":10,"state":"running","component_num":{"tidb":1,"pd":3,"tikv":0,"tiflash":0}}]% 
    
    

    This PR also need to change the tidb-dashboard.

  • config: fix update conflict causing by http API and file reload

    config: fix update conflict causing by http API and file reload

    Signed-off-by: Zhenchi [email protected]

    What problem does this PR solve?

    Issue Number: close #136

    What is changed and how it works?

    1. Add function UpdateGlobalConfig
    2. Use mutex to protect global config
  • topsql: fix instances output more than given time range

    topsql: fix instances output more than given time range

    Signed-off-by: Zhenchi [email protected]

    What problem does this PR solve?

    In master, if we want to query instances between ts 400 and ts 500, we query vm via /api/v1/query_range with arguments

    query = last_over_time(instance[100s])
    start = 400
    end = 500
    step = 100s
    

    However, it will fetch instances from (300, 500].

    What is changed and how it works?

    In this patch, we query vm via /api/v1/query with arguments

    query = last_over_time(instance[101s])
    time = 500
    

    Finally, It will fetch instances from [400, 500].

    Besides, fetchSumFromTSDB can also change to the same way.

  • topsql: fix heavy loop in backoff

    topsql: fix heavy loop in backoff

    Signed-off-by: Zhenchi [email protected]

    What problem does this PR solve?

    Issue Number: close #71

    What is changed and how it works?

    Only on calling Recv, grpc client emits a grpc, then throws rpc error: code = Unimplemented.

    Previously, I expected catching Unimplemented on calling Subscribe but that's not true.

    This Patch

    [2021/12/22 15:38:30.713 +08:00] [INFO] [manager.go:72] ["Top SQL is enabled"]
    [2021/12/22 15:38:30.713 +08:00] [INFO] [scraper.go:60] ["starting to scrape top SQL from the component"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"]
    [2021/12/22 15:38:30.713 +08:00] [INFO] [scraper.go:60] ["starting to scrape top SQL from the component"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"]
    [2021/12/22 15:38:30.715 +08:00] [WARN] [scraper.go:296] ["failed to call Subscribe"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"] [error="rpc error: code = Unimplemented desc = "]
    [2021/12/22 15:38:30.716 +08:00] [WARN] [scraper.go:276] ["failed to call Subscribe"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"] [error="rpc error: code = Unimplemented desc = unknown service tipb.TopSQLPubSub"]
    [2021/12/22 15:38:34.716 +08:00] [WARN] [scraper.go:252] ["retry to scrape component"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"] [retried=1]
    [2021/12/22 15:38:34.716 +08:00] [WARN] [scraper.go:252] ["retry to scrape component"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"] [retried=1]
    [2021/12/22 15:38:34.717 +08:00] [WARN] [scraper.go:296] ["failed to call Subscribe"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"] [error="rpc error: code = Unimplemented desc = "]
    [2021/12/22 15:38:34.717 +08:00] [WARN] [scraper.go:276] ["failed to call Subscribe"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"] [error="rpc error: code = Unimplemented desc = unknown service tipb.TopSQLPubSub"]
    [2021/12/22 15:38:42.718 +08:00] [WARN] [scraper.go:252] ["retry to scrape component"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"] [retried=2]
    [2021/12/22 15:38:42.718 +08:00] [WARN] [scraper.go:252] ["retry to scrape component"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"] [retried=2]
    [2021/12/22 15:38:42.719 +08:00] [WARN] [scraper.go:296] ["failed to call Subscribe"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"] [error="rpc error: code = Unimplemented desc = "]
    [2021/12/22 15:38:42.719 +08:00] [WARN] [scraper.go:276] ["failed to call Subscribe"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"] [error="rpc error: code = Unimplemented desc = unknown service tipb.TopSQLPubSub"]
    [2021/12/22 15:38:48.634 +08:00] [INFO] [manager.go:118] ["update profile target info finished"] [update-count=0]
    [2021/12/22 15:38:58.720 +08:00] [WARN] [scraper.go:252] ["retry to scrape component"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"] [retried=3]
    [2021/12/22 15:38:58.720 +08:00] [WARN] [scraper.go:252] ["retry to scrape component"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"] [retried=3]
    [2021/12/22 15:38:58.721 +08:00] [WARN] [scraper.go:296] ["failed to call Subscribe"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"] [error="rpc error: code = Unimplemented desc = "]
    [2021/12/22 15:38:58.722 +08:00] [WARN] [scraper.go:276] ["failed to call Subscribe"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"] [error="rpc error: code = Unimplemented desc = unknown service tipb.TopSQLPubSub"]
    [2021/12/22 15:39:30.722 +08:00] [WARN] [scraper.go:252] ["retry to scrape component"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"] [retried=4]
    [2021/12/22 15:39:30.722 +08:00] [WARN] [scraper.go:252] ["retry to scrape component"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"] [retried=4]
    [2021/12/22 15:39:30.723 +08:00] [WARN] [scraper.go:296] ["failed to call Subscribe"] [component="{\"name\":\"tikv\",\"ip\":\"127.0.0.1\",\"port\":20160,\"status_port\":20180}"] [error="rpc error: code = Unimplemented desc = "]
    [2021/12/22 15:39:30.723 +08:00] [WARN] [scraper.go:276] ["failed to call Subscribe"] [component="{\"name\":\"tidb\",\"ip\":\"127.0.0.1\",\"port\":4000,\"status_port\":10080}"] [error="rpc error: code = Unimplemented desc = unknown service tipb.TopSQLPubSub"]
    
  • conprof: fix genjidb(badger) gc doesn't work problem(#115)(#119)

    conprof: fix genjidb(badger) gc doesn't work problem(#115)(#119)

    What problem does this PR solve?

    Issue Number: close #xxx

    What is changed and how it works?

    cherry-pick https://github.com/pingcap/ng-monitoring/pull/115

    https://github.com/pingcap/ng-monitoring/pull/119

  • config: refine advertise address config

    config: refine advertise address config

    Signed-off-by: crazycs520 [email protected]

    What problem does this PR solve?

    Issue Number: close #57 Issue Number: close #56 Issue Number: close #90

    What is changed and how it works?

  • topsql: introduce sql exec count and sql duration

    topsql: introduce sql exec count and sql duration

    Signed-off-by: mornyx [email protected]

    What is changed and how it works?

    • Add two metrics: sql_exec_count and sql_duration.
    • Implement write and query for new metrics.
    • Enhance integration tests of TopSQL.

    Related PR

    • https://github.com/pingcap/tipb/pull/250
  • config: fix update conflict causing by http API and file reload (#137)

    config: fix update conflict causing by http API and file reload (#137)

    This is an automated cherry-pick of #137

    Signed-off-by: Zhenchi [email protected]

    What problem does this PR solve?

    Issue Number: close #136

    What is changed and how it works?

    1. Add function UpdateGlobalConfig
    2. Use mutex to protect global config
  • conprof: fix genjidb(badger) gc doesn't work problem

    conprof: fix genjidb(badger) gc doesn't work problem

    Signed-off-by: crazycs520 [email protected]

    What problem does this PR solve?

    close: #120

    Related issue: https://github.com/genjidb/genji/issues/454

    https://discuss.dgraph.io/t/badgerdb-consume-too-much-disk-space/17070

    Related repo: https://github.com/crazycs520/genji/tree/v0.13.0.bugfix

    What is changed and how it works?

    The bug was caused by genjidb, but genjidb doesn't support badger now, so I have to fork the genjidb then fix in my repo, then replace it in go.mod

  • conprof: fix unstable test

    conprof: fix unstable test

    Signed-off-by: crazycs520 [email protected]

    What problem does this PR solve?

    --- FAIL: TestProfileStorage (0.54s)
        store_test.go:111:
                    Error Trace:    store_test.go:111
                                                            store_test.go:29
                    Error:          Not equal:
                                    expected: 1644219544
                                    actual  : 1644219543
                    Test:           TestProfileStorage
    

    What is changed and how it works?

  • turn off badger's prefetch and tuning for memory usage

    turn off badger's prefetch and tuning for memory usage

    Signed-off-by: Zhenchi [email protected]

    What problem does this PR solve?

    ngm occupies much memory when it collects lots of conprof data. It is mainly caused by retention of conprof. When executing DELETE FROM conprof_table WHERE ts <= ?, genji does an index scan through iterator, which acts prefetch due to the default option. The prefetch loads conprof data to memory through mmap but never read them. We benefit little from prefetch, but it does cause memory usage to increase.

    What is changed and how it works?

    • Disable prefetch.
    • Tweak options of badger.
  • Bring Continuous Profiling to TiCDC

    Bring Continuous Profiling to TiCDC

    We hope to bring Continuous Profiling to TiCDC, this will be divided into two steps:

    • [x] Add TiCDC to the ng-monitoring backend. Although not yet shown on the web frontend, we can simply click the "Download" button to get all the Profiles (including TiCDC) at a certain point in time. image

    • [ ] Let the web frontend of TiDB Dashboard display TiCDC related profiles.

  • Continuous Profiling should be disabled by default until tiflash can be safely profiled

    Continuous Profiling should be disabled by default until tiflash can be safely profiled

    See: https://github.com/pingcap/tiflash/issues/5687

    Continuous Profiling has been enabled by default since v6.1.0, so TiDB/TiKV/TiFlash will continue to trigger CPU profiling by default since v6.1.0. The CPU profiler of TiKV/TiFlash is pprof-rs. pprof-rs registers a signal handler during CPU profiling and periodically triggers the SIGPROF signal. After the SIGPROF signal is triggered, the signal handler is dispatched to the business thread for execution to sample the call stack of the current thread. Signal handler also makes system calls through glibc during execution, so it has a chance to affect errno. After the signal handler is executed, the business thread may get the wrong errno. However, errno has been protected in the current version of pprof-rs, so there is still no key evidence that the errno was modified by pprof-rs. The current judgment is based on two facts:

    1. The problem only occurs while profiling is on
    2. pprof-rs has a very similar issue

    So, Continuous Profiling should be disabled by default until tiflash can be safely profiled.

  • TopSQL's retention time does not work

    TopSQL's retention time does not work

    Bug Report

    Please answer these questions before submitting your issue. Thanks!

    What did you do?

    Open topsql feature from March 21.

    What did you expect to see?

    topsql's retention time is 1 month and it does work, it should delete data that exceed retention time.

    What did you see instead?

    data for march 21 is still available.

    What version of TiDB Dashboard are you using (./tidb-dashboard --version)?

    6.0

    kbNIPG4cUQ

    This issue is from https://github.com/pingcap/tidb-dashboard/issues/1218

  • genjidb vs sqlite?

    genjidb vs sqlite?

    The storage engine is evolved as the follows:

    Use a KV that supports ZSTD to achieve max compression → Use genjidb for easier access over that KV engine

    However, as we are now actually not using ZSTD for block-compressing, but compressing at a per-profile level, the genjidb + badger is not the only choice any more. For example, as the most widely deployed database engine, Sqlite may be a better choice.

  • Closing ng-monitoring does not actively remove the topology item

    Closing ng-monitoring does not actively remove the topology item

    I'm using Ctrl+C to gracefully quit the ng-monitoring and starting a new one:

    [GIN] 2022/02/08 - 14:43:48 | 404 |         792ns | 192.168.126.218 | GET      "/"
    ^C[2022/02/08 14:46:06.297 +08:00] [INFO] [main.go:108] ["received signal"] [sig=interrupt]
    [2022/02/08 14:46:06.297 +08:00] [INFO] [http.go:79] ["shutting down http server"]
    [2022/02/08 14:46:06.297 +08:00] [INFO] [http.go:81] ["http server is down"]
    [2022/02/08 14:46:06.298 +08:00] [INFO] [default_subscriber.go:48] ["stopping scrapers"]
    [2022/02/08 14:46:06.298 +08:00] [INFO] [default_subscriber.go:51] ["stop scrapers successfully"]
    [2022/02/08 14:46:06.298 +08:00] [INFO] [database.go:20] ["Stopping timeseries database"]
    [2022/02/08 14:46:06.313 +08:00] [INFO] [database.go:22] ["Stop timeseries database successfully"]
    [2022/02/08 14:46:06.313 +08:00] [INFO] [database.go:24] ["Stopping document database"]
    [2022/02/08 14:46:06.313 +08:00] [INFO] [document.go:51] ["badger stop running value log gc loop"]
    [2022/02/08 14:46:06.384 +08:00] [INFO] [database.go:26] ["Stop document database successfully"]
    

    However the topology is not cleaned up in time, so that TiDB Dashboard keeps connect to the non-existing ng-monitoring server:

    $ etcdctl get /topology --prefix
    /topology/ng-monitoring/192.168.126.218:12020/info   <-- new
    {"git_hash":"1afcaa990af5c65b222e0ab59171867248645f4a","ip":"192.168.126.218","listening_port":12020,"start_timestamp":1644302767}
    /topology/ng-monitoring/192.168.126.218:12020/ttl   <-- new
    1644302767649395000
    /topology/ng-monitoring/192.168.3.105:12020/info   <-- old, gracefully exited
    {"git_hash":"1afcaa990af5c65b222e0ab59171867248645f4a","ip":"192.168.3.105","listening_port":12020,"start_timestamp":1644225105}
    /topology/ng-monitoring/192.168.3.105:12020/ttl   <-- old, gracefully exited
    1644302745923627000
    /topology/tidb/127.0.0.1:4000/info
    {"version":"v5.3.0","git_hash":"4a1b2e9fe5b5afb1068c56de47adb07098d768d6","ip":"127.0.0.1","status_port":10080,"deploy_path":"/Users/breezewish/.tiup/components/tidb/v5.3.0","start_timestamp":1644224846,"labels":{}}
    /topology/tidb/127.0.0.1:4000/ttl
    1644302786725919000
    

    This will cause problems when user scales-in and scales-out (switch) the ngm node.

Sand is the next, versatile, high-level compiled or interpreted language that's easy to learn and performant to run.

Sand is the newest, dynamically typed, interpreted programming language. Table of Contents History Project Stats History Sand was created as part of @

Mar 13, 2022
A system and resource monitoring tool written in Golang!
A system and resource monitoring tool written in Golang!

Grofer A clean and modern system and resource monitor written purely in golang using termui and gopsutil! Currently compatible with Linux only. Curren

Jan 8, 2023
Monitoring service uses variables with golang

Monitoring service Setting up Monitoring service uses variables. If no variables are set, the default values listed below will be used: PORT=8000 SECR

Oct 8, 2021
Monitoring stack app for golang

Monitoring Application This is a simple monitoring application taken from the following repositor: https://github.com/AnaisUrlichs/observe-argo-rollou

May 26, 2022
Monitoriamento-go - Program for monitoring websites in Golang

***Programa para monitoriamento de sites em Go lang. *** No "sitesaqui.txt" colo

Feb 2, 2022
Simple Golang tool for monitoring linux cpu, ram and disk usage.

Simple Golang tool for monitoring linux cpu, ram and disk usage.

Mar 19, 2022
A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go
A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go

Package monkit is a flexible code instrumenting and data collection library. See documentation at https://godoc.org/gopkg.in/spacemonkeygo/monkit.v3 S

Dec 14, 2022
The Prometheus monitoring system and time series database.

Prometheus Visit prometheus.io for the full documentation, examples and guides. Prometheus, a Cloud Native Computing Foundation project, is a systems

Dec 31, 2022
A GNU/Linux monitoring and profiling tool focused on single processes.
A GNU/Linux monitoring and profiling tool focused on single processes.

Uroboros is a GNU/Linux monitoring tool focused on single processes. While utilities like top, ps and htop provide great overall details, they often l

Dec 26, 2022
Open source framework for processing, monitoring, and alerting on time series data

Kapacitor Open source framework for processing, monitoring, and alerting on time series data Installation Kapacitor has two binaries: kapacitor – a CL

Dec 26, 2022
rtop is an interactive, remote system monitoring tool based on SSH

rtop rtop is a remote system monitor. It connects over SSH to a remote system and displays vital system metrics (CPU, disk, memory, network). No speci

Dec 30, 2022
distributed monitoring system
distributed monitoring system

OWL OWL 是由国内领先的第三方数据智能服务商 TalkingData 开源的一款企业级分布式监控告警系统,目前由 Tech Operation Team 持续开发更新维护。 OWL 后台组件全部使用 Go 语言开发,Go 语言是 Google 开发的一种静态强类型、编译型、并发型,并具有垃圾回

Dec 24, 2022
Ping monitoring engine used in https://ping.gg

Disclaimer: If you are new to Go this is not a good place to learn best practices, the code is not very idiomatic and there's probably a few bad ideas

Dec 22, 2022
Simple and extensible monitoring agent / library for Kubernetes: https://gravitational.com/blog/monitoring_kubernetes_satellite/

Satellite Satellite is an agent written in Go for collecting health information in a kubernetes cluster. It is both a library and an application. As a

Nov 10, 2022
An open-source and enterprise-level monitoring system.
 An open-source and enterprise-level monitoring system.

Falcon+ Documentations Usage Open-Falcon API Prerequisite Git >= 1.7.5 Go >= 1.6 Getting Started Docker Please refer to ./docker/README.md. Build from

Jan 1, 2023
Distributed simple and robust release management and monitoring system.
Distributed simple and robust release management and monitoring system.

Agente Distributed simple and robust release management and monitoring system. **This project on going work. Road map Core system First worker agent M

Nov 17, 2022
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Nov 10, 2022
checkah is an agentless SSH system monitoring and alerting tool.

CHECKAH checkah is an agentless SSH system monitoring and alerting tool. Features: agentless check over SSH (password, keyfile, agent) config file bas

Oct 14, 2022
mtail - extract internal monitoring data from application logs for collection into a timeseries database
 mtail - extract internal monitoring data from application logs for collection into a timeseries database

mtail - extract internal monitoring data from application logs for collection into a timeseries database mtail is a tool for extracting metrics from a

Dec 29, 2022