Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system.

Go Build and Test Appveyor Build status Docker Pulls

Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system. Cloudprober employs the "active" monitoring model. It runs probes against (or on) your components to verify that they are working as expected. For example, it can run a probe to verify that your frontends can reach your backends. Similarly it can run a probe to verify that your in-Cloud VMs can actually reach your on-premise systems. This kind of monitoring makes it possible to monitor your systems' interfaces regardless of the implementation and helps you quickly pin down what's broken in your system.

Cloudprober Use Case

Features

  • Automated target discovery for Cloud targets. GCE and Kubernetes are supported out-of-the-box, other Cloud providers can be added easily.
  • Integration with open source monitoring stack of Prometheus and Grafana. Cloudprober exports probe results as counter based metrics that work well with Prometheus and Grafana.
  • Out of the box, config based integration with popular monitoring systems: Prometheus, DataDog, PostgreSQL, StackDriver, CloudWatch.
  • Fast and efficient built-in implementations for the most common types of checks: PING (ICMP), HTTP, UDP, DNS. Especially PING and UDP probes are implemented in such a way that thousands of hosts can be probed with minimal resources.
  • Arbitrary, complex probes can be run through the external probe type. For example, you could write a simple script to insert and delete a row in your database, and execute this script through the 'EXTERNAL' probe type.
  • Standard metrics - total, success, latency. Latency can be configured to be a distribution (histogram) metric, allowing calculations of percentiles.
  • Strong focus on ease of deployment. Cloudprober is written entirely in Go, and compiles into a static binary. It can be easily deployed, either as a standalone binary or through docker containers. Thanks to the automated, continuous, target discovery, there is usually no need to re-deploy or re-configure cloudprober in response to most of the changes.
  • Low footprint. Cloudprober docker image is small, containing just the statically compiled binary and it takes very little CPU and RAM to run even a large number of probes.
  • Extensible architecture. Cloudprober can be easily extended along most of the dimensions. Adding support for other Cloud targets, monitoring systems and even a new probe type, is straight-forward and fairly easy.

Getting Started

Visit Getting Started page to get started with Cloudprober.

Feedback

We'd love to hear your feedback. If you're using Cloudprober, would you please mind sharing how you use it by adding a comment here. It will be a great help in planning Cloudprober's future progression.

Join Cloudprober Slack or Github discussions for questions and discussion about Cloudprober.

Comments
  • Support resolving IP Range in RDS client

    Support resolving IP Range in RDS client

    If the RDS resources' IP is a IP range, e.g. GCP IPv6 forwarding rule, we want the RDS client to parse the cidr and return IP Address.

    https://github.com/cloudprober/cloudprober/blob/master/rds/client/client.go#L152

  • IPv4 ping is broken on Flatcar Container Linux

    IPv4 ping is broken on Flatcar Container Linux

    Describe the bug When migrating our Kubernetes nodes from CoreOS to Flatcar Linux, the cloudprober ping probes stopped working with error message

    W0329 14:01:01.380354       1 ping.go:375] [cloudprober] Not a valid ICMP echo reply packet from: xxx
    

    ICMP reply looks like:

    00000000  45 00 00 54 fc e2 00 00  40 01 95 ab 0a 2e 65 d5  |[email protected].|
    00000010  64 5a 13 be 00 00 ca e1  06 cc 06 01 16 e0 de 89  |dZ..............|
    00000020  fe 35 a4 6b 16 e0 de 89  fe 35 a4 6b 16 e0 de 89  |.5.k.....5.k....|
    00000030  fe 35 a4 6b 16 e0 de 89  fe 35 a4 6b 16 e0 de 89  |.5.k.....5.k....|
    00000040  fe 35 a4 6b 16 e0 de 89  fe 35 a4 6b 16 e0 de 89  |.5.k.....5.k....|
    00000050  fe 35 a4 6b 00 00 00 00  00 00 00 00 00 00 00 00  |.5.k............|
    

    Apparently the same issue as reported in #80.

    Cloudprober Version v0.11.6

    Additional context I have temporarily fixed the issue by removing runtime.GOOS == "darwin" && in code line https://github.com/cloudprober/cloudprober/blob/master/probes/ping/ping.go#L379 and building my own custom docker image.

  • podman macos environment causes ping probe to be false positive

    podman macos environment causes ping probe to be false positive

    Describe the bug Host is not up but the success rate is %100

    Cloudprober Version 0.11.4

    To Reproduce I am trying to run the container on podman which was installed via brew.

    1. Crete Dockerfile and cloudprobe.cfg $ cat Dockerfile FROM docker.io/cloudprober/cloudprober COPY cloudprober.cfg /etc/cloudprober.cfg

    $cat cloudprober.cfg probe { name: "cn01" type: PING targets { host_names: "172.16.10.108" } interval_msec: 5000 # 5s timeout_msec: 1000 # 1s } 2. podman build -t hncr.io/cloudprober .

    1. podman run --name observer --network host hncr.io/cloudprober:latest

    2. cloudprober 1646355191063709290 1646357755 labels=ptype=ping,probe=cn01,dst=172.16.10.108 total=1024 success=1024 latency=1765228.603 validation_failure=map:validator,data-integrity:0

    3. ping 172.16.10.108 PING 172.16.10.108 (172.16.10.108): 56 data bytes Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 Request timeout for icmp_seq 2 Additional context Add any other context about the problem here.

    ❯ podman info host: arch: amd64 buildahVersion: 1.23.1 cgroupControllers:

    • memory
    • pids cgroupManager: systemd cgroupVersion: v2 conmon: package: conmon-2.1.0-2.fc35.x86_64 path: /usr/bin/conmon version: 'conmon version 2.1.0, commit: ' cpus: 1 distribution: distribution: fedora variant: coreos version: "35" eventLogger: journald hostname: localhost.localdomain idMappings: gidmap:
      • container_id: 0 host_id: 1000 size: 1
      • container_id: 1 host_id: 100000 size: 65536 uidmap:
      • container_id: 0 host_id: 1000 size: 1
      • container_id: 1 host_id: 100000 size: 65536 kernel: 5.15.18-200.fc35.x86_64 linkmode: dynamic logDriver: journald memFree: 1034878976 memTotal: 2061381632 ociRuntime: name: crun package: crun-1.4.2-1.fc35.x86_64 path: /usr/bin/crun version: |- crun version 1.4.2 commit: f6fbc8f840df1a414f31a60953ae514fa497c748 spec: 1.0.0 +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL os: linux remoteSocket: exists: true path: /run/user/1000/podman/podman.sock security: apparmorEnabled: false capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT rootless: true seccompEnabled: true seccompProfilePath: /usr/share/containers/seccomp.json selinuxEnabled: true serviceIsRemote: true slirp4netns: executable: /usr/bin/slirp4netns package: slirp4netns-1.1.12-2.fc35.x86_64 version: |- slirp4netns version 1.1.12 commit: 7a104a101aa3278a2152351a082a6df71f57c9a3 libslirp: 4.6.1 SLIRP_CONFIG_VERSION_MAX: 3 libseccomp: 2.5.3 swapFree: 0 swapTotal: 0 uptime: 5h 54m 19.4s (Approximately 0.21 days) plugins: log:
    • k8s-file
    • none
    • journald network:
    • bridge
    • macvlan volume:
    • local registries: search:
    • docker.io store: configFile: /var/home/core/.config/containers/storage.conf containerStore: number: 1 paused: 0 running: 1 stopped: 0 graphDriverName: overlay graphOptions: {} graphRoot: /var/home/core/.local/share/containers/storage graphStatus: Backing Filesystem: xfs Native Overlay Diff: "true" Supports d_type: "true" Using metacopy: "false" imageStore: number: 5 runRoot: /run/user/1000/containers volumePath: /var/home/core/.local/share/containers/storage/volumes version: APIVersion: 3.4.4 Built: 1638999907 BuiltTime: Wed Dec 8 21:45:07 2021 GitCommit: "" GoVersion: go1.16.8 OsArch: linux/amd64 Version: 3.4.4

    ➜ podman inspect hncr.io/cloudprober:latest [ { "Id": "fe71b1a63c8e16cdfe0780377a493cf14197ca8b3272be514782f60b1bbaa92d", "Digest": "sha256:0ddfb4018e10acf4ab84a59a9cf630ec4f79ba31ff505562237cc54a220c46d6", "RepoTags": [ "hncr.io/cloudprober:latest" ], "RepoDigests": [ "hncr.io/cloudprober@sha256:0ddfb4018e10acf4ab84a59a9cf630ec4f79ba31ff505562237cc54a220c46d6" ], "Parent": "e93c1a15f4fe0cabdece87fbad954ad7a382ccedeb24334e118e3e7dbe4b5332", "Comment": "", "Created": "2022-03-04T00:53:05.30462115Z", "Config": { "Env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" ], "Entrypoint": [ "/cloudprober", "--logtostderr" ], "Labels": { "com.microscaling.license": "Apache-2.0", "io.buildah.version": "1.23.1", "org.label-schema.build-date": "2022-02-03T17:44:39Z", "org.label-schema.name": "Cloudprober", "org.label-schema.vcs-ref": "87c4d42", "org.label-schema.vcs-url": "https://github.com/cloudprober/cloudprober", "org.label-schema.version": "v0.11.4" } }, "Version": "", "Author": "", "Architecture": "amd64", "Os": "linux", "Size": 37597201, "VirtualSize": 37597201, "GraphDriver": { "Name": "overlay", "Data": { "LowerDir": "/var/home/core/.local/share/containers/storage/overlay/263bbc5a92eefaa97313514e6b529bb0ea2ce41a77a7a072032bcc048f64eb13/diff:/var/home/core/.local/share/containers/storage/overlay/bd8a0b959097fa3cd5d3f32861a0a64eaa2c70203ba997c0f1b9d78e329e59a9/diff:/var/home/core/.local/share/containers/storage/overlay/01fd6df81c8ec7dd24bbbd72342671f41813f992999a3471b9d9cbc44ad88374/diff", "UpperDir": "/var/home/core/.local/share/containers/storage/overlay/b0e1fe57c82b6e6e751edf401e9022a5dbbf1dd547a009e2c7d6befa2eee1e1f/diff", "WorkDir": "/var/home/core/.local/share/containers/storage/overlay/b0e1fe57c82b6e6e751edf401e9022a5dbbf1dd547a009e2c7d6befa2eee1e1f/work" } }, "RootFS": { "Type": "layers", "Layers": [ "sha256:01fd6df81c8ec7dd24bbbd72342671f41813f992999a3471b9d9cbc44ad88374", "sha256:503aacf61ed5d42a83f53b799a8a9db04d97ec8bcaaf0e4b6fd444996e871e50", "sha256:6dcfbd8fd01b2caee7d30f68704959d00919395b1e6838cb17ab9c8ced22f798", "sha256:6a0241365673036d387cc582cff0c836f3a11c7cc00b197a4af460d48d24c6af" ] }, "Labels": { "com.microscaling.license": "Apache-2.0", "io.buildah.version": "1.23.1", "org.label-schema.build-date": "2022-02-03T17:44:39Z", "org.label-schema.name": "Cloudprober", "org.label-schema.vcs-ref": "87c4d42", "org.label-schema.vcs-url": "https://github.com/cloudprober/cloudprober", "org.label-schema.version": "v0.11.4" }, "Annotations": { "org.opencontainers.image.base.digest": "sha256:d3121c10aeca683132fb61a06b533f71629fdeb26503671e89531978453ff9ec", "org.opencontainers.image.base.name": "docker.io/cloudprober/cloudprober:latest" }, "ManifestType": "application/vnd.oci.image.manifest.v1+json", "User": "", "History": [ { "created": "2021-12-30T19:19:40.833034683Z", "created_by": "/bin/sh -c #(nop) ADD file:6db446a57cbd2b7f4cfde1f280177b458390ed5a6d1b54c6169522bc2c4d838e in / " }, { "created": "2021-12-30T19:19:41.006954958Z", "created_by": "/bin/sh -c #(nop) CMD ["sh"]", "empty_layer": true }, { "created": "2022-02-03T17:44:51.485420889Z", "created_by": "COPY ca-certificates.crt /etc/ssl/certs/ca-certificates.crt # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "COPY /stage-0-workdir/cloudprober / # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "ARG BUILD_DATE", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "ARG VERSION", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "ARG VCS_REF", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "LABEL org.label-schema.build-date=2022-02-03T17:44:39Z org.label-schema.name=Cloudprober org.label-schema.vcs-url=https://github.com/cloudprober/cloudprober org.label-schema.vcs-ref=87c4d42 org.label-schema.version=v0.11.4 com.microscaling.license=Apache-2.0", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "ENTRYPOINT ["/cloudprober" "--logtostderr"]", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-03-04T00:53:05.305970995Z", "created_by": "/bin/sh -c #(nop) COPY file:e5d838cb326e5be33db8cb94bd29366f592a4f928a93fc1d878bfe288d662cbf in /etc/cloudprober.cfg ", "comment": "FROM docker.io/cloudprober/cloudprober:latest" } ], "NamesHistory": [ "hncr.io/cloudprober:latest" ] } ]

  • [k8s rds] Resources discovered on k8s use made up names and not IPs for probing

    [k8s rds] Resources discovered on k8s use made up names and not IPs for probing

    Describe the bug

    Discovering endpoint resources on k8s for HTTP probes results in errors like this:

    W1021 20:59:13.816516 1 http.go:307] [cloudprober.sdc-service-edge] Target:edge_10.10.88.28_service-discovery, URL:https://edge_10.10.88.28_service-discovery:9090/ready, http.doHTTPRequest: Get "https://edge_10.10.88.28_service-discovery:9090/ready": dial tcp: lookup edge_10.10.88.28_service-discovery on 169.254.20.10:53: no such host

    This name isn't resolvable, and appears to be made up by Cloudprober. There don't appear to be any options to just use the discovered IP, rather than the resource name. ("resource" in this case refers to the internal Cloudprober concept of "endpoint resources", and not anything directly k8s related, as far as I can tell.)

    Cloudprober Version

    v0.11.9 I think.

    To Reproduce (slightly simplified config; I dropped some filters and some of the templates being expanded here.)

    probe {
        type: HTTP
        name: "my_service"
        targets {
            rds_targets {
                resource_path: "k8s://endpoints"
            }
        }
        interval: "30s"
        timeout: "5s"
        {{template "standard_latency_options" .}}
    
        http_probe {
            protocol: HTTPS
            relative_url: "/ready"
            port: 9090
        }
    
        validator {
            name: "status_code_2xx"
            http_validator {
                success_status_codes: "200-299"
            }
        }
    
        validator {
            name: "OK?"
            regex: "LIVE"
        }
    
    }
    
    rds_server {
      provider {
        kubernetes_config {
          endpoints {}
        }
      }
    }
    
    
  • errors when running ping tests on raspberry pi3. timestamp control message data size (8) is less than timestamp size (16 bytes)

    errors when running ping tests on raspberry pi3. timestamp control message data size (8) is less than timestamp size (16 bytes)

    Describe the bug errors when running ping tests on raspberry pi3. timestamp control message data size (8) is less than timestamp size (16 bytes)

    The latest versions of cloudprober armv7 have problems with icmp tests. What could this be related to?

    v0.11.9

    To Reproduce

    test.cfg 
    probe {
        name: "icmp_dns_test"
        type: PING
        targets {
        host_names: "1.1.1.1,9.9.9.9"
        }
       interval_msec: 5000  # 5s
       timeout_msec: 1000   # 1s
    }
    
    cloudprober# ./cloudprober -config_file test.cfg  -logtostderr
    I0808 08:34:39.174278   22922 prober.go:111] [cloudprober.global] Creating a PING probe: icmp_dns_test
    I0808 08:34:39.175383   22922 prometheus.go:186] [cloudprober.prometheus] Initialized prometheus exporter at the URL: /metrics
    I0808 08:34:39.175919   22922 probestatus.go:165] [cloudprober.probestatus] Initialized status surfacer at the URL: probesstatus
    I0808 08:34:39.177631   22922 sysvars.go:186] [cloudprober.sysvars] 1659936879 labels=ptype=sysvars,probe=sysvars hostname="access" start_timestamp="1659936878" version="v0.11.9"
    I0808 08:34:43.918322   22922 prober.go:295] [cloudprober.global] Starting probe: icmp_dns_test
    W0808 08:34:48.960235   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:48.964727   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:48.985619   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:48.989977   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    cloudprober 1659936879175480089 1659936889 labels=ptype=sysvars,probe=sysvars hostname="access" start_timestamp="1659936878" version="v0.11.9"
    I0808 08:34:49.179533   22922 prometheus.go:261] [cloudprober.prometheus] Checking validity of new label: ptype
    cloudprober 1659936879175480090 1659936889 labels=ptype=sysvars,probe=sysvars cpu_usage_msec=175.860
    I0808 08:34:49.179723   22922 prometheus.go:261] [cloudprober.prometheus] Checking validity of new label: probe
    cloudprober 1659936879175480091 1659936889 labels=ptype=sysvars,probe=sysvars uptime_msec=10214.543 gc_time_msec=0.473 mallocs=22045 frees=10253
    I0808 08:34:49.179870   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: hostname
    cloudprober 1659936879175480092 1659936889 labels=ptype=sysvars,probe=sysvars goroutines=15 mem_stats_sys_bytes=11877372
    I0808 08:34:49.180017   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: start_timestamp
    I0808 08:34:49.180152   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: version
    I0808 08:34:49.180306   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: cpu_usage_msec
    I0808 08:34:49.180469   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: uptime_msec
    I0808 08:34:49.180607   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: gc_time_msec
    I0808 08:34:49.180738   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: mallocs
    I0808 08:34:49.180864   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: frees
    I0808 08:34:49.181004   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: goroutines
    I0808 08:34:49.181164   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: mem_stats_sys_bytes
    W0808 08:34:49.921089   22922 ping.go:342] [cloudprober.icmp_dns_test] read udp 0.0.0.0:3: i/o timeout
    W0808 08:34:53.960428   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:53.964440   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:53.985605   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:54.022854   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:54.921206   22922 ping.go:342] [cloudprober.icmp_dns_test] read udp 0.0.0.0:3: i/o timeout
    I0808 08:34:54.921741   22922 prometheus.go:261] [cloudprober.prometheus] Checking validity of new label: dst
    I0808 08:34:54.921911   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: total
    cloudprober 1659936879175480093 1659936893 labels=ptype=ping,probe=icmp_dns_test,dst=1.1.1.1 total=4 success=0 latency=0.000 validation_failure=map:validator,data-integrity:0
    I0808 08:34:54.922043   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: success
    cloudprober 1659936879175480094 1659936893 labels=ptype=ping,probe=icmp_dns_test,dst=9.9.9.9 total=4 success=0 latency=0.000 validation_failure=map:validator,data-integrity:0
    I0808 08:34:54.922162   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: latency
    I0808 08:34:54.922285   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: validation_failure
    I0808 08:34:54.922428   22922 prometheus.go:261] [cloudprober.prometheus] Checking validity of new label: validator```
    
    cloudprober# cat /etc/os-release 
    PRETTY_NAME="Raspbian GNU/Linux 10 (buster)"
    NAME="Raspbian GNU/Linux"
    VERSION_ID="10"
    VERSION="10 (buster)"
    VERSION_CODENAME=buster
    ID=raspbian
    ID_LIKE=debian
    
    model name      : ARMv7 Processor rev 4 (v7l)
    
    Hardware        : BCM2835
    Revision        : a02082
    Serial          : 00000000a9eb2a7f
    Model           : Raspberry Pi 3 Model B Rev 1.2
    
    
  • External probe: Defunct processes may remain behind after probe timeout

    External probe: Defunct processes may remain behind after probe timeout

    Describe the bug

    When we use an external probe that forks a child process, the defunct child process may remain behind after probe timeout.

    Since the cloudprober is PID 1 in a container environment, I think the cloudprober should reap defunct processes like init does. Alternatively, we can put another entrypoint that can reap defunct processes such as Busybox and Tini.

    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root         137  4.0  0.0   5892  2996 ?        Rs   03:00   0:00 ps auxfw
    root           1  0.3  0.7 740840 30028 ?        Ssl  02:59   0:00 /cloudprober --logtostderr -config_file /opt/cloudprober.cfg
    root          14  0.0  0.0      0     0 ?        Z    02:59   0:00 [sleep] <defunct>
    root          40  0.0  0.0      0     0 ?        Z    02:59   0:00 [sleep] <defunct>
    root          74  0.0  0.0      0     0 ?        Z    03:00   0:00 [sleep] <defunct>
    root         108  0.0  0.0      0     0 ?        Z    03:00   0:00 [sleep] <defunct>
    root         135  0.0  0.0   1316     4 ?        S    03:00   0:00 /bin/sh /opt/probe.sh
    root         136  0.0  0.0   1308     4 ?        S    03:00   0:00  \_ /bin/sleep 3
    

    Cloudprober Version

    v0.11.8

    To Reproduce Steps to reproduce the behavior:

    1. Place the config and the probe script

    probe.sh

    #!/bin/sh
    /bin/sleep 3
    echo done
    

    cloudprober.cfg

    probe {
      name: "dummy"
      type: EXTERNAL
      interval_msec: 5000
      timeout_msec: 1000
      targets { dummy_targets {} }
      external_probe {
        mode: ONCE
        command: "/opt/probe.sh"
      }
    }
    
    1. Run the cloudprober in Docker
    docker run --name cp-sleep --rm -v $PWD:/opt/ cloudprober/cloudprober:v0.11.8 -config_file /opt/cloudprober.cfg
    
    1. Defunct sleep processes remain behind
    docker run --pid container:cp-sleep ubuntu:20.04 ps auxfw
    
    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root         137  4.0  0.0   5892  2996 ?        Rs   03:00   0:00 ps auxfw
    root           1  0.3  0.7 740840 30028 ?        Ssl  02:59   0:00 /cloudprober --logtostderr -config_file /opt/cloudprober.cfg
    root          14  0.0  0.0      0     0 ?        Z    02:59   0:00 [sleep] <defunct>
    root          40  0.0  0.0      0     0 ?        Z    02:59   0:00 [sleep] <defunct>
    root          74  0.0  0.0      0     0 ?        Z    03:00   0:00 [sleep] <defunct>
    root         108  0.0  0.0      0     0 ?        Z    03:00   0:00 [sleep] <defunct>
    root         135  0.0  0.0   1316     4 ?        S    03:00   0:00 /bin/sh /opt/probe.sh
    root         136  0.0  0.0   1308     4 ?        S    03:00   0:00  \_ /bin/sleep 3
    
  • HTTP probes don't refresh bearer tokens

    HTTP probes don't refresh bearer tokens

    Describe the bug My probes against a Google API were initially succeeding but failed with a 401 after an hour. Upon further debugging, I realized:

    • The HTTP request is created once and used repeatedly for each probe request.
    • The bearer token header is set at the time of HTTP request creation.
    • So in effect, the same bearer token is used over and over. If it bearer token is short-lived like an access token, then the probe will fail when the token expires.

    Practically speaking, cloudprober can't be used against many web APIs until this bug is fixed.

    Note that cloudprober is getting new access tokens but just isn't using them in the HTTP request. I was able to fix this by simply creating a new HTTP request instance for each probe request.

    Cloudprober Version 0.11.4 but I suspect it affects the past few versions.

    To Reproduce Steps to reproduce the behavior:

    1. Decide on a Google API to test.
    2. Open Google Cloud's Cloud Shell (shouldn't cost any).
    3. git clone and compile cloudprober.
    4. Create a cloudprober.cfg similar to the following:
    probe {
      name: "foo"
      type: HTTP
      targets {
        host_names: "foo.googleapis.com"
      }
      http_probe {
        relative_url: "/path/to/api/endpoint
        protocol: HTTPS
        method: POST
        headers {
          name: "content-type"
          value: "application/json"
        }
        // Will auth with the Cloud Shell user's access creds.
        oauth_config {
          google_credentials {
          }
        }
      }
      interval_msec: 30000
      timeout_msec: 1000
      validator {
        name: "status_code_200"
        http_validator {
          success_status_codes: "200"
        }
      }
    }
    
    1. Run cloudprober in Cloud Shell.

    Expected: 200 Actual: 200 and then 401s when the token expires.

  • Migrate ingresses from v1beta1 to v1 API

    Migrate ingresses from v1beta1 to v1 API

    Hello all, as you know networking.k8s.io/v1beta1 API version of Ingress is no longer served as of v1.22 k8s version. Therefore Cloudprober fails to list the available ingress targets.

    client.go:72] [cloudprober.rds-server] kubernetes.client: getting URL: https://100.64.0.1:443/apis/networking.k8s.io/v1beta1/ingresses
    ingresses.go:196] [cloudprober.rds-server] ingressesLister.expand(): error while getting ingresses list from API: HTTP response status code: 404, status: 404 Not Found
    

    Can we please merge this change and cut a release to fix the issue. We are not able to use Cloudprober after our upgrade to v1.22.

  • https probe with resolve_first failing due to certificate mismatch

    https probe with resolve_first failing due to certificate mismatch

    Describe the bug Using http_probe in HTTPS mode with resolve_first=true option causes certificate mismatch errors because the actual request is done with the resolved ip address without setting TLS server_name to the original host name. This can be mitigated by manually configuring tlsconfig.server_name in http_probe, but this falls short when I have multiple targets with different target hosts.

    Cloudprober Version v0.11.8

    To Reproduce Steps to reproduce the behavior:

    1. Minimal configuration (I replaced target fqdn and ip address with dummy values):
      probe {
        name: "reproducer"
        type: HTTP
        targets {
          file_targets {
            file_path: "resources.json"
          }
        }
        http_probe {
          protocol: HTTPS
          resolve_first: true
          #tls_config {
          #  server_name: "www.example.org"
          #}
        }
      }
      

      resources.json:

      {
        "resources": [
          {
            "name": "www.example.org",
            "ip": "127.0.0.1",
            "port": 443,
            "labels": {
              "fqdn": "www.example.org"
            }
          }
        ]
      }
      
    2. Run cloudprober: W0608 15:04:12.985724 1 http.go:321] [cloudprober.reproducer] Target:www.example.org, URL:https://127.0.0.1:443, http.doHTTPRequest: Get "https://127.0.0.1:443": x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs

    Additional context Explicitly configuring tlsconfig server_name resolves the issue but requires to write a dedicated probe for every target. Unfortunately substitutions do not work here. Being able to use server_name: "@target.label.fqdn@" would have been nice but the value gets passed verbatim. Setting resolve_first: true is important for my use case because I want to override the target host with the ip address I configured in the targets "ip" field.

  • Probes status in Cloudprober UI

    Probes status in Cloudprober UI

    Currently Cloudprober runs probes, and surfaces the generated data (success, failure, latency, etc) to other metrics systems like Prometheus, Cloudwatch, etc. It itself doesn't expose data directly to the users in a way that is easy to interpret. It will be nice if it did that.

  • Add resolved IP of http-probes' targets into a label

    Add resolved IP of http-probes' targets into a label

    In short I am probosing to add a label that informs the user to which ip a hostname was resolved. This helps to debug unstable DNS-configs or problems related to DNS-load-balancing.

    Usecase:

    Imagine you use DNS load balancing and than thanks to your cloudprober metrics you get a warning that something is wrong. However, it just tells you https://example.net/awesome-api is down. However, that url resolves to different deployments depending on origin of the request and/or load-situation. So, now you just know that at least one deployment is broken but not which one... This already helps but it would be even better to know to which deployment (in other words to which ip) the request that failed actually went.

    Details/additional notes:

    In request.go line 112 that information is already present. If my understanding of the source-code is correct the url_host variable contains the target's ip which was retrieved from dns. It is however only internally available and at least I found no option to configure cloudprober to just save that data as a value to a label. I would recommend to add an option to http-probes which allows this. I furthermore want to add that enabling this feature might lead to a potentially huge amount of metrics in case the hostname tested resolves to a new ip on each request. From my experience this is very unlikely but I still want to point it out.

  • Add a schedule window option to define when the probes should run

    Add a schedule window option to define when the probes should run

    Currently the probes run 24x7 at the intervals defined in the config file. It would be good to have a means to scheduled the time window when this probe should run. Assuming some workloads are unavailable / shutdown at specific time eg. weekends, I would not want cloudprober to reach out to those URL's

  • README.md uses outdated StackDriver term

    README.md uses outdated StackDriver term

    Hi, I immediately get the feeling that something is outdated when I read StackDriver. I propose changing the text in the picture and the text in the README.md to Operations Suite. I would proceed with a PR if you give me a thumbs up?

  • add support for setting environment variables for external probes

    add support for setting environment variables for external probes

    Adds the ability to configure environment variables for external probers.

    This is useful for golang binaries where configuration is done via environment variables during init() functions (anti-pattern, I know, but we can't fix every upstream project to be configurable after init).

  • [common.oauth] Support JSON format from Bearer token sources

    [common.oauth] Support JSON format from Bearer token sources

    See this: https://github.com/cloudprober/cloudprober/blob/master/common/oauth/proto/config.proto

    Currently bearer token type expects file and command sources to return only access token string. We should support reading JSON as well, so that we can get additional information about the token, e.g. expiration time (typically: expires_in).

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Nov 10, 2022
Nightingale - A Distributed and High-Performance Monitoring System. Prometheus enterprise edition
Nightingale - A Distributed and High-Performance Monitoring System. Prometheus enterprise edition

Introduction ?? A Distributed and High-Performance Monitoring System. Prometheus

Jan 7, 2022
Go web monitor - A web monitor with golang

Step Download “go installer” and install on your machine. Open VPN. Go to “web-m

Jan 6, 2022
Monitor the performance of your Ethereum 2.0 staking pool.

eth-pools-metrics Monitor the performance of your Ethereum 2.0 staking pool. Just input the withdrawal credentials that were used in the deposit contr

Dec 30, 2022
Monitoring-go - A simple monitoring tool to sites of MOVA

Monitoring GO A simple monitoring tool to sites of MOVA How to use Clone Repo gi

Feb 14, 2022
Open Source Software monitoring platform tools.

ByteOpen Open Source Software monitoring platform tools. Usage Clone the repo to your own go src path cd ~/go/src git clone https://code.byted.org/inf

Nov 21, 2021
Distributed simple and robust release management and monitoring system.
Distributed simple and robust release management and monitoring system.

Agente Distributed simple and robust release management and monitoring system. **This project on going work. Road map Core system First worker agent M

Nov 17, 2022
Gomon - Go language based system monitor
Gomon - Go language based system monitor

Copyright © 2021 The Gomon Project. Welcome to Gomon, the Go language based system monitor Welcome to Gomon, the Go language based system monitor Over

Nov 18, 2022
The Prometheus monitoring system and time series database.

Prometheus Visit prometheus.io for the full documentation, examples and guides. Prometheus, a Cloud Native Computing Foundation project, is a systems

Dec 31, 2022
A system and resource monitoring tool written in Golang!
A system and resource monitoring tool written in Golang!

Grofer A clean and modern system and resource monitor written purely in golang using termui and gopsutil! Currently compatible with Linux only. Curren

Jan 8, 2023
An open-source and enterprise-level monitoring system.
 An open-source and enterprise-level monitoring system.

Falcon+ Documentations Usage Open-Falcon API Prerequisite Git >= 1.7.5 Go >= 1.6 Getting Started Docker Please refer to ./docker/README.md. Build from

Jan 1, 2023
checkah is an agentless SSH system monitoring and alerting tool.

CHECKAH checkah is an agentless SSH system monitoring and alerting tool. Features: agentless check over SSH (password, keyfile, agent) config file bas

Oct 14, 2022
rtop is an interactive, remote system monitoring tool based on SSH

rtop rtop is a remote system monitor. It connects over SSH to a remote system and displays vital system metrics (CPU, disk, memory, network). No speci

Dec 30, 2022
distributed monitoring system
distributed monitoring system

OWL OWL 是由国内领先的第三方数据智能服务商 TalkingData 开源的一款企业级分布式监控告警系统,目前由 Tech Operation Team 持续开发更新维护。 OWL 后台组件全部使用 Go 语言开发,Go 语言是 Google 开发的一种静态强类型、编译型、并发型,并具有垃圾回

Dec 24, 2022
Monitor your network and internet speed with Docker & Prometheus
Monitor your network and internet speed with Docker & Prometheus

Stand-up a Docker Prometheus stack containing Prometheus, Grafana with blackbox-exporter, and speedtest-exporter to collect and graph home Internet reliability and throughput.

Dec 26, 2022
Cloudinsight Agent is a system tool that monitors system processes and services, and sends information back to your Cloudinsight account.

Cloudinsight Agent 中文版 README Cloudinsight Agent is written in Go for collecting metrics from the system it's running on, or from other services, and

Nov 3, 2022
Hidra is a tool to monitor all of your services without making a mess.

hidra Don't lose your mind monitoring your services. Hidra lends you its head. ICMP If you want to use ICMP scenario, you should activate on your syst

Nov 8, 2022
Monitor & detect crashes in your Kubernetes(K8s) cluster
Monitor & detect crashes in your Kubernetes(K8s) cluster

kwatch kwatch helps you monitor all changes in your Kubernetes(K8s) cluster, detects crashes in your running apps in realtime, and publishes notificat

Dec 28, 2022
Dead simple, super fast, zero allocation and modular logger for Golang

Onelog Onelog is a dead simple but very efficient JSON logger. It is one of the fastest JSON logger out there. Also, it is one of the logger with the

Sep 26, 2022