M3 monorepo - Distributed TSDB, Aggregator and Query Engine, Prometheus Sidecar, Graphite Compatible, Metrics Platform

M3

GoDoc Build Status FOSSA Status

M3 Logo

Distributed TSDB and Query Engine, Prometheus Sidecar, Metrics Aggregator, and more such as Graphite storage and query engine.

Table of Contents

More Information

Community Meetings

You can find recordings of past meetups here: https://vimeo.com/user/120001164/folder/2290331.

Install

Dependencies

The simplest and quickest way to try M3 is to use Docker, read the M3 quickstart section for other options.

This example uses jq to format the output of API calls. It is not essential for using M3DB.

Usage

The below is a simplified version of the M3 quickstart guide, and we suggest you read that for more details.

  1. Start a Container
docker run -p 7201:7201 -p 7203:7203 --name m3db -v $(pwd)/m3db_data:/var/lib/m3db quay.io/m3db/m3dbnode:v1.0.0
  1. Create a Placement and Namespace
#!/bin/bash
curl -X POST http://localhost:7201/api/v1/database/create -d '{
  "type": "local",
  "namespaceName": "default",
  "retentionTime": "12h"
}' | jq .
  1. Ready a Namespace
curl -X POST http://localhost:7201/api/v1/services/m3db/namespace/ready -d '{
  "name": "default"
}' | jq .
  1. Write Metrics
#!/bin/bash
curl -X POST http://localhost:7201/api/v1/json/write -d '{
  "tags": 
    {
      "__name__": "third_avenue",
      "city": "new_york",
      "checkout": "1"
    },
    "timestamp": '\"$(date "+%s")\"',
    "value": 3347.26
}'
  1. Query Results

Linux

curl -X "POST" -G "http://localhost:7201/api/v1/query_range" \
  -d "query=third_avenue" \
  -d "start=$(date "+%s" -d "45 seconds ago")" \
  -d "end=$( date +%s )" \
  -d "step=5s" | jq .  

macOS/BSD

curl -X "POST" -G "http://localhost:7201/api/v1/query_range" \
  -d "query=third_avenue > 6000" \
  -d "start=$(date -v -45S "+%s")" \
  -d "end=$( date +%s )" \
  -d "step=5s" | jq .

Contributing

You can ask questions and give feedback in the following ways:

M3 welcomes pull requests, read contributing guide to help you get setup for building and contributing to M3.


This project is released under the Apache License, Version 2.0.

Owner
M3
The open source metrics platform built on M3DB, a distributed timeseries database
M3
Comments
  • Parallelize commitlog reader

    Parallelize commitlog reader

    This is PR 2 in three PR's that together will improve the performance of the commitlog bootstrapper. This PR parallelizes the commitlog bootstrapper (specifically it parallelizes the msgpack decoding)

  • coordinator cpu contention when running 1m aggregation of samples

    coordinator cpu contention when running 1m aggregation of samples

    I'm running coordinator on a dedicated 32 core box and running about 90k samples/s through it (after replication, with RF=3). I'm using 0.4.6, have 1 prometheus, and 6 db instances. It generally uses 3 about 3 cores, and I've got it sending to an unaggregated namespace and a namespace aggregating at the 1h level. Here's what the baseline looks like m3coordinator.only_hourly.pb.gz

    I turned on a 1m aggregation namespace, and cpu usage jumped to 28 cores, mostly in system time. here's my cpu graph for the box: image and here are 3 profiles grabbed in that window, 2 for 15s, and one for 60s: m3coordinator.slow_15s_1.pb.gz m3coordinator.slow_15s_2.pb.gz m3coordinator.slow_60s.pb.gz

    When I disabled the 1m aggregation level and restarted the coordinator (15:37 in the graph), system time was still high, but recovered. Here's a profile about 5m after the restart: m3coordinator.in_recovery.pb.gz

    let me know what else I can provide. Thanks, Bert

  • Parallelize commitlog bootstrapper

    Parallelize commitlog bootstrapper

    This is PR 3 in three PR's that together will improve the performance of the commitlog bootstrapper. This PR parallelizes the commitlog bootstrapper itself.

  • Add license scan report and status

    Add license scan report and status

    Your FOSSA integration was successful! Attached in this PR is a badge and license report to track scan status in your README.

    Below are docs for integrating FOSSA license checks into your CI:

  • PromQL query returns unrelated series

    PromQL query returns unrelated series

    As mentioned on Gitter, I run into some strange issue with queries against M3DB returning unrelated series. This results in Grafana throwing errors like "many-to-many matching not allowed" or "Multiple Series Error" on f.e. the default Prometheus node-exporter dashboard.


    Prometheus: 2.3.2 M3Coordinator: 0.4.1 (running on the same host as prometheus) M3DB: 0.4.1 (4 node cluster, 2 replicasets of 2 nodes each with 3 seednodes) Prometheus config: read_recent: true

    PromQL (latest value):

    up{job="node-exporter",` instance="m3db-node04:9100"}
    

    Result (omitted the values):

    up{instance="m3db-node04:9100",job="node-exporter"}
    

    PromQL (1 day window):

    up{job="node-exporter",` instance="m3db-node04:9100"}[1d]
    

    Result:

    up{instance="m3db-node04:9100",job="node-exporter"}
    node_memory_CmaFree_bytes{instance="m3db-node03:9100",job="node-exporter"}
    

    Didn't expect to get a 'node_memory_CmaFree_bytes' serie, doesn't even match the instance...


    PromQL (2 day window):

    up{job="node-exporter",` instance="m3db-node04:9100"}[2d]
    

    Result:

        up{instance="m3db-node04:9100",job="node-exporter"}
        node_network_transmit_packets_total{device="lo",instance="m3db-node04:9100",job="node-exporter"}
        node_memory_CmaFree_bytes{instance="m3db-node03:9100",job="node-exporter"}
        go_memstats_alloc_bytes{instance="m3db-node03:9100",job="node-exporter"}
    

    These extra series have values up to 'now', so they are not old series or something.


    PromQL (1 week window): A total of 6 series show up.


    PromQL (2 week window):

    up
    

    Result: All the 'up' series show up (different instance tags ofc), but no other series.


    I have two Prometheus instances scraping the same nodes. On the 2nd node I changed the config to read_recent: false. The 2nd node only shows the requested serie, as expected.

  • Strip fields in /labels endpoints

    Strip fields in /labels endpoints

    Currently, /labels endpoint does not strip the labels passed in the Restrict-By-Tags-JSON header.

    This PR fixes it by stripping the labels as part of the json rendering (RenderListTagResultsJSON). The labels to strip were added to the RenderSeriesMetadataOptions struct, since this struct is being used by other response writers which should support strip anyhow - this can be done in future PR.

    Also contains a test.

  • Metrics Migration from Graphite

    Metrics Migration from Graphite

    Do we have best practices around migrating tons of metrics from Graphite to M3 ? or any provided tools, we can build tools for that, but curious if anything existed.

  • Support Go 1.9.X

    Support Go 1.9.X

    1. Add a linter to automatically detect usage of map[time.Time]
    2. Fix remaining instances of map[time.Time]
    3. Fix tests that do the equivalent of time.Time == time.Time
    4. Add Go 1.9.2 to travis.yml
  • [dbnode] Use ref to segment data for index results instead of alloc each

    [dbnode] Use ref to segment data for index results instead of alloc each

    What this PR does / why we need it:

    This greatly reduces the memory allocation usage when querying the index as it ensures the lifetime of the segment survives the query and returns just references to the index data.

    Special notes for your reviewer:

    Does this PR introduce a user-facing and/or backwards incompatible change?:

    NONE
    

    Does this PR require updating code package or user-facing documentation?:

    NONE
    
  • Read Bloom Filters into memory inside of seeker

    Read Bloom Filters into memory inside of seeker

    • [x] Add a function for reading out the bloom filter from disk using an anonymous mmap region + validate the digest + write appropriate unit tests
    • [x] Add container object for BloomFilter for lifecycle management (freeing the mmap'd region when we're done with it)
    • [x] Read bloom filter in seeker with lifecycle management ( construct in Open() and release resources in Close() )
  • Update fileset index format to sorted string order with summaries and bloom filter

    Update fileset index format to sorted string order with summaries and bloom filter

    This is the beginning of a set of changes to move to remove all block references from memory except those that are being written to and those that have been loaded in recently as cached blocks due to reads.

    This will write out the new file format and support reading it.

    For now the seeker still loads the ID and offsets into memory, the next change will update it to load the summaries file and resolve IDs by reading the index entries on disk after searching the summaries for a jumping off point.

  • Investigate Sudden Surge in

    Investigate Sudden Surge in "too far in past" errors in M3 Aggregator

    We are using the fleet-of-m3-coordinators-and-m3-aggregators topology to aggregate metrics before sending it to downstream grafana remote storage https://m3db.io/docs/how_to/any_remote_storage/#fleet-of-m3-coordinators-and-m3-aggregators

    Apps ---> otel collector Prometheus Receiver -> M3 Coordinator --> M3 Aggregators --> M3 Coordinator (Aggregated metrics )-> Prometheus Remote Write to Grafana

    We are observing a sudden surge in M3 Aggregation Errors of the type "too far in past" after around ~30 hours of traffic. (1.5 full day of prod traffic). I am not sure if its just caused by load since 1) the cluster looks very stable for the first full day of prod traffic 2) the increase in these errors are not gradual. It really is a sudden increase and that causes the M3 Coordinators to finally OOM and crash.

    The surge in the above errors are accompanied by an increase in the ingest latency metric Screenshot 2023-01-05 at 11 25 03 AM

    Topology Details:

    25 Otel Collector Pods -> 25 M3 Coordinator Nodes -> 24 M3 Agg Nodes - Shards 512 - RF 2

    M3 Agg End to End details dashboard snapshot.pdf M3 Coordinator Dashboard snapshot.pdf

    I suspect something might be wrong with my configs

    M3 Agg Log example:

    {"level":"error","ts":1672883758.452722,"msg":"could not process message","error":"datapoint for aggregation too far in past: off_by=28m19.550704118s, timestamp=2023-01-05T01:22:23Z, past_limit=2023-01-05T01:50:   43Z, timestamp_unix_nanos=1672881743902000000, past_limit_unix_nanos=1672883443452704118","errorCauses":[{"error":"datapoint for aggregation too far in past: off_by=28m19.550704118s, timestamp=2023-01-05T01:22:    23Z, past_limit=2023-01-05T01:50:43Z, timestamp_unix_nanos=1672881743902000000, past_limit_unix_nanos=1672883443452704118"}],"shard":854,"proto":"type:TIMED_METRIC_WITH_METADATAS timed_metric_with_metadatas:       <metric:<type:GAUGE id:\"u'\\n\\000\\010\\000__name__\\032\\000varz_disconnect_end_bucket\\n\\000__rollup__\\004\\000true\\007\\000appname\\n\\000uber-                                                               voice\\013\\000client_type\\007\\000dialpad\\007\\000country\\002\\000mx\\010\\000instance\\014\\0000.0.0.0:8080\\003\\000job\\r\\000log-processor\\002\\000le\\005\\0000.                                            025\\006\\000module\\006\\000answer\\007\\000version\\n\\0002212-02-49\" time_nanos:1672881743902000000 > metadatas:<metadatas:<cutover_nanos:1672771058627771735 metadata:<pipelines:<aggregation_id:<id:128 >       storage_policies:<resolution:<window_size:15000000000 precision:1000000000 > retention:<period:7200000000000 > > pipeline:<> > > > > > "} 
    
    

    Configs: M3 Coordinator

    listenAddress: 0.0.0.0:7201
    
    logging:
      level: info
    
    metrics:
      scope:
        prefix: "m3coordinator"
      prometheus:
        handlerPath: /metrics
        listenAddress: 0.0.0.0:3030
      sanitization: prometheus
      samplingRate: 1.0
    
    backend: prom-remote
    
    prometheusRemoteBackend: 
      endpoints:
        # There must be one endpoint for unaggregated metric (retention=0, resolution=0) so that m3 would be throw
        # an error.
        # We have a mapping rule to drop unaggregated metric (see downsample-> mappingRules below). So putting endpoint
        # here would NOT direct unaggregated metric.
        - name: unaggregated 
          address: http://nginx-reverse-proxy.monitoring.svc.cluster.local:9092/
              
        # Use the following endpoint for directing pre-aggregated metric to a self-host prometheus instance for testing.
        # address: http://prometheus-m3-pre-agg.m3aggregation.svc.cluster.local:9090/api/v1/write
            
        - name: nginx-sidecar
          address: http://nginx-reverse-proxy.monitoring.svc.cluster.local:9092/
         # Use the following endpoint for directing aggregated metric to a self-host prometheus instance for testing.
         # address: http://prometheus-agg.m3aggregation.svc.cluster.local:9090/api/v1/write
              
          storagePolicy:
            retention: 2h
            resolution: 15s
            downsample:
              all: false
              # all: false means not all the metric is going through downsampling, only those passed the filter
      connectTimeout: 15s # Default is 5s, increase to 15s
      maxIdleConns: 500 # Default is 100, increase to 500
    
    clusterManagement:
      etcd:
        env: m3aggregation
        zone: embedded
        service: m3db
        cacheDir: /var/lib/m3kv
        etcdClusters:
          - zone: embedded
            endpoints:
              - etcd-0.etcd:2379
              - etcd-1.etcd:2379
              - etcd-2.etcd:2379
    
    downsample:
      rules:
        # mapping rule to drop unaggregate metrics.
        mappingRules:
          - name: "Drop unaggregate metric"
            filter: "__name__:*"
            drop: True
    
        rollupRules:
          # Exclude instance label for non VARZ metrics.
          # eg. web client metric includes a device id in its 'instance' label.
          # eg. K8s metrics includes a pod id in its 'instance' label.
          - name: "Exclude instance for _count"
            filter: "__name__:[!v]??[!z]*_count instance:*"
            transforms:
            - rollup:
                metricName: "{{ .MetricName }}"
                excludeBy: ["instance"]
                aggregations: ["Sum"]
            storagePolicies:
            - resolution: 15s
              retention: 2h
          
          - name: "Exclude instance for _sum"
            filter: "__name__:[!v]??[!z]*_sum instance:*"
            transforms:
            - rollup:
                metricName: "{{ .MetricName }}"
                excludeBy: ["instance"]
                aggregations: ["Sum"]
            storagePolicies:
            - resolution: 15s
              retention: 2h
        
          - name: "Exclude instance for _bucket"
            filter: "__name__:[!v]??[!z]*_bucket instance:*"
            transforms:
            - rollup:
                metricName: "{{ .MetricName }}"
                excludeBy: ["instance"]
                aggregations: ["Sum"]
            storagePolicies:
            - resolution: 15s
              retention: 2h
        
          - name: "Exclude instance for _total"
            filter: "__name__:[!v]??[!z]*_total instance:*"
            transforms:
            - rollup:
                metricName: "{{ .MetricName }}"
                excludeBy: ["instance"]
                aggregations: ["Sum"]
            storagePolicies:
            - resolution: 15s
              retention: 2h
            
          # todo: figure out how to filter out varz metrics
          # We still need this for self monitoring gauge metrics. 
          - name: "Exclude instance for gauge"
            filter: "__name__:!*{_count,_sum,_bucket,_total} instance:*"
            transforms:
            - rollup:
                metricName: "{{ .MetricName }}"
                excludeBy: ["instance"]
                aggregations: ["Last"]
            storagePolicies:
            - resolution: 15s
              retention: 2h
          
          # VARZ's target id is log_processor_instance.
          # Here we use another set of roll up rule to exclude two labels.
          - name: "Exclude log_processor_instance for _count"
            filter: "__name__:varz_*_count log_processor_instance:*"
            transforms:
            - rollup:
                metricName: "{{ .MetricName }}"
                excludeBy: ["log_processor_instance"]
                aggregations: ["Sum"]
            storagePolicies:
            - resolution: 15s
              retention: 2h
          
          - name: "Exclude log_processor_instance for _sum"
            filter: "__name__:varz_*_sum log_processor_instance:*"
            transforms:
            - rollup:
                metricName: "{{ .MetricName }}"
                excludeBy: ["log_processor_instance"]
                aggregations: ["Sum"]
            storagePolicies:
            - resolution: 15s
              retention: 2h
        
          - name: "Exclude log_processor_instance for _bucket"
            filter: "__name__:varz_*_bucket log_processor_instance:*"
            transforms:
            - rollup:
                metricName: "{{ .MetricName }}"
                excludeBy: ["log_processor_instance"]
                aggregations: ["Sum"]
            storagePolicies:
            - resolution: 15s
              retention: 2h
        
          - name: "Exclude log_processor_instance for _total"
            filter: "__name__:varz_*_total log_processor_instance:*"
            transforms:
            - rollup:
                metricName: "{{ .MetricName }}"
                excludeBy: ["log_processor_instance"]
                aggregations: ["Sum"]
            storagePolicies:
            - resolution: 15s
              retention: 2h
          
          # todo: figure out how to apply this for varz metrics
          # - name: "Exclude instance & log_processor_instance for gauge"
          #   filter: "__name__:!*{_count,_sum,_bucket,_total} instance:* log_processor_instance:*"
          #   transforms:
          #   - rollup:
          #       metricName: "{{ .MetricName }}"
          #       excludeBy: ["instance", "log_processor_instance"]
          #       aggregations: ["Last"]
          #   storagePolicies:
          #   - resolution: 15s
          #     retention: 2h
    
    
      matcher:
        requireNamespaceWatchOnInit: false
    
      remoteAggregator:
        client:
          type: m3msg
          m3msg:
            producer:
              writer:
                topicName: aggregator_ingest
                topicServiceOverride:
                  zone: embedded
                  environment: m3aggregation
                placement:
                  isStaged: true
                placementServiceOverride:
                  namespaces:
                    placement: /placement
                connection:
                  numConnections: 64
                messagePool:
                  size: 16384
                  watermark:
                    low: 0.7
                    high: 1.0
    
    # This is for configuring the ingestion server that will receive metrics from the m3aggregators on port 7507
    ingest:
      ingester:
        workerPoolSize: 10000
        opPool:
          size: 10000
        retry:
          maxRetries: 1
          jitter: true
        logSampleRate: 0.01
      m3msg:
        server:
          listenAddress: "0.0.0.0:7507"
          retry:
            maxBackoff: 10s
            jitter: true
    

    M3 Agg

    logging:
      level: info
    
    metrics:
      scope:
        prefix: m3aggregator
      prometheus:
        onError: none
        handlerPath: /metrics
        listenAddress: 0.0.0.0:6002
        timerType: histogram
      sanitization: prometheus
      samplingRate: 1.0
      extended: none
    
    http:
      listenAddress: 0.0.0.0:6001
      readTimeout: 60s
      writeTimeout: 60s
    
    m3msg:
      server:
        listenAddress: 0.0.0.0:6000
        retry:
          maxBackoff: 30s
      consumer:
        messagePool:
          size: 16384
    
    kvClient:
      etcd:
        env: m3aggregation
        zone: embedded
        service: m3aggregator
        cacheDir: /var/lib/m3kv
        etcdClusters:
          - zone: embedded
            endpoints:
              - etcd-0.etcd:2379
              - etcd-1.etcd:2379
              - etcd-2.etcd:2379
    
    runtimeOptions:
      kvConfig:
        environment: m3aggregation
        zone: embedded
      writeValuesPerMetricLimitPerSecondKey: write-values-per-metric-limit-per-second
      writeValuesPerMetricLimitPerSecond: 0
      writeNewMetricLimitClusterPerSecondKey: write-new-metric-limit-cluster-per-second
      writeNewMetricLimitClusterPerSecond: 0
      writeNewMetricNoLimitWarmupDuration: 0
    
    aggregator:
      hostID:
        resolver: environment
        envVarName: M3AGGREGATOR_HOST_ID
      instanceID:
        type: host_id
      verboseErrors: true
      metricPrefix: ""
      counterPrefix: ""
      timerPrefix: ""
      gaugePrefix: ""
      aggregationTypes:
        counterTransformFnType: empty
        timerTransformFnType: suffix
        gaugeTransformFnType: empty
        aggregationTypesPool:
          size: 1024
        quantilesPool:
          buckets:
            - count: 256
              capacity: 4
            - count: 128
              capacity: 8
      stream:
        eps: 0.001
        capacity: 32
        streamPool:
          size: 4096
        samplePool:
          size: 4096
        floatsPool:
          buckets:
            - count: 4096
              capacity: 16
            - count: 2048
              capacity: 32
            - count: 1024
              capacity: 64
      client:
        type: m3msg
        m3msg:
          producer:
            writer:
              topicName: aggregator_ingest
              topicServiceOverride:
                zone: embedded
                environment: m3aggregation
              placement:
                isStaged: true
              placementServiceOverride:
                namespaces:
                  placement: /placement
              messagePool:
                size: 16384
                watermark:
                  low: 0.7
                  high: 1
              messageRetry:
                initialBackoff: 5s # Chronosphere setting.
                maxBackoff: 5s # Chronosphere setting.
      placementManager:
        kvConfig:
          namespace: /placement
          environment: m3aggregation
          zone: embedded
        placementWatcher:
          key: m3aggregator
          initWatchTimeout: 10s
      hashType: murmur32
      bufferDurationBeforeShardCutover: 10m
      bufferDurationAfterShardCutoff: 10m
      bufferDurationForFutureTimedMetric: 10m # Allow test to write into future.
      bufferDurationForPastTimedMetric: 5m # Don't wait too long for timed metrics to flush.
      resignTimeout: 1m
      flushTimesManager:
        kvConfig:
          environment: m3aggregation
          zone: embedded
        flushTimesKeyFmt: shardset/%d/flush
        flushTimesPersistRetrier:
          initialBackoff: 100ms
          backoffFactor: 2.0
          maxBackoff: 30s
          maxRetries: 0
      electionManager:
        election:
          leaderTimeout: 10s
          resignTimeout: 10s
          ttlSeconds: 10
        serviceID:
          name: m3aggregator
          environment: m3aggregation
          zone: embedded
        electionKeyFmt: shardset/%d/lock
        campaignRetrier:
          initialBackoff: 100ms
          backoffFactor: 2.0
          maxBackoff: 2s
          forever: true
          jitter: true
        changeRetrier:
          initialBackoff: 100ms
          backoffFactor: 2.0
          maxBackoff: 5s
          forever: true
          jitter: true
        resignRetrier:
          initialBackoff: 100ms
          backoffFactor: 2.0
          maxBackoff: 5s
          forever: true
          jitter: true
        campaignStateCheckInterval: 1s
        shardCutoffCheckOffset: 30s
      flushManager:
        checkEvery: 1s
        jitterEnabled: true
        maxJitters:
          - flushInterval: 5s
            maxJitterPercent: 1.0
          - flushInterval: 10s
            maxJitterPercent: 0.5
          - flushInterval: 1m
            maxJitterPercent: 0.5
          - flushInterval: 10m
            maxJitterPercent: 0.5
          - flushInterval: 1h
            maxJitterPercent: 0.25
        numWorkersPerCPU: 0.5
        flushTimesPersistEvery: 10s
        maxBufferSize: 5m
        forcedFlushWindowSize: 10s
      flush:
        handlers:
          - dynamicBackend:
              name: m3msg
              hashType: murmur32
              producer:
                buffer:
                  maxBufferSize: 1000000000 # max buffer before m3msg start dropping data.
                writer:
                  topicName: aggregated_metrics
                  topicServiceOverride:
                    zone: embedded
                    environment: m3aggregation
                  messagePool:
                    size: 16384
                    watermark:
                      low: 0.7
                      high: 1
      passthrough:
        enabled: true
      forwarding:
        maxSingleDelay: 30s # Chronosphere setting.
        maxConstDelay: 5m # Need to add some buffer window, since timed metrics by default are delayed by 1min.
      entryTTL: 11h
      entryCheckInterval: 5m
      maxTimerBatchSizePerWrite: 140
      defaultStoragePolicies: []
      maxNumCachedSourceSets: 2
      discardNaNAggregatedValues: true
      entryPool:
        size: 4096
      counterElemPool:
        size: 4096
      timerElemPool:
        size: 4096
      gaugeElemPool:
        size: 0
    

    Performance issues

    If the issue is performance related, please provide the following information along with a description of the issue that you're experiencing:

    1. What service is experiencing the performance issue? (M3Coordinator, M3DB, M3Aggregator, etc)
    2. Approximately how many datapoints per second is the service handling?
    3. What is the approximate series cardinality that the series is handling in a given time window? I.E How many unique time series are being measured?
    4. What is the hardware configuration (number CPU cores, amount of RAM, disk size and types, etc) that the service is running on? Is the service the only process running on the host or is it colocated with other software?
    5. What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).
    6. How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?

    In addition to the above information, CPU and heap profiles are always greatly appreciated.

    CPU / Heap Profiles

    CPU and heap profiles are critical to helping us debug performance issues. All our services run with the net/http/pprof server enabled by default.

    Instructions for obtaining CPU / heap profiles for various services are below, please attach these profiles to the issue whenever possible.

    M3Coordinator

    CPU curl <HOST_NAME>:<PORT(default 7201)>/debug/pprof/profile?seconds=5 > m3coord_cpu.out

    Heap curl <HOST_NAME>:<PORT(default 7201)>/debug/pprof/heap > m3coord_heap.out

    M3DB

    CPU curl <HOST_NAME>:<PORT(default 9004)>/debug/pprof/profile?seconds=5 > m3db_cpu.out

    Heap curl <HOST_NAME>:<PORT(default 9004)>/debug/pprof/heap -> m3db_heap.out

    M3DB Grafana Dashboard Screenshots

    If the service experiencing performance issues is M3DB and you're monitoring it using Prometheus, any screenshots you could provide using this dashboard would be helpful.

  • [aggregator] Aggregation Entry option handling and code cleanup

    [aggregator] Aggregation Entry option handling and code cleanup

    What this PR does / why we need it:

    Fixes #

    Special notes for your reviewer:

    Does this PR introduce a user-facing and/or backwards incompatible change?:

    
    

    Does this PR require updating code package or user-facing documentation?:

    
    
  • Bump express from 4.17.1 to 4.18.2 in /src/ctl/ui

    Bump express from 4.17.1 to 4.18.2 in /src/ctl/ui

    Bumps express from 4.17.1 to 4.18.2.

    Release notes

    Sourced from express's releases.

    4.18.2

    4.18.1

    • Fix hanging on large stack of sync routes

    4.18.0

    ... (truncated)

    Changelog

    Sourced from express's changelog.

    4.18.2 / 2022-10-08

    4.18.1 / 2022-04-29

    • Fix hanging on large stack of sync routes

    4.18.0 / 2022-04-25

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

  • Bump qs from 6.7.0 to 6.7.3 in /src/ctl/ui

    Bump qs from 6.7.0 to 6.7.3 in /src/ctl/ui

    Bumps qs from 6.7.0 to 6.7.3.

    Changelog

    Sourced from qs's changelog.

    6.7.3

    • [Fix] parse: ignore __proto__ keys (#428)
    • [Fix] stringify: avoid encoding arrayformat comma when encodeValuesOnly = true (#424)
    • [Robustness] stringify: avoid relying on a global undefined (#427)
    • [readme] remove travis badge; add github actions/codecov badges; update URLs
    • [Docs] add note and links for coercing primitive values (#408)
    • [meta] fix README.md (#399)
    • [meta] do not publish workflow files
    • [actions] backport actions from main
    • [Dev Deps] backport updates from main
    • [Tests] use nyc for coverage
    • [Tests] clean up stringify tests slightly

    6.7.2

    • [Fix] proper comma parsing of URL-encoded commas (#361)
    • [Fix] parses comma delimited array while having percent-encoded comma treated as normal text (#336)

    6.7.1

    • [Fix] parse: Fix parsing array from object with comma true (#359)
    • [Fix] parse: with comma true, handle field that holds an array of arrays (#335)
    • [fix] parse: with comma true, do not split non-string values (#334)
    • [Fix] parse: throw a TypeError instead of an Error for bad charset (#349)
    • [Fix] fix for an impossible situation: when the formatter is called with a non-string value
    • [Refactor] formats: tiny bit of cleanup.
    • readme: add security note
    • [meta] add tidelift marketing copy
    • [meta] add funding field
    • [meta] add FUNDING.yml
    • [meta] Clean up license text so it’s properly detected as BSD-3-Clause
    • [Dev Deps] update eslint, @ljharb/eslint-config, tape, safe-publish-latest, evalmd, iconv-lite, mkdirp, object-inspect, browserify
    • [Tests] parse: add passing arrayFormat tests
    • [Tests] use shared travis-ci configs
    • [Tests] Buffer.from in node v5.0-v5.9 and v4.0-v4.4 requires a TypedArray
    • [Tests] add tests for depth=0 and depth=false behavior, both current and intuitive/intended
    • [Tests] use eclint instead of editorconfig-tools
    • [actions] add automatic rebasing / merge commit blocking
    Commits
    • 834389a v6.7.3
    • 45143b6 [Tests] use nyc for coverage
    • 5d55ddc [meta] do not publish workflow files
    • f945393 [Fix] parse: ignore __proto__ keys (#428)
    • a8d5286 [Robustness] stringify: avoid relying on a global undefined (#427)
    • 04eac8d [Fix] stringify: avoid encoding arrayformat comma when `encodeValuesOnly = ...
    • 9dab77e [readme] remove travis badge; add github actions/codecov badges; update URLs
    • b9a039d [Tests] clean up stringify tests slightly
    • 29c8f3c [Docs] add note and links for coercing primitive values (#408)
    • c87c8c9 [meta] fix README.md (#399)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

  • [coordinator] Add write host error and fetch host error sampled logging and config

    [coordinator] Add write host error and fetch host error sampled logging and config

    Adds ability to log the host write errors and host fetch errors in a sampled way, configurable by config. Useful for investigating if certain hosts encounter flakey errors more frequently than other nodes in a cluster (either in replica/rack/etc).

    Special notes for your reviewer:

    Does this PR introduce a user-facing and/or backwards incompatible change?:

    NONE
    

    Does this PR require updating code package or user-facing documentation?:

    NONE
    
  • Bump decode-uri-component from 0.2.0 to 0.2.2 in /src/ctl/ui

    Bump decode-uri-component from 0.2.0 to 0.2.2 in /src/ctl/ui

    Bumps decode-uri-component from 0.2.0 to 0.2.2.

    Release notes

    Sourced from decode-uri-component's releases.

    v0.2.2

    • Prevent overwriting previously decoded tokens 980e0bf

    https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.1...v0.2.2

    v0.2.1

    • Switch to GitHub workflows 76abc93
    • Fix issue where decode throws - fixes #6 746ca5d
    • Update license (#1) 486d7e2
    • Tidelift tasks a650457
    • Meta tweaks 66e1c28

    https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.1

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

A docker container that can be deployed as a sidecar on any kubernetes pod to monitor PSI metrics

CgroupV2 PSI Sidecar CgroupV2 PSI Sidecar can be deployed on any kubernetes pod with access to cgroupv2 PSI metrics. About This is a docker container

Nov 23, 2021
The metrics-agent collects allocation metrics from a Kubernetes cluster system and sends the metrics to cloudability

metrics-agent The metrics-agent collects allocation metrics from a Kubernetes cluster system and sends the metrics to cloudability to help you gain vi

Jan 14, 2022
Sensu-go-postgres-metrics - The sensu-go-postgres-metrics is a sensu check that collects PostgreSQL metrics

sensu-go-postgres-metrics Table of Contents Overview Known issues Usage examples

Jan 12, 2022
Export Prometheus metrics from journald events using Prometheus Go client library

journald parser and Prometheus exporter Export Prometheus metrics from journald events using Prometheus Go client library. For demonstration purposes,

Jan 3, 2022
A lightweight, cloud-native data transfer agent and aggregator
A lightweight, cloud-native data transfer agent and aggregator

English | 中文 Loggie is a lightweight, high-performance, cloud-native agent and aggregator based on Golang. It supports multiple pipeline and pluggable

Jan 6, 2023
Cmsnr - cmsnr (pronounced "commissioner") is a lightweight framework for running OPA in a sidecar alongside your applications in Kubernetes.

cmsnr Description cmsnr (pronounced "commissioner") is a lightweight framework for running OPA in a sidecar alongside your applications in Kubernetes.

Jan 13, 2022
Tcpdump-webhook - Toy Sidecar Injection with Mutating Webhook

tcpdump-webhook A simple demonstration of Kubernetes Mutating Webhooks. Injects

Feb 8, 2022
cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resource objects related of Kubernetes Cluster API.

Overview cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resou

Oct 27, 2022
This library provides a metrics package which can be used to instrument code, expose application metrics, and profile runtime performance in a flexible manner.

This library provides a metrics package which can be used to instrument code, expose application metrics, and profile runtime performance in a flexible manner.

Jan 18, 2022
Flash-metrics - Flash Metrics Storage With Golang

Flash Metrics Storage bootstrap: $ echo -e "max-index-length = 12288" > tidb.con

Jan 8, 2022
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

kepler Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics Architectur

Dec 26, 2022
📡 Prometheus exporter that exposes metrics from SpaceX Starlink Dish
📡  Prometheus exporter that exposes metrics from SpaceX Starlink Dish

Starlink Prometheus Exporter A Starlink exporter for Prometheus. Not affiliated with or acting on behalf of Starlink(™) ?? Starlink Monitoring System

Dec 19, 2022
Prometheus exporter for Chia node metrics

chia_exporter Prometheus metric collector for Chia nodes, using the local RPC API Building and Running With the Go compiler tools installed: go build

Sep 19, 2022
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

DCGM-Exporter This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM. Documentation

Dec 27, 2022
A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2
A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2

CloudLinux LVE Exporter for Prometheus LVE Exporter - A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2 Help on flags: -h, --h

Nov 2, 2021
A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover

YAAE (Yet Another AWS Exporter) A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover About This exporter is meant to expo

Dec 10, 2022
Prometheus metrics exporter for libvirt.

Libvirt exporter Prometheus exporter for vm metrics written in Go with pluggable metric collectors. Installation and Usage If you are new to Prometheu

Jul 4, 2022
Prometheus Exporter for Kvrocks Metrics
Prometheus Exporter for Kvrocks Metrics

Prometheus Kvrocks Metrics Exporter This is a fork of oliver006/redis_exporter to export the kvrocks metrics. Building and running the exporter Build

Sep 7, 2022
Prometheus metrics from Deco M4

Prometheus Deco Exporter Provider prometheus metrics from Deco M4. Usage Set the environment variables DECO_EXPORTER_ADDR to the address of your main

May 17, 2022