NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

Last update: Dec 27, 2022

Comments: 15

DCGM-Exporter

This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM.

Documentation

Official documentation for DCGM-Exporter can be found on docs.nvidia.com.

Quickstart

To gather metrics on a GPU node, simply start the dcgm-exporter container:

$ docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
...

Quickstart on Kubernetes

Note: Consider using the NVIDIA GPU Operator rather than DCGM-Exporter directly.

Ensure you have already setup your cluster with the default runtime as NVIDIA.

The recommended way to install DCGM-Exporter is to use the Helm chart:

$ helm repo add gpu-helm-charts \
  https://nvidia.github.io/dcgm-exporter/helm-charts

Update the repo:

$ helm repo update

And install the chart:

$ helm install \ 
    --generate-name \ 
    gpu-helm-charts/dcgm-exporter

Once the dcgm-exporter pod is deployed, you can use port forwarding to obtain metrics quickly:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

# Let's get the output of a random pod:
$ NAME=$(kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" \
                         -o "jsonpath={ .items[0].metadata.name}")

$ kubectl port-forward $NAME 8080:9400 &
$ curl -sL http://127.0.01:8080/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 9223372036854775794
...

To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the user guide. dcgm-exporter is deployed as part of the GPU Operator. To get started with integrating with Prometheus, check the Operator user guide.

Building from Source

In order to build dcgm-exporter ensure you have the following:

$ git clone https://github.com/NVIDIA/dcgm-exporter.git
$ cd dcgm-exporter
$ make binary
$ sudo make install
...
$ dcgm-exporter &
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
...

Changing Metrics

With dcgm-exporter you can configure which fields are collected by specifying a custom CSV file. You will find the default CSV file under etc/default-counters.csv in the repository, which is copied on your system or container to /etc/dcgm-exporter/default-counters.csv

The layout and format of this file is as follows:

# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

A custom csv file can be specified using the -f option or --collectors as follows:

$ dcgm-exporter -f /tmp/custom-collectors.csv

Notes:

Always make sure your entries have 2 commas (',')
The complete list of counters that can be collected can be found on the DCGM API reference manual: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/group__dcgmFieldIdentifiers.html

What about a Grafana Dashboard?

You can find the official NVIDIA DCGM-Exporter dashboard here: https://grafana.com/grafana/dashboards/12239

You will also find the json file on this repo under grafana/dcgm-exporter-dashboard.json

Pull requests are accepted!

Issues and Contributing

Checkout the Contributing document!

Please let us know by filing a new issue
You can contribute by opening a pull request

Owner

NVIDIA Corporation

https://github.com/NVIDIA/dcgm-exporter

Comments

Pod metrics displays Daemonset name of dcgm-exporter rather than the pod with GPU

Expected Behavior: I'm trying to get gpu metrics working for my workloads and would expect be able to see my pod name show up in the prometheus metrics as per this guide in the section "Per-pod GPU metrics in a Kubernetes cluster"

Existing Behavior: The metrics show up but the "pod" tag is "somename-gpu-dcgm-exporter" which is unhelpful as it does not map back to my pods.

example metric: DCGM_FI_DEV_GPU_TEMP{UUID="GPU-<UUID>", container="exporter", device="nvidia0", endpoint="metrics", gpu="0", instance="<Instance>", job="somename-gpu-dcgm-exporter", namespace="some-namespace", pod="somename-gpu-dcgm-exporter-vfbhl", service="somename-gpu-dcgm-exporter"}

K8s cluster: GKE clusters with a nodepool running 2 V100 GPUs per node Setup: I used helm template to generate the yaml to apply to my GKE cluster. I ran into the issue described here, so I needed to add privileged: true, downgrade to nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04, and add nvidia-install-dir-host volume.

Things I've tried:

Verified DCGM_EXPORTER_KUBERNETES is set to true
Went through https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L126 to see if I misunderstood the functionality or could find any easy resolution
I see there is a code change since my downgrade, but that seemed enable MIG, but that didn't seem like it applied to me. Even if it did, the issue I encountered that forced the downgrade would still exist.

The daemonset looked as below:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: somename-gpu-dcgm-exporter
  namespace: some-namespace
  labels:
    helm.sh/chart: dcgm-exporter-2.4.0
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: somename-gpu
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: somename-gpu
      app.kubernetes.io/component: "dcgm-exporter"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dcgm-exporter
        app.kubernetes.io/instance: somename-gpu
        app.kubernetes.io/component: "dcgm-exporter"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-accelerator
                    operator: Exists
      serviceAccountName: gpu-dcgm-exporter
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      - name: nvidia-install-dir-host
        hostPath:
          path: /home/kubernetes/bin/nvidia
      tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: "Exists"
        - effect: NoSchedule
          key: nodeSize
          operator: Equal
          value: my-special-nodepool-taint
      containers:
      - name: exporter
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
          runAsNonRoot: false
          runAsUser: 0
          privileged: true
        image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
        imagePullPolicy: "IfNotPresent"
        args:
        - -f
        - /etc/dcgm-exporter/dcp-metrics-included.csv
        env:
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        ports:
        - name: "metrics"
          containerPort: 9400
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        livenessProbe:
          httpGet:
            path: /health
            port: 9400
          initialDelaySeconds: 5
          periodSeconds: 5
        readinessProbe:
          httpGet:
            path: /health
            port: 9400
          initialDelaySeconds: 5

Confirm DCP GPU family
Hi.

I have two questions.

I would like to know about DCP GPU family. Which gpu are including?

How should I one standard dashboard to show the GPU utilization with some GPU familiy sever(T4, RTX A6000, A100 or Geforce RTX3080 and so on) under the K8s environment?

As you know, if not included in the DCP GPU family, the DCGM_FI_PROF_* metrics will be disabled. If it will mixed the GPU family for our cluster, the dashboard will not work well... Or should I use the previous metrics of "DCGM_FI_DEV_GPU_UTIL"?

Best regards. Kaka

Error starting nv-hostengine: DCGM initialization error

Run this command on a server with Nvidia A100 GPU, that one of them have MIG turned on: docker run --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.3.1-2.6.1-ubuntu20.04 and the output I got:

Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2021-12-14T17:28:56Z" level=info msg="Starting dcgm-exporter"
CacheManager Init Failed. Error: -17
time="2021-12-14T17:28:56Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

Docker version 20.10.11, build dea9396 Ubuntu: VERSION="20.04.3 LTS (Focal Fossa)" x86_64 CPU: AMD

user@host~$ nvidia-smi
Tue Dec 14 17:30:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0    33W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:14:00.0 Off |                   On |
| N/A   30C    P0    32W / 250W |     20MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  1    1   0   0  |     10MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    2   0   1  |     10MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

GPU freeezes when dcgm-exporter is used

We have a problem with dcgm-exporter while running it in the Kubernetes cluster. All nodes are bootstrapped from the same code and are identical. On some nodes (randomly) it fails to start and causes GPU to freeze and this one becomes useless. When we un-deploy daemonset with dcgm-exporter everything works just fine. Used GPUs are the same. On staging environment where we have the same config but with nvidia-device-plugin 1.11 version it works without issues.

Working one:

❯ ssh ip-10-123-218-199.eu-west-1.compute.internal sudo nvidia-smi
Warning: Permanently added 'ip-10-123-218-199.eu-west-1.compute.internal' (ED25519) to the list of known hosts.
Fri Aug 19 16:27:48 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   81C    P0    70W /  70W |    317MiB / 15360MiB |     83%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11149      C   /usr/bin/dcgmproftester11         315MiB |
+-----------------------------------------------------------------------------+

Faulty one:

❯ ssh ip-10-123-216-188.eu-west-1.compute.internal sudo nvidia-smi
Warning: Permanently added 'ip-10-123-216-188.eu-west-1.compute.internal' (ED25519) to the list of known hosts.
No devices were found

In system logs on faulty ones we can observe the next messages:

❯ ssh ip-10-123-216-6.eu-west-1.compute.internal sudo grep NVRM /var/log/messages
Warning: Permanently added 'ip-10-123-216-6.eu-west-1.compute.internal' (ED25519) to the list of known hosts.
Jul 27 11:22:22 ip-172-31-43-161.eu-west-1.compute.internal kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  510.73.08  Wed May 18 20:34:14 UTC 2022
Aug 18 18:21:05 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  510.73.08  Wed May 18 20:34:14 UTC 2022
Aug 18 18:25:58 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU at PCI:0000:00:1e: GPU-59d33ac2-81e7-c6bf-de3d-8ca92a29f45a
Aug 18 18:25:58 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: Xid (PCI:0000:00:1e): 119, pid=9345, Timeout waiting for RPC from GSP! Expected function 63.
Aug 18 18:25:58 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: Xid (PCI:0000:00:1e): 119, pid=9345, Timeout waiting for RPC from GSP! Expected function 63.
Aug 18 18:25:58 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: Xid (PCI:0000:00:1e): 119, pid=9345, Timeout waiting for RPC from GSP! Expected function 76.
Aug 18 18:25:58 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x23:0x65:1401)
Aug 18 18:25:58 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: rm_init_adapter failed, device minor number 0
Aug 18 18:26:05 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU at PCI:0000:00:1e: GPU-00000000-0000-0000-0000-000000000000
Aug 18 18:26:05 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: Xid (PCI:0000:00:1e): 119, pid=9345, Timeout waiting for RPC from GSP! Expected function 4097.
Aug 18 18:26:05 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x63:0x65:2344)
Aug 18 18:26:05 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: rm_init_adapter failed, device minor number 0
Aug 18 18:26:05 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: request_irq() failed (-4)
Aug 18 18:26:09 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU at PCI:0000:00:1e: GPU-00000000-0000-0000-0000-000000000000
Aug 18 18:26:09 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: Xid (PCI:0000:00:1e): 119, pid=10728, Timeout waiting for RPC from GSP! Expected function 4097.
Aug 18 18:26:09 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x63:0x65:2344)
Aug 18 18:26:09 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: rm_init_adapter failed, device minor number 0
Aug 18 18:26:10 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x62:0x0:2288)
Aug 18 18:26:10 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: rm_init_adapter failed, device minor number 0
Aug 18 18:26:14 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU at PCI:0000:00:1e: GPU-00000000-0000-0000-0000-000000000000
Aug 18 18:26:14 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: Xid (PCI:0000:00:1e): 119, pid=10841, Timeout waiting for RPC from GSP! Expected function 4097.
Aug 18 18:26:14 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x63:0x65:2344)
Aug 18 18:26:14 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU 0000:00:1e.0: rm_init_adapter failed, device minor number 0
Aug 18 18:26:18 ip-10-123-216-6.eu-west-1.compute.internal kernel: NVRM: GPU at PCI:0000:00:1e: GPU-00000000-0000-0000-0000-000000000000

Components versions: OS: Rocky Linux 8.6 (RHEL bug-to-bug compatible) Kernel: Linux ip-10-88-66-41.eu-west-1.compute.internal 4.18.0-372.16.1.el8_6.x86_64 #1 SMP Wed Jul 13 15:36:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux Kubernetes: v1.22.11 Container runtime: crio 1.22.5 (default runtime: nvidia) RunC version: 1.0.3 SELinux status: disabled (for troubleshooting) Nvidia drivers: 3:510.73.08-1.el8 (installed on the node) Nvidia toolkit: 1.10.0-1 Nvidia-device-plugin: v0.12.2 Dcgm-exporter: 2.4.6-2.6.10-ubuntu20.04 (same with 2.4.6-2.6.8-ubuntu20.04)

Nvidia-device-plugin config (fails on both gpu and shared-gpu nodes):

apiVersion: v1
data:
  _default: |-
    version: v1
    flags:
      migStrategy: none
  gpu: |-
    version: v1
    flags:
      migStrategy: none
  gpu-shared: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: True
        failRequestsGreaterThanOne: False
        resources:
        - name: nvidia.com/gpu
          replicas: 2

We already seen this issue https://github.com/NVIDIA/dcgm-exporter/issues/84 and deployed dcgm-exporter with minimal config:

apiVersion: v1
data:
  metrics: |
    DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
    DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
    DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
    DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
    DCGM_FI_DEV_DEC_UTIL,     gauge, Decoder utilization (in %).
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

Add support for string fields as labels

Any entry in the config file with type "label" will become a label on all metrics.

Closes https://github.com/NVIDIA/dcgm-exporter/issues/72.

Issue running 2.4.6-2.6.8

@glowkey where these any breaking changes in the latest release?

I just tested the release by swapping the docker images out and I get the following error:

setting up csv
/etc/dcgm-exporter/dcp-metrics-bolt.csv
done
time="2022-07-19T14:22:40Z" level=info msg="Starting dcgm-exporter"
time="2022-07-19T14:22:41Z" level=info msg="DCGM successfully initialized!"
time="2022-07-19T14:22:41Z" level=info msg="Collecting DCP Metrics"
time="2022-07-19T14:22:41Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-bolt.csv"
time="2022-07-19T14:22:41Z" level=fatal msg="Error getting device busid: API version mismatch"

Rolling back to 2.3.5-2.6.5 removed the issue and I didn't see this issue on 2.4.5-2.6.7 but that release had other metric issues.

Enable some commented by default metrics

I have the DCGM Exporter up and running in a Kubernetes Cluster.

I would like to enable two metrics (DCGM_FI_PROF_SM_ACTIVE, and DCGM_FI_PROF_SM_OCCUPANCY).

I tried to edit the Configmap, and even the dcp-metrics-included.csv file inside the pod, but that doesn't seem to work, Prometheus still doesn't publish those metrics.

What should I do? Do I need to deploy the DCGM Exporter from scratch? Thanks.

Applying the latest dcgm-exporter some issues with the exporter container

On the DCGM-exporter getting the following on the exporter container

gpu-operator   nvidia-container-toolkit-daemonset-gc4bl                          1/1     Running            0             43s
gpu-operator   nvidia-cuda-validator-5g9k2                                       0/1     Completed          0             39s
gpu-operator   nvidia-dcgm-exporter-pxl49                                        0/1     CrashLoopBackOff   2 (20s ago)   43s
gpu-operator   nvidia-device-plugin-daemonset-ss8fj                              1/1     Running            0             43s
gpu-operator   nvidia-device-plugin-validator-ddssg                              0/1     Completed          0             23s

When I look at the logs I see:

kubectl logs --previous --tail 100 nvidia-dcgm-exporter-pxl49 -n gpu-operator
Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init)
time="2022-08-18T23:45:25Z" level=info msg="Starting dcgm-exporter"
time="2022-08-18T23:45:25Z" level=info msg="DCGM successfully initialized!"
time="2022-08-18T23:45:25Z" level=info msg="Collecting DCP Metrics"
time="2022-08-18T23:45:25Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2022-08-18T23:45:25Z" level=fatal msg="Could not find Prometheus metry type label"

nvcr.io/nvidia/k8s/dcgm-exporter:2.4.5-2.6.7-ubuntu20.04

Latest Release bugs 2.4.5-2.6.7 - metrics missing

When upgrading to 2.4.5-2.6.7 we lose access to: DCGM_FI_DEV_FB_USED DCGM_FI_DEV_FB_TOTAL DCGM_FI_DEV_FB_FREE

Going back to 2.3.5-2.6.5 resolves the issue

Also it looks like DCGM 2.4.5 should now support the following, however it appears the new release which uses 2.4.5 does not yet, this is more a question than an issue DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE

Servers are running 470.57.02
Allow disabling service

We scrape the pod port directly. Plus scraping the service doesn't really work with daemonset since the service will point to only one of the daemonset pods, and when more than one node exists with GPUs, this ends up not scraping more than one node.
the tests fail

try to run the test under pkg/dcgmexporter, it fails. here is the steps: cd pkg/dcgmexporter go test 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ListPodResourcesRequest 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ListPodResourcesResponse 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.PodResources 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ContainerResources 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ContainerDevices --- FAIL: TestDCGMCollector (0.00s) gpu_collector_test.go:35: Error Trace: gpu_collector_test.go:35 Error: Received unexpected error: libdcgm.so not Found Test: TestDCGMCollector /tmp/go-build21440241/b001/dcgmexporter.test: symbol lookup error: /tmp/go-build21440241/b001/dcgmexporter.test: undefined symbol: dcgmGetAllDevices exit status 127 FAIL github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter 0.016s

is there any settings to run the test?
Error with chart install

Hi,

I have been struggling with installing dcgm-exporter chart on my k8s cluster for a while. I've followed the provided documentation to install it with helm but the pods are endlessly restarting with this error message: Failed to initialize NVML Error starting nv-hostengine: DCGM initialization error

Chart version: 3.1.3-3.1.2-ubuntu20.04 Kube cluster version: 1.23.11-gke.300 CUDA drivers version: 510.47.03 GPU type: NVIDIA T4

Does someone have faced the same issue ?
Exporting processes with DCGM

I would love to export the different processes that can be seen using nvidia-smi. I have got the dcgm_exporter working in a docker container running on a bare metal host, with logs being stored in a central Prometheus server.

I do not understand how I would be able to export the running process names, their PID, and respective gpu memory usage, all of which can be seen from nvidia-smi.

I can see there is an available field identifier DCGM_FI_PROCESS_NAME; however when I try to add this to my /etc/dcgm-exporter/dcp-metrics-included.csv; the exporter crashes and exports nothing.

If anyone could provide any insight I would greatly appreciate it!!

Thanks :pray:
process's SM Utilization is always lower than the gpu's SM Utilization

Hi，when only one process runs on the gpu，why the process's SM Utilization is always lower than the gpu's SM Utilization, This is the result of the nvidia-smi: I want to know who uses the 1% SM Utilization of the gpu?

DCGM_FI_DEV_GPU_UTIL doesn't show up with A100 GPU in MIG mode

Hey there,

I want to make sure could I use dcgm-exporter to monitor the GPU utilization in A100 MIG mode?

I have a test on those MIG GPU instances, but I find that I cannot see DCGM_FI_DEV_*_UTIL metrics from dcgm-exporter. Even if I enable them in /etc/dcgm-exporter/default-counters.csv

My driver: 450.203.03 My dcgm-exporter: nvcr.io/nvidia/k8s/dcgm-exporter:2.4.6-2.6.10-ubuntu20.04 My nvidia-smi -L output:

GPU 0: A100-SXM4-40GB (UUID: GPU-a810a1fc-9f46-0e6f-5cef-bf34c2248e12)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-a810a1fc-9f46-0e6f-5cef-bf34c2248e12/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-a810a1fc-9f46-0e6f-5cef-bf34c2248e12/2/0)
GPU 1: A100-SXM4-40GB (UUID: GPU-604c7a18-e7d0-02e0-723a-7578be840651)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-604c7a18-e7d0-02e0-723a-7578be840651/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-604c7a18-e7d0-02e0-723a-7578be840651/2/0)
GPU 2: A100-SXM4-40GB (UUID: GPU-213c841d-7181-629b-9a20-5c80fbd75ad9)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-213c841d-7181-629b-9a20-5c80fbd75ad9/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-213c841d-7181-629b-9a20-5c80fbd75ad9/2/0)
GPU 3: A100-SXM4-40GB (UUID: GPU-8345bdf0-6a9c-e1c1-5994-9e656aed3abc)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-8345bdf0-6a9c-e1c1-5994-9e656aed3abc/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-8345bdf0-6a9c-e1c1-5994-9e656aed3abc/2/0)
GPU 4: A100-SXM4-40GB (UUID: GPU-a583e23a-4957-5154-902b-a347482e0937)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-a583e23a-4957-5154-902b-a347482e0937/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-a583e23a-4957-5154-902b-a347482e0937/2/0)
GPU 5: A100-SXM4-40GB (UUID: GPU-730b3e2c-4b49-106c-9b0b-1ce13a10d125)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-730b3e2c-4b49-106c-9b0b-1ce13a10d125/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-730b3e2c-4b49-106c-9b0b-1ce13a10d125/2/0)
GPU 6: A100-SXM4-40GB (UUID: GPU-cc59bb72-3e96-09b6-02aa-c8c72fe191bf)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-cc59bb72-3e96-09b6-02aa-c8c72fe191bf/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-cc59bb72-3e96-09b6-02aa-c8c72fe191bf/2/0)
GPU 7: A100-SXM4-40GB (UUID: GPU-f57f1f74-7c11-6b82-c417-3544ef4c5d7b)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-f57f1f74-7c11-6b82-c417-3544ef4c5d7b/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-f57f1f74-7c11-6b82-c417-3544ef4c5d7b/2/0)

My docker run cmd: docker run -d --gpus all -p 9400:9400 --cap-add SYS_ADMIN nvcr.io/nvidia/k8s/dcgm-exporter:2.4.6-2.6.10-ubuntu20.04

Related tags

DevOps Tools dcgm-exporter

gpu-memory-monitor is a metrics server for collecting GPU memory usage of kubernetes pods.

gpu-memory-monitor is a metrics server for collecting GPU memory usage of kubernetes pods. If you have a GPU machine, and some pods are using the GPU device, you can run the container by docker or kubernetes when your GPU device belongs to nvidia. The gpu-memory-monitor will collect the GPU memory usage of pods, you can get those metrics by API of gpu-memory-monitor

Jul 27, 2022

Build and run Docker containers leveraging NVIDIA GPUs

NVIDIA Container Toolkit Introduction The NVIDIA Container Toolkit allows users to build and run GPU accelerated Docker containers. The toolkit includ

Jan 7, 2023

Json-log-exporter - A Nginx log parser exporter for prometheus metrics

json-log-exporter A Nginx log parser exporter for prometheus metrics. Installati

Jan 5, 2022

nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.

Nano GPU Agent About this Project Nano GPU Agent is a Kubernetes device plugin implement for gpu allocation and use in container. It runs as a Daemons

Dec 29, 2022

Openvpn exporter - Prometheus OpenVPN exporter For golang

Prometheus OpenVPN exporter Please note: This repository is currently unmaintain

Jan 2, 2022

Amplitude-exporter - Amplitude charts to prometheus exporter PoC

Amplitude exporter Amplitude charts to prometheus exporter PoC. Work in progress

May 26, 2022

Vulnerability-exporter - A Prometheus Exporter for managing vulnerabilities in kubernetes by using trivy

Kubernetes Vulnerability Exporter A Prometheus Exporter for managing vulnerabili

Dec 4, 2022

Netstat exporter - Prometheus exporter for exposing reserved ports and it's mapped process

Netstat exporter Prometheus exporter for exposing reserved ports and it's mapped

Feb 3, 2022

📡 Prometheus exporter that exposes metrics from SpaceX Starlink Dish

Starlink Prometheus Exporter A Starlink exporter for Prometheus. Not affiliated with or acting on behalf of Starlink(™) ?? Starlink Monitoring System

Dec 19, 2022

Prometheus exporter for Chia node metrics

chia_exporter Prometheus metric collector for Chia nodes, using the local RPC API Building and Running With the Go compiler tools installed: go build

Sep 19, 2022

A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2

CloudLinux LVE Exporter for Prometheus LVE Exporter - A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2 Help on flags: -h, --h

Nov 2, 2021

A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover

YAAE (Yet Another AWS Exporter) A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover About This exporter is meant to expo

Dec 10, 2022

Prometheus metrics exporter for libvirt.

Libvirt exporter Prometheus exporter for vm metrics written in Go with pluggable metric collectors. Installation and Usage If you are new to Prometheu

Jul 4, 2022

Prometheus Exporter for Kvrocks Metrics

Prometheus Kvrocks Metrics Exporter This is a fork of oliver006/redis_exporter to export the kvrocks metrics. Building and running the exporter Build

Sep 7, 2022

A prometheus exporter which reports metrics about your Gmail inbox.

prometheus-gmail-exporter-go A prometheus exporter for gmail. Heavily inspired by https://github.com/jamesread/prometheus-gmail-exporter, but written

Nov 15, 2022

LLS-Exporter exports fuel level sensor data (rs-485 lls protocol) as prometheus metrics

LLS Exporter LLS Exporter reads rs485/rs232 data from serial port, decodes lls protocol and exports fuel level sensor data as prometheus metrics. Lice

Dec 14, 2021

Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using HPE Smart Storage Administrator tool

hpessa-exporter Overview Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using

Jan 17, 2022

Exporter your cypress.io dashboard into prometheus Metrics

Cypress.io dashboard Prometheus exporter Prometheus exporter for a project from Cypress.io dashboards, giving the ability to alert, make special opera

Feb 8, 2022

Github exporter for Prometheus metrics. Written in Go, with love ❤️

Github exporter for Prometheus This is a Github exporter for Prometheus metrics exposed by Github API. Written in Go with pluggable metrics collectors

Oct 5, 2022