NVIDIA device plugin for Kubernetes

NVIDIA device plugin for Kubernetes

Table of Contents

About

The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:

  • Expose the number of GPUs on each nodes of your cluster
  • Keep track of the health of your GPUs
  • Run GPU enabled containers in your Kubernetes cluster.

This repository contains NVIDIA's official implementation of the Kubernetes device plugin.

Please note that:

  • The NVIDIA device plugin API is beta as of Kubernetes v1.10.
  • The NVIDIA device plugin is still considered beta and is missing
    • More comprehensive GPU health checking features
    • GPU cleanup features
    • ...
  • Support will only be provided for the official NVIDIA device plugin (and not for forks or other variants of this plugin).

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:

  • NVIDIA drivers ~= 384.81
  • nvidia-docker version > 2.0 (see how to install and it's prerequisites)
  • docker configured with nvidia as the default runtime.
  • Kubernetes version >= 1.10

Quick Start

Preparing your GPU Nodes

The following steps need to be executed on all your GPU nodes. This README assumes that the NVIDIA drivers and nvidia-docker have been installed.

Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit. This is because the new --gpus options hasn't reached kubernetes yet. Example:

# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker

You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

if runtimes is not already present, head to the install page of nvidia-docker

Enabling GPU Support in Kubernetes

Once you have configured the options above on all the GPU nodes in your cluster, you can enable GPU support by deploying the following Daemonset:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.10.0/nvidia-device-plugin.yml

Note: This is a simple static daemonset meant to demonstrate the basic features of the nvidia-device-plugin. Please see the instructions below for Deployment via helm when deploying the plugin in a production setting.

Running GPU Jobs

With the daemonset deployed, NVIDIA GPUs can now be requested by a container using the nvidia.com/gpu resource type:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs
    - name: digits-container
      image: nvcr.io/nvidia/digits:20.12-tensorflow-py3
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs

WARNING: if you don't request GPUs when using the device plugin with NVIDIA images all the GPUs on the machine will be exposed inside your container.

Deployment via helm

The preferred method to deploy the device plugin is as a daemonset using helm. Instructions for installing helm can be found here.

The helm chart for the latest release of the plugin (v0.10.0) includes a number of customizable values. The most commonly overridden ones are:

  failOnInitError:
      fail the plugin if an error is encountered during initialization, otherwise block indefinitely
      (default 'true')
  compatWithCPUManager:
      run with escalated privileges to be compatible with the static CPUManager policy
      (default 'false')
  legacyDaemonsetAPI:
      use the legacy daemonset API version 'extensions/v1beta1'
      (default 'false')
  migStrategy:
      the desired strategy for exposing MIG devices on GPUs that support it
      [none | single | mixed] (default "none")
  deviceListStrategy:
      the desired strategy for passing the device list to the underlying runtime
      [envvar | volume-mounts] (default "envvar")
  deviceIDStrategy:
      the desired strategy for passing device IDs to the underlying runtime
      [uuid | index] (default "uuid")
  nvidiaDriverRoot:
      the root path for the NVIDIA driver installation (typical values are '/' or '/run/nvidia/driver')

When set to true, the failOnInitError flag fails the plugin if an error is encountered during initialization. When set to false, it prints an error message and blocks the plugin indefinitely instead of failing. Blocking indefinitely follows legacy semantics that allow the plugin to deploy successfully on nodes that don't have GPUs on them (and aren't supposed to have GPUs on them) without throwing an error. In this way, you can blindly deploy a daemonset with the plugin on all nodes in your cluster, whether they have GPUs on them or not, without encountering an error. However, doing so means that there is no way to detect an actual error on nodes that are supposed to have GPUs on them. Failing if an initilization error is encountered is now the default and should be adopted by all new deployments.

The compatWithCPUManager flag configures the daemonset to be able to interoperate with the static CPUManager of the kubelet. Setting this flag requires one to deploy the daemonset with elevated privileges, so only do so if you know you need to interoperate with the CPUManager.

The legacyDaemonsetAPI flag configures the daemonset to use version extensions/v1beta1 of the DaemonSet API. This API version was removed in Kubernetes v1.16, so is only intended to allow newer plugins to run on older versions of Kubernetes.

The migStrategy flag configures the daemonset to be able to expose Multi-Instance GPUs (MIG) on GPUs that support them. More information on what these strategies are and how they should be used can be found in Supporting Multi-Instance GPUs (MIG) in Kubernetes.

Note: With a migStrategy of mixed, you will have additional resources available to you of the form nvidia.com/mig-<slice_count>g.<memory_size>gb that you can set in your pod spec to get access to a specific MIG device.

The deviceListStrategy flag allows one to choose which strategy the plugin will use to advertise the list of GPUs allocated to a container. This is traditionally done by setting the NVIDIA_VISIBLE_DEVICES environment variable as described here. This strategy can be selected via the (default) envvar option. Support was recently added to the nvidia-container-toolkit to also allow passing the list of devices as a set of volume mounts instead of as an environment variable. This strategy can be selected via the volume-mounts option. Details for the rationale behind this strategy can be found here.

The deviceIDStrategy flag allows one to choose which strategy the plugin will use to pass the device ID of the GPUs allocated to a container. The device ID has traditionally been passed as the UUID of the GPU. This flag lets a user decide if they would like to use the UUID or the index of the GPU (as seen in the output of nvidia-smi) as the identifier passed to the underlying runtime. Passing the index may be desirable in situations where pods that have been allocated GPUs by the plugin get restarted with different physical GPUs attached to them.

Please take a look in the following values.yaml file to see the full set of overridable parameters for the device plugin.

Installing via helm installfrom the nvidia-device-plugin helm repository

The preferred method of deployment is with helm install via the nvidia-device-plugin helm repository.

This repository can be installed as follows:

$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
$ helm repo update

Once this repo is updated, you can begin installing packages from it to depoloy the nvidia-device-plugin daemonset. Below are some examples of deploying the plugin with the various flags from above.

Note: Since this is a pre-release version, you will need to pass the --devel flag to helm search repo in order to see this release listed.

Using the default values for the flags:

$ helm install \
    --version=0.10.0 \
    --generate-name \
    nvdp/nvidia-device-plugin

Enabling compatibility with the CPUManager and running with a request for 100ms of CPU time and a limit of 512MB of memory.

$ helm install \
    --version=0.10.0 \
    --generate-name \
    --set compatWithCPUManager=true \
    --set resources.requests.cpu=100m \
    --set resources.limits.memory=512Mi \
    nvdp/nvidia-device-plugin

Use the legacy Daemonset API (only available on Kubernetes < v1.16):

$ helm install \
    --version=0.10.0 \
    --generate-name \
    --set legacyDaemonsetAPI=true \
    nvdp/nvidia-device-plugin

Enabling compatibility with the CPUManager and the mixed migStrategy

$ helm install \
    --version=0.10.0 \
    --generate-name \
    --set compatWithCPUManager=true \
    --set migStrategy=mixed \
    nvdp/nvidia-device-plugin

Deploying via helm install with a direct URL to the helm package

If you prefer not to install from the nvidia-device-plugin helm repo, you can run helm install directly against the tarball of the plugin's helm package. The examples below install the same daemonsets as the method above, except that they use direct URLs to the helm package instead of the helm repo.

Using the default values for the flags:

$ helm install \
    --generate-name \
    https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.10.0.tgz

Enabling compatibility with the CPUManager and running with a request for 100ms of CPU time and a limit of 512MB of memory.

$ helm install \
    --generate-name \
    --set compatWithCPUManager=true \
    --set resources.requests.cpu=100m \
    --set resources.limits.memory=512Mi \
    https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.10.0.tgz

Use the legacy Daemonset API (only available on Kubernetes < v1.16):

$ helm install \
    --generate-name \
    --set legacyDaemonsetAPI=true \
    https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.10.0.tgz

Enabling compatibility with the CPUManager and the mixed migStrategy

$ helm install \
    --generate-name \
    --set compatWithCPUManager=true \
    --set migStrategy=mixed \
    https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.10.0.tgz

Building and Running Locally

The next sections are focused on building the device plugin locally and running it. It is intended purely for development and testing, and not required by most users. It assumes you are pinning to the latest release tag (i.e. v0.10.0), but can easily be modified to work with any available tag or branch.

With Docker

Build

Option 1, pull the prebuilt image from Docker Hub:

$ docker pull nvcr.io/nvidia/k8s-device-plugin:v0.10.0
$ docker tag nvcr.io/nvidia/k8s-device-plugin:v0.10.0 nvcr.io/nvidia/k8s-device-plugin:devel

Option 2, build without cloning the repository:

$ docker build \
    -t nvcr.io/nvidia/k8s-device-plugin:devel \
    -f docker/Dockerfile \
    https://github.com/NVIDIA/k8s-device-plugin.git#v0.10.0

Option 3, if you want to modify the code:

$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin
$ docker build \
    -t nvcr.io/nvidia/k8s-device-plugin:devel \
    -f docker/Dockerfile \
    .

Run

Without compatibility for the CPUManager static policy:

$ docker run \
    -it \
    --security-opt=no-new-privileges \
    --cap-drop=ALL \
    --network=none \
    -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \
    nvcr.io/nvidia/k8s-device-plugin:devel

With compatibility for the CPUManager static policy:

$ docker run \
    -it \
    --privileged \
    --network=none \
    -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \
    nvcr.io/nvidia/k8s-device-plugin:devel --pass-device-specs

Without Docker

Build

$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build

Run

Without compatibility for the CPUManager static policy:

$ ./k8s-device-plugin

With compatibility for the CPUManager static policy:

$ ./k8s-device-plugin --pass-device-specs

Changelog

Version v0.10.0

  • Update CUDA base images to 11.4.2
  • Ignore Xid=13 (Graphics Engine Exception) critical errors in device healthcheck
  • Ignore Xid=64 (Video processor exception) critical errors in device healthcheck
  • Build multiarch container images for linux/amd64 and linux/arm64
  • Use Ubuntu 20.04 for Ubuntu-based container images
  • Remove Centos7 images

Version v0.9.0

  • Fix bug when using CPUManager and the device plugin MIG mode not set to "none"
  • Allow passing list of GPUs by device index instead of uuid
  • Move to urfave/cli to build the CLI
  • Support setting command line flags via environment variables

Version v0.8.2

  • Update all dockerhub references to nvcr.io

Version v0.8.1

  • Fix permission error when using NewDevice instead of NewDeviceLite when constructing MIG device map

Version v0.8.0

  • Raise an error if a device has migEnabled=true but has no MIG devices
  • Allow mig.strategy=single on nodes with non-MIG gpus

Version v0.7.3

  • Update vendoring to include bug fix for nvmlEventSetWait_v2

Version v0.7.2

  • Fix bug in dockfiles for ubi8 and centos using CMD not ENTRYPOINT

Version v0.7.1

  • Update all Dockerfiles to point to latest cuda-base on nvcr.io

Version v0.7.0

  • Promote v0.7.0-rc.8 to v0.7.0

Version v0.7.0-rc.8

  • Permit configuration of alternative container registry through environment variables.
  • Add an alternate set of gitlab-ci directives under .nvidia-ci.yml
  • Update all k8s dependencies to v1.19.1
  • Update vendoring for NVML Go bindings
  • Move restart loop to force recreate of plugins on SIGHUP

Version v0.7.0-rc.7

  • Fix bug which only allowed running the plugin on machines with CUDA 10.2+ installed

Version v0.7.0-rc.6

  • Add logic to skip / error out when unsupported MIG device encountered
  • Fix bug treating memory as multiple of 1000 instead of 1024
  • Switch to using CUDA base images
  • Add a set of standard tests to the .gitlab-ci.yml file

Version v0.7.0-rc.5

  • Add deviceListStrategyFlag to allow device list passing as volume mounts

Version v0.7.0-rc.4

  • Allow one to override selector.matchLabels in the helm chart
  • Allow one to override the udateStrategy in the helm chart

Version v0.7.0-rc.3

  • Fail the plugin if NVML cannot be loaded
  • Update logging to print to stderr on error
  • Add best effort removal of socket file before serving
  • Add logic to implement GetPreferredAllocation() call from kubelet

Version v0.7.0-rc.2

  • Add the ability to set 'resources' as part of a helm install
  • Add overrides for name and fullname in helm chart
  • Add ability to override image related parameters helm chart
  • Add conditional support for overriding secutiryContext in helm chart

Version v0.7.0-rc.1

  • Added migStrategy as a parameter to select the MIG strategy to the helm chart
  • Add support for MIG with different strategies {none, single, mixed}
  • Update vendored NVML bindings to latest (to include MIG APIs)
  • Add license in UBI image
  • Update UBI image with certification requirements

Version v0.6.0

  • Update CI, build system, and vendoring mechanism
  • Change versioning scheme to v0.x.x instead of v1.0.0-betax
  • Introduced helm charts as a mechanism to deploy the plugin

Version v0.5.0

  • Add a new plugin.yml variant that is compatible with the CPUManager
  • Change CMD in Dockerfile to ENTRYPOINT
  • Add flag to optionally return list of device nodes in Allocate() call
  • Refactor device plugin to eventually handle multiple resource types
  • Move plugin error retry to event loop so we can exit with a signal
  • Update all vendored dependencies to their latest versions
  • Fix bug that was inadvertently always disabling health checks
  • Update minimal driver version to 384.81

Version v0.4.0

  • Fixes a bug with a nil pointer dereference around getDevices:CPUAffinity

Version v0.3.0

  • Manifest is updated for Kubernetes 1.16+ (apps/v1)
  • Adds more logging information

Version v0.2.0

  • Adds the Topology field for Kubernetes 1.16+

Version v0.1.0

  • If gRPC throws an error, the device plugin no longer ends up in a non responsive state.

Version v0.0.0

  • Reversioned to SEMVER as device plugins aren't tied to a specific version of kubernetes anymore.

Version v1.11

  • No change.

Version v1.10

  • The device Plugin API is now v1beta1

Version v1.9

  • The device Plugin API changed and is no longer compatible with 1.8
  • Error messages were added

Issues and Contributing

Checkout the Contributing document!

Versioning

Before v1.10 the versioning scheme of the device plugin had to match exactly the version of Kubernetes. After the promotion of device plugins to beta this condition was was no longer required. We quickly noticed that this versioning scheme was very confusing for users as they still expected to see a version of the device plugin for each version of Kubernetes.

This versioning scheme applies to the tags v1.8, v1.9, v1.10, v1.11, v1.12.

We have now changed the versioning to follow SEMVER. The first version following this scheme has been tagged v0.0.0.

Going forward, the major version of the device plugin will only change following a change in the device plugin API itself. For example, version v1beta1 of the device plugin API corresponds to version v0.x.x of the device plugin. If a new v2beta2 version of the device plugin API comes out, then the device plugin will increase its major version to 1.x.x.

As of now, the device plugin API for Kubernetes >= v1.10 is v1beta1. If you have a version of Kubernetes >= 1.10 you can deploy any device plugin version > v0.0.0.

Upgrading Kubernetes with the Device Plugin

Upgrading Kubernetes when you have a device plugin deployed doesn't require you to do any, particular changes to your workflow. The API is versioned and is pretty stable (though it is not guaranteed to be non breaking). Starting with Kubernetes version 1.10, you can use v0.3.0 of the device plugin to perform upgrades, and Kubernetes won't require you to deploy a different version of the device plugin. Once a node comes back online after the upgrade, you will see GPUs re-registering themselves automatically.

Upgrading the device plugin itself is a more complex task. It is recommended to drain GPU tasks as we cannot guarantee that GPU tasks will survive a rolling upgrade. However we make best efforts to preserve GPU tasks during an upgrade.

Comments
  • k8s-device-plugin v1.9  deployment CrashLoopBackOff

    k8s-device-plugin v1.9 deployment CrashLoopBackOff

    I try tp deployed device-plugin v1.9 on k8s.

    And I have similar problem nvidia-device-plugin container CrashLoopBackOff error v1.8

    and container CrashLoopBackOff error

    NAME                                   READY     STATUS             RESTARTS   AGE
    nvidia-device-plugin-daemonset-2h9rh   0/1       CrashLoopBackOff   11          33m
    

    Use docker Run locally problem

    docker build -t nvidia/k8s-device-plugin:1.9 .
    
    Successfully built d12ed13b386a
    Successfully tagged nvidia/k8s-device-plugin:1.9
    
    14:25:40 Loading NVML
    14:25:40 Failed to start nvml with error: could not load NVML library.
    

    Environment :

    $ cat /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf 
    /usr/lib/nvidia-384
    /usr/lib32/nvidia-384
    
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 106...  Off  | 00000000:03:00.0 Off |                  N/A |
    | 38%   29C    P8     6W / 120W |      0MiB /  6069MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    
    
    

    And I used docker run --runtime=nvidia --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9

    show error :

    2017/12/27 14:38:22 Loading NVML
    2017/12/27 14:38:22 Fetching devices.
    2017/12/27 14:38:22 Starting FS watcher.
    2017/12/27 14:38:22 Starting OS watcher.
    2017/12/27 14:38:22 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2017/12/27 14:38:27 Could not register device plugin: context deadline exceeded
    2017/12/27 14:38:27 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
    2017/12/27 14:38:27 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2017/12/27 14:38:32 Could not register device plugin: context deadline exceeded
    2017/12/27 14:38:32 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
    2017/12/27 14:38:32 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2017/12/27 14:38:37 Could not register device plugin: context deadline exceeded
    .
    .
    .
    
    
  • Pods are not scheduled in all GPUs of a physical server.

    Pods are not scheduled in all GPUs of a physical server.

    Description In the below hardware configuration, while trying to deploy NVIDIA triton service with 4 replicas in this server with 1 GPU each, 3 pods were running and the 4th pod was not spinning up and the following error was displayed.

    FailedScheduling    pod/model-2-79d7d6786c-bprm8    0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 1 node(s) didn't match Pod's node affinity/selector.
    

    Information about the environment

    • [x] Hardware - RTX A5000 x 4 GPUs
    • [x] Hardware - Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
    • [x] Operating System - Linux agentnode 5.4.0-124-generic 140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
    • [x] GPU-Operator helm version: gpu-operator-v1.11.1

    Deployment file used:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: model-infr
    spec:
      replicas: 4
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: model-infr
        spec:
          containers:
          - args:
            - pip3 install opencv-python-headless && tritonserver --model-store=s3://model-infr/
            command:
            - /bin/sh
            - -c
            image: nvcr.io/nvidia/tritonserver:22.06-py3
            imagePullPolicy: IfNotPresent
            name: tritonserver
            ports:
            - containerPort: 8000
              name: http
              protocol: TCP
            - containerPort: 8001
              name: grpc
              protocol: TCP
            - containerPort: 8002
              name: metrics
              protocol: TCP
            resources:
              limits:
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-RTX-A5000
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
            name: dshm
    

    While checking nvidia-smi in the actual system was able to get below output and clearly denotes GPU0 was free to schedule.

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA RTX A5000    Off  | 00000000:31:00.0 Off |                  Off |
    | 30%   46C    P8    22W / 230W |     10MiB / 24564MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  NVIDIA RTX A5000    Off  | 00000000:4B:00.0 Off |                  Off |
    | 48%   76C    P2   149W / 230W |  13222MiB / 24564MiB |     29%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  NVIDIA RTX A5000    Off  | 00000000:B1:00.0 Off |                  Off |
    | 52%   81C    P2   166W / 230W |  13222MiB / 24564MiB |     57%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  NVIDIA RTX A5000    Off  | 00000000:CA:00.0 Off |                  Off |
    | 52%   80C    P2   176W / 230W |  13222MiB / 24564MiB |     66%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
    |    0   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
    |    1   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
    |    1   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
    |    1   N/A  N/A    911402      C   tritonserver                    13209MiB |
    |    2   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
    |    2   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
    |    2   N/A  N/A    911401      C   tritonserver                    13209MiB |
    |    3   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
    |    3   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
    |    3   N/A  N/A    911223      C   tritonserver                    13209MiB |
    +-----------------------------------------------------------------------------+
    

    Expected behavior

    Expecting Triton to schedule pods in all 4 GPUs.

    Common error checking:

    • [x] The output of nvidia-smi -a on your host
    GPU 00000000:31:00.0
        Product Name                          : NVIDIA RTX A5000
        Product Brand                         : NVIDIA RTX
        Product Architecture                  : Ampere
        Display Mode                          : Disabled
        Display Active                        : Disabled
        Persistence Mode                      : Disabled
        MIG Mode
            Current                           : N/A
            Pending                           : N/A
        Accounting Mode                       : Disabled
        Accounting Mode Buffer Size           : 4000
        Driver Model
            Current                           : N/A
            Pending                           : N/A
        Serial Number                         : 1321721010340
        GPU UUID                              : GPU-94fc63c0-00d3-0076-199f-f67cd9e4c4d2
        Minor Number                          : 0
        VBIOS Version                         : 94.02.6D.00.05
        MultiGPU Board                        : No
        Board ID                              : 0x3100
        GPU Part Number                       : 900-5G132-2200-000
        Module ID                             : 0
        Inforom Version
            Image Version                     : G132.0500.00.01
            OEM Object                        : 2.0
            ECC Object                        : 6.16
            Power Management Object           : N/A
        GPU Operation Mode
            Current                           : N/A
            Pending                           : N/A
        GSP Firmware Version                  : N/A
        GPU Virtualization Mode
            Virtualization Mode               : None
            Host VGPU Mode                    : N/A
        IBMNPU
            Relaxed Ordering Mode             : N/A
        PCI
            Bus                               : 0x31
            Device                            : 0x00
            Domain                            : 0x0000
            Device Id                         : 0x223110DE
            Bus Id                            : 00000000:31:00.0
            Sub System Id                     : 0x147E10DE
            GPU Link Info
                PCIe Generation
                    Max                       : 4
                    Current                   : 1
                Link Width
                    Max                       : 16x
                    Current                   : 16x
            Bridge Chip
                Type                          : N/A
                Firmware                      : N/A
            Replays Since Reset               : 0
            Replay Number Rollovers           : 0
            Tx Throughput                     : 0 KB/s
            Rx Throughput                     : 0 KB/s
        Fan Speed                             : 30 %
        Performance State                     : P8
        Clocks Throttle Reasons
            Idle                              : Active
            Applications Clocks Setting       : Not Active
            SW Power Cap                      : Not Active
            HW Slowdown                       : Not Active
                HW Thermal Slowdown           : Not Active
                HW Power Brake Slowdown       : Not Active
            Sync Boost                        : Not Active
            SW Thermal Slowdown               : Not Active
            Display Clock Setting             : Not Active
        FB Memory Usage
            Total                             : 24564 MiB
            Reserved                          : 307 MiB
            Used                              : 10 MiB
            Free                              : 24245 MiB
        BAR1 Memory Usage
            Total                             : 256 MiB
            Used                              : 3 MiB
            Free                              : 253 MiB
        Compute Mode                          : Default
        Utilization
            Gpu                               : 0 %
            Memory                            : 0 %
            Encoder                           : 0 %
            Decoder                           : 0 %
        Encoder Stats
            Active Sessions                   : 0
            Average FPS                       : 0
            Average Latency                   : 0
        FBC Stats
            Active Sessions                   : 0
            Average FPS                       : 0
            Average Latency                   : 0
        Ecc Mode
            Current                           : Disabled
            Pending                           : Disabled
        ECC Errors
            Volatile
                SRAM Correctable              : N/A
                SRAM Uncorrectable            : N/A
                DRAM Correctable              : N/A
                DRAM Uncorrectable            : N/A
            Aggregate
                SRAM Correctable              : N/A
                SRAM Uncorrectable            : N/A
                DRAM Correctable              : N/A
                DRAM Uncorrectable            : N/A
        Retired Pages
            Single Bit ECC                    : N/A
            Double Bit ECC                    : N/A
            Pending Page Blacklist            : N/A
        Remapped Rows
            Correctable Error                 : 0
            Uncorrectable Error               : 0
            Pending                           : No
            Remapping Failure Occurred        : No
            Bank Remap Availability Histogram
                Max                           : 192 bank(s)
                High                          : 0 bank(s)
                Partial                       : 0 bank(s)
                Low                           : 0 bank(s)
                None                          : 0 bank(s)
        Temperature
            GPU Current Temp                  : 40 C
            GPU Shutdown Temp                 : 98 C
            GPU Slowdown Temp                 : 95 C
            GPU Max Operating Temp            : 90 C
            GPU Target Temperature            : 84 C
            Memory Current Temp               : N/A
            Memory Max Operating Temp         : N/A
        Power Readings
            Power Management                  : Supported
            Power Draw                        : 20.57 W
            Power Limit                       : 230.00 W
            Default Power Limit               : 230.00 W
            Enforced Power Limit              : 230.00 W
            Min Power Limit                   : 100.00 W
            Max Power Limit                   : 230.00 W
        Clocks
            Graphics                          : 0 MHz
            SM                                : 0 MHz
            Memory                            : 405 MHz
            Video                             : 555 MHz
        Applications Clocks
            Graphics                          : 1695 MHz
            Memory                            : 8001 MHz
        Default Applications Clocks
            Graphics                          : 1695 MHz
            Memory                            : 8001 MHz
        Max Clocks
            Graphics                          : 2100 MHz
            SM                                : 2100 MHz
            Memory                            : 8001 MHz
            Video                             : 1950 MHz
        Max Customer Boost Clocks
            Graphics                          : N/A
        Clock Policy
            Auto Boost                        : N/A
            Auto Boost Default                : N/A
        Voltage
            Graphics                          : 0.000 mV
        Processes
            GPU instance ID                   : N/A
            Compute instance ID               : N/A
            Process ID                        : 1580
                Type                          : G
                Name                          : /usr/lib/xorg/Xorg
                Used GPU Memory               : 4 MiB
            GPU instance ID                   : N/A
            Compute instance ID               : N/A
            Process ID                        : 2231
                Type                          : G
                Name                          : /usr/lib/xorg/Xorg
                Used GPU Memory               : 4 MiB
    
    
    • [x] The k8s-device-plugin container logs
    2022/08/11 06:57:10 Starting Plugins.
    2022/08/11 06:57:10 Loading configuration.
    2022/08/11 06:57:10 Initializing NVML.
    2022/08/11 06:57:10 Updating config with default resource matching patterns.
    2022/08/11 06:57:10
    Running with config:
    {
      "version": "v1",
      "flags": {
        "migStrategy": "single",
        "failOnInitError": true,
        "nvidiaDriverRoot": "/",
        "plugin": {
          "passDeviceSpecs": true,
          "deviceListStrategy": "envvar",
          "deviceIDStrategy": "uuid"
        }
      },
      "resources": {
        "gpus": [
          {
            "pattern": "*",
            "name": "nvidia.com/gpu"
          }
        ],
        "mig": [
          {
            "pattern": "*",
            "name": "nvidia.com/gpu"
          }
        ]
      },
      "sharing": {
        "timeSlicing": {}
      }
    }
    2022/08/11 06:57:10 Retreiving plugins.
    2022/08/11 06:57:10 No MIG devices found. Falling back to mig.strategy=none
    2022/08/11 06:57:10 Starting GRPC server for 'nvidia.com/gpu'
    2022/08/11 06:57:10 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
    2022/08/11 06:57:10 Registered device plugin for 'nvidia.com/gpu' with Kubelet
    
    • [x] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet) No specific errors are logged with respect to this pod schedule or NVIDIA.

    Additional information that might help better understand your environment and reproduce the bug:

    • [x] Kubelet version from kubelet version: Kubernetes v1.21.14+rke2r1
    • [x] Containerd version: containerd github.com/k3s-io/containerd v1.4.13-k3s1 04203d2174f8b8d05bcec98000fba67c0aa69223
  • OpenShift 3.9/Docker-CE, Could not register device plugin: context deadline exceeded

    OpenShift 3.9/Docker-CE, Could not register device plugin: context deadline exceeded

    Following blog posting "How to use GPUs with Device Plugin in OpenShift 3.9 (Now Tech Preview!)" in blog.openshift.com

    In my case, nvidia-device-plugin shows errors like below:

    # oc logs -f nvidia-device-plugin-daemonset-nj9p8
    2018/06/06 12:40:11 Loading NVML
    2018/06/06 12:40:11 Fetching devices.
    2018/06/06 12:40:11 Starting FS watcher.
    2018/06/06 12:40:11 Starting OS watcher.
    2018/06/06 12:40:11 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2018/06/06 12:40:16 Could not register device plugin: context deadline exceeded
    2018/06/06 12:40:16 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
    2018/06/06 12:40:16 You can check the prerequisites at: https://github.com/NVIDIA/k...
    2018/06/06 12:40:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k...
    2018/06/06 12:40:16 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    ...
    
    • One of the device-plugin-daemonset pod description is
    # oc describe pod nvidia-device-plugin-daemonset-2
    Name:           nvidia-device-plugin-daemonset-2jqgk
    Namespace:      nvidia
    Node:           node02/192.168.5.102
    Start Time:     Wed, 06 Jun 2018 22:59:32 +0900
    Labels:         controller-revision-hash=4102904998
                    name=nvidia-device-plugin-ds
                    pod-template-generation=1
    Annotations:    openshift.io/scc=nvidia-deviceplugin
    Status:         Running
    IP:             192.168.5.102
    Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
    Containers:
      nvidia-device-plugin-ctr:
        Container ID:   docker://b92280bd124df9fd46fe08ab4bbda76e2458cf5572f5ffc651661580bcd9126d
        Image:          nvidia/k8s-device-plugin:1.9
        Image ID:       docker-pullable://nvidia/k8s-device-plugin@sha256:7ba244bce75da00edd907209fe4cf7ea8edd0def5d4de71939899534134aea31
        Port:           <none>
        State:          Running
          Started:      Wed, 06 Jun 2018 22:59:34 +0900
        Ready:          True
        Restart Count:  0
        Environment:    <none>
        Mounts:
          /var/lib/kubelet/device-plugins from device-plugin (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from nvidia-deviceplugin-token-cv7p5 (ro)
    Conditions:
      Type           Status
      Initialized    True 
      Ready          True 
      PodScheduled   True 
    Volumes:
      device-plugin:
        Type:          HostPath (bare host directory volume)
        Path:          /var/lib/kubelet/device-plugins
        HostPathType:  
      nvidia-deviceplugin-token-cv7p5:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  nvidia-deviceplugin-token-cv7p5
        Optional:    false
    QoS Class:       BestEffort
    Node-Selectors:  <none>
    Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                     node.kubernetes.io/memory-pressure:NoSchedule
                     node.kubernetes.io/not-ready:NoExecute
                     node.kubernetes.io/unreachable:NoExecute
    Events:
      Type    Reason                 Age   From             Message
      ----    ------                 ----  ----             -------
      Normal  SuccessfulMountVolume  1h    kubelet, node02  MountVolume.SetUp succeeded for volume "device-plugin"
      Normal  SuccessfulMountVolume  1h    kubelet, node02  MountVolume.SetUp succeeded for volume "nvidia-deviceplugin-token-cv7p5"
      Normal  Pulled                 1h    kubelet, node02  Container image "nvidia/k8s-device-plugin:1.9" already present on machine
      Normal  Created                1h    kubelet, node02  Created container
      Normal  Started                1h    kubelet, node02  Started container
    
    • And running "docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9" shows the log messages just like above.

    • On each origin-nodes, docker run test shows like this(its normal, right?),

    # docker run --rm nvidia/cuda nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
    Tesla-P40
    
    # docker run -it --rm docker.io/mirrorgoogleconta...
    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
    

    [Test Env.]

    • 1 Master with OpenShift v3.9(Origin)
    • 2 GPU nodes with Tesla-P40*2
    • Docker-CE, nvidia-docker2 on GPU nodes

    [Master]

    # oc version
    oc v3.9.0+46ff3a0-18
    kubernetes v1.9.1+a0ce1bc657
    features: Basic-Auth GSSAPI Kerberos SPNEGO
    
    Server https://MYDOMAIN.local:8443
    openshift v3.9.0+46ff3a0-18
    kubernetes v1.9.1+a0ce1bc657
    
    # uname -r
    3.10.0-862.3.2.el7.x86_64
    
    # cat /etc/redhat-release 
    CentOS Linux release 7.5.1804 (Core)
    

    [GPU nodes]

    # docker version
    Client:
    Version: 18.03.1-ce
    API version: 1.37
    Go version: go1.9.5
    Git commit: 9ee9f40
    Built: Thu Apr 26 07:20:16 2018
    OS/Arch: linux/amd64
    Experimental: false
    Orchestrator: swarm
    
    Server:
    Engine:
    Version: 18.03.1-ce
    API version: 1.37 (minimum version 1.12)
    Go version: go1.9.5
    Git commit: 9ee9f40
    Built: Thu Apr 26 07:23:58 2018
    OS/Arch: linux/amd64
    Experimental: false
    
    # uname -r
    3.10.0-862.3.2.el7.x86_64
    
    # cat /etc/redhat-release 
    CentOS Linux release 7.5.1804 (Core)
    
    # docker ps
    CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    4b1a37d31cb9 openshift/node:v3.9.0 "/usr/local/bin/orig…" 22 minutes ago Up 21 minutes origin-node
    efbedeeb88f0 fe3e6b0d95b5 "nvidia-device-plugin" About an hour ago Up About an hour k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
    36aa988447b8 openshift/origin-pod:v3.9.0 "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
    6e6b598fa144 openshift/openvswitch:v3.9.0 "/usr/local/bin/ovs-…" 2 hours ago Up 2 hours openvswitch
    
    # cat /etc/docker/daemon.json 
    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    

    Please help me with this problem. TIA!

  • 0/1 nodes are available: 1 Insufficient nvidia.com/gpu

    0/1 nodes are available: 1 Insufficient nvidia.com/gpu

    Deploying any PODS with the nvidia.com/gpu resource limits results in "0/1 nodes are available: 1 Insufficient nvidia.com/gpu."

    I also see this error in the Daemonset POD logs: 2018/02/27 16:43:50 Warning: GPU with UUID GPU-edae6d5d-6698-fb8d-2c6b-2a791224f089 is too old to support healtchecking with error: %!s(MISSING). Marking it unhealthy

    running nvidia-docker2, have deployed the nvidia device plugin as a daemonset.

    On worker Node uname -a Linux gpu 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

    docker run --rm nvidia/cuda nvidia-smi Wed Feb 28 18:07:07 2018
    +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.30 Driver Version: 390.30 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 760 Off | 00000000:0B:00.0 N/A | N/A | | 34% 43C P8 N/A / N/A | 0MiB / 1999MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 760 Off | 00000000:90:00.0 N/A | N/A | | 34% 42C P8 N/A / N/A | 0MiB / 1999MiB | N/A Default | +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 Not Supported | | 1 Not Supported | +-----------------------------------------------------------------------------+

  • Crio integration?

    Crio integration?

    Hi

    I am trying to use crio with nvidia-runtime-hook, as explained in (1)However, after creating this daemonset, I run 'kubectl describe nodes" and I don't see any mention to nvidia gpus, plus the pods that require it are in pending state.

    Have you tried this with crio? Have you instructions on how to make it work? And how can I debug it and get more info?

    Thanks

  • Multiple pods share one GPU

    Multiple pods share one GPU

    Issue or feature description

    *Nvidia GeForce GTX 1050 Ti is ready on my host, also nvidia k8s-device-plugin is running well, i can see that nvidia.com/gpu is ready,

    # kubectl describe node k8s
    ...
    Capacity:
     cpu:                 8
     ephemeral-storage:   75881276Ki
     hugepages-1Gi:       0
     hugepages-2Mi:       0
     memory:              16362632Ki
     nvidia.com/gpu:      1
    Allocatable:
     cpu:                 8
     ephemeral-storage:   69932183846
     hugepages-1Gi:       0
     hugepages-2Mi:       0
     memory:              16260232Ki
     nvidia.com/gpu:      1
    

    However, nvidia.com/gpu resource value is only 1, so pod-1 hold all Nvidia GeForce GTX 1050 Ti GPU resource, that pod-2 can not be deployed because there is no free nvidia.com/gpu resource.

    So, can GPU resource be shared with multiple pods?

    Thanks

  •  0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

    0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

    I deployed device-plugin container on k8s via the guide. But when i run tensorflow-notebook (By exeucte kubectl create -f tensorflow-notebook.yml),the pod was sill pending:

    [root@mlssdi010001 k8s]# kubectl describe pod tf-notebook-747db6987b-86zts Name: tf-notebook-747db6987b-86zts .... Events: Type Reason Age From Message


    Warning FailedScheduling 47s (x15 over 3m) default-scheduler 0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

    Pod info:

    [root@mlssdi010001 k8s]# kubectl get pod --all-namespaces -o wide
    NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
    default tf-notebook-747db6987b-86zts 0/1 Pending 0 5s
    .... kube-system nvidia-device-plugin-daemonset-ljrwc 1/1 Running 0 34s 10.244.1.11 mlssdi010003
    kube-system nvidia-device-plugin-daemonset-m7h2r 1/1 Running 0 34s 10.244.2.12 mlssdi010002

    Nodes info:

    NAME STATUS ROLES AGE VERSION mlssdi010001 Ready master 1d v1.9.0 mlssdi010002 Ready 1d v1.9.0 (GPU Node,1 * Tesla M40) mlssdi010003 Ready 1d v1.9.0 (GPU Node,1 * Tesla M40)

  • Device Plugin is not returning with an error, Pod not restarted

    Device Plugin is not returning with an error, Pod not restarted


    1. Issue or feature description

    The device plugin is not returning an error if it fails.

    020/05/14 02:11:19 Loading NVML
    2020/05/14 02:11:19 Failed to initialize NVML: could not load NVML library.
    2020/05/14 02:11:19 If this is a GPU node, did you set the docker default runtime to `nvidia`?
    2020/05/14 02:11:19 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
    2020/05/14 02:11:19 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
    

    The Pod shows Running and is not restarted. During scale-up the DevicePlugin can start before the driver and hook are deployed.

    2. Steps to reproduce the issue

    Deploy the GPU operator on a single node and scale up to two nodes (OpenShift).

  • k8s-device-plugin fails with k8s static CPU policy

    k8s-device-plugin fails with k8s static CPU policy

    1. Issue or feature description

    Kubelet configured with a static CPU policy (e.g. --cpu-manager-policy=static --kube-reserved cpu=0.1) will cause nvidia-smi to fail after short delay.

    Configure a test pod to request a nvidia.com/gpu resource, then run a simple nvidia-smi command as "sleep 30; nvidia-smi" and this will always fail with: "Failed to initialize NVML: Unknown Error"

    Running the same without the sleep, command works and nvidia-smi returns the expected info

    2. Steps to reproduce the issue

    Kubernetes 1.14 $ kubelet --version Kubernetes v1.14.8 Device plugin: nvidia/k8s-device-plugin:1.11 (also with 1.0.0.0-beta4)

    apply the daemonset for the nvidia plugin then apply a pod yaml for a pod requesting one device:

    kind: Pod
    metadata:
      name: gputest
    spec:
      containers:
      - command:
        - /bin/bash
        args:
        - -c
        - "sleep 30; nvidia-smi"
        image: nvidia/cuda:8.0-runtime-ubuntu16.04
        name: app
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            memory: 1Gi
            nvidia.com/gpu: "1"
      restartPolicy: Never
      tolerations:
      - effect: NoSchedule
        operator: Exists
      nodeSelector:
        beta.kubernetes.io/arch: amd64
    

    then follow the pod logs:

    Failed to initialize NVML: Unknown Error
    

    The pod persists in this state

    3. Information to attach (optional if deemed irrelevant)

    Common error checking:

    • [ ] The output of nvidia-smi -a on your host
    
    ==============NVSMI LOG==============
    
    Timestamp                           : Tue Nov 12 12:22:08 2019
    Driver Version                      : 390.30
    
    Attached GPUs                       : 1
    GPU 00000000:03:00.0
        Product Name                    : Tesla M2090
        Product Brand                   : Tesla
        Display Mode                    : Disabled
        Display Active                  : Disabled
        Persistence Mode                : Disabled
        Accounting Mode                 : N/A
        Accounting Mode Buffer Size     : N/A
        Driver Model
            Current                     : N/A
            Pending                     : N/A
        Serial Number                   : 0320512020115
        GPU UUID                        : GPU-f473d23b-0a01-034e-933b-58d52ca40425
        Minor Number                    : 0
        VBIOS Version                   : 70.10.46.00.01
        MultiGPU Board                  : No
        Board ID                        : 0x300
        GPU Part Number                 : N/A
        Inforom Version
            Image Version               : N/A
            OEM Object                  : 1.1
            ECC Object                  : 2.0
            Power Management Object     : 4.0
        GPU Operation Mode
            Current                     : N/A
            Pending                     : N/A
        GPU Virtualization Mode
            Virtualization mode         : None
        PCI
            Bus                         : 0x03
            Device                      : 0x00
            Domain                      : 0x0000
            Device Id                   : 0x109110DE
            Bus Id                      : 00000000:03:00.0
            Sub System Id               : 0x088710DE
            GPU Link Info
                PCIe Generation
                    Max                 : 2
                    Current             : 1
                Link Width
                    Max                 : 16x
                    Current             : 16x
            Bridge Chip
                Type                    : N/A
                Firmware                : N/A
            Replays since reset         : N/A
            Tx Throughput               : N/A
            Rx Throughput               : N/A
        Fan Speed                       : N/A
        Performance State               : P12
        Clocks Throttle Reasons         : N/A
        FB Memory Usage
            Total                       : 6067 MiB
            Used                        : 0 MiB
            Free                        : 6067 MiB
        BAR1 Memory Usage
            Total                       : N/A
            Used                        : N/A
            Free                        : N/A
        Compute Mode                    : Default
        Utilization
            Gpu                         : 0 %
            Memory                      : 0 %
            Encoder                     : N/A
            Decoder                     : N/A
        Encoder Stats
            Active Sessions             : 0
            Average FPS                 : 0
            Average Latency             : 0
        Ecc Mode
            Current                     : Disabled
            Pending                     : Disabled
        ECC Errors
            Volatile
                Single Bit
                    Device Memory       : N/A
                    Register File       : N/A
                    L1 Cache            : N/A
                    L2 Cache            : N/A
                    Texture Memory      : N/A
                    Texture Shared      : N/A
                    CBU                 : N/A
                    Total               : N/A
                Double Bit
                    Device Memory       : N/A
                    Register File       : N/A
                    L1 Cache            : N/A
                    L2 Cache            : N/A
                    Texture Memory      : N/A
                    Texture Shared      : N/A
                    CBU                 : N/A
                    Total               : N/A
            Aggregate
                Single Bit
                    Device Memory       : N/A
                    Register File       : N/A
                    L1 Cache            : N/A
                    L2 Cache            : N/A
                    Texture Memory      : N/A
                    Texture Shared      : N/A
                    CBU                 : N/A
                    Total               : N/A
                Double Bit
                    Device Memory       : N/A
                    Register File       : N/A
                    L1 Cache            : N/A
                    L2 Cache            : N/A
                    Texture Memory      : N/A
                    Texture Shared      : N/A
                    CBU                 : N/A
                    Total               : N/A
        Retired Pages
            Single Bit ECC              : N/A
            Double Bit ECC              : N/A
            Pending                     : N/A
        Temperature
            GPU Current Temp            : N/A
            GPU Shutdown Temp           : N/A
            GPU Slowdown Temp           : N/A
            GPU Max Operating Temp      : N/A
            Memory Current Temp         : N/A
            Memory Max Operating Temp   : N/A
        Power Readings
            Power Management            : Supported
            Power Draw                  : 29.81 W
            Power Limit                 : 225.00 W
            Default Power Limit         : N/A
            Enforced Power Limit        : N/A
            Min Power Limit             : N/A
            Max Power Limit             : N/A
        Clocks
            Graphics                    : 50 MHz
            SM                          : 101 MHz
            Memory                      : 135 MHz
            Video                       : 135 MHz
        Applications Clocks
            Graphics                    : N/A
            Memory                      : N/A
        Default Applications Clocks
            Graphics                    : N/A
            Memory                      : N/A
        Max Clocks
            Graphics                    : 650 MHz
            SM                          : 1301 MHz
            Memory                      : 1848 MHz
            Video                       : 540 MHz
        Max Customer Boost Clocks
            Graphics                    : N/A
        Clock Policy
            Auto Boost                  : N/A
            Auto Boost Default          : N/A
        Processes                       : None
    
    • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
    {
        "experimental": true,
        "storage-driver": "overlay2",
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    
    • [ ] The k8s-device-plugin container logs
    2019/11/11 19:10:56 Loading NVML
    2019/11/11 19:10:56 Fetching devices.
    2019/11/11 19:10:56 Starting FS watcher.
    2019/11/11 19:10:56 Starting OS watcher.
    2019/11/11 19:10:56 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2019/11/11 19:10:56 Registered device plugin with Kubelet
    
    • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet) repeated:
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: E1112 12:32:21.880196    8053 cpu_manager.go:252] [cpumanager] reconcileState: failed to add container (pod: kube-proxy-bm82q, container: kube-proxy, container id: 92273ce7687ead38fb1c59b18934179183ea1b9e4f59107e92eec2f987bb91be, error: rpc error: code = Unknown desc
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: I1112 12:32:21.880175    8053 policy_static.go:195] [cpumanager] static policy: RemoveContainer (container id: 92273ce7687ead38fb1c59b18934179183ea1b9e4f59107e92eec2f987bb91be)
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: : unknown
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: E1112 12:32:21.880153    8053 cpu_manager.go:183] [cpumanager] AddContainer error: rpc error: code = Unknown desc = failed to update container "92273ce7687ead38fb1c59b18934179183ea1b9e4f59107e92eec2f987bb91be": Error response from daemon: Cannot update container 92273
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: : unknown
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: E1112 12:32:21.880081    8053 remote_runtime.go:350] UpdateContainerResources "92273ce7687ead38fb1c59b18934179183ea1b9e4f59107e92eec2f987bb91be" from runtime service failed: rpc error: code = Unknown desc = failed to update container "92273ce7687ead38fb1c59b1893417918
    

    Additional information that might help better understand your environment and reproduce the bug:

    • [ ] Docker version from docker version Version: 18.09.1

    • [ ] Docker command, image and tag used

    • [ ] Kernel version from uname -a

    Linux dal1k8s-worker-06 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
    
    • [ ] Any relevant kernel output lines from dmesg
    [    2.840610] nvidia: module license 'NVIDIA' taints kernel.
    [    2.879301] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
    [    2.911779] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.30  Wed Jan 31 21:32:48 PST 2018
    [    2.912960] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
    [   13.893608] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 242
    
    • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    Desired=Unknown/Install/Remove/Purge/Hold
    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
    ||/ Name                                                                      Version                                   Architecture                              Description
    +++-=========================================================================-=========================================-=========================================-=======================================================================================================================================================
    ii  libnvidia-container-tools                                                 1.0.1-1                                   amd64                                     NVIDIA container runtime library (command-line tools)
    ii  libnvidia-container1:amd64                                                1.0.1-1                                   amd64                                     NVIDIA container runtime library
    ii  nvidia-390                                                                390.30-0ubuntu1                           amd64                                     NVIDIA binary driver - version 390.30
    ii  nvidia-container-runtime                                                  2.0.0+docker18.09.1-1                     amd64                                     NVIDIA container runtime
    ii  nvidia-container-runtime-hook                                             1.4.0-1                                   amd64                                     NVIDIA container runtime hook
    un  nvidia-current                                                            <none>                                    <none>                                    (no description available)
    un  nvidia-docker                                                             <none>                                    <none>                                    (no description available)
    ii  nvidia-docker2                                                            2.0.3+docker18.09.1-1                     all                                       nvidia-docker CLI wrapper
    un  nvidia-driver-binary                                                      <none>                                    <none>                                    (no description available)
    un  nvidia-legacy-340xx-vdpau-driver                                          <none>                                    <none>                                    (no description available)
    un  nvidia-libopencl1-390                                                     <none>                                    <none>                                    (no description available)
    un  nvidia-libopencl1-dev                                                     <none>                                    <none>                                    (no description available)
    un  nvidia-opencl-icd                                                         <none>                                    <none>                                    (no description available)
    ii  nvidia-opencl-icd-390                                                     390.30-0ubuntu1                           amd64                                     NVIDIA OpenCL ICD
    un  nvidia-persistenced                                                       <none>                                    <none>                                    (no description available)
    ii  nvidia-prime                                                              0.8.2                                     amd64                                     Tools to enable NVIDIA's Prime
    ii  nvidia-settings                                                           410.79-0ubuntu1                           amd64                                     Tool for configuring the NVIDIA graphics driver
    un  nvidia-settings-binary                                                    <none>                                    <none>                                    (no description available)
    un  nvidia-smi                                                                <none>                                    <none>                                    (no description available)
    un  nvidia-vdpau-driver                                                       <none>                                    <none>                                    (no description available)
    
    • [ ] NVIDIA container library version from nvidia-container-cli -V
    version: 1.0.1
    build date: 2019-01-15T23:24+00:00
    build revision: 038fb92d00c94f97d61492d4ed1f82e981129b74
    build compiler: gcc-5 5.4.0 20160609
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections```
    
    
     - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
    
  • nvidia-device-plugin container CrashLoopBackOff error

    nvidia-device-plugin container CrashLoopBackOff error

    I deployed device-plugin container on k8s via the guide. However I got container CrashLoopBackOff error:

    NAME                                   READY     STATUS             RESTARTS   AGE
    nvidia-device-plugin-daemonset-zb8xn   0/1       CrashLoopBackOff   6          9m
    

    And when I run

    docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.8

    I got error like this:

    2017/11/29 01:54:30 Loading NVML
    2017/11/29 01:54:30 could not load NVML library
    

    But I am pretty sure that I have installed NVML library. So did I miss anything here? How to check if I installed NVML library?

  • pod fail to find gpu some time after created

    pod fail to find gpu some time after created

    The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

    1. Issue or feature description

    on Version v0.10.0 At first, pod was able to get gpu resource, and some time after, pod cannot find gpu with error: i didn't modify cpu_manager_policy and set campatWithCPUManager true

    root@instance-81:/# nvidia-smi
    Failed to initialize NVML: Unknown Error
    

    2. Steps to reproduce the issue

    install nvidia-device-plugin with helm with values

    compatWithCPUManager: true
    resources:
        limits:
          cpu: 10m
          memory: 50Mi
        requests:
          cpu: 5m
          memory: 30Mi
    image:
      repository: nvcr.io/nvidia/k8s-device-plugin
      pullPolicy: IfNotPresent
      # Overrides the image tag whose default is the chart appVersion.
      tag: "v0.10.0"
    

    3. Information to attach (optional if deemed irrelevant)

    Common error checking:

    • [ ] The output of nvidia-smi -a on your host
    • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
    • [ ] The k8s-device-plugin container logs
    • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

    Additional information that might help better understand your environment and reproduce the bug:

    • [ ] Docker version from docker version docker://20.10.7
    • [ ] Docker command, image and tag used
    • [ ] Kernel version from uname -a
    • [ ] Any relevant kernel output lines from dmesg
    • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    ||/ 이름                                                      버전                              Architecture                      설명
    +++-=========================================================-=================================-=================================-=======================================================================================================================
    un  libgldispatch0-nvidia                                     <none>                            <none>                            (설명 없음)
    ii  libnvidia-cfg1-465:amd64                                  465.19.01-0ubuntu1                amd64                             NVIDIA binary OpenGL/GLX configuration library
    un  libnvidia-cfg1-any                                        <none>                            <none>                            (설명 없음)
    un  libnvidia-common                                          <none>                            <none>                            (설명 없음)
    ii  libnvidia-common-465                                      465.19.01-0ubuntu1                all                               Shared files used by the NVIDIA libraries
    un  libnvidia-compute                                         <none>                            <none>                            (설명 없음)
    rc  libnvidia-compute-460:amd64                               460.91.03-0ubuntu0.18.04.1        amd64                             NVIDIA libcompute package
    ii  libnvidia-compute-465:amd64                               465.19.01-0ubuntu1                amd64                             NVIDIA libcompute package
    ii  libnvidia-container-tools                                 1.7.0-1                           amd64                             NVIDIA container runtime library (command-line tools)
    ii  libnvidia-container1:amd64                                1.7.0-1                           amd64                             NVIDIA container runtime library
    un  libnvidia-decode                                          <none>                            <none>                            (설명 없음)
    ii  libnvidia-decode-465:amd64                                465.19.01-0ubuntu1                amd64                             NVIDIA Video Decoding runtime libraries
    un  libnvidia-encode                                          <none>                            <none>                            (설명 없음)
    ii  libnvidia-encode-465:amd64                                465.19.01-0ubuntu1                amd64                             NVENC Video Encoding runtime library
    un  libnvidia-extra                                           <none>                            <none>                            (설명 없음)
    ii  libnvidia-extra-465:amd64                                 465.19.01-0ubuntu1                amd64                             Extra libraries for the NVIDIA driver
    un  libnvidia-fbc1                                            <none>                            <none>                            (설명 없음)
    ii  libnvidia-fbc1-465:amd64                                  465.19.01-0ubuntu1                amd64                             NVIDIA OpenGL-based Framebuffer Capture runtime library
    un  libnvidia-gl                                              <none>                            <none>                            (설명 없음)
    ii  libnvidia-gl-465:amd64                                    465.19.01-0ubuntu1                amd64                             NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
    un  libnvidia-ifr1                                            <none>                            <none>                            (설명 없음)
    ii  libnvidia-ifr1-465:amd64                                  465.19.01-0ubuntu1                amd64                             NVIDIA OpenGL-based Inband Frame Readback runtime library
    un  libnvidia-ml1                                             <none>                            <none>                            (설명 없음)
    un  nvidia-304                                                <none>                            <none>                            (설명 없음)
    un  nvidia-340                                                <none>                            <none>                            (설명 없음)
    un  nvidia-384                                                <none>                            <none>                            (설명 없음)
    un  nvidia-390                                                <none>                            <none>                            (설명 없음)
    un  nvidia-common                                             <none>                            <none>                            (설명 없음)
    un  nvidia-compute-utils                                      <none>                            <none>                            (설명 없음)
    rc  nvidia-compute-utils-460                                  460.91.03-0ubuntu0.18.04.1        amd64                             NVIDIA compute utilities
    ii  nvidia-compute-utils-465                                  465.19.01-0ubuntu1                amd64                             NVIDIA compute utilities
    un  nvidia-container-runtime                                  <none>                            <none>                            (설명 없음)
    un  nvidia-container-runtime-hook                             <none>                            <none>                            (설명 없음)
    ii  nvidia-container-toolkit                                  1.7.0-1                           amd64                             NVIDIA container runtime hook
    rc  nvidia-dkms-460                                           460.91.03-0ubuntu0.18.04.1        amd64                             NVIDIA DKMS package
    ii  nvidia-dkms-465                                           465.19.01-0ubuntu1                amd64                             NVIDIA DKMS package
    un  nvidia-dkms-kernel                                        <none>                            <none>                            (설명 없음)
    un  nvidia-docker                                             <none>                            <none>                            (설명 없음)
    ii  nvidia-docker2                                            2.8.0-1                           all                               nvidia-docker CLI wrapper
    ii  nvidia-driver-465                                         465.19.01-0ubuntu1                amd64                             NVIDIA driver metapackage
    un  nvidia-driver-binary                                      <none>                            <none>                            (설명 없음)
    un  nvidia-kernel-common                                      <none>                            <none>                            (설명 없음)
    rc  nvidia-kernel-common-460                                  460.91.03-0ubuntu0.18.04.1        amd64                             Shared files used with the kernel module
    ii  nvidia-kernel-common-465                                  465.19.01-0ubuntu1                amd64                             Shared files used with the kernel module
    un  nvidia-kernel-source                                      <none>                            <none>                            (설명 없음)
    un  nvidia-kernel-source-460                                  <none>                            <none>                            (설명 없음)
    ii  nvidia-kernel-source-465                                  465.19.01-0ubuntu1                amd64                             NVIDIA kernel source package
    un  nvidia-legacy-340xx-vdpau-driver                          <none>                            <none>                            (설명 없음)
    ii  nvidia-modprobe                                           510.39.01-0ubuntu1                amd64                             Load the NVIDIA kernel driver and create device files
    un  nvidia-opencl-icd                                         <none>                            <none>                            (설명 없음)
    un  nvidia-persistenced                                       <none>                            <none>                            (설명 없음)
    ii  nvidia-prime                                              0.8.16~0.18.04.1                  all                               Tools to enable NVIDIA's Prime
    ii  nvidia-settings                                           510.39.01-0ubuntu1                amd64                             Tool for configuring the NVIDIA graphics driver
    un  nvidia-settings-binary                                    <none>                            <none>                            (설명 없음)
    un  nvidia-smi                                                <none>                            <none>                            (설명 없음)
    un  nvidia-utils                                              <none>                            <none>                            (설명 없음)
    ii  nvidia-utils-465                                          465.19.01-0ubuntu1                amd64                             NVIDIA driver support binaries
    un  nvidia-vdpau-driver                                       <none>                            <none>                            (설명 없음)
    ii  xserver-xorg-video-nvidia-465                             465.19.01-0ubuntu1                amd64                             NVIDIA binary Xorg driver
    
    • [ ] NVIDIA container library version from nvidia-container-cli -V
    cli-version: 1.7.0
    lib-version: 1.7.0
    build date: 2021-11-30T19:53+00:00
    build revision: f37bb387ad05f6e501069d99e4135a97289faf1f
    build compiler: x86_64-linux-gnu-gcc-7 7.5.0
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
    
  • container failed to start after the VM node migrated to another host

    container failed to start after the VM node migrated to another host

    1. Issue or feature description

    When there is some problem with host, the VM node runing on it will be migrated to another health host. After migration, the containers of GPU pods of the node fails to start with following errors:

    OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: GPU-12345678-1234-1234-1234-1234567890ab: unknown device: unknown
    

    2. Steps to reproduce the issue

    1. Create a GPU pod with nvidia.com/gpu request
    2. Migrate the node where the pod in running, to another host
    3. After migration, the pod fails to start with above error.

    3. Information to attach (optional if deemed irrelevant)

    Common error checking:

    • [ ] The output of nvidia-smi -a on your host
    • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
    • [ ] The k8s-device-plugin container logs
    • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

    Additional information that might help better understand your environment and reproduce the bug:

    • [ ] Docker version from docker version
    • [ ] Docker command, image and tag used
    • [ ] Kernel version from uname -a
    • [ ] Any relevant kernel output lines from dmesg
    • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    • [ ] NVIDIA container library version from nvidia-container-cli -V
    • [ ] NVIDIA container library logs (see troubleshooting)
  • nvidia-device-plugin getting CrashLoopBackOff while installing using helm

    nvidia-device-plugin getting CrashLoopBackOff while installing using helm

    1. Issue or feature description

    I have created a multi-node k0s Kubernetes cluster using this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu but while deploying nvidia-device-plugin via helm https://github.com/NVIDIA/k8s-device-plugin#deployment-via-helm, I'm getting the status as CrashLoopBackOff

    kube-system calico-kube-controllers-555bc4b957-q99v4 1/1 Running 34 (3d6h ago) 25d kube-system calico-node-btnm7 1/1 Running 6 (3d6h ago) 25d kube-system calico-node-hqtr4 1/1 Running 3 (7d2h ago) 25d kube-system coredns-ddddfbd5c-jnxwm 1/1 Running 6 (3d6h ago) 25d kube-system coredns-ddddfbd5c-pwgqd 1/1 Running 6 (3d6h ago) 25d kube-system konnectivity-agent-bckg8 1/1 Running 2 (7d2h ago) 19d kube-system konnectivity-agent-kvml7 1/1 Running 1 (3d6h ago) 7d2h kube-system kube-proxy-5mbz6 1/1 Running 6 (3d6h ago) 25d kube-system kube-proxy-rts4r 1/1 Running 3 (7d2h ago) 25d kube-system metrics-server-7d7c4887f4-8tlt7 1/1 Running 8 (3d6h ago) 25d nvidia-device-plugin nvdp-nvidia-device-plugin-2plxw 1/2 CrashLoopBackOff 2800 ( ago) 12d nvidia-device-plugin nvdp-nvidia-device-plugin-qprsf 1/2 CrashLoopBackOff 2784 (94s ago) 12d nvidia-device-plugin nvidia-device-plugin-788xx 0/1 CrashLoopBackOff 3183 ( ago) 13d nvidia-device-plugin nvidia-device-plugin-pwj4k 0/1 CrashLoopBackOff 3168 (24s ago) 13d

    2. Steps to reproduce the issue

    I have followed this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu

    Download k0s binary curl -L "https://github.com/k0sproject/k0s/releases/download/v1.24.4%2Bk0s.0/k0s-v1.24.4+k0s.0-amd64" -o /tmp/k0s chmod +x /tmp/k0s Download k0sctl binary curl -L "https://github.com/k0sproject/k0sctl/releases/download/v0.13.2/k0sctl-linux-x64" -o /usr/local/bin/k0sctl chmod +x /usr/local/bin/k0sctl

    Then you need to create a k0sctl.yaml config file: For a multi-node Kubernetes cluster

    k0sctl.yaml file

    apiVersion: k0sctl.k0sproject.io/v1beta1 kind: Cluster metadata: name: my-cluster spec: hosts: - role: controller localhost: enabled: true files: - name: containerd-config src: /tmp/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null - role: worker ssh: address: 43.88.62.134 user: user keyPath: .ssh/id_rsa files: - name: containerd-config src: /tmp/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null - role: worker ssh: address: 43.88.62.133 user: user keyPath: .ssh/id_rsa files: - name: containerd-config src: /tmp/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null k0s: version: 1.24.4+k0s.0 config: spec: network: provider: calico

    /tmp/containerd.toml file

    version = 2

    [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2"

    Then run the command: k0sctl apply --config /path/to/k0sctl.yaml

    Deploy NVIDIA GPU Operator

    values.yaml file

    operator: defaultRuntime: containerd

    toolkit: version: v1.10.0-ubuntu20.04 env: - name: CONTAINERD_CONFIG value: /etc/k0s/containerd.toml - name: CONTAINERD_SOCKET value: /run/k0s/containerd.sock - name: CONTAINERD_RUNTIME_CLASS value: nvidia - name: CONTAINERD_SET_AS_DEFAULT value: "true"

    driver: manager: image: k8s-driver-manager repository: nvcr.io/nvidia/cloud-native version: v0.4.0 imagePullPolicy: IfNotPresent env: - name: ENABLE_AUTO_DRAIN value: "true" - name: DRAIN_USE_FORCE value: "true" - name: DRAIN_POD_SELECTOR_LABEL value: "" - name: DRAIN_TIMEOUT_SECONDS value: "0s" - name: DRAIN_DELETE_EMPTYDIR_DATA value: "true" repoConfig: configMapName: repo-config version: "495.29.05"

    validator: version: "v1.11.0"

    Install Helm

    curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
    && chmod 700 get_helm.sh
    && ./get_helm.sh

    Now, add the NVIDIA Helm repository:

    helm repo add nvidia https://nvidia.github.io/gpu-operator
    && helm repo update

    helm install --wait --generate-name
    nvidia/gpu-operator

    helm upgrade --install --namespace=gpu-operator --create-namespace --wait --values=values.yaml gpu-operator nvidia/gpu-operator

  • Update README.md

    Update README.md

    /etc/apt/trusted.gpg.d/libnvidia-container.gpg Is not really the right place for this but the upstream .list file does not have a path specified and relies on the built-in apt database.

  • apt-key is deprecated

    apt-key is deprecated

    curl ... | apt-key add - is deprecated. The suggested workflow is to add the gpg key directly to the file system.

    tl;dr store it in /etc/apt/keyrings/libnvidia-container.gpg (if placed there by OS admin) and point it out in .list file with [signed-by=/etc/apt/keyrings/libnvidia-container.gpg] as sketched here

    apt-key manual states:

    Then you can directly replace this with (though note the recommendation below):
       wget -qO- https://myrepo.example/myrepo.asc | sudo tee /etc/apt/trusted.gpg.d/myrepo.asc
    

    Make sure to use the "asc" extension for ASCII armored keys and the "gpg" extension for the binary OpenPGP format (also known as "GPG key public ring"). The binary OpenPGP format works for all apt versions, while the ASCII armored format works for apt version >= 1.4.

    Recommended: Instead of placing keys into the /etc/apt/trusted.gpg.d directory, you can place them anywhere on your filesystem by using the Signed-By option in your sources.list and pointing to the filename of the key. See sources.list(5) for details. Since APT 2.4, /etc/apt/keyrings is provided as the recommended location for keys not managed by packages. When using a deb822-style sources.list, and with apt version >= 2.4, the Signed-By option can also be used to include the full ASCII armored keyring directly in the sources.list without an additional file.

  • #364 Check config symlink instead of file existence in config-manager

    #364 Check config symlink instead of file existence in config-manager

    As described in issue #364, when changing value of the configuration label, the config-manager tries to create the symlink to the new config file without properly deleting the symlink pointing to the previous (deleted) configuration file, thus resulting in the "file exists" error mentioned above when trying to create the symlink to the new configuration.

    The PR makes the config-manager check for symlink existence instead of checking if the linked file exists.

Nvidia GPU exporter for prometheus using nvidia-smi binary
Nvidia GPU exporter for prometheus using nvidia-smi binary

nvidia_gpu_exporter Nvidia GPU exporter for prometheus, using nvidia-smi binary to gather metrics. Introduction There are many Nvidia GPU exporters ou

Jan 5, 2023
nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.
nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.

Nano GPU Agent About this Project Nano GPU Agent is a Kubernetes device plugin implement for gpu allocation and use in container. It runs as a Daemons

Dec 29, 2022
K8s-socketcan - Virtual SocketCAN Kubernetes device plugin

Virtual SocketCAN Kubernetes device plugin This plugins enables you to create vi

Feb 15, 2022
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

DCGM-Exporter This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM. Documentation

Dec 27, 2022
Build and run Docker containers leveraging NVIDIA GPUs
Build and run Docker containers leveraging NVIDIA GPUs

NVIDIA Container Toolkit Introduction The NVIDIA Container Toolkit allows users to build and run GPU accelerated Docker containers. The toolkit includ

Jan 7, 2023
NVIDIA container runtime

nvidia-container-runtime A modified version of runc adding a custom pre-start hook to all containers. If environment variable NVIDIA_VISIBLE_DEVICES i

Dec 29, 2022
k8s applications at my home (on arm64 devices e.g nvidia jet son nano)

k8s applications at my home (on arm64 devices e.g nvidia jet son nano)

Oct 9, 2022
OpenAIOS vGPU scheduler for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory.
OpenAIOS vGPU scheduler for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory.

OpenAIOS vGPU scheduler for Kubernetes English version|中文版 Introduction 4paradigm k8s vGPU scheduler is an "all in one" chart to manage your GPU in k8

Jan 3, 2023
Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API

KOSS is a Extension API Server which exposes OS properties and functionality using Kubernetes API, so it can be accessed using e.g. kubectl. At the moment this is highly experimental and only managing sysctl is supported. To make things actually usable, you must run KOSS binary as root on the machine you will be managing.

May 19, 2021
Kubectl Locality Plugin - A plugin to get the locality of pods

Kubectl Locality Plugin - A plugin to get the locality of pods

Nov 18, 2021
Fleet - Open source device management, built on osquery.
Fleet - Open source device management, built on osquery.

Fleet - Open source device management, built on osquery.

Dec 30, 2022
Go WhatsApp Multi-Device Implementation in REST API with Multi-Session/Account Support

Go WhatsApp Multi-Device Implementation in REST API This repository contains example of implementation go.mau.fi/whatsmeow package with Multi-Session/

Dec 3, 2022
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod

GPU Mounter GPU Mounter is a kubernetes plugin which enables add or remove GPU resources for running Pods. This Introduction(In Chinese) is recommende

Jan 5, 2023
Dothill (Seagate) AssuredSAN dynamic provisioner for Kubernetes (CSI plugin).

Dothill-csi dynamic provisioner for Kubernetes A dynamic persistent volume (PV) provisioner for Dothill AssuredSAN based storage systems. Introduction

Oct 11, 2022
Kubectl plugin to ease sniffing on kubernetes pods using tcpdump and wireshark
Kubectl plugin to ease sniffing on kubernetes pods using tcpdump and wireshark

ksniff A kubectl plugin that utilize tcpdump and Wireshark to start a remote capture on any pod in your Kubernetes cluster. You get the full power of

Jan 4, 2023
octant plugin for kubernetes policy report
octant plugin for kubernetes policy report

Policy Report octant plugin [Under development] Resource Policy Report Tab Namespace Policy Report Tab Policy Report Navigation Installation Install p

Aug 7, 2022
kubectl plugin for signing Kubernetes manifest YAML files with sigstore
kubectl plugin for signing Kubernetes manifest YAML files with sigstore

k8s-manifest-sigstore kubectl plugin for signing Kubernetes manifest YAML files with sigstore ⚠️ Still under developement, not ready for production us

Nov 28, 2022
Kubectl plugin to run curl commands against kubernetes pods

kubectl-curl Kubectl plugin to run curl commands against kubernetes pods Motivation Sending http requests to kubernetes pods is unnecessarily complica

Dec 22, 2022
A Kubernetes CSI plugin to automatically mount SPIFFE certificates to Pods using ephemeral volumes
A Kubernetes CSI plugin to automatically mount SPIFFE certificates to Pods using ephemeral volumes

csi-driver-spiffe csi-driver-spiffe is a Container Storage Interface (CSI) driver plugin for Kubernetes to work along cert-manager. This CSI driver tr

Dec 1, 2022