OpenAIOS vGPU scheduler for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory.

OpenAIOS vGPU scheduler for Kubernetes

build status docker pulls slack discuss

English version|中文版

Introduction

4paradigm k8s vGPU scheduler is an "all in one" chart to manage your GPU in k8s cluster, it has everything you expect for a k8s GPU manager, including:

GPU sharing: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.

Device Memory Control: GPUs can be allocated with certain device memory and have made it that it does not exceed the boundary.

Virtual Device memory: You can oversubscribe GPU device memory by using host memory as its swap.

Easy to use: You don't need to modify your task yaml to use our scheduler. All your GPU jobs will be automatically supported after installation.

The k8s vGPU scheduler is based on retaining features of 4paradigm k8s-device-plugin (4paradigm/k8s-device-plugin), such as splitting the physical GPU, limiting the memory, and computing unit. It adds the scheduling module to balance the GPU usage across GPU nodes. In addition, it allows users to allocate GPU by specifying the device memory and device core usage. Furthermore, the vGPU scheduler can virtualize the device memory (the used device memory can exceed the physical device memory), run some tasks with large device memory requirements, or increase the number of shared tasks. You can refer to the benchmarks report.

When to use

  1. Scenarios when pods need to be allocated with certain device memory usage or device cores.
  2. Needs to balance GPU usage in cluster with mutiple GPU node
  3. Low utilization of device memory and computing units, such as running 10 tf-servings on one GPU.
  4. Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and the cloud platform that provides small GPU instance.
  5. In the case of insufficient physical device memory, virtual device memory can be turned on, such as training of large batches and large models.

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:

  • NVIDIA drivers ~= 384.81
  • nvidia-docker version > 2.0
  • Kubernetes version >= 1.16
  • glibc >= 2.17
  • kernel version >= 3.10
  • helm

Quick Start

Preparing your GPU Nodes

The following steps need to be executed on all your GPU nodes. This README assumes that both the NVIDIA drivers and nvidia-docker have been installed.

Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit.

You will need to enable the NVIDIA runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

if runtimes is not already present, head to the install page of nvidia-docker

Then, you need to label your GPU nodes which can be scheduled by 4pd-k8s-scheduler by adding "gpu=on", otherwise, it cannot be managed by our scheduler.

kubectl label nodes {nodeid} gpu=on

Download

Once you have configured the options above on all the GPU nodes in your cluster, remove existing NVIDIA device plugin for Kubernetes if it already exists. Then, you need to clone our project, and enter deployments folder

$ git clone https://github.com/4paradigm/k8s-vgpu-scheduler.git
$ cd k8s-vgpu-scheduler/deployments

Set scheduler image version

Check your Kubernetes version by the using the following command

kubectl version

Then you need to set the Kubernetes scheduler image version according to your Kubernetes server version key scheduler.kubeScheduler.image in deployments/values.yaml file , for example, if your cluster server version is 1.16.8, then you should change image version to 1.16.8

scheduler:
  kubeScheduler:
    image: "registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.16.8"

Enabling vGPU Support in Kubernetes

You can customize your installation by adjusting configs.

After checking those config arguments, you can enable the vGPU support by the following command:

$ helm install vgpu vgpu -n kube-system

You can verify your installation by the following command:

$ kubectl get pods -n kube-system

If the following two pods vgpu-device-plugin and vgpu-scheduler are in Running state, then your installation is successful.

Running GPU Jobs

NVIDIA vGPUs can now be requested by a container using the nvidia.com/gpu resource type:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 vGPUs
          nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional,Integer)
          nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU (Optional,Integer)

You should be cautious that if the task can't fit in any GPU node(ie. the number of nvidia.com/gpu you request exceeds the number of GPU in any node). The task will get stuck in pending state.

You can now execute nvidia-smi command in the container and see the difference of GPU memory between vGPU and real GPU.

WARNING: if you don't request vGPUs when using the device plugin with NVIDIA images all the vGPUs on the machine will be exposed inside your container.

Upgrade

To Upgrade the k8s-vGPU to the latest version, all you need to do is restart the chart. The latest version will be downloaded automatically.

$ helm uninstall vgpu -n kube-system
$ helm install vgpu vgpu -n kube-system

Uninstall

helm uninstall vgpu -n kube-system

Scheduling

Current schedule strategy is to select GPU with the lowest task. Thus balance the loads across mutiple GPUs

Benchmarks

Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows

Test Environment description
Kubernetes version v1.12.9
Docker version 18.09.1
GPU Type Tesla V100
GPU Num 2
Test instance description
nvidia-device-plugin k8s + nvidia k8s-device-plugin
vGPU-device-plugin k8s + VGPU k8s-device-plugin,without virtual device memory
vGPU-device-plugin(virtual device memory) k8s + VGPU k8s-device-plugin,with virtual device memory

Test Cases:

test id case type params
1.1 Resnet-V2-50 inference batch=50,size=346*346
1.2 Resnet-V2-50 training batch=20,size=346*346
2.1 Resnet-V2-152 inference batch=10,size=256*256
2.2 Resnet-V2-152 training batch=10,size=256*256
3.1 VGG-16 inference batch=20,size=224*224
3.2 VGG-16 training batch=2,size=224*224
4.1 DeepLab inference batch=2,size=512*512
4.2 DeepLab training batch=1,size=384*384
5.1 LSTM inference batch=100,size=1024*300
5.2 LSTM training batch=10,size=1024*300

Test Result: img

img

To reproduce:

  1. install k8s-vGPU-scheduler,and configure properly
  2. run benchmark job
$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
  1. View the result by using kubctl logs
$ kubectl logs [pod id]

Features

  • Specify the number of vGPUs divided by each physical GPU.
  • Limits vGPU's Device Memory.
  • Allows vGPU allocation by specifying device memory
  • Limits vGPU's Streaming Multiprocessor.
  • Allows vGPU allocation by specifying device core usage
  • Zero changes to existing programs

Experimental Features

  • Virtual Device Memory

    The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.

Known Issues

  • Currently, A100 MIG is not supported
  • Currently, only computing tasks are supported, and video codec processing is not supported.

TODO

  • Support video codec processing
  • Support Multi-Instance GPUs (MIG)

Tests

  • TensorFlow 1.14.0/2.4.1
  • torch 1.1.0
  • mxnet 1.4.0
  • mindspore 1.1.1

The above frameworks have passed the test.

Issues and Contributing

Authors

Owner
4Paradigm
4Paradigm Open Source Community
4Paradigm
Comments
  • [4pdvGPU ERROR (pid:167 thread=140191321859904 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4

    [4pdvGPU ERROR (pid:167 thread=140191321859904 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4

    hi,

    容器内分配了三张gpu,启动6个进程服务,启动正常,运行一段时间后报错如下

    [4pdvGPU ERROR (pid:167 thread=140191321859904 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4 python3.7: /home/limengxuan/work/libcuda_override/src/multiprocess/multiprocess_memory_limit.c:455: lock_shrreg: Assertion 0' failed. python3.7: /home/limengxuan/work/libcuda_override/src/multiprocess/multiprocess_memory_limit.c:455: lock_shrreg: Assertion0' failed. [2022-06-16 18:30:58 +0800] [12] [ERROR] Exception in worker process Traceback (most recent call last): File "/opt/python37/lib/python3.7/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker worker.init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/ggevent.py", line 146, in init_process super().init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 134, in init_process self.load_wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi self.wsgi = self.app.wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi self.callable = self.load() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in load return self.load_wsgiapp() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp return util.import_app(self.app_uri) File "/opt/python37/lib/python3.7/site-packages/gunicorn/util.py", line 359, in import_app mod = importlib.import_module(module) File "/opt/python37/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 967, in _find_and_load_unlocked File "", line 677, in _load_unlocked File "", line 728, in exec_module File "", line 219, in _call_with_frames_removed File "/translation/server.py", line 40, in initModels(device) File "/translation/Opus.py", line 12, in initModels model = OPUSModel(device, OPUS_PATH+m) File "/translation/OpusMT.py", line 16, in init self.model.to(self.device) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply param_applied = fn(param) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid argument [2022-06-16 18:30:58 +0800] [168] [ERROR] Exception in worker process Traceback (most recent call last): File "/opt/python37/lib/python3.7/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker worker.init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/ggevent.py", line 146, in init_process super().init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 134, in init_process self.load_wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi self.wsgi = self.app.wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi self.callable = self.load() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in load return self.load_wsgiapp() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp return util.import_app(self.app_uri) File "/opt/python37/lib/python3.7/site-packages/gunicorn/util.py", line 359, in import_app mod = importlib.import_module(module) File "/opt/python37/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 967, in _find_and_load_unlocked File "", line 677, in _load_unlocked File "", line 728, in exec_module File "", line 219, in _call_with_frames_removed File "/translation/server.py", line 40, in initModels(device) File "/translation/Opus.py", line 12, in initModels model = OPUSModel(device, OPUS_PATH+m) File "/translation/OpusMT.py", line 16, in init self.model.to(self.device) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) [Previous line repeated 3 more times] File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply param_applied = fn(param) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid argument [2022-06-16 18:30:58 +0800] [12] [INFO] Worker exiting (pid: 12) [2022-06-16 18:30:58 +0800] [168] [INFO] Worker exiting (pid: 168) [2022-06-16 18:30:59 +0800] [163] [WARNING] Worker with pid 167 was terminated due to signal 6 [2022-06-16 18:30:59 +0800] [688] [INFO] Booting worker with pid: 688 [2022-06-16 18:31:00 +0800] [7] [WARNING] Worker with pid 11 was terminated due to signal 6 [2022-06-16 18:31:00 +0800] [752] [INFO] Booting worker with pid: 752 merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pi d=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49905merge pid=49905merge pid=49905merge pid=49905merge pid=49905merge pid=49905merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069[2022-06-16 18:31:36 +0800] [163] [INFO] Shutting down: Master [2022-06-16 18:31:36 +0800] [163] [INFO] Reason: Worker failed to boot. [2022-06-16 18:31:37 +0800] [7] [INFO] Shutting down: Master [2022-06-16 18:31:37 +0800] [7] [INFO] Reason: Worker failed to boot. [4pdvGPU ERROR (pid:460 thread=140489760016192 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4 python3.7: /home/limengxuan/work/libcuda_override/src/multiprocess/multiprocess_memory_limit.c:455: lock_shrreg: Assertion `0' failed. [2022-06-17 04:42:32 +0800] [454] [WARNING] Worker with pid 460 was terminated due to signal 6 [2022-06-17 04:42:32 +0800] [4395] [INFO] Booting worker with pid: 4395 merge pid=47068merge pid=59211merge pid=59211merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=11907merge pid=11907merge pid=59211merge pid=59211merge pid=59211merge pid=47068merge pid=47068merge pid=47068merge pid=47068[2022-06-17 04:45:49 +0800] [454] [CRITICAL] WORKER TIMEOUT (pid:459) [2022-06-17 04:45:54 +0800] [454] [WARNING] Worker with pid 459 was terminated due to signal 9 [2022-06-17 04:45:54 +0800] [4504] [INFO] Booting worker with pid: 4504 merge pid=11907merge pid=59211merge pid=59211merge pid=11907merge pid=11907merge pid=11907merge pid=11907merge pid=11907merge pid=19026merge pid=19026merge pid=59211merge pid=59211merge pid=59211merge pid=11907merge pid=11907merge pid=11907merge pid=11907[2022-06-17 04:49:46 +0800] [454] [CRITICAL] WORKER TIMEOUT (pid:4395) [2022-06-17 04:49:48 +0800] [454] [WARNING] Worker with pid 4395 was terminated due to signal 9 [2022-06-17 04:49:48 +0800] [4611] [INFO] Booting worker with pid: 4611

  • Unable to schedule blue/gpu-pod1

    Unable to schedule blue/gpu-pod1

    当我按照指导安装好后:

    [root@host-172-21-9-35 gpu]# cat vgpu-test.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod1
      namespace: blue
    spec:
      nodeSelector:
        gpu: "on"
      containers:
      - name: gpu-pod
        image: nvidia/cuda:9.0-base
        command: ["/bin/sh", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 3
                   nvidia.com/gpumem: 1000
                   nvidia.com/gpucores: 30
    [root@host-172-21-9-35 gpu]# kubectl create -f vgpu-test.yaml
    error: error parsing vgpu-test.yaml: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context
    

    删除

                   nvidia.com/gpumem: 1000
                   nvidia.com/gpucores: 30
    

    后,创建资源,但是一直在pending

    [root@host-172-21-9-35 gpu]# kubectl get pods -n blue
    NAME       READY   STATUS    RESTARTS   AGE
    gpu-pod1   0/1     Pending   0          8s
    

    查看scheduler的日志信息:

    [root@host-172-21-9-35 gpu]# kubectl logs -n kube-system vgpu-scheduler-b4f756599-qrh9j kube-scheduler --tail=10
    I0124 08:49:33.368916       1 scheduling_queue.go:841] About to try and schedule pod blue/gpu-pod1
    I0124 08:49:33.368952       1 scheduler.go:606] Attempting to schedule pod: blue/gpu-pod1
    I0124 08:49:33.398379       1 factory.go:453] Unable to schedule blue/gpu-pod1: no fit: 0/5 nodes are available: 4 node(s) didn't match node selector.; waiting
    I0124 08:49:33.398456       1 scheduler.go:773] Updating pod condition for blue/gpu-pod1 to (PodScheduled==False, Reason=Unschedulable)
    I0124 08:49:33.412627       1 generic_scheduler.go:1212] Node host-172-18-199-14 is a potential node for preemption.
    I0124 08:50:58.830278       1 scheduling_queue.go:841] About to try and schedule pod blue/gpu-pod1
    I0124 08:50:58.830322       1 scheduler.go:606] Attempting to schedule pod: blue/gpu-pod1
    I0124 08:50:58.832585       1 factory.go:453] **Unable to schedule blue/gpu-pod1: no fit: 0/5 nodes are available: 4 node(s) didn't match node selector.; waiting**
    I0124 08:50:58.832671       1 scheduler.go:773] Updating pod condition for blue/gpu-pod1 to (PodScheduled==False, Reason=Unschedulable)
    I0124 08:50:58.838431       1 generic_scheduler.go:1212] Node host-172-18-199-14 is a potential node for preemption.
    

    集群插件pod:

    [root@host-172-21-9-35 gpu]# kubectl get pods -n kube-system | grep gpu
    vgpu-device-plugin-km2r6                         2/2     Running            0          17m
    vgpu-scheduler-b4f756599-qrh9j                   2/2     Running            0          17m
    

    主机信息: gpu 主机信息:

    [host-172-18-199-14@root]/root$ rpm -qa | grep nvidia
    libnvidia-container1-1.6.0-1.x86_64
    nvidia-container-toolkit-1.6.0-1.x86_64
    nvidia-container-runtime-3.6.0-1.noarch
    nvidia-docker2-2.7.0-1.noarch
    libnvidia-container-tools-1.6.0-1.x86_64
    

    NVIDIA显卡驱动版本: NVIDIA-Linux-x86_64-470.82.01.run

    docker信息:

    [host-172-18-199-14@root]/root$ docker info
    Client:
     Debug Mode: false
    
    Server:
     Containers: 48
      Running: 28
      Paused: 0
      Stopped: 20
     Images: 25
     Server Version: 19.03.12
     Storage Driver: overlay2
      Backing Filesystem: xfs
      Supports d_type: true
      Native Overlay Diff: true
     Logging Driver: json-file
     Cgroup Driver: systemd
     Plugins:
      Volume: local
      Network: bridge host ipvlan macvlan null overlay
      Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
     Swarm: inactive
     Runtimes: nvidia runc
     Default Runtime: nvidia
    
    
  • Invalid device memory limit: CUDA_DEVICE_SM_LIMIT=0

    Invalid device memory limit: CUDA_DEVICE_SM_LIMIT=0

    image 请问一下,我在容器中使用的时候,会报很多warning,但是GPU是能正常使用的,安装命令如下

     helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.20.10  --set devicePlugin.deviceMemoryScaling=1 --set devicePlugin.deviceSplitCount=2  -n kube-system
    

    是因为什么原因呢?

  • 启用无效求助

    启用无效求助

    容器内nvidia-smi仍为满显存

    安装命令 helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.21.5 --set devicePlugin.deviceSplitCount=24 --set scheduler.defaultMem=1000 -n gpu-operator-resources

    nv环境

    k8s@node-gpu1:~$ dpkg --get-selections | grep nvidia libnvidia-cfg1-510:amd64 install libnvidia-common-510 install libnvidia-compute-510:amd64 install libnvidia-compute-510:i386 install libnvidia-container-tools install libnvidia-container1:amd64 install libnvidia-decode-510:amd64 install libnvidia-decode-510:i386 install libnvidia-encode-510:amd64 install libnvidia-encode-510:i386 install libnvidia-extra-510:amd64 install libnvidia-fbc1-510:amd64 install libnvidia-fbc1-510:i386 install libnvidia-gl-510:amd64 install libnvidia-gl-510:i386 install nvidia-compute-utils-510 install nvidia-container-toolkit install nvidia-dkms-510 install nvidia-docker2 install nvidia-driver-510 install nvidia-kernel-common-510 install nvidia-kernel-source-510 install nvidia-modprobe install nvidia-prime install nvidia-settings install nvidia-utils-510 install xserver-xorg-video-nvidia-510 install

  • 集群中拥有不同型号GPU时无法调度

    集群中拥有不同型号GPU时无法调度

    集群中有3个GPU节点,其中gpu1和gpu2各有两块Tesla V100,驱动版本为515.86.01;gpu3拥有4块RTX A6000,驱动版本为525.60.11。

    按照README中指示的方式部署最新版vgpu,设置devicePlugin.deviceSplitCount=2。然后使用kubectl describe nodes可以看到gpu1 和gpu2 各有4个nvidia.com/gpu资源,gpu3有8个nvidia.com/gpu资源。

    在应用Pod中设置limits.nvidia.com/gpu: 1,但是启动Pod卡在Pending阶段,使用kubectl describe node有如下警告:

      Type     Reason            Age   From           Message
      ----     ------            ----  ----           -------
      Warning  FailedScheduling  19s   4pd-scheduler  0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node unregisterd.
    

    将gpu3节点驱逐后,在gpu1和gpu2两个节点上可以正常启动Pod。

  • helm install vgpu vgpu -n kube-system 时vgpu-device-plugin没有安装上

    helm install vgpu vgpu -n kube-system 时vgpu-device-plugin没有安装上

    命令执行: helm install vgpu vgpu-charts/vgpu --set devicePlugin.deviceSplitCount=8 --set devicePlugin.deviceMemoryScaling=4 --set scheduler.kubeScheduler.imageTag=v1.20.0 -n kube-system

  • demo's error is  bash: symbol lookup error: /usr/local/vgpu/libvgpu.so: undefined symbol: cuMemAllocAsync

    demo's error is bash: symbol lookup error: /usr/local/vgpu/libvgpu.so: undefined symbol: cuMemAllocAsync

    hi, I need help, when I use demo :

    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod
    spec:
      containers:
        - name: ubuntu-container
          image: ubuntu:18.04
          command: ["bash", "-c", "sleep 86400"]
          resources:
            limits:
              nvidia.com/gpu: 3 # 请求2个vGPUs
              nvidia.com/gpumem: 3000 # 每个vGPU申请3000m显存 (可选,整数类型)
              nvidia.com/gpucores: 20 # 每个vGPU的算力为30%实际显卡的算力 (可选,整数类型)
    

    it's error for:

     bash: symbol lookup error: /usr/local/vgpu/libvgpu.so: undefined symbol: cuMemAllocAsync
    
  • 重新安裝後,devicePlugin無法正確創建

    重新安裝後,devicePlugin無法正確創建

    您好 之前由於系統更新顯示卡驅動 我為重新佈署k8s-vgpu-schedular schedular佈署沒有問題 但直行到devicePlugin就會出錯 image

    顯示以下錯誤資訊 七 28 17:30:34 workernode kubelet[108628]: E0728 17:30:34.498027 108628 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"device-plugin\" with PostStartHookError: \"Exec lifecycle hook ([/bin/sh -c mv /usrbin/nvidia-container-runtime /usrbin/nvidia-container-runtime-4pdbackup;cp /k8s-vgpu/bin/nvidia-container-runtime /usrbin/;cp -f /k8s-vgpu/lib/* /usr/local/vgpu/]) for Container \\\"device-plugin\\\" in Pod \\\"vgpu-device-plugin-2sx47_kube-system(0a6c9800-2873-4368-9dcf-be0659f94b7f)\\\" failed - error: command '/bin/sh -c mv /usrbin/nvidia-container-runtime /usrbin/nvidia-container-runtime-4pdbackup;cp /k8s-vgpu/bin/nvidia-container-runtime /usrbin/;cp -f /k8s-vgpu/lib/* /usr/local/vgpu/' exited with 126: , message: \\\"OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown\\\\r\\\\n\\\"\"" pod="kube-system/vgpu-device-plugin-2sx47" podUID=0a6c9800-2873-4368-9dcf-be0659f94b7f 目前看起來是啟動後立即停止,導致一系列cmd運行出錯 想詢問是否遇過以下問題 以及該如何解決 謝謝

    補充: nvidia docker 有重新佈署過,執行官方的測試程序是沒有問題的 docker run --runtime=nvidia --rm nvidia/cuda:11.0-base nvidia-smi image 其他的Kubernetes服務也都可以正常佈署

    系統資訊:

    • system os: ubuntu 20.04
    • cluster version: 1.23.4
    • docker version: 20.10.7
    • nvidia docker2 version: 2.11.0
    • k8s-vgpu-schedular version: latest
    • nvidia driver version: 515.57
    • gpu card: RTX 2060 Super
  • vgpu-device-plugin-monitor服务的gpu监控可以接入prometheus监控么?

    vgpu-device-plugin-monitor服务的gpu监控可以接入prometheus监控么?

    hi,您好

    参考https://zhuanlan.zhihu.com/p/125692899

    部署gpu集群监控,感觉可以将vgpu监控导入prometheus。尝试了下建立servicemonitor,但prometheus的target列表无新建的vgpu监控项目。请问如何配置可以使用prometheus监控vgpu资源?以下是我的servicemonitor配置文件。

    apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2022-06-01T16:57:54Z" generation: 5 labels: app: vgpu-metrics name: vgpu-metrics namespace: monitoring resourceVersion: "674182" selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors/vgpu-metrics uid: 06a166be-1142-4153-b06e-fe2691fd858a spec: endpoints:

    • path: /metrics port: monitorport jobLabel: jobLabel namespaceSelector: matchNames:
      • kube-system selector: matchLabels: app.kubernetes.io/component: 4pd-scheduler
  • 关于GPU Node的节点设置问题

    关于GPU Node的节点设置问题

  • Installation failure after update. 版本更新后部署失败

    Installation failure after update. 版本更新后部署失败

    您好, 我正在尝试部署vgpu-scheduler,发现更新后使用helm安装失败,查看日志是kube-scheduler参数错误,如下。

    kubectl get pods -A kube-system vgpu-device-plugin-k4bjg 2/2 Running 0 12m kube-system vgpu-scheduler-5bc998b64f-bssbm 1/2 CrashLoopBackOff 6 (4m38s ago) 12m

    kubectl logs -n kube-system vgpu-scheduler-5bc998b64f-bssbm -c kube-scheduler Error: unknown flag: --policy-config-file

    请问下如何解决?或者如何使用旧版本安装?

  • python+gunicorn使用vgpu,没有正常工作,Worker with pid 284 was terminated due to signal 9

    python+gunicorn使用vgpu,没有正常工作,Worker with pid 284 was terminated due to signal 9

    [2023-01-04 13:37:05,017] backend.py [serve] [line:95] - INFO: === Running command 'gunicorn --timeout=60 -b 0.0.0.0:8080 -w 4 ${GUNICORN_CMD_ARGS} --max-requests 10 -- mlflow.pyfunc.scoring_server.wsgi:app' predict-DO5THCMQ 10.244.154.140:8080 [2023-01-04 13:37:05,047] nacosUtil.py [server_regist_and_send_beat] [line:183] - INFO: 第1次心跳日志:predict-DO5THCMQ 10.244.154.140:8080 [2023-01-04 13:37:05 +0800] [83] [INFO] Starting gunicorn 20.1.0 [2023-01-04 13:37:05 +0800] [83] [INFO] Listening at: http://0.0.0.0:8080 (83) [2023-01-04 13:37:05 +0800] [83] [INFO] Using worker: sync [2023-01-04 13:37:05 +0800] [85] [INFO] Booting worker with pid: 85 [2023-01-04 13:37:05 +0800] [109] [INFO] Booting worker with pid: 109 [2023-01-04 13:37:05 +0800] [110] [INFO] Booting worker with pid: 110 [2023-01-04 13:37:05 +0800] [116] [INFO] Booting worker with pid: 116 [4pdvGPU Msg(85:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(85:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(85:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(85:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(85:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(85:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(109:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(109:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(109:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(109:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(109:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(116:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(116:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(116:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(116:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(116:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(110:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(110:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(110:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(110:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(110:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [2023-01-04 13:38:05 +0800] [83] [CRITICAL] WORKER TIMEOUT (pid:85) [2023-01-04 13:38:05 +0800] [83] [CRITICAL] WORKER TIMEOUT (pid:109) [2023-01-04 13:38:05 +0800] [83] [CRITICAL] WORKER TIMEOUT (pid:110) [2023-01-04 13:38:05 +0800] [83] [CRITICAL] WORKER TIMEOUT (pid:116) [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [2023-01-04 13:38:06 +0800] [83] [WARNING] Worker with pid 85 was terminated due to signal 9 [2023-01-04 13:38:06 +0800] [83] [WARNING] Worker with pid 109 was terminated due to signal 9 [2023-01-04 13:38:06 +0800] [284] [INFO] Booting worker with pid: 284 [2023-01-04 13:38:06 +0800] [83] [WARNING] Worker with pid 116 was terminated due to signal 9 [2023-01-04 13:38:06 +0800] [285] [INFO] Booting worker with pid: 285 [2023-01-04 13:38:07 +0800] [318] [INFO] Booting worker with pid: 318 [2023-01-04 13:38:07 +0800] [83] [WARNING] Worker with pid 110 was terminated due to signal 9 [2023-01-04 13:38:07 +0800] [333] [INFO] Booting worker with pid: 333 [4pdvGPU Msg(318:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(318:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(318:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(318:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(318:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(318:140143647065664:multiprocess_memory_limit.c:545)]: Kick dead proc 85 [4pdvGPU Warn(318:140143647065664:multiprocess_memory_limit.c:545)]: Kick dead proc 109 [4pdvGPU Warn(318:140143647065664:multiprocess_memory_limit.c:545)]: Kick dead proc 110 [4pdvGPU Warn(318:140143647065664:multiprocess_memory_limit.c:545)]: Kick dead proc 116 [4pdvGPU Warn(318:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(318:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(284:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(284:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(284:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(284:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(285:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(285:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(285:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(285:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(284:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Msg(285:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(284:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(284:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Warn(285:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(285:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(333:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(333:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(333:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(333:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(333:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0

  • 咱们的vgpu共享技术是类似腾讯的vCUDA还是阿里的cGPU呢?

    咱们的vgpu共享技术是类似腾讯的vCUDA还是阿里的cGPU呢?

    这里是一篇介绍GPU共享技术的分享,里面介绍了两种GPU共享技术的方案: 1、CUDA层劫持(腾讯云的vCUDA(已下线?)) 2、GPU驱动层劫持(阿里云cGPU) 第1种方案的缺点是对CUDA依赖,CUDA新版本增加功能或者接口变更,第1种方案可能就不适用了 总体看第2种方案更优越一些。

    腾讯云后来新的GPU共享方案qGPU,应该采用的是第2种方案。

    咱们的vGPU是哪一种呢?从之前的issue看起来像是第1种。

    如果vGPU采用第2种方案的话,是值得赞赏和尝试的。

  • 使用显存计算问题导致 device OOM 错误,从而使预测终止

    使用显存计算问题导致 device OOM 错误,从而使预测终止

    我们使用 3张 RTX A4000 (16GB显存)显卡做我们的模型推理任务,每个任务的资源需求如下,由于我们的GPU使用率不高,所以我们对 gpucores 不做限制,只对显存做限制,让单张A4000能够启动两个任务。

    resources:
            limits:
              nvidia.com/gpu: '1'
              nvidia.com/gpucores: '0'
              nvidia.com/gpumem: 8k
    

    按照预期,总共成功启动了6个任务,在运行30分钟左右,有一个任务出现以下错误:usage=9502195712 limit=8388608000,但从多个监控上看到,每个任务的显存使用均在 6GB 以内,而且我们的推理服务也已经是产品级别的版本,压测过程中从来没有出现过显存超过6GB的情况,但是这个日志里统计出来的单任务 usage = 9.5GB,导致判断我们的显存使用超限报错,进而导致任务预测不能继续,我们工程团队判断应该是这个 usage 统计出现了问题,麻烦帮忙跟进一下。

     [4pdvGPU ERROR (pid:99 thread=139987068696320 multiprocess_memory_limit.c:277)]: device OOM encountered: usage=9502195712 limit=8388608000
     python: /libcuda_override/src/multiprocess/multiprocess_memory_limit.c:277: set_gpu_device_memory_monitor: Assertion `0' failed.
     /usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 40 leaked semaphores to clean up at shutdown
    
       len(cache))
    

    服务器的 metrics 显示6个任务的使用显存都没有超过6GB

    # HELP Device_memory_desc_of_container Container device meory description
    # TYPE Device_memory_desc_of_container counter
    Device_memory_desc_of_container{context="162529280",ctrname="model",data="2244253696",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0005-dgj8p",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.601089752e+09
    Device_memory_desc_of_container{context="162529280",ctrname="model",data="2376374272",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0003-nnhhx",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.733210328e+09
    Device_memory_desc_of_container{context="162529280",ctrname="model",data="2386860032",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0004-c8tnp",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.743696088e+09
    Device_memory_desc_of_container{context="162529280",ctrname="model",data="2693044224",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0002-29d6z",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.04988028e+09
    Device_memory_desc_of_container{context="162529280",ctrname="model",data="2718210048",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0001-xtcn5",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.075046104e+09
    Device_memory_desc_of_container{context="162529280",ctrname="model",data="2745473024",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0006-bbdvh",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.10230908e+09
    # HELP HostCoreUtilization GPU core utilization
    # TYPE HostCoreUtilization gauge
    HostCoreUtilization{deviceid="0",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",zone="vGPU"} 0
    HostCoreUtilization{deviceid="1",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",zone="vGPU"} 44
    HostCoreUtilization{deviceid="2",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",zone="vGPU"} 0
    # HELP HostGPUMemoryUsage GPU device memory usage
    # TYPE HostGPUMemoryUsage gauge
    HostGPUMemoryUsage{deviceid="0",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",zone="vGPU"} 4648
    HostGPUMemoryUsage{deviceid="1",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",zone="vGPU"} 8573
    HostGPUMemoryUsage{deviceid="2",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",zone="vGPU"} 9441
    # HELP vGPU_device_memory_limit_in_bytes vGPU device limit
    # TYPE vGPU_device_memory_limit_in_bytes gauge
    vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",podname="ct-urinary-d3f914dc-0004-c8tnp",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
    vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",podname="ct-urinary-d3f914dc-0005-dgj8p",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
    vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",podname="ct-urinary-d3f914dc-0002-29d6z",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
    vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",podname="ct-urinary-d3f914dc-0006-bbdvh",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
    vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",podname="ct-urinary-d3f914dc-0001-xtcn5",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
    vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",podname="ct-urinary-d3f914dc-0003-nnhhx",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
    # HELP vGPU_device_memory_usage_in_bytes vGPU device usage
    # TYPE vGPU_device_memory_usage_in_bytes gauge
    vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",podname="ct-urinary-d3f914dc-0004-c8tnp",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.743696088e+09
    vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",podname="ct-urinary-d3f914dc-0005-dgj8p",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.601089752e+09
    vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",podname="ct-urinary-d3f914dc-0002-29d6z",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.04988028e+09
    vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",podname="ct-urinary-d3f914dc-0006-bbdvh",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.10230908e+09
    vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",podname="ct-urinary-d3f914dc-0001-xtcn5",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.075046104e+09
    vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",podname="ct-urinary-d3f914dc-0003-nnhhx",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.733210328e+09
    

    下图是服务器显卡性能监控,每张卡跑了两个任务,每个任务申请显存 8GB,每张卡的累计显存不超过12GB,但从11:56分开始0号卡上的一个任务就因为 device OOM 报错停止预测,但实际上这个任务显存使用并没有超限。

    1 2 image image

  • K8S指定节点就报错

    K8S指定节点就报错

    说明:你好,我使用官网的例子,当加上nodeName:ip 这一行时候,pod就会起不来,去除这一行,那么pod就会起来,使用的是k8s1.16.8的版本。这是什么原因?

    apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: nodeName: master containers: - name: ubuntu-container image: ubuntu:18.04 command: ["bash", "-c", "sleep 1000"] resources: limits: nvidia.com/gpu: 1 # requesting 2 vGPUs nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional,Integer) nvidia.com/gpucores: 60 # Each vGPU uses 30% of the entire GPU (Optional,Integer)

  • UnexpectedAdmissionError

    UnexpectedAdmissionError

    Errors is as follows:

    Events:
      Type     Reason                    Age               From           Message
      ----     ------                    ----              ----           -------
      Warning  FailedScheduling          73s               4pd-scheduler  AssumePod failed: pod c4c9c2a8-83d3-4e9b-b6fa-ed47719d3d19 is in the cache, so can't be assumed
      Warning  FailedScheduling          73s               4pd-scheduler  AssumePod failed: pod c4c9c2a8-83d3-4e9b-b6fa-ed47719d3d19 is in the cache, so can't be assumed
      Normal   Scheduled                 73s               4pd-scheduler  Successfully assigned default/gpu-pod to pkm-05
      Warning  FailedScheduling          73s               4pd-scheduler  AssumePod failed: pod c4c9c2a8-83d3-4e9b-b6fa-ed47719d3d19 is in the cache, so can't be assumed
      Warning  UnexpectedAdmissionError  73s               kubelet        Allocate failed due to rpc error: code = Unavailable desc = transport is closing, which is unexpected
      Warning  FailedMount               9s (x8 over 73s)  kubelet        MountVolume.SetUp failed for volume "kube-api-access-8hcmj" : object "default"/"kube-root-ca.crt" not registered
    
    

    Example file as follows:

    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod
    spec:
      containers:
        - name: ubuntu-container
          image: ubuntu:18.04
          command: ["bash", "-c", "sleep 86400"]
          resources:
            limits:
              nvidia.com/gpu: 2 # requesting 2 vGPUs
              nvidia.com/gpumem-percentage: 50 # Each vGPU contains 50% device memory of that GPU (Optional,Integer)
              nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
    
nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.
nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.

Nano GPU Agent About this Project Nano GPU Agent is a Kubernetes device plugin implement for gpu allocation and use in container. It runs as a Daemons

Dec 29, 2022
OpenAIOS is an incubating open-source distributed OS kernel based on Kubernetes for AI workloads
OpenAIOS is an incubating open-source distributed OS kernel based on Kubernetes for AI workloads

OpenAIOS is an incubating open-source distributed OS kernel based on Kubernetes for AI workloads. OpenAIOS-Platform is an AI development platform built upon OpenAIOS for enterprises to develop and deploy AI applications for production.

Dec 9, 2022
kubernetes Display Resource (CPU/Memory/Gpu/PodCount) Usage and Request and Limit.
kubernetes Display Resource (CPU/Memory/Gpu/PodCount) Usage and Request and Limit.

kubectl resource-view A plugin to access Kubernetes resource requests, limits, and usage. Display Resource (CPU/Memory/Gpu/PodCount) Usage and Request

Apr 22, 2022
K8s-scheduler-extender - Scheduler extender for thpa

k8s-scheduler-extender-example This is an example of Kubernetes Scheduler Extend

Feb 6, 2022
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod

GPU Mounter GPU Mounter is a kubernetes plugin which enables add or remove GPU resources for running Pods. This Introduction(In Chinese) is recommende

Jan 5, 2023
gpupod is a tool to list and watch GPU pod in the kubernetes cluster.

gpupod gpupod is simple tool to list and watch GPU pod in kubernetes cluster. usage Usage: gpupod [flags] Flags: -t, --createdTime with pod c

Dec 8, 2021
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

kube-batch kube-batch is a batch scheduler for Kubernetes, providing mechanisms for applications which would like to run batch jobs leveraging Kuberne

Jan 6, 2023
A web-based simulator for the Kubernetes scheduler
A web-based simulator for the Kubernetes scheduler

Web-based Kubernetes scheduler simulator Hello world. Here is web-based Kubernetes scheduler simulator. On the simulator, you can create/edit/delete t

Dec 22, 2022
NVIDIA device plugin for Kubernetes

NVIDIA device plugin for Kubernetes Table of Contents About Prerequisites Quick Start Preparing your GPU Nodes Enabling GPU Support in Kubernetes Runn

Dec 31, 2022
NVIDIA device plugin for Kubernetes

NVIDIA device plugin for Kubernetes Table of Contents About Prerequisites Quick Start Preparing your GPU Nodes Enabling GPU Support in Kubernetes Runn

Dec 28, 2021
K8s-socketcan - Virtual SocketCAN Kubernetes device plugin

Virtual SocketCAN Kubernetes device plugin This plugins enables you to create vi

Feb 15, 2022
Memory-Alignment: a tool to help analyze layout of fields in struct in memory
Memory-Alignment: a tool to help analyze layout of fields in struct in memory

Memory Alignment Memory-Alignment is a tool to help analyze layout of fields in struct in memory. Usage go get github.com/vearne/mem-aligin Example p

Oct 26, 2022
Nvidia GPU exporter for prometheus using nvidia-smi binary
Nvidia GPU exporter for prometheus using nvidia-smi binary

nvidia_gpu_exporter Nvidia GPU exporter for prometheus, using nvidia-smi binary to gather metrics. Introduction There are many Nvidia GPU exporters ou

Jan 5, 2023
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

DCGM-Exporter This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM. Documentation

Dec 27, 2022
Wirewold cellular automata simulator, running entirely on GPU.

Wireworld-gpu Wireworld implements the data and rules for the Wireworld cellular automata. This particular version is an experiment whereby the simula

Dec 31, 2022
Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform.

robolaunch ?? Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform. robola

Jan 1, 2023
AutoGpuAffinity - Auto Gpu Affinity with golang
AutoGpuAffinity - Auto Gpu Affinity with golang

AutoGpuAffinity The idea and concept is from AMIT (repository) Formulas for calc

Dec 15, 2022
A simple project (which is visitor counter) on kubernetesA simple project (which is visitor counter) on kubernetes

k8s playground This project aims to deploy a simple project (which is visitor counter) on kubernetes. Deploy steps kubectl apply -f secret.yaml kubect

Dec 16, 2022
Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API

KOSS is a Extension API Server which exposes OS properties and functionality using Kubernetes API, so it can be accessed using e.g. kubectl. At the moment this is highly experimental and only managing sysctl is supported. To make things actually usable, you must run KOSS binary as root on the machine you will be managing.

May 19, 2021