OpenAIOS vGPU scheduler for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory.

Last update: Jan 3, 2023

Comments: 16

OpenAIOS vGPU scheduler for Kubernetes

English version|中文版

Introduction

4paradigm k8s vGPU scheduler is an "all in one" chart to manage your GPU in k8s cluster, it has everything you expect for a k8s GPU manager, including:

GPU sharing: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.

Device Memory Control: GPUs can be allocated with certain device memory and have made it that it does not exceed the boundary.

Virtual Device memory: You can oversubscribe GPU device memory by using host memory as its swap.

Easy to use: You don't need to modify your task yaml to use our scheduler. All your GPU jobs will be automatically supported after installation.

The k8s vGPU scheduler is based on retaining features of 4paradigm k8s-device-plugin (4paradigm/k8s-device-plugin), such as splitting the physical GPU, limiting the memory, and computing unit. It adds the scheduling module to balance the GPU usage across GPU nodes. In addition, it allows users to allocate GPU by specifying the device memory and device core usage. Furthermore, the vGPU scheduler can virtualize the device memory (the used device memory can exceed the physical device memory), run some tasks with large device memory requirements, or increase the number of shared tasks. You can refer to the benchmarks report.

When to use

Scenarios when pods need to be allocated with certain device memory usage or device cores.
Needs to balance GPU usage in cluster with mutiple GPU node
Low utilization of device memory and computing units, such as running 10 tf-servings on one GPU.
Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and the cloud platform that provides small GPU instance.
In the case of insufficient physical device memory, virtual device memory can be turned on, such as training of large batches and large models.

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:

NVIDIA drivers ~= 384.81
nvidia-docker version > 2.0
Kubernetes version >= 1.16
glibc >= 2.17
kernel version >= 3.10
helm

Quick Start

Preparing your GPU Nodes

The following steps need to be executed on all your GPU nodes. This README assumes that both the NVIDIA drivers and nvidia-docker have been installed.

Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit.

You will need to enable the NVIDIA runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

if runtimes is not already present, head to the install page of nvidia-docker

Then, you need to label your GPU nodes which can be scheduled by 4pd-k8s-scheduler by adding "gpu=on", otherwise, it cannot be managed by our scheduler.

kubectl label nodes {nodeid} gpu=on

Download

Once you have configured the options above on all the GPU nodes in your cluster, remove existing NVIDIA device plugin for Kubernetes if it already exists. Then, you need to clone our project, and enter deployments folder

$ git clone https://github.com/4paradigm/k8s-vgpu-scheduler.git
$ cd k8s-vgpu-scheduler/deployments

Set scheduler image version

Check your Kubernetes version by the using the following command

kubectl version

Then you need to set the Kubernetes scheduler image version according to your Kubernetes server version key scheduler.kubeScheduler.image in deployments/values.yaml file , for example, if your cluster server version is 1.16.8, then you should change image version to 1.16.8

scheduler:
  kubeScheduler:
    image: "registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.16.8"

Enabling vGPU Support in Kubernetes

You can customize your installation by adjusting configs.

After checking those config arguments, you can enable the vGPU support by the following command:

$ helm install vgpu vgpu -n kube-system

You can verify your installation by the following command:

$ kubectl get pods -n kube-system

If the following two pods vgpu-device-plugin and vgpu-scheduler are in Running state, then your installation is successful.

Running GPU Jobs

NVIDIA vGPUs can now be requested by a container using the nvidia.com/gpu resource type:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 vGPUs
          nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory （Optional,Integer）
          nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU （Optional,Integer)

You should be cautious that if the task can't fit in any GPU node(ie. the number of nvidia.com/gpu you request exceeds the number of GPU in any node). The task will get stuck in pending state.

You can now execute nvidia-smi command in the container and see the difference of GPU memory between vGPU and real GPU.

WARNING: if you don't request vGPUs when using the device plugin with NVIDIA images all the vGPUs on the machine will be exposed inside your container.

Upgrade

To Upgrade the k8s-vGPU to the latest version, all you need to do is restart the chart. The latest version will be downloaded automatically.

$ helm uninstall vgpu -n kube-system
$ helm install vgpu vgpu -n kube-system

Uninstall

helm uninstall vgpu -n kube-system

Scheduling

Current schedule strategy is to select GPU with the lowest task. Thus balance the loads across mutiple GPUs

Benchmarks

Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows

Test Environment	description
Kubernetes version	v1.12.9
Docker version	18.09.1
GPU Type	Tesla V100
GPU Num	2

Test instance	description
nvidia-device-plugin	k8s + nvidia k8s-device-plugin
vGPU-device-plugin	k8s + VGPU k8s-device-plugin，without virtual device memory
vGPU-device-plugin(virtual device memory)	k8s + VGPU k8s-device-plugin，with virtual device memory

Test Cases:

test id	case	type	params
1.1	Resnet-V2-50	inference	batch=50,size=346*346
1.2	Resnet-V2-50	training	batch=20,size=346*346
2.1	Resnet-V2-152	inference	batch=10,size=256*256
2.2	Resnet-V2-152	training	batch=10,size=256*256
3.1	VGG-16	inference	batch=20,size=224*224
3.2	VGG-16	training	batch=2,size=224*224
4.1	DeepLab	inference	batch=2,size=512*512
4.2	DeepLab	training	batch=1,size=384*384
5.1	LSTM	inference	batch=100,size=1024*300
5.2	LSTM	training	batch=10,size=1024*300

Test Result:

To reproduce:

install k8s-vGPU-scheduler，and configure properly
run benchmark job

$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml

View the result by using kubctl logs

$ kubectl logs [pod id]

Features

Specify the number of vGPUs divided by each physical GPU.
Limits vGPU's Device Memory.
Allows vGPU allocation by specifying device memory
Limits vGPU's Streaming Multiprocessor.
Allows vGPU allocation by specifying device core usage
Zero changes to existing programs

Experimental Features

Virtual Device Memory

The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.

Known Issues

Currently, A100 MIG is not supported
Currently, only computing tasks are supported, and video codec processing is not supported.

TODO

Support video codec processing
Support Multi-Instance GPUs (MIG)

Tests

TensorFlow 1.14.0/2.4.1
torch 1.1.0
mxnet 1.4.0
mindspore 1.1.1

The above frameworks have passed the test.

Issues and Contributing

You can report a bug, a doubt or modify by filing a new issue
If you want to know more or have ideas, you can participate in the Discussions and the slack exchanges

Authors

Mengxuan Li ([email protected])
Zhaoyou Pei ([email protected])
Guangchuan Shi ([email protected])
Zhao Zheng ([email protected])

Owner

4Paradigm

4Paradigm Open Source Community

https://github.com/4paradigm/k8s-vgpu-scheduler

Comments

[4pdvGPU ERROR (pid:167 thread=140191321859904 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4

hi，

容器内分配了三张gpu，启动6个进程服务，启动正常，运行一段时间后报错如下

[4pdvGPU ERROR (pid:167 thread=140191321859904 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4 python3.7: /home/limengxuan/work/libcuda_override/src/multiprocess/multiprocess_memory_limit.c:455: lock_shrreg: Assertion 0' failed. python3.7: /home/limengxuan/work/libcuda_override/src/multiprocess/multiprocess_memory_limit.c:455: lock_shrreg: Assertion0' failed. [2022-06-16 18:30:58 +0800] [12] [ERROR] Exception in worker process Traceback (most recent call last): File "/opt/python37/lib/python3.7/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker worker.init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/ggevent.py", line 146, in init_process super().init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 134, in init_process self.load_wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi self.wsgi = self.app.wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi self.callable = self.load() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in load return self.load_wsgiapp() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp return util.import_app(self.app_uri) File "/opt/python37/lib/python3.7/site-packages/gunicorn/util.py", line 359, in import_app mod = importlib.import_module(module) File "/opt/python37/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 967, in _find_and_load_unlocked File "", line 677, in _load_unlocked File "", line 728, in exec_module File "", line 219, in _call_with_frames_removed File "/translation/server.py", line 40, in initModels(device) File "/translation/Opus.py", line 12, in initModels model = OPUSModel(device, OPUS_PATH+m) File "/translation/OpusMT.py", line 16, in init self.model.to(self.device) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply param_applied = fn(param) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid argument [2022-06-16 18:30:58 +0800] [168] [ERROR] Exception in worker process Traceback (most recent call last): File "/opt/python37/lib/python3.7/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker worker.init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/ggevent.py", line 146, in init_process super().init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 134, in init_process self.load_wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi self.wsgi = self.app.wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi self.callable = self.load() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in load return self.load_wsgiapp() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp return util.import_app(self.app_uri) File "/opt/python37/lib/python3.7/site-packages/gunicorn/util.py", line 359, in import_app mod = importlib.import_module(module) File "/opt/python37/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 967, in _find_and_load_unlocked File "", line 677, in _load_unlocked File "", line 728, in exec_module File "", line 219, in _call_with_frames_removed File "/translation/server.py", line 40, in initModels(device) File "/translation/Opus.py", line 12, in initModels model = OPUSModel(device, OPUS_PATH+m) File "/translation/OpusMT.py", line 16, in init self.model.to(self.device) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) [Previous line repeated 3 more times] File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply param_applied = fn(param) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid argument [2022-06-16 18:30:58 +0800] [12] [INFO] Worker exiting (pid: 12) [2022-06-16 18:30:58 +0800] [168] [INFO] Worker exiting (pid: 168) [2022-06-16 18:30:59 +0800] [163] [WARNING] Worker with pid 167 was terminated due to signal 6 [2022-06-16 18:30:59 +0800] [688] [INFO] Booting worker with pid: 688 [2022-06-16 18:31:00 +0800] [7] [WARNING] Worker with pid 11 was terminated due to signal 6 [2022-06-16 18:31:00 +0800] [752] [INFO] Booting worker with pid: 752 merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pi d=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49905merge pid=49905merge pid=49905merge pid=49905merge pid=49905merge pid=49905merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069[2022-06-16 18:31:36 +0800] [163] [INFO] Shutting down: Master [2022-06-16 18:31:36 +0800] [163] [INFO] Reason: Worker failed to boot. [2022-06-16 18:31:37 +0800] [7] [INFO] Shutting down: Master [2022-06-16 18:31:37 +0800] [7] [INFO] Reason: Worker failed to boot. [4pdvGPU ERROR (pid:460 thread=140489760016192 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4 python3.7: /home/limengxuan/work/libcuda_override/src/multiprocess/multiprocess_memory_limit.c:455: lock_shrreg: Assertion `0' failed. [2022-06-17 04:42:32 +0800] [454] [WARNING] Worker with pid 460 was terminated due to signal 6 [2022-06-17 04:42:32 +0800] [4395] [INFO] Booting worker with pid: 4395 merge pid=47068merge pid=59211merge pid=59211merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=11907merge pid=11907merge pid=59211merge pid=59211merge pid=59211merge pid=47068merge pid=47068merge pid=47068merge pid=47068[2022-06-17 04:45:49 +0800] [454] [CRITICAL] WORKER TIMEOUT (pid:459) [2022-06-17 04:45:54 +0800] [454] [WARNING] Worker with pid 459 was terminated due to signal 9 [2022-06-17 04:45:54 +0800] [4504] [INFO] Booting worker with pid: 4504 merge pid=11907merge pid=59211merge pid=59211merge pid=11907merge pid=11907merge pid=11907merge pid=11907merge pid=11907merge pid=19026merge pid=19026merge pid=59211merge pid=59211merge pid=59211merge pid=11907merge pid=11907merge pid=11907merge pid=11907[2022-06-17 04:49:46 +0800] [454] [CRITICAL] WORKER TIMEOUT (pid:4395) [2022-06-17 04:49:48 +0800] [454] [WARNING] Worker with pid 4395 was terminated due to signal 9 [2022-06-17 04:49:48 +0800] [4611] [INFO] Booting worker with pid: 4611

Unable to schedule blue/gpu-pod1

当我按照指导安装好后：

[root@host-172-21-9-35 gpu]# cat vgpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
  namespace: blue
spec:
  nodeSelector:
    gpu: "on"
  containers:
  - name: gpu-pod
    image: nvidia/cuda:9.0-base
    command: ["/bin/sh", "-c", "sleep 86400"]
    resources:
      limits:
        nvidia.com/gpu: 3
               nvidia.com/gpumem: 1000
               nvidia.com/gpucores: 30
[root@host-172-21-9-35 gpu]# kubectl create -f vgpu-test.yaml
error: error parsing vgpu-test.yaml: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context

删除

               nvidia.com/gpumem: 1000
               nvidia.com/gpucores: 30

后，创建资源，但是一直在pending

[root@host-172-21-9-35 gpu]# kubectl get pods -n blue
NAME       READY   STATUS    RESTARTS   AGE
gpu-pod1   0/1     Pending   0          8s

查看scheduler的日志信息：

[root@host-172-21-9-35 gpu]# kubectl logs -n kube-system vgpu-scheduler-b4f756599-qrh9j kube-scheduler --tail=10
I0124 08:49:33.368916       1 scheduling_queue.go:841] About to try and schedule pod blue/gpu-pod1
I0124 08:49:33.368952       1 scheduler.go:606] Attempting to schedule pod: blue/gpu-pod1
I0124 08:49:33.398379       1 factory.go:453] Unable to schedule blue/gpu-pod1: no fit: 0/5 nodes are available: 4 node(s) didn't match node selector.; waiting
I0124 08:49:33.398456       1 scheduler.go:773] Updating pod condition for blue/gpu-pod1 to (PodScheduled==False, Reason=Unschedulable)
I0124 08:49:33.412627       1 generic_scheduler.go:1212] Node host-172-18-199-14 is a potential node for preemption.
I0124 08:50:58.830278       1 scheduling_queue.go:841] About to try and schedule pod blue/gpu-pod1
I0124 08:50:58.830322       1 scheduler.go:606] Attempting to schedule pod: blue/gpu-pod1
I0124 08:50:58.832585       1 factory.go:453] **Unable to schedule blue/gpu-pod1: no fit: 0/5 nodes are available: 4 node(s) didn't match node selector.; waiting**
I0124 08:50:58.832671       1 scheduler.go:773] Updating pod condition for blue/gpu-pod1 to (PodScheduled==False, Reason=Unschedulable)
I0124 08:50:58.838431       1 generic_scheduler.go:1212] Node host-172-18-199-14 is a potential node for preemption.

集群插件pod：

[root@host-172-21-9-35 gpu]# kubectl get pods -n kube-system | grep gpu
vgpu-device-plugin-km2r6                         2/2     Running            0          17m
vgpu-scheduler-b4f756599-qrh9j                   2/2     Running            0          17m

主机信息： gpu 主机信息:

[host-172-18-199-14@root]/root$ rpm -qa | grep nvidia
libnvidia-container1-1.6.0-1.x86_64
nvidia-container-toolkit-1.6.0-1.x86_64
nvidia-container-runtime-3.6.0-1.noarch
nvidia-docker2-2.7.0-1.noarch
libnvidia-container-tools-1.6.0-1.x86_64

NVIDIA显卡驱动版本： NVIDIA-Linux-x86_64-470.82.01.run

docker信息：

[host-172-18-199-14@root]/root$ docker info
Client:
 Debug Mode: false

Server:
 Containers: 48
  Running: 28
  Paused: 0
  Stopped: 20
 Images: 25
 Server Version: 19.03.12
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: nvidia runc
 Default Runtime: nvidia

Invalid device memory limit: CUDA_DEVICE_SM_LIMIT=0
请问一下，我在容器中使用的时候，会报很多warning，但是GPU是能正常使用的，安装命令如下

helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.20.10 --set devicePlugin.deviceMemoryScaling=1 --set devicePlugin.deviceSplitCount=2 -n kube-system

是因为什么原因呢？
启用无效求助

容器内nvidia-smi仍为满显存

安装命令 helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.21.5 --set devicePlugin.deviceSplitCount=24 --set scheduler.defaultMem=1000 -n gpu-operator-resources

nv环境

k8s@node-gpu1:~$ dpkg --get-selections | grep nvidia libnvidia-cfg1-510:amd64 install libnvidia-common-510 install libnvidia-compute-510:amd64 install libnvidia-compute-510:i386 install libnvidia-container-tools install libnvidia-container1:amd64 install libnvidia-decode-510:amd64 install libnvidia-decode-510:i386 install libnvidia-encode-510:amd64 install libnvidia-encode-510:i386 install libnvidia-extra-510:amd64 install libnvidia-fbc1-510:amd64 install libnvidia-fbc1-510:i386 install libnvidia-gl-510:amd64 install libnvidia-gl-510:i386 install nvidia-compute-utils-510 install nvidia-container-toolkit install nvidia-dkms-510 install nvidia-docker2 install nvidia-driver-510 install nvidia-kernel-common-510 install nvidia-kernel-source-510 install nvidia-modprobe install nvidia-prime install nvidia-settings install nvidia-utils-510 install xserver-xorg-video-nvidia-510 install
集群中拥有不同型号GPU时无法调度
集群中有3个GPU节点，其中gpu1和gpu2各有两块Tesla V100，驱动版本为515.86.01；gpu3拥有4块RTX A6000，驱动版本为525.60.11。

按照README中指示的方式部署最新版vgpu，设置devicePlugin.deviceSplitCount=2。然后使用kubectl describe nodes可以看到gpu1 和gpu2 各有4个nvidia.com/gpu资源，gpu3有8个nvidia.com/gpu资源。

在应用Pod中设置limits.nvidia.com/gpu: 1，但是启动Pod卡在Pending阶段，使用kubectl describe node有如下警告：

Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 19s 4pd-scheduler 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node unregisterd.

将gpu3节点驱逐后，在gpu1和gpu2两个节点上可以正常启动Pod。
helm install vgpu vgpu -n kube-system 时vgpu-device-plugin没有安装上

命令执行： helm install vgpu vgpu-charts/vgpu --set devicePlugin.deviceSplitCount=8 --set devicePlugin.deviceMemoryScaling=4 --set scheduler.kubeScheduler.imageTag=v1.20.0 -n kube-system

demo's error is bash: symbol lookup error: /usr/local/vgpu/libvgpu.so: undefined symbol: cuMemAllocAsync

hi, I need help, when I use demo :

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 3 # 请求2个vGPUs
          nvidia.com/gpumem: 3000 # 每个vGPU申请3000m显存 （可选，整数类型）
          nvidia.com/gpucores: 20 # 每个vGPU的算力为30%实际显卡的算力 （可选，整数类型）

it's error for:

 bash: symbol lookup error: /usr/local/vgpu/libvgpu.so: undefined symbol: cuMemAllocAsync

重新安裝後，devicePlugin無法正確創建
您好之前由於系統更新顯示卡驅動我為重新佈署k8s-vgpu-schedular schedular佈署沒有問題但直行到devicePlugin就會出錯

顯示以下錯誤資訊七 28 17:30:34 workernode kubelet[108628]: E0728 17:30:34.498027 108628 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"device-plugin\" with PostStartHookError: \"Exec lifecycle hook ([/bin/sh -c mv /usrbin/nvidia-container-runtime /usrbin/nvidia-container-runtime-4pdbackup;cp /k8s-vgpu/bin/nvidia-container-runtime /usrbin/;cp -f /k8s-vgpu/lib/* /usr/local/vgpu/]) for Container \\\"device-plugin\\\" in Pod \\\"vgpu-device-plugin-2sx47_kube-system(0a6c9800-2873-4368-9dcf-be0659f94b7f)\\\" failed - error: command '/bin/sh -c mv /usrbin/nvidia-container-runtime /usrbin/nvidia-container-runtime-4pdbackup;cp /k8s-vgpu/bin/nvidia-container-runtime /usrbin/;cp -f /k8s-vgpu/lib/* /usr/local/vgpu/' exited with 126: , message: \\\"OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown\\\\r\\\\n\\\"\"" pod="kube-system/vgpu-device-plugin-2sx47" podUID=0a6c9800-2873-4368-9dcf-be0659f94b7f 目前看起來是啟動後立即停止，導致一系列cmd運行出錯想詢問是否遇過以下問題以及該如何解決謝謝

補充： nvidia docker 有重新佈署過，執行官方的測試程序是沒有問題的 docker run --runtime=nvidia --rm nvidia/cuda:11.0-base nvidia-smi 其他的Kubernetes服務也都可以正常佈署

系統資訊：

system os: ubuntu 20.04

cluster version: 1.23.4

docker version: 20.10.7

nvidia docker2 version: 2.11.0

k8s-vgpu-schedular version: latest

nvidia driver version: 515.57

gpu card: RTX 2060 Super
vgpu-device-plugin-monitor服务的gpu监控可以接入prometheus监控么？
hi，您好

参考https://zhuanlan.zhihu.com/p/125692899

部署gpu集群监控，感觉可以将vgpu监控导入prometheus。尝试了下建立servicemonitor，但prometheus的target列表无新建的vgpu监控项目。请问如何配置可以使用prometheus监控vgpu资源？以下是我的servicemonitor配置文件。

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2022-06-01T16:57:54Z" generation: 5 labels: app: vgpu-metrics name: vgpu-metrics namespace: monitoring resourceVersion: "674182" selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors/vgpu-metrics uid: 06a166be-1142-4153-b06e-fe2691fd858a spec: endpoints:

path: /metrics port: monitorport jobLabel: jobLabel namespaceSelector: matchNames:

kube-system selector: matchLabels: app.kubernetes.io/component: 4pd-scheduler
关于GPU Node的节点设置问题

您好我看到范例有写说可以透过设置annotation里面的nvidia.com/use-gputype来指定显卡， https://github.com/4paradigm/k8s-vgpu-scheduler/blob/master/docs/examples/specify_card_type_to_use.yaml 那我需要在node端设定gpu资讯吗? 还是在透过helm安装完vgpu后会自动侦测? 如果不会的话想问一下有没有规范设定格式? 期待您的回覆，感谢😊😊
Installation failure after update. 版本更新后部署失败

您好，我正在尝试部署vgpu-scheduler，发现更新后使用helm安装失败，查看日志是kube-scheduler参数错误，如下。

kubectl get pods -A kube-system vgpu-device-plugin-k4bjg 2/2 Running 0 12m kube-system vgpu-scheduler-5bc998b64f-bssbm 1/2 CrashLoopBackOff 6 (4m38s ago) 12m

kubectl logs -n kube-system vgpu-scheduler-5bc998b64f-bssbm -c kube-scheduler Error: unknown flag: --policy-config-file

请问下如何解决？或者如何使用旧版本安装？
python+gunicorn使用vgpu，没有正常工作，Worker with pid 284 was terminated due to signal 9

[2023-01-04 13:37:05,017] backend.py [serve] [line:95] - INFO: === Running command 'gunicorn --timeout=60 -b 0.0.0.0:8080 -w 4 ${GUNICORN_CMD_ARGS} --max-requests 10 -- mlflow.pyfunc.scoring_server.wsgi:app' predict-DO5THCMQ 10.244.154.140:8080 [2023-01-04 13:37:05,047] nacosUtil.py [server_regist_and_send_beat] [line:183] - INFO: 第1次心跳日志：predict-DO5THCMQ 10.244.154.140:8080 [2023-01-04 13:37:05 +0800] [83] [INFO] Starting gunicorn 20.1.0 [2023-01-04 13:37:05 +0800] [83] [INFO] Listening at: http://0.0.0.0:8080 (83) [2023-01-04 13:37:05 +0800] [83] [INFO] Using worker: sync [2023-01-04 13:37:05 +0800] [85] [INFO] Booting worker with pid: 85 [2023-01-04 13:37:05 +0800] [109] [INFO] Booting worker with pid: 109 [2023-01-04 13:37:05 +0800] [110] [INFO] Booting worker with pid: 110 [2023-01-04 13:37:05 +0800] [116] [INFO] Booting worker with pid: 116 [4pdvGPU Msg(85:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(85:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(85:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(85:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(85:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(85:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(109:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(109:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(109:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(109:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(109:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(116:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(116:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(116:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(116:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(116:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(110:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(110:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(110:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(110:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(110:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [2023-01-04 13:38:05 +0800] [83] [CRITICAL] WORKER TIMEOUT (pid:85) [2023-01-04 13:38:05 +0800] [83] [CRITICAL] WORKER TIMEOUT (pid:109) [2023-01-04 13:38:05 +0800] [83] [CRITICAL] WORKER TIMEOUT (pid:110) [2023-01-04 13:38:05 +0800] [83] [CRITICAL] WORKER TIMEOUT (pid:116) [4pdvGPU Msg(85:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(110:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(116:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(109:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [2023-01-04 13:38:06 +0800] [83] [WARNING] Worker with pid 85 was terminated due to signal 9 [2023-01-04 13:38:06 +0800] [83] [WARNING] Worker with pid 109 was terminated due to signal 9 [2023-01-04 13:38:06 +0800] [284] [INFO] Booting worker with pid: 284 [2023-01-04 13:38:06 +0800] [83] [WARNING] Worker with pid 116 was terminated due to signal 9 [2023-01-04 13:38:06 +0800] [285] [INFO] Booting worker with pid: 285 [2023-01-04 13:38:07 +0800] [318] [INFO] Booting worker with pid: 318 [2023-01-04 13:38:07 +0800] [83] [WARNING] Worker with pid 110 was terminated due to signal 9 [2023-01-04 13:38:07 +0800] [333] [INFO] Booting worker with pid: 333 [4pdvGPU Msg(318:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(318:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(318:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(318:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(318:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(318:140143647065664:multiprocess_memory_limit.c:545)]: Kick dead proc 85 [4pdvGPU Warn(318:140143647065664:multiprocess_memory_limit.c:545)]: Kick dead proc 109 [4pdvGPU Warn(318:140143647065664:multiprocess_memory_limit.c:545)]: Kick dead proc 110 [4pdvGPU Warn(318:140143647065664:multiprocess_memory_limit.c:545)]: Kick dead proc 116 [4pdvGPU Warn(318:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(318:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(284:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(284:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(284:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(284:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(285:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(285:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(285:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(285:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(284:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Msg(285:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0 [4pdvGPU Warn(284:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(284:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Warn(285:140143647065664:export_table.c:158)]: Internal function call: 6b initiated=0 exportTable=0x7f75bd59b6c0 [4pdvGPU Msg(285:140143647065664:utils.c:34)]: unified_lock locked, waiting 1 second... [4pdvGPU Msg(333:140143647065664:libvgpu.c:871)]: Initializing... [4pdvGPU Msg(333:140143647065664:device.c:248)]: driver version=10020 [4pdvGPU Msg(333:140143647065664:hook.c:400)]: loaded nvml libraries [4pdvGPU Msg(333:140143647065664:hook.c:408)]: initial_virtual_map [4pdvGPU Msg(333:140143647065664:multiprocess_memory_limit.c:101)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0
咱们的vgpu共享技术是类似腾讯的vCUDA还是阿里的cGPU呢？

这里是一篇介绍GPU共享技术的分享，里面介绍了两种GPU共享技术的方案： 1、CUDA层劫持(腾讯云的vCUDA（已下线？）) 2、GPU驱动层劫持(阿里云cGPU) 第1种方案的缺点是对CUDA依赖，CUDA新版本增加功能或者接口变更，第1种方案可能就不适用了总体看第2种方案更优越一些。

腾讯云后来新的GPU共享方案qGPU，应该采用的是第2种方案。

咱们的vGPU是哪一种呢？从之前的issue看起来像是第1种。

如果vGPU采用第2种方案的话，是值得赞赏和尝试的。

使用显存计算问题导致 device OOM 错误，从而使预测终止

我们使用 3张 RTX A4000 （16GB显存）显卡做我们的模型推理任务，每个任务的资源需求如下，由于我们的GPU使用率不高，所以我们对 gpucores 不做限制，只对显存做限制，让单张A4000能够启动两个任务。

resources:
        limits:
          nvidia.com/gpu: '1'
          nvidia.com/gpucores: '0'
          nvidia.com/gpumem: 8k

按照预期，总共成功启动了6个任务，在运行30分钟左右，有一个任务出现以下错误：usage=9502195712 limit=8388608000，但从多个监控上看到，每个任务的显存使用均在 6GB 以内，而且我们的推理服务也已经是产品级别的版本，压测过程中从来没有出现过显存超过6GB的情况，但是这个日志里统计出来的单任务 usage = 9.5GB，导致判断我们的显存使用超限报错，进而导致任务预测不能继续，我们工程团队判断应该是这个 usage 统计出现了问题，麻烦帮忙跟进一下。

 [4pdvGPU ERROR (pid:99 thread=139987068696320 multiprocess_memory_limit.c:277)]: device OOM encountered: usage=9502195712 limit=8388608000
 python: /libcuda_override/src/multiprocess/multiprocess_memory_limit.c:277: set_gpu_device_memory_monitor: Assertion `0' failed.
 /usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 40 leaked semaphores to clean up at shutdown

   len(cache))

服务器的 metrics 显示6个任务的使用显存都没有超过6GB

# HELP Device_memory_desc_of_container Container device meory description
# TYPE Device_memory_desc_of_container counter
Device_memory_desc_of_container{context="162529280",ctrname="model",data="2244253696",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0005-dgj8p",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.601089752e+09
Device_memory_desc_of_container{context="162529280",ctrname="model",data="2376374272",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0003-nnhhx",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.733210328e+09
Device_memory_desc_of_container{context="162529280",ctrname="model",data="2386860032",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0004-c8tnp",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.743696088e+09
Device_memory_desc_of_container{context="162529280",ctrname="model",data="2693044224",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0002-29d6z",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.04988028e+09
Device_memory_desc_of_container{context="162529280",ctrname="model",data="2718210048",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0001-xtcn5",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.075046104e+09
Device_memory_desc_of_container{context="162529280",ctrname="model",data="2745473024",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",module="2194306776",offset="0",podname="ct-urinary-d3f914dc-0006-bbdvh",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.10230908e+09
# HELP HostCoreUtilization GPU core utilization
# TYPE HostCoreUtilization gauge
HostCoreUtilization{deviceid="0",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",zone="vGPU"} 0
HostCoreUtilization{deviceid="1",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",zone="vGPU"} 44
HostCoreUtilization{deviceid="2",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",zone="vGPU"} 0
# HELP HostGPUMemoryUsage GPU device memory usage
# TYPE HostGPUMemoryUsage gauge
HostGPUMemoryUsage{deviceid="0",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",zone="vGPU"} 4648
HostGPUMemoryUsage{deviceid="1",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",zone="vGPU"} 8573
HostGPUMemoryUsage{deviceid="2",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",zone="vGPU"} 9441
# HELP vGPU_device_memory_limit_in_bytes vGPU device limit
# TYPE vGPU_device_memory_limit_in_bytes gauge
vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",podname="ct-urinary-d3f914dc-0004-c8tnp",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",podname="ct-urinary-d3f914dc-0005-dgj8p",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",podname="ct-urinary-d3f914dc-0002-29d6z",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",podname="ct-urinary-d3f914dc-0006-bbdvh",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",podname="ct-urinary-d3f914dc-0001-xtcn5",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
vGPU_device_memory_limit_in_bytes{ctrname="model",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",podname="ct-urinary-d3f914dc-0003-nnhhx",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 8.388608e+09
# HELP vGPU_device_memory_usage_in_bytes vGPU device usage
# TYPE vGPU_device_memory_usage_in_bytes gauge
vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",podname="ct-urinary-d3f914dc-0004-c8tnp",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.743696088e+09
vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-3430d85b-1c05-0ac5-f70e-59124f911cd1",podname="ct-urinary-d3f914dc-0005-dgj8p",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.601089752e+09
vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",podname="ct-urinary-d3f914dc-0002-29d6z",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.04988028e+09
vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-975b108d-6a83-9569-5d0d-30c897b07063",podname="ct-urinary-d3f914dc-0006-bbdvh",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.10230908e+09
vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",podname="ct-urinary-d3f914dc-0001-xtcn5",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 5.075046104e+09
vGPU_device_memory_usage_in_bytes{ctrname="model",deviceuuid="GPU-b3c5c445-4fe2-c012-3397-a2ad3d7515cc",podname="ct-urinary-d3f914dc-0003-nnhhx",podnamespace="distribution",vdeviceid="0",zone="vGPU"} 4.733210328e+09

下图是服务器显卡性能监控，每张卡跑了两个任务，每个任务申请显存 8GB，每张卡的累计显存不超过12GB，但从11:56分开始0号卡上的一个任务就因为 device OOM 报错停止预测，但实际上这个任务显存使用并没有超限。

K8S指定节点就报错

说明：你好，我使用官网的例子，当加上nodeName:ip 这一行时候，pod就会起不来，去除这一行,那么pod就会起来，使用的是k8s1.16.8的版本。这是什么原因？

apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: nodeName: master containers: - name: ubuntu-container image: ubuntu:18.04 command: ["bash", "-c", "sleep 1000"] resources: limits: nvidia.com/gpu: 1 # requesting 2 vGPUs nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory （Optional,Integer） nvidia.com/gpucores: 60 # Each vGPU uses 30% of the entire GPU （Optional,Integer)

UnexpectedAdmissionError

Errors is as follows:

Events:
  Type     Reason                    Age               From           Message
  ----     ------                    ----              ----           -------
  Warning  FailedScheduling          73s               4pd-scheduler  AssumePod failed: pod c4c9c2a8-83d3-4e9b-b6fa-ed47719d3d19 is in the cache, so can't be assumed
  Warning  FailedScheduling          73s               4pd-scheduler  AssumePod failed: pod c4c9c2a8-83d3-4e9b-b6fa-ed47719d3d19 is in the cache, so can't be assumed
  Normal   Scheduled                 73s               4pd-scheduler  Successfully assigned default/gpu-pod to pkm-05
  Warning  FailedScheduling          73s               4pd-scheduler  AssumePod failed: pod c4c9c2a8-83d3-4e9b-b6fa-ed47719d3d19 is in the cache, so can't be assumed
  Warning  UnexpectedAdmissionError  73s               kubelet        Allocate failed due to rpc error: code = Unavailable desc = transport is closing, which is unexpected
  Warning  FailedMount               9s (x8 over 73s)  kubelet        MountVolume.SetUp failed for volume "kube-api-access-8hcmj" : object "default"/"kube-root-ca.crt" not registered

Example file as follows:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 vGPUs
          nvidia.com/gpumem-percentage: 50 # Each vGPU contains 50% device memory of that GPU （Optional,Integer）
          nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU （Optional,Integer)

OpenAIOS vGPU scheduler for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory.

OpenAIOS vGPU scheduler for Kubernetes

Introduction

When to use

Prerequisites

Quick Start

Preparing your GPU Nodes

Download

Set scheduler image version

Enabling vGPU Support in Kubernetes

Running GPU Jobs

Upgrade

Uninstall

Scheduling

Benchmarks

Features

Experimental Features

Known Issues

TODO

Tests

Issues and Contributing

Authors

Owner

4Paradigm

Comments

[4pdvGPU ERROR (pid:167 thread=140191321859904 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4

Unable to schedule blue/gpu-pod1

Invalid device memory limit: CUDA_DEVICE_SM_LIMIT=0

启用无效求助

集群中拥有不同型号GPU时无法调度

helm install vgpu vgpu -n kube-system 时vgpu-device-plugin没有安装上

demo's error is bash: symbol lookup error: /usr/local/vgpu/libvgpu.so: undefined symbol: cuMemAllocAsync

重新安裝後，devicePlugin無法正確創建

vgpu-device-plugin-monitor服务的gpu监控可以接入prometheus监控么？

关于GPU Node的节点设置问题

Installation failure after update. 版本更新后部署失败

python+gunicorn使用vgpu，没有正常工作，Worker with pid 284 was terminated due to signal 9

咱们的vgpu共享技术是类似腾讯的vCUDA还是阿里的cGPU呢？

使用显存计算问题导致 device OOM 错误，从而使预测终止

K8S指定节点就报错

UnexpectedAdmissionError

Related tags

nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.

OpenAIOS is an incubating open-source distributed OS kernel based on Kubernetes for AI workloads

kubernetes Display Resource (CPU/Memory/Gpu/PodCount) Usage and Request and Limit.

K8s-scheduler-extender - Scheduler extender for thpa

A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod

gpupod is a tool to list and watch GPU pod in the kubernetes cluster.

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

A web-based simulator for the Kubernetes scheduler

NVIDIA device plugin for Kubernetes

NVIDIA device plugin for Kubernetes

K8s-socketcan - Virtual SocketCAN Kubernetes device plugin

Memory-Alignment: a tool to help analyze layout of fields in struct in memory

Nvidia GPU exporter for prometheus using nvidia-smi binary

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

Wirewold cellular automata simulator, running entirely on GPU.

Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform.

AutoGpuAffinity - Auto Gpu Affinity with golang

A simple project (which is visitor counter) on kubernetesA simple project (which is visitor counter) on kubernetes

Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API