GPU Sharing Scheduler for Kubernetes Cluster

Aliyun (Alibaba Cloud) Container Service

Last update: Jan 6, 2023

Comments: 16

GPU Sharing Scheduler Extender in Kubernetes

Overview

More and more data scientists run their Nvidia GPU based inference tasks on Kubernetes. Some of these tasks can be run on the same Nvidia GPU device to increase GPU utilization. So one important challenge is how to share GPUs between the pods. The community is also very interested in this topic.

Now there is a GPU sharing solution on native Kubernetes: it is based on scheduler extenders and device plugin mechanism, so you can reuse this solution easily in your own Kubernetes.

Prerequisites

Kubernetes 1.11+
golang 1.10+
NVIDIA drivers ~= 361.93
Nvidia-docker version > 2.0 (see how to install and it's prerequisites)
Docker configured with Nvidia as the default runtime.

Design

For more details about the design of this project, please read this Design document.

Setup

You can follow this Installation Guide. If you are using Alibaba Cloud Kubernetes, please follow this doc to install with Helm Charts.

User Guide

You can check this User Guide.

Developing

Scheduler Extender

git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git && cd gpushare-scheduler-extender
docker build -t cheyang/gpushare-scheduler-extender .

Device Plugin

git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git && cd gpushare-device-plugin
docker build -t cheyang/gpushare-device-plugin .

Kubectl Extension

golang > 1.10

mkdir -p $GOPATH/src/github.com/AliyunContainerService
cd $GOPATH/src/github.com/AliyunContainerService
git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git
cd gpushare-device-plugin
go build -o $GOPATH/bin/kubectl-inspect-gpushare-v2 cmd/inspect/*.go

Demo

- Demo 1: Deploy multiple GPU Shared Pods and schedule them on the same GPU device in binpack way

- Demo 2: Avoid GPU memory requests that fit at the node level, but not at the GPU device level

Related Project

gpushare device plugin

Roadmap

Integrate Nvidia MPS as the option for isolation
Automated Deployment for the Kubernetes cluster which is deployed by kubeadm
Scheduler Extener High Availablity
Generic Solution for GPU, RDMA and other devices

Adopters

If you are intrested in GPUShare and would like to share your experiences with others, you are warmly welcome to add your information on ADOPTERS.md page. We will continuousely discuss new requirements and feature design with you in advance.

Acknowledgments

GPU sharing solution is based on Nvidia Docker2, and their gpu sharing design is our reference. The Nvidia Community is very supportive and We are very grateful.

Owner

Aliyun (Alibaba Cloud) Container Service

阿里云容器服务 - ACS (Container Service), ACK (Container Service for Kubernetes) , ASK (Serverless Kubernetes) etc.

https://github.com/AliyunContainerService/gpushare-scheduler-extender

Comments

Modify scheduler configuration in minikube
I use minikube to start a local k8s cluster. I try to share gpus in my local minikube. In this link, https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md#2-modify-scheduler-configuration, I don't know how to modify scheduler configuration. So I skipped this step. Finally, I execute the command kubectl inspect gpushare, but I got a empty resource.

[root@localhost gpushare]# kubectl inspect gpushare NAME IPADDRESS GPU Memory() Allocated/Total GPU Memory In Cluster: 0/0 (0%)

Do I have to do modify scheduler configuration step?, but how can i speify the scheduler configuration in miniukube? thanks for your time.
policy-config-file is no longer supported by kubernetes starting by v1.23
from installation instruction:

Add Policy config file parameter in scheduler arguments - --policy-config-file=/etc/kubernetes/scheduler-policy-config.json

but the option changed since v1.23, referring to kubernetes document: https://kubernetes.io/docs/reference/scheduling/policies/ instead we could use --config=KubeSchedulerConfiguration

and the config file has a apiVersion of kube-scheduler-config.v1beta3, for example:

# /etc/kubernetes/scheduler-policy-config.yaml apiVersion: kubescheduler.config.k8s.io/v1beta2 kind: KubeSchedulerConfiguration clientConnection: kubeconfig: /etc/kubernetes/scheduler.conf extenders: - urlPrefix: "http://127.0.0.1:32766/gpushare-scheduler" filterVerb: filter bindVerb: bind enableHTTPS: false nodeCacheCapable: false managedResources: - name: aliyun.com/gpu-mem ignoredByScheduler: false ignorable: false

and

- --config=/etc/kubernetes/scheduler-policy-config.yaml

tested on my machine and works well
gpushare-device plugin daemonset is not working

Hii There, I really appreciate your solution on GPU sharing in Kubernetes. I followed all steps to set up this scheduler extender but getting some issue with the GPU share device plugin daemonset it is not working and its desired number is also zero, unable to describe that daemonset and there is no pod created in the backend by daemonset.

how to change scheduler config, when I start k8s with rke?

I have get gpushare-schd-extender-886d94bf6-fl5mf running, but as you can see, when using rke to start up k8s, there is no scheduler config I can change for scheduler container. Can you suggest how to include the json?

[root@k8s-demo-slave1 kubernetes]# pwd
/etc/kubernetes
[root@k8s-demo-slave1 kubernetes]# ls
scheduler-policy-config.json  ssl
[root@k8s-demo-slave1 kubernetes]# cd ssl
[root@k8s-demo-slave1 ssl]# ls
kube-apiserver-key.pem                   kube-apiserver-requestheader-ca.pem           kubecfg-kube-controller-manager.yaml  kube-controller-manager-key.pem  kube-etcd-192-168-2-229.pem  kube-scheduler-key.pem
kube-apiserver.pem                       kube-ca-key.pem                               kubecfg-kube-node.yaml                kube-controller-manager.pem      kube-node-key.pem            kube-scheduler.pem
kube-apiserver-proxy-client-key.pem      kube-ca.pem                                   kubecfg-kube-proxy.yaml               kube-etcd-192-168-2-140-key.pem  kube-node.pem                kube-service-account-token-key.pem
kube-apiserver-proxy-client.pem          kubecfg-kube-apiserver-proxy-client.yaml      kubecfg-kube-scheduler.yaml           kube-etcd-192-168-2-140.pem      kube-proxy-key.pem           kube-service-account-token.pem
kube-apiserver-requestheader-ca-key.pem  kubecfg-kube-apiserver-requestheader-ca.yaml  kubecfg-kube-scheduler.yaml.bak       kube-etcd-192-168-2-229-key.pem  kube-proxy.pem

[root@k8s-demo-slave1 ssl]# cat kubecfg-kube-scheduler.yaml
apiVersion: v1
kind: Config
clusters:
- cluster:
    api-version: v1
    certificate-authority: /etc/kubernetes/ssl/kube-ca.pem
    server: "https://127.0.0.1:6443"
  name: "local"
contexts:
- context:
    cluster: "local"
    user: "kube-scheduler-local"
  name: "local"
current-context: "local"
users:
- name: "kube-scheduler-local"
  user:
    client-certificate: /etc/kubernetes/ssl/kube-scheduler.pem
    client-key: /etc/kubernetes/ssl/kube-scheduler-key.pem


[root@k8s-demo-slave1 ssl]# kubectl get pods -A

NAMESPACE       NAME                                      READY   STATUS      RESTARTS   AGE
ingress-nginx   default-http-backend-5bcc9fd598-8ggs8     0/1     Evicted     0          17d
ingress-nginx   default-http-backend-5bcc9fd598-ch87f     1/1     Running     0          17d
ingress-nginx   default-http-backend-5bcc9fd598-jbw26     0/1     Evicted     0          21d
ingress-nginx   nginx-ingress-controller-df7sh            1/1     Running     0          21d
ingress-nginx   nginx-ingress-controller-mr89d            1/1     Running     0          17d
kube-system     canal-2bflt                               2/2     Running     0          16d
kube-system     canal-h5sjc                               2/2     Running     0          16d
kube-system     coredns-799dffd9c4-vzvrw                  1/1     Running     0          21d
kube-system     coredns-autoscaler-84766fbb4-5xpk8        1/1     Running     0          21d
kube-system     gpushare-schd-extender-886d94bf6-fl5mf    1/1     Running     0          28m
kube-system     metrics-server-59c6fd6767-2ct2h           1/1     Running     0          17d
kube-system     metrics-server-59c6fd6767-tphk6           0/1     Evicted     0          17d
kube-system     metrics-server-59c6fd6767-vlbgp           0/1     Evicted     0          21d
kube-system     rke-coredns-addon-deploy-job-8dbml        0/1     Completed   0          21d
kube-system     rke-ingress-controller-deploy-job-7h6vd   0/1     Completed   0          21d
kube-system     rke-metrics-addon-deploy-job-sbrlp        0/1     Completed   0          21d
kube-system     rke-network-plugin-deploy-job-5r7d6       0/1     Completed   0          21d

I have a master and a GPU node, After I create a gpu pod, I get problem.

Error: failed to start container "binpack-1": Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused "process_linux.go:339: running prestart hook 0 caused "error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=no-gpu-has-2MiB-to-run --compute --utility --require=cuda>=8.0 --pid=260909 /var/lib/docker/overlay2/64f498e224fd0a93b0e15b8769699a97527a3acab3c6288c4c8d939bbe4ca82c/merged]\nnvidia-container-cli: device error: unknown device id: no-gpu-has-2MiB-to-run\n"
Question: how does the GPU device number count in the container with "gpu-count"?

Given a server with 8 GPUs, if we start a pod with "aliyun.com/gpu-count:2", and the scheduler assign GPU3 and GPU7 to this pod, what is the GPU number for these 2 GPU cards in the pod? 0 and 1?

gpushare-device-plugin pod fails to start

Hey everyone, I'm trying out the gpu-share-scheduler-extender on a RKE2 cluster, I've gone through all the steps:

Deploy GPU share scheduler extender :white_check_mark:
Modify scheduler configuration :white_check_mark:
Add gpushare node labels to the nodes requiring GPU sharing :white_check_mark:

But fails on the last one to get the device-plugin pod to start.

My setup.

Kubernetes: RKE2 v1.21.6+rke2r1
Host OS: Ubuntu 20.04
CRI: containerd.
nvidia-container-runtime 3.8.1-1
nvidia-headless-460-server:amd64 470.103.01-0ubuntu0.20.04.1
nvidia-utils-460-server:amd64 470.103.01-0ubuntu0.20.04.1

My containerd config.toml.

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    enable_selinux = false
    sandbox_image = "index.docker.io/rancher/pause:3.2"
    stream_server_address = "127.0.0.1"
    stream_server_port = "10010"

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      disable_snapshot_annotations = true
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

  [plugins."io.containerd.internal.v1.opt"]
    path = "/data/rancher/rke2/agent/containerd"

I can successfully start a container directly in containerd using the ctr command, and run nvidia-smi.

# ctr -a /run/k3s/containerd/containerd.sock run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi
Mon Mar 14 07:33:43 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   29C    P0    58W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But trying to start the gpushare-device-plugin pod, cuda, or tensor pod through kubernetes all fails with the error below. No matter what command I try to run inside the pod.

Warning Failed 13s (x2 over 14s) kubelet Error: failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init
caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: failed to process request: unknown

Although I can start a ordinary ubuntu pod on the gpu node without issue. Anyone have any idea's on what the problem might be? Or where one should start troubleshooting?

error with gpushare

I follow the Installation Guide, when i apply a yaml file with gpu resource, the status is always RunContainerError, and extender scheduler's log is "pod gpushare in ns default is not assigned to any node, skip", and this is pod imformation: " &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:gpushare,GenerateName:,Namespace:default,SelfLink:/api/v1/namespaces/default/pods/gpushare,UID:1679f323-4474-11e9-bb2f-246e96b68028,ResourceVersion:788724,Generation:0,CreationTimestamp:2019-03-12 03:08:19 +0000 UTC,DeletionTimestamp:,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"gpushare","namespace":"default"},"spec":{"containers":[{"image":"cr.d.xiaomi.net/jishaomin/pause:2.0","name":"test","resources":{"limits":{"aliyun.com/gpu-mem":"10G","cpu":2,"memory":"4G"}}}],"restartPolicy":"Always"}} ,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-76r29 {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-76r29,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{test cr.d.xiaomi.net/jishaomin/pause:2.0 [] [] [] [] [] {map[cpu:{{2 0} {} 2 DecimalSI} memory:{{4 9} {} 4G DecimalSI} aliyun.com/gpu-mem:{{10 9} {} 10G DecimalSI}] map[memory:{{4 9} {} 4G DecimalSI} aliyun.com/gpu-mem:{{10 9} {} 10G DecimalSI} cpu:{{2 0} {} 2 DecimalSI}]} [{default-token-76r29 true /var/run/secrets/kubernetes.io/serviceaccount }] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[],HostAliases:[],PriorityClassName:,Priority:nil,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:,ContainerStatuses:[],QOSClass:Guaranteed,InitContainerStatuses:[],NominatedNodeName:,},} "

unknow device error when deploy k8s+helm+jupyterhub

我在根据github安装aliyun gpushare后，在k8s中运行安装手册中的demo，gpushare都是正常的，如下：

kubectl inspect gpushare
NAME     IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU Memory(GiB)
ubuntu1  192.168.1.178  4/10                   0/10                   0/10                   0/10                   4/40
ubuntu2  192.168.1.196  0/10                   0/10                   0/10                   0/10                   0/40
-------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
4/80 (5%)

但是，使用jupyterhub申请gpu时出现错误： [Warning] Error: failed to start container "notebook": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-3MiB-to-run: unknown device\\n\""": unknown 安装方法参考Zero to JupyterHub with Kubernetes，即：k8s+helm+jupyterhub

#  kubectl inspect gpushare  
# jupyterhub的gpu处于pending状态
NAME     IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  PENDING(Allocated)  GPU Memory(GiB)
ubuntu1  192.168.1.178  4/10                   0/10                   0/10                   0/10                   3                   7/40
ubuntu2  192.168.1.196  0/10                   0/10                   0/10                   0/10                                       0/40
---------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
7/80 (8%)

相关配置如下： Kubernetes v1.15.2，jupyterhub：0.9.0，helm：2.11.0

config.yaml

    - display_name: "jupyter/AI-notebook-gpu"
      description: "1 GPU(noly available): tensorflow:2.2."
      kubespawner_override:
        image: registry.cn-shenzhen.aliyuncs.com/joe-jupyter/tensorflow-notebook-gpu:0.1.1
        extra_resource_limits:
          # nvidia.com/gpu: "1" # 该配置正常
          aliyun.com/gpu-mem: 3 # 该配置出错

gpushare-device-plugin-ds-t8s89日志：

I0725 04:12:37.933892       1 podmanager.go:123] list pod jupyter-joseph516 in ns jhub in node ubuntu1 and status is Pending
I0725 04:12:37.933913       1 podutils.go:81] No assume timestamp for pod jupyter-joseph516 in namespace jhub, so it's not GPUSharedAssumed assumed pod.
W0725 04:12:37.933932       1 allocate.go:152] invalid allocation requst: request GPU memory 3 can't be satisfied.

gpushare-schd-extender-978bd945b-fs2td日志：

[ debug ] 2020/07/25 04:20:43 controller.go:176: begin to sync gpushare pod jupyter-joseph516 in ns jhub
[ debug ] 2020/07/25 04:20:43 cache.go:90: Add or update pod info: &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:jupyter-joseph516,GenerateName:,Namespace:jhub,SelfLink:/api/v1/namespaces/jhub/pods/jupyter-joseph516,UID:84a24b9f-9625-4d1b-b769-70bfa3267ae8,ResourceVersion:604744,Generation:0,CreationTimestamp:2020-07-25 04:20:43 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: jupyterhub,chart: jupyterhub-0.9.0,component: singleuser-server,heritage: jupyterhub,hub.jupyter.org/network-access-hub: true,release: jhub,},Annotations:map[string]string{hub.jupyter.org/username: joseph516,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{volume-joseph516 {nil nil nil nil nil nil nil nil nil PersistentVolumeClaimVolumeSource{ClaimName:claim-joseph516,ReadOnly:false,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}} {jupyterhub-shared {nil nil nil nil nil nil nil nil nil &PersistentVolumeClaimVolumeSource{ClaimName:jupyterhub-shared-volume,ReadOnly:false,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{notebook registry.cn-shenzhen.aliyuncs.com/joe-jupyter/tensorflow-notebook-gpu:0.1.1 [] [jupyterhub-singleuser --ip=0.0.0.0 --port=8888 --NotebookApp.default_url=/lab]  [{notebook-port 0 8888 TCP }] [] [{JUPYTERHUB_API_TOKEN a44aba33aa5e4f3d918f966c93c06fcb nil} {JPY_API_TOKEN a44aba33aa5e4f3d918f966c93c06fcb nil} {JUPYTERHUB_ADMIN_ACCESS 1 nil} {JUPYTERHUB_CLIENT_ID jupyterhub-user-joseph516 nil} {JUPYTERHUB_HOST  nil} {JUPYTERHUB_OAUTH_CALLBACK_URL /user/joseph516/oauth_callback nil} {JUPYTERHUB_USER joseph516 nil} {JUPYTERHUB_SERVER_NAME  nil} {JUPYTERHUB_API_URL http://10.109.96.76:8081/hub/api nil} {JUPYTERHUB_ACTIVITY_URL http://10.109.96.76:8081/hub/api/users/joseph516/activity nil} {JUPYTERHUB_BASE_URL / nil} {JUPYTERHUB_SERVICE_PREFIX /user/joseph516/ nil} {MEM_LIMIT 2147483648 nil} {MEM_GUARANTEE 536870912 nil} {CPU_LIMIT 1.0 nil} {CPU_GUARANTEE 0.05 nil} {JUPYTER_IMAGE_SPEC registry.cn-shenzhen.aliyuncs.com/joe-jupyter/tensorflow-notebook-gpu:0.1.1 nil} {JUPYTER_IMAGE registry.cn-shenzhen.aliyuncs.com/joe-jupyter/tensorflow-notebook-gpu:0.1.1 nil}] {map[cpu:{{1 0} {<nil>} 1 DecimalSI} memory:{{2147483648 0} {<nil>} 2147483648 DecimalSI} aliyun.com/gpu-mem:{{3 0} {<nil>} 3 DecimalSI}] map[memory:{{536870912 0} {<nil>} 536870912 DecimalSI} aliyun.com/gpu-mem:{{3 0} {<nil>} 3 DecimalSI} cpu:{{50 -3} {<nil>} 50m DecimalSI}]} [{volume-joseph516 false /home/jovyan  <nil>} {jupyterhub-shared false /home/jovyan/shared  <nil>}] [] nil nil Lifecycle{PostStart:nil,PreStop:nil,} /dev/termination-log File IfNotPresent &SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:*1000,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:*0,} false false false}],RestartPolicy:OnFailure,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:*100,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:&Affinity{NodeAffinity:&NodeAffinity{RequiredDuringSchedulingIgnoredDuringExecution:nil,PreferredDuringSchedulingIgnoredDuringExecution:[{100 {[{hub.jupyter.org/node-purpose In [user]}] []}}],},PodAffinity:nil,PodAntiAffinity:nil,},SchedulerName:jhub-user-scheduler,InitContainers:[{block-cloud-metadata jupyterhub/k8s-network-tools:0.9.0 [iptables -A OUTPUT -d 169.254.169.254 -j DROP] []  [] [] [] {map[] map[]} [] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{Capabilities:&Capabilities{Add:[NET_ADMIN],Drop:[],},Privileged:*true,SELinuxOptions:nil,RunAsUser:*0,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],AutomountServiceAccountToken:*false,Tolerations:[{hub.jupyter.org/dedicated Equal user NoSchedule <nil>} {hub.jupyter.org_dedicated Equal user NoSchedule <nil>} {node.kubernetes.io/not-ready Exists  NoExecute 0xc420837f70} {node.kubernetes.io/unreachable Exists  NoExecute 0xc420837f90}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:Burstable,InitContainerStatuses:[],NominatedNodeName:,},}
[ debug ] 2020/07/25 04:20:43 cache.go:91: Node map[ubuntu1:0xc4200de380 ubuntu2:0xc4205eb000]
[ debug ] 2020/07/25 04:20:43 cache.go:93: pod jupyter-joseph516 in ns jhub is not assigned to any node, skip
[  info ] 2020/07/25 04:20:43 controller.go:223: end processNextWorkItem()
[ debug ] 2020/07/25 04:20:43 controller.go:295: No need to update pod name jupyter-joseph516 in ns jhub and old status is Pending, new status is Pending; its old annotation map[hub.jupyter.org/username:joseph516] and new annotation map[hub.jupyter.org/username:joseph516]
[ debug ] 2020/07/25 04:20:43 controller.go:295: No need to update pod name jupyter-joseph516 in ns jhub and old status is Pending, new status is Pending; its old annotation map[hub.jupyter.org/username:joseph516] and new annotation map[hub.jupyter.org/username:joseph516]
[ debug ] 2020/07/25 04:20:43 controller.go:295: No need to update pod name tf-notebook-64b47cf64d-2bnbg in ns default and old status is Running, new status is Running; its old annotation map[ALIYUN_COM_GPU_MEM_ASSIGNED:true ALIYUN_COM_GPU_MEM_ASSUME_TIME:1595649235908083002 ALIYUN_COM_GPU_MEM_DEV:10 ALIYUN_COM_GPU_MEM_IDX:0 ALIYUN_COM_GPU_MEM_POD:4 cni.projectcalico.org/podIP:172.16.25.157/32] and new annotation map[ALIYUN_COM_GPU_MEM_POD:4 cni.projectcalico.org/podIP:172.16.25.157/32 ALIYUN_COM_GPU_MEM_ASSIGNED:true ALIYUN_COM_GPU_MEM_ASSUME_TIME:1595649235908083002 ALIYUN_COM_GPU_MEM_DEV:10 ALIYUN_COM_GPU_MEM_IDX:0]
[ debug ] 2020/07/25 04:20:43 controller.go:295: No need to update pod name jupyter-joseph516 in ns jhub and old status is Pending, new status is Pending; its old annotation map[hub.jupyter.org/username:joseph516] and new annotation map[hub.jupyter.org/username:joseph516]
[  info ] 2020/07/25 04:20:44 controller.go:210: begin processNextWorkItem()
[ debug ] 2020/07/25 04:20:45 controller.go:295: No need to update pod name jupyter-joseph516 in ns jhub and old status is Pending, new status is Pending; its old annotation map[hub.jupyter.org/username:joseph516] and new annotation map[cni.projectcalico.org/podIP:172.16.152.109/32 hub.jupyter.org/username:joseph516]
[ debug ] 2020/07/25 04:20:46 controller.go:295: No need to update pod name jupyter-joseph516 in ns jhub and old status is Pending, new status is Pending; its old annotation map[cni.projectcalico.org/podIP:172.16.152.109/32 hub.jupyter.org/username:joseph516] and new annotation map[cni.projectcalico.org/podIP:172.16.152.109/32 hub.jupyter.org/username:joseph516]
[ debug ] 2020/07/25 04:20:47 controller.go:295: No need to update pod name jupyter-joseph516 in ns jhub and old status is Pending, new status is Running; its old annotation map[cni.projectcalico.org/podIP:172.16.152.109/32 hub.jupyter.org/username:joseph516] and new annotation map[hub.jupyter.org/username:joseph516 cni.projectcalico.org/podIP:172.16.152.109/32]
[ debug ] 2020/07/25 04:21:01 controller.go:295: No need to update pod name jupyter-joseph516 in ns jhub and old status is Running, new status is Running; its old annotation map[cni.projectcalico.org/podIP:172.16.152.109/32 hub.jupyter.org/username:joseph516] and new annotation map[cni.projectcalico.org/podIP:172.16.152.109/32 hub.jupyter.org/username:joseph516]
[ debug ] 2020/07/25 04:21:13 controller.go:295: No need to update pod name jupyter-joseph516 in ns jhub and old status is Running, new status is Running; its old annotation map[cni.projectcalico.org/podIP:172.16.152.109/32 hub.jupyter.org/username:joseph516] and new annotation map[cni.projectcalico.org/podIP:172.16.152.109/32 hub.jupyter.org/username:joseph516]
[ debug ] 2020/07/25 04:25:46 cache.go:118: Node map[ubuntu1:0xc4200de380 ubuntu2:0xc4205eb000]
[ debug ] 2020/07/25 04:25:46 cache.go:155: GetNodeInfo() uses the existing nodeInfo for ubuntu2
[  warn ] 2020/07/25 04:25:46 nodeinfo.go:84: Pod jupyter-joseph516 in ns jhub is not set the GPU ID -1 in node ubuntu2
[  info ] 2020/07/25 04:25:46 controller.go:223: end processNextWorkItem()
[  info ] 2020/07/25 04:25:47 controller.go:210: begin processNextWorkItem()

how to get GPU metric data of pod deployed by this plugin

https://github.com/NVIDIA/k8s-device-plugin https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm/k8s/pod-gpu-metrics-exporter

using these tools succeed to get gpu metric data, but this repo conflict with nvidia-device-plugin, so I cant get metric data any more.

Is there monitoring tool to adapt this repo?

gpushare scheduler extender bind code 500

envs & version

kubernetes: 1.17 scheduler extender: k8s-gpushare-schd-extender:1.11-d170d8a

errors log

[ debug ] 2020/04/28 11:20:22 gpushare-predicate.go:17: check if the pod name gpu-demo-gpushare-6cfbbdfb66-szs7m can be scheduled on node g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 cache.go:155: GetNodeInfo() uses the existing nodeInfo for g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:282: getAllGPUs: map[0:11019 1:11019] in node g1-med-dev1-100, and dev map[1:0xc422a6ab20 0:0xc422a6ab00]
[ debug ] 2020/04/28 11:20:22 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc422a6ab00
[ debug ] 2020/04/28 11:20:22 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc422a6ab20
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:272: getUsedGPUs: map[0:0 1:0] in node g1-med-dev1-100, and devs map[1:0xc422a6ab20 0:0xc422a6ab00]
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:121: AvailableGPUs: map[0:11019 1:11019] in node g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 gpushare-predicate.go:31: The pod gpu-demo-gpushare-6cfbbdfb66-szs7m in the namespace zhaogaolong can be scheduled on g1-med-dev1-100
[  info ] 2020/04/28 11:20:22 routes.go:93: gpusharingfilter extenderFilterResult = {"Nodes":null,"NodeNames":["g1-med-dev1-100"],"FailedNodes":{},"Error":""}
[ debug ] 2020/04/28 11:20:22 routes.go:162: /gpushare-scheduler/filter response=&{0xc4200bcbe0 0xc421bda600 0xc421f29d00 0x565b70 true false false false 0xc421f29e80 {0xc420354540 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 74 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc422f560e0 0}
[ debug ] 2020/04/28 11:20:22 routes.go:160: /gpushare-scheduler/bind request body = &{0xc421e92bc0 <nil> <nil> false true {0 0} false false false 0x69bfd0}
[ debug ] 2020/04/28 11:20:22 routes.go:121: gpusharingBind ExtenderArgs ={gpu-demo-gpushare-6cfbbdfb66-szs7m zhaogaolong 320c6174-95d6-44d1-ac48-62414d49fe13 g1-med-dev1-100}
[ debug ] 2020/04/28 11:20:22 cache.go:155: GetNodeInfo() uses the existing nodeInfo for g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:143: Allocate() ----Begin to allocate GPU for gpu mem for pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong----
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:282: getAllGPUs: map[0:11019 1:11019] in node g1-med-dev1-100, and dev map[0:0xc422a6ab00 1:0xc422a6ab20]
[ debug ] 2020/04/28 11:20:22 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc422a6ab00
[ debug ] 2020/04/28 11:20:22 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc422a6ab20
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:272: getUsedGPUs: map[0:0 1:0] in node g1-med-dev1-100, and devs map[0:0xc422a6ab00 1:0xc422a6ab20]
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:220: reqGPU for pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong: 256
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:221: AvailableGPUs: map[0:11019 1:11019] in node g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:239: Find candidate dev id 0 for pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong successfully.
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:147: Allocate() 1. Allocate GPU ID 0 to pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong.----
[  warn ] 2020/04/28 11:20:22 gpushare-bind.go:36: Failed to handle pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong due to error Pod "gpu-demo-gpushare-6cfbbdfb66-szs7m" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)
  core.PodSpec{
        Volumes:        []core.Volume{{Name: "cpuinfo", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/lib/lxcfs/proc/cpuinfo", Type: &""}}}, {Name: "meminfo", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/lib/lxcfs/proc/meminfo", Type: &""}}}, {Name: "diskstats", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/lib/lxcfs/proc/diskstats", Type: &""}}}, {Name: "stat", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/lib/lxcfs/proc/stat", Type: &""}}}, {Name: "med-log", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/log/k8s/zhaogaolong/gpu-demo", Type: &"DirectoryOrCreate"}}}, {Name: "default-token-k4hm4", VolumeSource: core.VolumeSource{Secret: &core.SecretVolumeSource{SecretName: "default-token-k4hm4", DefaultMode: &420}}}},
        InitContainers: nil,
        Containers: []core.Container{
                {
                        ... // 7 identical fields
                        Env:       []core.EnvVar{{Name: "TZ", Value: "Asia/Shanghai"}, {Name: "LANG", Value: "en_US.UTF-8"}, {Name: "LC_ALL", Value: "en_US.UTF-8"}, {Name: "GUAZI_ENV", Value: "dev"}, {Name: "MED_CLUSTER", Value: "dev"}, {Name: "MED_RUNNING_CLUSTER_NAME", Value: "dev1"}, {Name: "MED_ENV", Value: "dev"}, {Name: "CLOUD_ENV", Value: "dev"}, {Name: "MED_REFERENCE", Value: "gpu-demo-gpushare"}, {Name: "MED_GROUP", Value: "zhaogaolong"}, {Name: "MED_APPNAME", Value: "gpu-demo"}, {Name: "MED_DEPLOY", Value: "gpushare"}, {Name: "MED_DUMP", Value: "false"}, {Name: "POD_NAME", ValueFrom: &core.EnvVarSource{FieldRef: &core.ObjectFieldSelector{APIVersion: "v1", FieldPath: "metadata.name"}}}, {Name: "POD_IP", ValueFrom: &core.EnvVarSource{FieldRef: &core.ObjectFieldSelector{APIVersion: "v1", FieldPath: "status.podIP"}}}, {Name: "NODE_NAME", ValueFrom: &core.EnvVarSource{FieldRef: &core.ObjectFieldSelector{APIVersion: "v1", FieldPath: "spec.nodeName"}}}, {Name: "MED_CPU", Value: "1.0"}, {Name: "MED_MEMORY", Value: "2.0"}, {Name: "MED_GPU_SHARE_MEMORY", Value: "0.25"}},
                        Resources: core.ResourceRequirements{Limits: core.ResourceList{"aliyun.com/gpu-mem": {i: resource.int64Amount{value: 256}, s: "256", Format: "DecimalSI"}, "cpu": {i: resource.int64Amount{value: 1}, s: "1", Format: "DecimalSI"}, "memory": {i: resource.int64Amount{value: 2, scale: 9}, s: "2G", Format: "DecimalSI"}}, Requests: core.ResourceList{"aliyun.com/gpu-mem": {i: resource.int64Amount{value: 256}, s: "256", Format: "DecimalSI"}, "cpu": {i: resource.int64Amount{value: 100, scale: -3}, s: "100m", Format: "DecimalSI"}, "memory": {i: resource.int64Amount{value: 400, scale: 6}, s: "400M", Format: "DecimalSI"}}},
                        VolumeMounts: []core.VolumeMount{
                                ... // 2 identical elements
                                {Name: "diskstats", MountPath: "/proc/diskstats"},
                                {Name: "stat", MountPath: "/proc/stat"},
                                {
                                        ... // 3 identical fields
                                        SubPath:          "",
                                        MountPropagation: nil,
-                                       SubPathExpr:      "",
+                                       SubPathExpr:      "$(POD_NAME)/gpu",
                                },
                                {Name: "default-token-k4hm4", ReadOnly: true, MountPath: "/var/run/secrets/kubernetes.io/serviceaccount"},
                        },
                        VolumeDevices: nil,
                        LivenessProbe: nil,
                        ... // 10 identical fields
                },
        },
        EphemeralContainers: nil,
        RestartPolicy:       "Always",
        ... // 24 identical fields
  }
[  info ] 2020/04/28 11:20:22 routes.go:137: extenderBindingResult = {"Error":"Pod \"gpu-demo-gpushare-6cfbbdfb66-szs7m\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)\n  core.PodSpec{\n  \tVolumes:        []core.Volume{{Name: \"cpuinfo\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/lib/lxcfs/proc/cpuinfo\", Type: \u0026\"\"}}}, {Name: \"meminfo\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/lib/lxcfs/proc/meminfo\", Type: \u0026\"\"}}}, {Name: \"diskstats\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/lib/lxcfs/proc/diskstats\", Type: \u0026\"\"}}}, {Name: \"stat\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/lib/lxcfs/proc/stat\", Type: \u0026\"\"}}}, {Name: \"med-log\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/log/k8s/zhaogaolong/gpu-demo\", Type: \u0026\"DirectoryOrCreate\"}}}, {Name: \"default-token-k4hm4\", VolumeSource: core.VolumeSource{Secret: \u0026core.SecretVolumeSource{SecretName: \"default-token-k4hm4\", DefaultMode: \u0026420}}}},\n  \tInitContainers: nil,\n  \tContainers: []core.Container{\n  \t\t{\n  \t\t\t... // 7 identical fields\n  \t\t\tEnv:       []core.EnvVar{{Name: \"TZ\", Value: \"Asia/Shanghai\"}, {Name: \"LANG\", Value: \"en_US.UTF-8\"}, {Name: \"LC_ALL\", Value: \"en_US.UTF-8\"}, {Name: \"GUAZI_ENV\", Value: \"dev\"}, {Name: \"MED_CLUSTER\", Value: \"dev\"}, {Name: \"MED_RUNNING_CLUSTER_NAME\", Value: \"dev1\"}, {Name: \"MED_ENV\", Value: \"dev\"}, {Name: \"CLOUD_ENV\", Value: \"dev\"}, {Name: \"MED_REFERENCE\", Value: \"gpu-demo-gpushare\"}, {Name: \"MED_GROUP\", Value: \"zhaogaolong\"}, {Name: \"MED_APPNAME\", Value: \"gpu-demo\"}, {Name: \"MED_DEPLOY\", Value: \"gpushare\"}, {Name: \"MED_DUMP\", Value: \"false\"}, {Name: \"POD_NAME\", ValueFrom: \u0026core.EnvVarSource{FieldRef: \u0026core.ObjectFieldSelector{APIVersion: \"v1\", FieldPath: \"metadata.name\"}}}, {Name: \"POD_IP\", ValueFrom: \u0026core.EnvVarSource{FieldRef: \u0026core.ObjectFieldSelector{APIVersion: \"v1\", FieldPath: \"status.podIP\"}}}, {Name: \"NODE_NAME\", ValueFrom: \u0026core.EnvVarSource{FieldRef: \u0026core.ObjectFieldSelector{APIVersion: \"v1\", FieldPath: \"spec.nodeName\"}}}, {Name: \"MED_CPU\", Value: \"1.0\"}, {Name: \"MED_MEMORY\", Value: \"2.0\"}, {Name: \"MED_GPU_SHARE_MEMORY\", Value: \"0.25\"}},\n  \t\t\tResources: core.ResourceRequirements{Limits: core.ResourceList{\"aliyun.com/gpu-mem\": {i: resource.int64Amount{value: 256}, s: \"256\", Format: \"DecimalSI\"}, \"cpu\": {i: resource.int64Amount{value: 1}, s: \"1\", Format: \"DecimalSI\"}, \"memory\": {i: resource.int64Amount{value: 2, scale: 9}, s: \"2G\", Format: \"DecimalSI\"}}, Requests: core.ResourceList{\"aliyun.com/gpu-mem\": {i: resource.int64Amount{value: 256}, s: \"256\", Format: \"DecimalSI\"}, \"cpu\": {i: resource.int64Amount{value: 100, scale: -3}, s: \"100m\", Format: \"DecimalSI\"}, \"memory\": {i: resource.int64Amount{value: 400, scale: 6}, s: \"400M\", Format: \"DecimalSI\"}}},\n  \t\t\tVolumeMounts: []core.VolumeMount{\n  \t\t\t\t... // 2 identical elements\n  \t\t\t\t{Name: \"diskstats\", MountPath: \"/proc/diskstats\"},\n  \t\t\t\t{Name: \"stat\", MountPath: \"/proc/stat\"},\n  \t\t\t\t{\n  \t\t\t\t\t... // 3 identical fields\n  \t\t\t\t\tSubPath:          \"\",\n  \t\t\t\t\tMountPropagation: nil,\n- \t\t\t\t\tSubPathExpr:      \"\",\n+ \t\t\t\t\tSubPathExpr:      \"$(POD_NAME)/gpu\",\n  \t\t\t\t},\n  \t\t\t\t{Name: \"default-token-k4hm4\", ReadOnly: true, MountPath: \"/var/run/secrets/kubernetes.io/serviceaccount\"},\n  \t\t\t},\n  \t\t\tVolumeDevices: nil,\n  \t\t\tLivenessProbe: nil,\n  \t\t\t... // 10 identical fields\n  \t\t},\n  \t},\n  \tEphemeralContainers: nil,\n  \tRestartPolicy:       \"Always\",\n  \t... // 24 identical fields\n  }\n"}
[ debug ] 2020/04/28 11:20:22 routes.go:162: /gpushare-scheduler/bind response=&{0xc4200bcbe0 0xc4222c2c00 0xc421ea7280 0x565b70 true false false false 0xc421ea7300 {0xc421f46380 map[Content-Type:[application/json]] true true} map[Content-Type:[application/json]] true 4029 -1 500 false false [] 0 [84 117 101 44 32 50 56 32 65 112 114 32 50 48 50 48 32 49 49 58 50 48 58 50 50 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [53 48 48] 0xc422bb82a0 0}

I suspect it is a version of the dependency kube client go version too old. in GOpkg.toml

[[constraint]]
  name = "k8s.io/client-go"
  version = "~v8.0.0"

but this version can't support kubernetes 1.17

k3s services not started scheduler exited: stat /etc/kubernetes/scheduler.conf: no such file or directory

Hi ,

I'm using k3s with version v1.25.4+k3s1 .

When i'm trying to run the new configuration that you did to version 1.23+

I used this commit - https://github.com/AliyunContainerService/gpushare-scheduler-extender/commit/ab2a0ef3f36b6ebb9e021e483ed9ede655fd93d9

and get the error k3s services not started scheduler exited: stat /etc/kubernetes/scheduler.conf: no such file or directory .

this is come from https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/config/scheduler-policy-config.yaml

This is missing at k3s -

clientConnection: kubeconfig: /etc/kubernetes/scheduler.conf

Thanks
Wrong GPU ID
I just had updated GPU Cards in my Servers as below, I have 3 servers with different GPU cards now

When I am trying to create a new pod, the pod has been assigned to Server 2 with GPU ID = 1 although GPU 1 (0/0 Allocated) is not existing in Server 2 so my deployment is failed, i got this error

Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: device error: no-gpu-has-9MiB-to-run: unknown device: unknown

Why didn't pod assigned to Server 2 with GPU ID = 0 ?

How can I resolve this issue ?

Thanks so much
trivy image scan lists critical and high vulnerability against latest image k8s-gpushare-schd-extender:1.11-d170d8a

What happened: trivy image scan lists critical and high vulnerability against latest image k8s-gpushare-schd-extender:1.11-d170d8a

What you expected to happen: No critical or high vulnerability issues.

How to reproduce it: trivy image --ignore-unfixed --severity HIGH,CRITICAL --format template --template "@/usr/local/share/trivy/templates/html.tpl" -o report.html k8s-gpushare-schd-extender:1.11-d170d8a

report: k8s-gpushare-schd-extender_1.11-d170d8a.pdf
pod运行完成后，插件更新gpu池不及时。当有多个pending的pod排队分配资源时，最后一个pod会一直等到flushUnschedulablePodsLeftover才会重新分配资源
现象：同时创建多个pod，因为资源不足会排队等待k8s分配资源。最后一个pending的pod在新版本的k8s下，会额外等待5分钟才能申请到资源.

原因

gpushare-scheduler-extender插件是监听k8s的pod event来更新自己的gpu资源池

最后一个pod申请插件的filter时，插件还没有处理完上一个pod 完成的event来释放资源，所以这时候filter是失败的。

unschedulablePods队列中的pod没有机会触发重试，只能等5分钟一轮的flushUnschedulablePodsLeftover（旧版本k8s是1分钟）

备注

k8s有很多种情况下，会尝试把unschedulablePods放入backoffQ或activeQ，其中最重要的就是当节点删除Pod缓存时会触发movePodsToActiveOrBackoffQueue，所以pending状态的pod在前一个pod执行完后，会马上重试分配资源。

k8s还有一个机制，就是5分钟定时触发flushUnschedulablePodsLeftover，重试所有的unschedulablePods
读取到了两块显卡，但是请求/gpushare-scheduler/filter后部分容器一直只能调度到其中一块显卡

调度日志： begin to sync gpushare pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in ns ai-model [ debug ] 2022/09/05 13:42:11 cache.go:90: Add or update pod info: &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx,GenerateName:p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-,Namespace:ai-model,SelfLink:,UID:4c24f578-1a0c-417b-adb0-76b0b2604c06,ResourceVersion:60857261,Generation:0,CreationTimestamp:2022-09-05 13:42:11 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 00c54b7a-d015-11ec-8db7-12cc16fb82ca,pod-template-hash: 6f8f5c45dc,},Annotations:map[string]string{cattle.io/timestamp: 2022-09-02T09:28:07Z,field.cattle.io/ports: [[{"containerPort":7070,"dnsName":"p-00c54b7a-d015-11ec-8db7-12cc16fb82ca","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],},OwnerReferences:[{apps/v1 ReplicaSet p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc 56dd7312-1c8e-4531-8961-331905556825 0xc420719ada 0xc420719adb}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-00c54b7a-d015-11ec-8db7-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/air_switch:amd-2.4 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{C apabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42090e510} {node.kubernetes.io/unreachable Exists NoExecute 0xc42090e530}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} [ debug ] 2022/09/05 13:42:11 cache.go:91: Node map[worker1:0xc420fd43c0 worker2:0xc4206b8cc0] [ debug ] 2022/09/05 13:42:11 cache.go:93: pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in ns ai-model is not assigned to any node, skip [ debug ] 2022/09/05 13:42:11 controller.go:234: end processNextWorkItem() [ debug ] 2022/09/05 13:42:11 routes.go:160: /gpushare-scheduler/filter request body = &{0xc420575160 <nil> <nil> false true {0 0} false false false 0x69c120} [ debug ] 2022/09/05 13:42:11 routes.go:81: gpusharingfilter ExtenderArgs ={&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx,GenerateName:p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-,Namespace:ai-model,SelfLink:,UID:4c24f578-1a0c-417b-adb0-76b0b2604c06,ResourceVersion:60857261,Generation:0,CreationTimestamp:2022-09-05 13:42:11 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 00c54b7a-d015-11ec-8db7-12cc16fb82ca,pod-template-hash: 6f8f5c45dc,},Annotations:map[string]string{cattle.io/timestamp: 2022-09-02T09:28:07Z,field.cattle.io/ports: [[{"containerPort":7070,"dnsName":"p-00c54b7a-d015-11ec-8db7-12cc16fb82ca","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],},OwnerReferences:[{apps/v1 ReplicaSet p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc 56dd7312-1c8e-4531-8961-331905556825 0xc420e88737 0xc420e88738}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-00c54b7a-d015-11ec-8db7-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/air_switch:amd-2.4 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent Security Context{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420e88830} {node.kubernetes.io/unreachable Exists NoExecute 0xc420e88850}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} nil 0xc4205754a0} [ info ] 2022/09/05 13:42:11 gpushare-predicate.go:17: check if the pod name p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx can be scheduled on node worker1 [ info ] 2022/09/05 13:42:11 cache.go:160: GetNodeInfo() uses the existing nodeInfo for worker1 [ debug ] 2022/09/05 13:42:11 cache.go:162: node worker1 with devices map[0:0xc4203babc0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:423: getAllGPUs: map[0:12288] in node worker1, and dev map[0:0xc4203babc0] [ debug ] 2022/09/05 13:42:11 deviceinfo.go:42: GetUsedGPUMemory() podMap map[e0aa6077-113c-4a86-b4f4-6a93d2754747:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189-784bbcf688-mmb56,GenerateName:p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189-784bbcf688-,Namespace:ai-model,SelfLink:,UID:e0aa6077-113c-4a86-b4f4-6a93d2754747,ResourceVersion:60854784,Generation:0,CreationTimestamp:2022-09-05 13:34:53 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189,pod-template-hash: 784bbcf688,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384893715201857,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cattle.io/timestamp: 2022-09-02T09:09:47Z,field.cattle.io/ports: [[{"containerPort":7070,"dnsName":"p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],workload.cattle.io/state: {"d29ya2VyMQ==":"local:machine-95hdb"},},OwnerReferences:[{apps/v1 ReplicaSet p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189-784bbcf688 310dad46-6da8-4414-8d92-be6190f43f5c 0xc4208e8258 0xc4208e8259}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189 registry.kk.com/ai/offline_function_tensorrt/sign:amd-2.3 [] [] [{port-7070 0 7070 TCP }] [] [{DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_SECRET_KEY nil nil} {MINIO_SECURE False nil} {NVIDIA_VISIBLE_D EVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4208e8288} {node.kubernetes.io/unreachable Exists NoExecute 0xc4208e82b0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:53 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:53 +0000 UTC ContainersNotReady containers with unready status: [c-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:53 +0000 UTC ContainersNotReady containers with unready status: [c-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:53 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:34:53 +0000 UTC,ContainerStatuses:[{c-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} fal se 0 registry.kk.com/ai/offline_function_tensorrt/sign:amd-2.3 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 9a5d93f9-9c35-4633-b04d-1110035ebffd:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-56c0cd7c-d026-11ec-9141-12cc16fb82ca-85f94fd96c-qb99p,GenerateName:p-56c0cd7c-d026-11ec-9141-12cc16fb82ca-85f94fd96c-,Namespace:ai-model,SelfLink:,UID:9a5d93f9-9c35-4633-b04d-1110035ebffd,ResourceVersion:60853181,Generation:0,CreationTimestamp:2022-09-05 13:31:19 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 56c0cd7c-d026-11ec-9141-12cc16fb82ca,pod-template-hash: 85f94fd96c,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384679444463638,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-56c0cd7c-d026-11ec-9141-12cc16fb82ca-85f94fd96c ced90aa2-65c4-4170-882a-67f1fff26ff4 0xc420eab908 0xc420eab909}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-56c0cd7c-d026-11ec-9141-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/oil_leak:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false fals e}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420eab918} {node.kubernetes.io/unreachable Exists NoExecute 0xc420eab920}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:19 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:19 +0000 UTC ContainersNotReady containers with unready status: [c-56c0cd7c-d026-11ec-9141-12cc16fb82ca]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:19 +0000 UTC ContainersNotReady containers with unready status: [c-56c0cd7c-d026-11ec-9141-12cc16fb82ca]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:19 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:31:19 +0000 UTC,ContainerStatuses:[{c-56c0cd7c-d026-11ec-9141-12cc16fb82ca {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/oil_leak:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 7036996a-83c0-450d-bffb-54a18866e28c:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189-c64b6cbb6-gwztm,GenerateName:p-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189-c64b6cbb6-,Namespace:ai-model,SelfLink:,UID:7036996a-83c0-450d-bffb-54a18866e28c,ResourceVersion:59250704,Generation:0,Cr 09:32:43 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 026d8cd8-e0b6-11ec-b0ec-4a6aa9346189,pod-template-hash: c64b6cbb6,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662111163298589414,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189-c64b6cbb6 e5d757bf-9098-43f1-99fd-e49a3146b064 0xc42075d248 0xc42075d249}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189 registry.kk.com/ai/offline_function_tensorrt/break:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerNam e:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42075d268} {node.kubernetes.io/unreachable Exists NoExecute 0xc42075d300}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-02 09:32:43 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-02 09:32:43 +0000 UTC ContainersNotReady containers with unready status: [c-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-02 09:32:43 +0000 UTC ContainersNotReady containers with unready status: [c-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-02 09:32:43 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-02 09:32:43 +0000 UTC,ContainerStatuses:[{c-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/break:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 1c170222-116a-47a8-aa2f-a3d2abf3f39e:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-ca735e4a-00fc-11ed-9340-667f53977afd-5fc78ccb9d-tnbt5,GenerateName:p-ca735e4a-00fc-11ed-9340-667f53977afd-5fc78ccb9d-,Namespace:ai-model,SelfLink:,UID:1c170222-116a-47a8-aa2f-a3d2abf3f39e,ResourceVersion:60854436,Generation:0,CreationTimestamp:2022-09-05 13:34:05 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: ca735e4a-00fc-11ed-9340-667f53977afd,pod-template-hash: 5fc78ccb9d,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384845685953755,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-ca735e4a-00fc-11ed-9340-667f53977afd- 5fc78ccb9d e5afde70-d544-40a8-9172-1551da18a995 0xc420f664c8 0xc420f664c9}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-ca735e4a-00fc-11ed-9340-667f53977afd registry.kk.com/ai/offline_function_tensorrt/meter_sf6:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420f664d8} {node.kubernetes.io/unreachable Exists NoExecute 0xc420f664e0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:05 +0000 UTC } {Ready False 0001-01-01 00:00:00 UTC 2022-09-05 13:34:05 +0000 UTC ContainersNotReady containers with unready status: [c-ca735e4a-00fc-11ed-9340-667f53977afd]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:05 +0000 UTC ContainersNotReady containers with unready status: [c-ca735e4a-00fc-11ed-9340-667f53977afd]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:05 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:34:05 +0000 UTC,ContainerStatuses:[{c-ca735e4a-00fc-11ed-9340-667f53977afd {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/meter_sf6:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 3f0a633e-3aa4-44b8-95ff-708dd07104a8:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-a2ad1890-04b4-11ed-a69a-da2f2c420d44-675bd85754-8pl5z,GenerateName:p-a2ad1890-04b4-11ed-a69a-da2f2c420d44-675bd85754-,Namespace:ai-model,SelfLink:,UID:3f0a633e-3aa4-44b8-95ff-708dd07104a8,ResourceVersion:60854674,Generation:0,CreationTimestamp:2022-09-05 13:34:37 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: a2ad1890-04b4-11ed-a69a-da2f2c420d44,pod-template-hash: 675bd85754,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384877769111575,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-a2ad1890-04b4-11ed-a69a-da2f2c420d44-675bd85754 5df24a79-2516-4960-9123-75ff0da93c1f 0xc42047e8c8 0xc42047e8c9}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-a2ad1890-04b4-11ed-a69a-da2f2c420d44 registry.kk.com/ai/offline_function_tensorrt/arrest [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42047e8d8} {node.kubernetes.io/unreachable Exists NoExecute 0xc42047e8e0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:37 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:37 +0000 UTC ContainersNotReady containers with unready status: [c-a2ad1890-04b4-11ed-a69a-da2f2c420d44]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:37 +0000 UTC ContainersNotReady containers with unready status: [c-a2ad1890-04b4-11ed-a69a-da2f2c420d44]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:37 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:34:37 +0 UTC,ContainerStatuses:[{c-a2ad1890-04b4-11ed-a69a-da2f2c420d44 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/arrester:amd-2.3 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 9344fe44-a270-4dc9-b7fb-73e5faa04297:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-ad4a9c10-d5c4-11ec-a679-664243c08fda-64bc47bf7f-9h4sd,GenerateName:p-ad4a9c10-d5c4-11ec-a679-664243c08fda-64bc47bf7f-,Namespace:ai-model,SelfLink:,UID:9344fe44-a270-4dc9-b7fb-73e5faa04297,ResourceVersion:60854558,Generation:0,CreationTimestamp:2022-09-05 13:34:21 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: ad4a9c10-d5c4-11ec-a679-664243c08fda,pod-template-hash: 64bc47bf7f,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384861859045536,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-ad4a9c10-d5c4-11ec-a679-664243c08fda-64bc47bf7f bc501ac1-4d2f-4981-ba98-1c2a2c2f10bc 0xc42083b7a8 0xc42083b7a9}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-ad4a9c10-d5c4-11ec-a679-664243c08fda registry.kk.com/ai/offline_function_tensorrt/opening_closing:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k Decima lSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42083b7b8} {node.kubernetes.io/unreachable Exists NoExecute 0xc42083b7d0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:21 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:21 +0000 UTC ContainersNotReady containers with unready status: [c-ad4a9c10-d5c4-11ec-a679-664243c08fda]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:21 +0000 UTC ContainersNotReady containers with unready status: [c-ad4a9c10-d5c4-11ec-a679-664243c08fda]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:21 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:34:21 +0000 UTC,ContainerStatuses:[{c-ad4a9c10-d5c4-11ec-a679-664243c08fda {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/opening_closing:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} aaf48e5b-8d2b-47d2-ac82-0e189e58b82c:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-b521d0ac-d4c0-11ec-8214-4a2f8bc05280-8767b6d44-jmdkr,Gen erateName:p-b521d0ac-d4c0-11ec-8214-4a2f8bc05280-8767b6d44-,Namespace:ai-model,SelfLink:,UID:aaf48e5b-8d2b-47d2-ac82-0e189e58b82c,ResourceVersion:60853402,Generation:0,CreationTimestamp:2022-09-05 13:31:35 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: b521d0ac-d4c0-11ec-8214-4a2f8bc05280,pod-template-hash: 8767b6d44,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384695714386603,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cni.projectcalico.org/podIP: 10.42.3.193/32,cni.projectcalico.org/podIPs: 10.42.3.193/32,},OwnerReferences:[{apps/v1 ReplicaSet p-b521d0ac-d4c0-11ec-8214-4a2f8bc05280-8767b6d44 f85c8ab5-e6c2-4db2-886c-66c705d9a75d 0xc4211c4f80 0xc4211c4f81}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-b521d0ac-d4c0-11ec-8214-4a2f8bc05280 registry.kk.com/ai/offline_function_tensorrt/oil_temperature:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName :worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4211c4f90} {node.kubernetes.io/unreachable Exists NoExecute 0xc4211c4f98}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Running,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:35 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:37 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:37 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:35 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:10.42.3.193,StartTime:2022-09-05 13:31:35 +0000 UTC,ContainerStatuses:[{c-b521d0ac-d4c0-11ec-8214-4a2f8bc05280 {nil ContainerStateRunning{StartedAt:2022-09-05 13:31:37 +0000 UTC,} nil} {nil nil nil} true 0 registry.kk.com/ai/offline_function_tensorrt/oil_temperature:amd-2.2 docker-pullable://registry.kk.com/ai/offline_function_tensorrt/oil_temperature@sha256:9a3c5f598e91895cd8d64f75ec937e0e97678de97f29795d186576c937415d27 docker://822e435aa8b7bad3b7bc2117bc084d28a597b84b1ca94301a223f66ce0d14276}],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 59e08cc0-f4a5-46f7-b464-8a93a8949689:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-d049b5a6-d025-11ec-9bf1-12cc16fb82ca-85557f5c56-knqln,GenerateName:p-d049b5a6-d025-11ec-9bf1-12cc16fb82ca-85557f5c56-,Namespace:ai-model,SelfLink:,UID:59e08cc0-f4a5-46f7-b464-8a93a8949689,ResourceVersion:60853539,Generation:0,CreationTimestamp:2022-09-05 13:31:48 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: d049b5a6-d 025-11ec-9bf1-12cc16fb82ca,pod-template-hash: 85557f5c56,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384708195011096,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cni.projectcalico.org/podIP: 10.42.3.194/32,cni.projectcalico.org/podIPs: 10.42.3.194/32,},OwnerReferences:[{apps/v1 ReplicaSet p-d049b5a6-d025-11ec-9bf1-12cc16fb82ca-85557f5c56 78613520-9fe5-4eed-aedb-477bb311fc9f 0xc420edb1f0 0xc420edb1f1}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-d049b5a6-d025-11ec-9bf1-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/clamp:amd-2.4 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountSer viceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420edb200} {node.kubernetes.io/unreachable Exists NoExecute 0xc420edb208}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Running,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:48 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:50 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:50 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:48 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:10.42.3.194,StartTime:2022-09-05 13:31:48 +0000 UTC,ContainerStatuses:[{c-d049b5a6-d025-11ec-9bf1-12cc16fb82ca {nil ContainerStateRunning{StartedAt:2022-09-05 13:31:49 +0000 UTC,} nil} {nil nil nil} true 0 registry.kk.com/ai/offline_function_tensorrt/clamp:amd-2.4 docker-pullable://registry.kk.com/ai/offline_function_tensorrt/clamp@sha256:f6c4bec5083a5f06d5f3caad1829a4b60edd9aa1497e740dd875f223f305dc8e docker://6bc4196f459adc286a8d3b7f501b2fda1cc398d8a524212905a5abb02ef0030a}],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 545fe3a8-1ce9-43c8-8bd3-63fa246daceb:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-27fef2d4-d026-11ec-9e71-12cc16fb82ca-6cb96fbb47-n2njd,GenerateName:p-27fef2d4-d026-11ec-9e71-12cc16fb82ca-6cb96fbb47-,Namespace:ai-model,SelfLink:,UID:545fe3a8-1ce9-43c8-8bd3-63fa246daceb,ResourceVersion:60854186,Generation:0,CreationTimestamp:2022-09-05 13:33:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 27fef2d4-d026-11ec-9e71-12cc16fb82ca,pod-template-hash: 6cb96fbb47,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384812800850122,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cattle.io/timestamp: 2022-09-02T09:09:18Z,field.cattle.io/ports: [[{"containerP ort":7070,"dnsName":"p-27fef2d4-d026-11ec-9e71-12cc16fb82ca","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],workload.cattle.io/state: {"d29ya2VyMQ==":"local:machine-95hdb"},},OwnerReferences:[{apps/v1 ReplicaSet p-27fef2d4-d026-11ec-9e71-12cc16fb82ca-6cb96fbb47 f57acef1-56aa-4a7c-81d6-3120f705d370 0xc42207af18 0xc42207af19}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-27fef2d4-d026-11ec-9e71-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/respirator:amd-2.3 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil, SchedulerName:default-scheduler ,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42207af28} {node.kubernetes.io/unreachable Exists NoExecute 0xc42207af30}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:32 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:32 +0000 UTC ContainersNotReady containers with unready status: [c-27fef2d4-d026-11ec-9e71-12cc16fb82ca]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:32 +0000 UTC ContainersNotReady containers with unready status: [c-27fef2d4-d026-11ec-9e71-12cc16fb82ca]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:32 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:33:32 +0000 UTC,ContainerStatuses:[{c-27fef2d4-d026-11ec-9e71-12cc16fb82ca {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/respirator:amd-2.3 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 4f68c68e-df2a-4846-8ce7-2dd1b06fd60a:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-fa4f5964-d025-11ec-b57d-12cc16fb82ca-6fbff6899f-kvdft,GenerateName:p-fa4f5964-d025-11ec-b57d-12cc16fb82ca-6fbff6899f-,Namespace:ai-model,SelfLink:,UID:4f68c68e-df2a-4846-8ce7-2dd1b06fd60a,ResourceVersion:60853654,Generation:0,CreationTimestamp:2022-09-05 13:32:05 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: fa4f5964-d025-11ec-b57d-12cc16fb82ca,pod-template-hash: 6fbff6899f,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384725280351691,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-fa4f5964-d025-11ec-b 57d-12cc16fb82ca-6fbff6899f 610e2e85-1ea0-462d-8c59-fb6ad3497653 0xc420e89b98 0xc420e89b99}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-fa4f5964-d025-11ec-b57d-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/toggle_switch:amd-2.5 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420e89bb8} {node.kubernetes.io/unreachable Exists NoExecute 0xc420e89bc0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:32:05 +0000 UTC } {Ready Fals e 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:32:05 +0000 UTC ContainersNotReady containers with unready status: [c-fa4f5964-d025-11ec-b57d-12cc16fb82ca]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:32:05 +0000 UTC ContainersNotReady containers with unready status: [c-fa4f5964-d025-11ec-b57d-12cc16fb82ca]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:32:05 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:32:05 +0000 UTC,ContainerStatuses:[{c-fa4f5964-d025-11ec-b57d-12cc16fb82ca {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/toggle_switch:amd-2.5 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}], and its address is 0xc4203babc0 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-56c0cd7c-d026-11ec-9141-12cc16fb82ca-85f94fd96c-qb99p in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189-c64b6cbb6-gwztm in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-ca735e4a-00fc-11ed-9340-667f53977afd-5fc78ccb9d-tnbt5 in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-a2ad1890-04b4-11ed-a69a-da2f2c420d44-675bd85754-8pl5z in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189-784bbcf688-mmb56 in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-b521d0ac-d4c0-11ec-8214-4a2f8bc05280-8767b6d44-jmdkr in ns ai-model with status Running has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-d049b5a6-d025-11ec-9bf1-12cc16fb82ca-85557f5c56-knqln in ns ai-model with status Running has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-27fef2d4-d026-11ec-9e71-12cc16fb82ca-6cb96fbb47-n2njd in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-fa4f5964-d025-11ec-b57d-12cc16fb82ca-6fbff6899f-kvdft in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-ad4a9c10-d5c4-11ec-a679-664243c08fda-64bc47bf7f-9h4sd in ns ai-model with status Pending has GPU Mem 1000 [ info ] 2022/09/05 13:42:11 nodeinfo.go:413: getUsedGPUs: map[0:10000] in node worker1, and devs map[0:0xc4203babc0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:431: try to find unhealthy node unhealthy-gpu-worker1 [ info ] 2022/09/05 13:42:11 nodeinfo.go:397: available GPU list map[0:2288] before removing unhealty GPUs [ info ] 2022/09/05 13:42:11 nodeinfo.go:402: available GPU list map[0:2288] after removing unhealty GPUs [ debug ] 2022/09/05 13:42:11 nodeinfo.go:162: AvailableGPUs: map[0:2288] in node worker1 [ info ] 2022/09/05 13:42:11 gpushare-predicate.go:31: The pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in the namespace ai-model can be scheduled on worker1 [ info ] 2022/09/05 13:42:11 gpushare-predicate.go:17: check if the pod name p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx can be scheduled on node worker2 [ info ] 2022/09/05 13:42:11 cache.go:160: GetNodeInfo() uses the existing nodeInfo for worker2 [ debug ] 2022/09/05 13:42:11 cache.go:162: node worker2 with devices map[0:0xc42215caa0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:423: getAllGPUs: map[0:12288] in node worker2, and dev map[0:0xc42215caa0] [ debug ] 2022/09/05 13:42:11 deviceinfo.go:42: GetUsedGPUMemory() podMap map[1cbc9c36-73f5-4e2e-8ecf-9206a409d69b:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-c62f1b76-0408-11ed-9340-667f53977afd-6d45556b4d-bddqz,GenerateName:p-c62f1b76-0408-11ed-9340-667f53977afd-6d45556b4d-,Namespace:ai-model,SelfLink:,UID:1cbc9c36-73f5-4e2e-8ecf-9206a409d69b,ResourceVersion:60853218,Generation:0,CreationTimestamp:2022-09-05 13:31:21 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: c62f1b76-0408-11ed-9340-667f53977afd,pod-template-hash: 6d45556b4d,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384681501299370,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-c62f1b76-0408-11ed-9340-667f53977afd-6d45556b4d 40ab5179-2c75-44f4-bd6c-0a84fdd29c93 0xc4208a6c18 0xc4208a6c19}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-c62f1b76-0408-11ed-9340-667f53977afd registry.kk.com/ai/offline_function_tensorrt/volt_meter:amd-2.8 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}], RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4208a6c28} {node.kubernetes.io/unreachable Exists NoExecute 0xc4208a6c30}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:21 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:21 +0000 UTC ContainersNotReady containers with unready status: [c-c62f1b76-0408-11ed-9340-667f53977afd]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:21 +0000 UTC ContainersNotReady containers with unready status: [c-c62f1b76-0408-11ed-9340-667f53977afd]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:21 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.15,PodIP:,StartTime:2022-09-05 13:31:21 +0000 UTC,ContainerStatuses:[{c-c62f1b76-0408-11ed-9340-667f53977afd {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/volt_meter:amd-2.8 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 72301f05-54d4-454b-96ea-96d1180620b2:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-775af1d0-d025-11ec-b79a-12cc16fb82ca-655b867d44-zzdbr,GenerateName:p-775af1d0-d025-11ec-b79a-12cc16fb82ca-655b867d44-,Namespace:ai-model,SelfLink:,UID:72301f05-54d4-454b-96ea-96d1180620b2,ResourceVersion:60854311,Generation:0,Cr 13:33:47 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 775af1d0-d025-11ec-b79a-12cc16fb82ca,pod-template-hash: 655b867d44,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384828135383512,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cattle.io/timestamp: 2022-06-15T08:43:32Z,field.cattle.io/ports: [[{"containerPort":7070,"dnsName":"p-775af1d0-d025-11ec-b79a-12cc16fb82ca","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],},OwnerReferences:[{apps/v1 ReplicaSet p-775af1d0-d025-11ec-b79a-12cc16fb82ca-655b867d44 71570ef2-b354-4ff5-b41a-a4f70dbf8b1a 0xc42207a808 0xc42207a809}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-775af1d0-d025-11ec-b79a-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/oil_level:amd-2.4 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSecon ds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42207a818} {node.kubernetes.io/unreachable Exists NoExecute 0xc42207a820}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:48 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:48 +0000 UTC ContainersNotReady containers with unready status: [c-775af1d0-d025-11ec-b79a-12cc16fb82ca]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:48 +0000 UTC ContainersNotReady containers with unready status: [c-775af1d0-d025-11ec-b79a-12cc16fb82ca]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:48 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.15,PodIP:,StartTime:2022-09-05 13:33:48 +0000 UTC,ContainerStatuses:[{c-775af1d0-d025-11ec-b79a-12cc16fb82ca {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/oil_level:amd-2.4 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}], and its address is 0xc42215caa0 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-775af1d0-d025-11ec-b79a-12cc16fb82ca-655b867d44-zzdbr in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-c62f1b76-0408-11ed-9340-667f53977afd-6d45556b4d-bddqz in ns ai-model with status Pending has GPU Mem 1000 [ info ] 2022/09/05 13:42:11 nodeinfo.go:413: getUsedGPUs: map[0:2000] in node worker2, and devs map[0:0xc42215caa0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:431: try to find unhealthy node unhealthy-gpu-worker2 [ info ] 2022/09/05 13:42:11 nodeinfo.go:397: available GPU list map[0:10288] before removing unhealty GPUs [ info ] 2022/09/05 13:42:11 nodeinfo.go:402: available GPU list map[0:10288] after removing unhealty GPUs [ debug ] 2022/09/05 13:42:11 nodeinfo.go:162: AvailableGPUs: map[0:10288] in node worker2 [ info ] 2022/09/05 13:42:11 gpushare-predicate.go:31: The pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in the namespace ai-model can be scheduled on worker2 [ info ] 2022/09/05 13:42:11 routes.go:93: gpusharingfilter extenderFilterResult = {"Nodes":null,"NodeNames":["worker1","worker2"],"FailedNodes":{},"Error":""} [ debug ] 2022/09/05 13:42:11 routes.go:162: /gpushare-scheduler/filter response=&{0xc420fcc140 0xc420fd2000 0xc420fc2480 0x565cc0 true false false false 0xc420fc2580 {0xc4203d2a80 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 76 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc421faed90 0} [ debug ] 2022/09/05 13:42:11 routes.go:160: /gpushare-scheduler/bind request body = &{0xc421206c80 <nil> <nil> false true {0 0} false false false 0x69c120} [ debug ] 2022/09/05 13:42:11 routes.go:121: gpusharingBind ExtenderArgs ={p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx ai-model 4c24f578-1a0c-417b-adb0-76b0b2604c06 worker1} [ info ] 2022/09/05 13:42:11 cache.go:160: GetNodeInfo() uses the existing nodeInfo for worker1 [ debug ] 2022/09/05 13:42:11 cache.go:162: node worker1 with devices map[0:0xc4203babc0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:184: Allocate() ----Begin to allocate GPU for gpu mem for pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in ns ai-model----

在执行完/gpushare-scheduler/filter后通过这里打印能看出直接调度到worker1了，实际worker2比worker1更有充足的资源有没有什么办法解决该问题呢，是部署问题还是代码bug呢

GPU Sharing Scheduler for Kubernetes Cluster

GPU Sharing Scheduler Extender in Kubernetes

Overview

Prerequisites

Design

Setup

User Guide

Developing

Scheduler Extender

Device Plugin

Kubectl Extension

Demo

- Demo 1: Deploy multiple GPU Shared Pods and schedule them on the same GPU device in binpack way

- Demo 2: Avoid GPU memory requests that fit at the node level, but not at the GPU device level

Related Project

Roadmap

Adopters

Acknowledgments

Owner

Aliyun (Alibaba Cloud) Container Service

Comments

Modify scheduler configuration in minikube

policy-config-file is no longer supported by kubernetes starting by v1.23

gpushare-device plugin daemonset is not working

how to change scheduler config, when I start k8s with rke?

I have a master and a GPU node, After I create a gpu pod, I get problem.

Question: how does the GPU device number count in the container with "gpu-count"?

gpushare-device-plugin pod fails to start

error with gpushare

unknow device error when deploy k8s+helm+jupyterhub

how to get GPU metric data of pod deployed by this plugin

gpushare scheduler extender bind code 500

k3s services not started scheduler exited: stat /etc/kubernetes/scheduler.conf: no such file or directory

Wrong GPU ID

trivy image scan lists critical and high vulnerability against latest image k8s-gpushare-schd-extender:1.11-d170d8a

pod运行完成后，插件更新gpu池不及时。当有多个pending的pod排队分配资源时，最后一个pod会一直等到flushUnschedulablePodsLeftover才会重新分配资源

读取到了两块显卡，但是请求/gpushare-scheduler/filter后部分容器一直只能调度到其中一块显卡

Related tags

Linstor-scheduler-extender - LINSTOR scheduler extender plugin for Kubernetes

Crane scheduler is a Kubernetes scheduler which can schedule pod based on actual node load.

Statefulset-scheduler (aka sfs-scheduler)

Scheduler: the scheduler of distbuild written in Golang

Scheduler - Scheduler package is a zero-dependency scheduling library for Go

Package tasks is an easy to use in-process scheduler for recurring tasks in Go

A lightweight job scheduler based on priority queue with timeout, retry, replica, context cancellation and easy semantics for job chaining. Build for golang web apps.

A simple job scheduler backed by Postgres.

Chrono is a scheduler library that lets you run your task and code periodically

cpuworker - A Customized Goroutine Scheduler over Golang Runtime

goInterLock is golang job/task scheduler with distributed locking mechanism (by Using Redis🔒).

Chadburn is a scheduler alternative to cron, built on Go and designed for Docker environments.

A Framework for FaaS load balancing | stack-scheduler repository|

Go distributed task scheduler

A sample to showcase how to create a k8s scheduler extender

Scheduler CRUD For Golang

Scheduler: Go jobs execution system

personal tweet scheduler - it needs my guidance now for it to work for you - it works on my mac - will release it someday

K8s cluster simulator for workload scheduling.