A Kubernetes Native Batch System (Project under CNCF)

Last update: Jan 9, 2023

Comments: 15

Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workload including: machine learning/deep learning, bioinformatics/genomics and other "big data" applications. These types of applications typically run on generalized domain frameworks like TensorFlow, Spark, PyTorch, MPI, etc, which Volcano integrates with.

Volcano builds upon a decade and a half of experience running a wide variety of high performance workloads at scale using several systems and platforms, combined with best-of-breed ideas and practices from the open source community.

NOTE: the scheduler is built based on kube-batch; refer to #241 and #288 for more detail.

Volcano is a sandbox project of the Cloud Native Computing Foundation (CNCF). Please consider joining the CNCF if you are an organization that wants to take an active role in supporting the growth and evolution of the cloud native ecosystem.

Overall Architecture

Talks

Ecosystem

Quick Start Guide

Prerequisites

Kubernetes 1.12+ with CRD support

You can try Volcano by one of the following two ways.

Note:

For Kubernetes v1.16+ use CRDs under config/crd/bases (recommended)
For Kubernetes versions < v1.16 use CRDs under config/crd/v1beta1 (deprecated)

Install with YAML files

Install Volcano on an existing Kubernetes cluster. This way is both available for x86_64 and arm64 architecture.

For x86_64:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

For arm64:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development-arm64.yaml

Enjoy! Volcano will create the following resources in volcano-system namespace.

NAME                                       READY   STATUS      RESTARTS   AGE
pod/volcano-admission-5bd5756f79-dnr4l     1/1     Running     0          96s
pod/volcano-admission-init-4hjpx           0/1     Completed   0          96s
pod/volcano-controllers-687948d9c8-nw4b4   1/1     Running     0          96s
pod/volcano-scheduler-94998fc64-4z8kh      1/1     Running     0          96s

NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/volcano-admission-service   ClusterIP   10.98.152.108   <none>        443/TCP   96s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           96s
deployment.apps/volcano-controllers   1/1     1            1           96s
deployment.apps/volcano-scheduler     1/1     1            1           96s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/volcano-admission-5bd5756f79     1         1         1       96s
replicaset.apps/volcano-controllers-687948d9c8   1         1         1       96s
replicaset.apps/volcano-scheduler-94998fc64      1         1         1       96s

NAME                               COMPLETIONS   DURATION   AGE
job.batch/volcano-admission-init   1/1           48s        96s

Install from code

If you don't have a kubernetes cluster, try one-click install from code base:

./hack/local-up-volcano.sh

This way is only available for x86_64 temporarily.

Install monitoring system

If you want to get prometheus and grafana volcano dashboard after volcano installed, try following commands:

make TAG=latest generate-yaml
kubectl create -f _output/release/volcano-monitoring-latest.yaml

Meeting

Regular Community Meeting:

The Volcano team meets once per week on Friday, alternating between 10am Beijing Time (Convert to your timezone.) and 3pm Beijing Time (Convert to your timezone.)

Resources:

Contact

If you have any question, feel free to reach out to us in the following ways:

CNCF Slack Channel

Mailing List

Owner

Volcano

A Kubernetes Native Batch System

https://github.com/volcano-sh/volcano https://volcano.sh

Comments

Failed to launch mpijob after installing volcano

Hi everyone, I am trying to use the gang-scheduler in my k8s/kubeflow cluster and installed volcano following the tutorial here and here.

$ kubectl get all -n volcano-system 
NAME                                       READY   STATUS      RESTARTS   AGE
pod/volcano-admission-5bd5756f79-5rxkh     1/1     Running     0          24h
pod/volcano-admission-init-nf2mc           0/1     Completed   0          24h
pod/volcano-controllers-687948d9c8-xclv7   1/1     Running     0          24h
pod/volcano-scheduler-79f569766f-bxgnf     1/1     Running     0          24h


NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/volcano-admission-service   ClusterIP   10.107.67.206   <none>        443/TCP   24h


NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           24h
deployment.apps/volcano-controllers   1/1     1            1           24h
deployment.apps/volcano-scheduler     1/1     1            1           24h

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/volcano-admission-5bd5756f79     1         1         1       24h
replicaset.apps/volcano-controllers-687948d9c8   1         1         1       24h
replicaset.apps/volcano-scheduler-79f569766f     1         1         1       24h



NAME                               COMPLETIONS   DURATION   AGE
job.batch/volcano-admission-init   1/1           24s        24h

However, some error messages came up when I launched the mpijob. It seems the job queue is not working properly.

$ kubectl logs -n volcano-system volcano-controllers-687948d9c8-xclv7 --tail 10                                                                                             
I0917 02:26:57.418937       1 queue_controller.go:158] Begin sync queue default
I0917 02:26:57.418960       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 02:43:37.419076       1 queue_controller.go:158] Begin sync queue default
I0917 02:43:37.419106       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 03:00:17.419234       1 queue_controller.go:158] Begin sync queue default
I0917 03:00:17.419268       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 03:16:57.419408       1 queue_controller.go:158] Begin sync queue default
I0917 03:16:57.419431       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 03:33:37.419563       1 queue_controller.go:158] Begin sync queue default
I0917 03:33:37.419590       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted

The pods are all in "Pending" state

$ kubectl get pods                 
NAME                                      READY   STATUS    RESTARTS   AGE
mxnet-horovod-job-launcher-7pncv          0/1     Pending   0          159m
mxnet-horovod-job-worker-0                0/1     Pending   0          159m
mxnet-horovod-job-worker-1                0/1     Pending   0          159m
mxnet-horovod-job-worker-2                0/1     Pending   0          159m
mxnet-horovod-job-worker-3                0/1     Pending   0          159m

The output of the volcano-scheduler is like below

$ kubectl logs -n volcano-system volcano-scheduler-79f569766f-bxgnf --tail 20
I0917 03:38:21.543470       1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
I0917 03:38:21.543496       1 enqueue.go:122] Leaving Enqueue ...
I0917 03:38:21.543509       1 allocate.go:43] Enter Allocate ...
I0917 03:38:21.543523       1 allocate.go:94] Try to allocate resource to 0 Namespaces
I0917 03:38:21.543544       1 allocate.go:247] Leaving Allocate ...
I0917 03:38:21.543552       1 backfill.go:42] Enter Backfill ...
I0917 03:38:21.543562       1 backfill.go:91] Leaving Backfill ...
I0917 03:38:21.547705       1 session.go:154] Close Session 989f0526-d8fc-11e9-af2b-46b0d5a5c4cd
I0917 03:38:22.548180       1 cache.go:771] There are <1> Jobs, <1> Queues and <7> Nodes in total for scheduling.
I0917 03:38:22.548205       1 session.go:135] Open Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd with <1> Job and <1> Queues
I0917 03:38:22.548540       1 enqueue.go:43] Enter Enqueue ...
I0917 03:38:22.548553       1 enqueue.go:58] Added Queue <default> for Job <default/mxnet-horovod-job>
I0917 03:38:22.548564       1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
I0917 03:38:22.548593       1 enqueue.go:122] Leaving Enqueue ...
I0917 03:38:22.548606       1 allocate.go:43] Enter Allocate ...
I0917 03:38:22.548621       1 allocate.go:94] Try to allocate resource to 0 Namespaces
I0917 03:38:22.548642       1 allocate.go:247] Leaving Allocate ...
I0917 03:38:22.548651       1 backfill.go:42] Enter Backfill ...
I0917 03:38:22.548662       1 backfill.go:91] Leaving Backfill ...
I0917 03:38:22.552921       1 session.go:154] Close Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd

Really appreciate if someone can offer some help!

Large memory is used by volcano-scheduler
What happened: The volcano-scheduler uses large memory and the function of schedule seems to be unusual. The pod of job will always be pending. But there is no any error logs in pod. After restart it, everything is okay.

The image link of the result of "kubectl top pods -n volcano-system": https://l4x826wg3c.feishu.cn/file/boxcnLnrbgq6CmvjAQWfOlWX9gc

What you expected to happen: The schduler works well. Or know how to check what happened, then I could restart it when monitor some events.

How to reproduce it (as minimally and precisely as possible): Not sure how to reproduce it, but it happend many times after running some days.

Anything else we need to know?:

The ETCD in our cluster uses normal Disk, not SSD.

The Node volcano pods are deployed could be scheduled by other computer task like pytorchjob, and there is no Requests to volcano.

Environment:

Volcano Version: https://github.com/volcano-sh/volcano/commit/1b96bdf4de821e1e4af2b4c056f67be7559a880d

Kubernetes version (use kubectl version): 1.22.2

Cloud provider or hardware configuration:

OS (e.g. from /etc/os-release):

Kernel (e.g. uname -a):

Install tools:

Others:
Add GPU Numbers Predicates

Support specify GPU numbers for pod resource requests issue#1440

Currently, Volcano only supports specified GPU share memory. Specified GPU number is not supported. This pr supports defining GPU numbers for pod resource requests. You can check the design doc https://github.com/peiniliu/volcano/blob/dev/docs/user-guide/how_to_use_gpu_number.md for more details.
Switch to cross-compiled docker containers & container build

This allows for one container name to support multiple archs, and once pushed we can remove more of the arm64 specific installation stuff since the same container names will support both.

docker does "the right thing" and pulls the container associated with the arch.

You can take a look at the containers I built with this change in my own dockerhub at https://hub.docker.com/repository/docker/holdenk/volcanosh-scheduler , https://hub.docker.com/repository/docker/holdenk/volcanosh-controller-manager , https://hub.docker.com/repository/docker/holdenk/volcanosh-webhook-manager-base , etc.

This is in response to https://github.com/volcano-sh/volcano/issues/1570 (although it could also solve https://github.com/volcano-sh/volcano/issues/1568 since we wouldn't need volcano-development-arm64.yaml anymore).

To preserve backwards compatibility with users who might be developing locally in single arch mode I have that the default. If there are release docs I should update as well let me know.

Signed-off-by: Holden Karau [email protected]
Distinguish different pod-delete scenario
Try to address issue #791 It's a draft solution, need further discussion.

In my ENV, seems it could work, but the pg status not correct, after delete(after success) the original pod will gone and not recreate but the status of pg is:

status: phase: Running running: 2

not what expect to:

status: phase: Running running: 2 success: 1

plugin ssh and mpi for HPC calculation for engine on earthquake

/kind feature

Environment:

Volcano Version: 1.12
Kubernetes version (use kubectl version): Kind installation for testing: kubectl version Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-18T09:04:15Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}

I want to use volcano as scheduler for our engine calculator for earthquakes. The communication of the cluster engine when we use VM or baremetal hosts is made by ssh

I see that there are mpi plugin and also ssh plugin, but unfortunately I can't find any docs on what use these plugins in a deployment yaml. What i need is to understand in which way that plugin works to communicate from master to worker, look the follow example:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: lm-mpi-job
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  tasks:
    - replicas: 1
      name: mpimaster
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  sleep 10;
                  cat /etc/volcano/mpiworker.host | tr "\n" ","
                  MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`;
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world ;
                  sleep 100;
              image: volcanosh/example-mpi:0.0.1
              name: mpimaster
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /home
          restartPolicy: OnFailure
    - replicas: 2
      name: mpiworker
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: volcanosh/example-mpi:0.0.1
              name: mpiworker
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /home
          restartPolicy: OnFailure

In this example the user is root, but is ti possible to use a different user for ssh plugin to can ssh to worker from master? Because on our image container we don't use user root but we need ssh connection from master to worker like open mpi And mpi plugin works in the same way? I find only a PR but no documentation on site volcano.sh or github available

Thanks

dynamically set tasks' replicas, with the range of [min, max]
Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug kind feature

What happened: Suppose in Tensorflow area, when user submits a distributed Tensorflow job, he must decide the number of workers, and usually set MinAvalible := sum of replicas of tasks.

What you expected to happen: If we can set replicas to a range number, e.g. [min, max], we can enhance our scheduling ability.

If there are enough resources, we can set as many workers as possible ( <= max)

If there are less resources, we can start the job as fast as possible( >= min).

If tensorflow workload (or any other workload) allow dynamic workers(e.g. auto-scaling), it gives kube-volcano more possibility to schedule.

SLA plugin doesn't work on `batch/v1` `Job` objects; `sla-waiting-time` from `volcano-scheduler.conf` is ignored

What happened:

As mentioned in #1869 I am using Volcano to schedule Kubernetes Job objects, to try and prevent smaller jobs submitted later from immediately filling any available space and starving larger jobs submitted earlier.

My cluster has a 96-core node with hostname "k1.kube".

I installed Volcano from the Helm chart in tag v1.4.0, using this values.yaml:

basic:
  image_tag_version: "v1.4.0"
  controller_image_name: "volcanosh/vc-controller-manager"
  scheduler_image_name: "volcanosh/vc-scheduler"
  admission_image_name: "volcanosh/vc-webhook-manager"
  admission_secret_name: "volcano-admission-secret"
  admission_config_file: "config/volcano-admission.conf"
  scheduler_config_file: "config/volcano-scheduler.conf"

  image_pull_secret: ""
  admission_port: 8443
  crd_version: "v1"
custom:
  metrics_enable: "false"

And then overriding the scheduler configmap with this and restarting the scheduler pod:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
      - name: sla
        arguments:
          # Stop letting little jobs pass big jobs after the big jobs have been
          # waiting this long
          sla-waiting-time: 5m
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
        arguments:
          # Maybe this will try to fill already full nodes first?
          leastrequested.weight: 0
          mostrequested.weight: 2
          nodeaffinity.weight: 3
          podaffinity.weight: 3
          balancedresource.weight: 1
          tainttoleration.weight: 1
          imagelocality.weight: 1
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system

So I should be using a global SLA of 5 minutes.

Then, I prepared a test: fill up the node with some jobs, then queue a big job, then queue a bunch of smaller jobs after it:

# Clean up
kubectl delete job -l app=volcanotest

# Make 10 10 core jobs that will block out our test job for at least 2 minutes
# Make sure they don't all finish at once.
rm -f jobs_before.yml
for NUM in {1..10} ; do
cat >>jobs_before.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: presleep${NUM}
  labels:
    app: volcanotest
spec:
  template:
    spec:
      schedulerName: volcano
      nodeSelector:
        kubernetes.io/hostname: k1.kube
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep",  "$(( $RANDOM % 20 + 120 ))"]
        resources:
          limits:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
      restartPolicy: Never
  backoffLimit: 4
  ttlSecondsAfterFinished: 1000
---
EOF
done

# And 200 10 core jobs that, if they all pass it, will keep it blocked out for 20 minutes
# We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is working.
rm -f jobs_after.yml
for NUM in {1..200} ; do
cat >>jobs_after.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: postsleep${NUM}
  labels:
    app: volcanotest
spec:
  template:
    spec:
      schedulerName: volcano
      nodeSelector:
        kubernetes.io/hostname: k1.kube
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep",  "$(( $RANDOM % 20 + 60 ))"]
        resources:
          limits:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
      restartPolicy: Never
  backoffLimit: 4
  ttlSecondsAfterFinished: 1000
---
EOF
done

# And the test job itself between them.
rm -f job_middle.yml
cat >job_middle.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: middle
  labels:
    app: volcanotest
spec:
  template:
    spec:
      schedulerName: volcano
      nodeSelector:
        kubernetes.io/hostname: k1.kube
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep", "1"]
        resources:
          limits:
            memory: 300M
            cpu: 50000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 50000m
            ephemeral-storage: 1G
      restartPolicy: Never
  backoffLimit: 4
  ttlSecondsAfterFinished: 1000
EOF

kubectl apply -f jobs_before.yml
sleep 10
kubectl apply -f job_middle.yml
sleep 10
CREATION_TIME="$(kubectl get job middle -o jsonpath='{.metadata.creationTimestamp}')"
kubectl apply -f jobs_after.yml
# Wait for it to finish
COMPLETION_TIME=""
while [[ -z "${COMPLETION_TIME}" ]] ; do
    sleep 10
    COMPLETION_TIME="$(kubectl get job middle -o jsonpath='{.status.completionTime}')"
done
echo "Test large job was created at ${CREATION_TIME} and completed at ${COMPLETION_TIME}"

I observed jobs from jobs_after.yml being scheduled even when the job from job_middle.yml had had its pod pending for 10 minutes, which is double the global SLA time that should be being enforced.

What you expected to happen:

These shouldn't be much more than 5 minutes between the creation and completion times for the large middle job. When the job pod from job_middle.yml has been pending for 5 minutes, no more job pods from jobs_after.yml should be being scheduled by Volcano until job_middle.yml has been scheduled.

How to reproduce it (as minimally and precisely as possible): Use the Volcano helm chart, the above configmap override, kubectl -n volcano-system delete pod "$(kubectl get pod -n volcano-system | grep volcano-scheduler | cut -f1 -d' ')" to bounce the schedule pod after reconfiguring it, and the above Bash code to generate test jobs. Adjust the hostname label selectors and job sizes as needed to fill the test cluster node you are using.

Anything else we need to know?:

Is the SLA plugin maybe not smart enough to clear out space for a job to meet the SLA from a node that matches its selectors? Are other plugins in the config maybe scheduling stuff that the SLA plugin has decided chouldn't be scheduled yet?

The scheduler pod logs don't seem to include the string "sla", but they log a bunch for every pod that's waiting every second, so I might not be able to see the startup logs or every single line ever logged.

The jobs are definitely getting PodGroups created for them. Here's the PodGroup description for the middle job when it should have been run according to the SLA but has not yet been:

Name:         podgroup-31600c19-2282-47f1-934b-94026d88db1e
Namespace:    vg
Labels:       <none>
Annotations:  <none>
API Version:  scheduling.volcano.sh/v1beta1
Kind:         PodGroup
Metadata:
  Creation Timestamp:  2021-12-13T22:06:25Z
  Generation:          2
  Managed Fields:
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
      f:spec:
        .:
        f:minMember:
        f:minResources:
          .:
          f:cpu:
          f:ephemeral-storage:
          f:memory:
        f:priorityClassName:
      f:status:
    Manager:      vc-controller-manager
    Operation:    Update
    Time:         2021-12-13T22:06:25Z
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:phase:
    Manager:    vc-scheduler
    Operation:  Update
    Time:       2021-12-13T22:06:26Z
  Owner References:
    API Version:           batch/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Job
    Name:                  middle
    UID:                   31600c19-2282-47f1-934b-94026d88db1e
  Resource Version:        122332555
  Self Link:               /apis/scheduling.volcano.sh/v1beta1/namespaces/vg/podgroups/podgroup-31600c19-2282-47f1-934b-94026d88db1e
  UID:                     8bee9cca-40d5-47b5-90e7-ebb1bc70059a
Spec:
  Min Member:  1
  Min Resources:
    Cpu:                  50
    Ephemeral - Storage:  1G
    Memory:               300M
  Priority Class Name:    medium-priority
  Queue:                  default
Status:
  Conditions:
    Last Transition Time:  2021-12-13T22:06:26Z
    Message:               1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         86f1b151-92dd-4893-bcd3-c2573b3029fc
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                   From     Message
  ----     ------         ----                  ----     -------
  Warning  Unschedulable  64s (x1174 over 21m)  volcano  1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable

Environment:

Volcano Version: v1.4.0
Kubernetes version (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:23:04Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: Nodes are hosted on AWS instances.
OS (e.g. from /etc/os-release):

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Kernel (e.g. uname -a):

Linux master.kube 5.8.7-1.el7.elrepo.x86_64 #1 SMP Fri Sep 4 13:11:18 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Install tools:

helm version
version.BuildInfo{Version:"v3.7.2", GitCommit:"663a896f4a815053445eec4153677ddc24a0a361", GitTreeState:"clean", GoVersion:"go1.16.10"}

Others:

Pass conformance test

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Cherry pick related PR in kube-batch to volcano-sh/kube-batch for conformance test.

/cc @asifdxtreme

add admitPod and PGController

Which issue(s) this PR fixes : Fixes #135 #134

Special notes for your reviewer:

new func AdmitPod in admission controller
new PGcontroller in controller
delete Inqueue job phase
fix UT

Release note:


1. add ValidatingWebhookConfiguration volcano-validate-pod, only limit CREATE pods, allow pods to create when:
- pod.spec.schedulerName is default-scheduler
- podgroup phase isn't Pending
- normal job, no podgroup

2. new PGcontroller, create pg for normal job when use kube-batch.

3. if create job, job phase will be Pending->Running...... , so fix UT

Fair sharing not working
What happened: My cluster has total 11 CPU. I'm trying to create 2 queue(excluding default queue) with weight 5 for each queue. Queue manifest,

apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: test spec: weight: 5 --- apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: test1 spec: weight: 5

Queue List,

Name Weight State Inqueue Pending Running Unknown default 1 Open 0 0 0 0 test 5 Open 0 0 0 0 test1 5 Open 0 0 0 0

Created 3 Jobs for test queue with CPU resource as follow, job1 -> CPU 5 job2 -> CPU 5 job3 -> CPU 1

Now all 3 jobs are running and utilizing full cluster.

Now i'm creating new Job in test1 queue with CPU 2. I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state.

Name Weight State Inqueue Pending Running Unknown default 1 Open 0 0 0 0 test 5 Open 0 0 3 0 test1 5 Open 1 0 0 0

Configuration,

actions: "enqueue, allocate, backfill" tiers: - plugins: - name: priority - name: gang - name: conformance - plugins: - name: drf - name: predicates - name: proportion - name: nodeorder - name: binpack

What you expected to happen: I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state. How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Volcano Version: v1.3.0

Kubernetes version (use kubectl version):

Cloud provider or hardware configuration:

OS (e.g. from /etc/os-release):

Kernel (e.g. uname -a):

Install tools:

Others:

Error occur when execute the same task many times

What happened: When test the performance of spark-native integration with volcano, when execute the same task with 50 times, interval is 3 seconds, half task failed.

What you expected to happen: All task execute one by one and all of them should success.

How to reproduce it (as minimally and precisely as possible): Executing task by the following code.

#!/bin/bash
s=0
for ((i=1;i<=50;i=i+1))
do
     nohup ${SPARK_HOME1}/bin/spark-submit \
     --master k8s://https://10.32.226.132:6443 \
     --deploy-mode cluster \
     --class cn.cestc.test.JavaSparkReadHiveInHahadoopForComponent \
     --driver-cores 1 \
     --driver-memory 2G \
     --num-executors 1 \
     --executor-cores 1 \
     --executor-memory 2G \
     --name native_modetask2 \
     --jars hdfs://dev-host-03:8082/dolphinscheduler_arm/supportarm/resources/performance1/mysql-connector-java-8.0.29.jar \
     --conf spark.executor.instances=1 \
     --conf spark.kubernetes.namespace=support132x86 \
     --conf spark.kubernetes.authenticate.driver.serviceAccountName=support132x86 \
     --conf spark.kubernetes.container.image=10.32.226.224:85/public-release/spark-volcano:3.3.0  \
     --conf spark.kubernetes.scheduler.name=volcano  \
     --conf spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/tmp/podgroup-template.yaml \
     --conf spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep \
     --conf spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep \
  hdfs://dev-host-03:8082/dolphinscheduler_x86/support132x86/resources/performance1/SparkJupiterTest-2.0.jar\
 datacluster hdfs://datacluster/user/hive/warehouse \
thrift://dev-host-01:9083,thrift://dev-host-02:9083,thrift://dev-host-03:9083\
 dev-host-01:8082 dev-host-03:8082 testdata\
 hdfs://datacluster/dolphinscheduler_arm/supportarm/resources/performance1/yuanj2.sql \
10.32.226.69 30306 testdata root CESTC1Dhr7El67KD3jG@ out_test_task2 2000 > log-$i.log &
     sleep 3s
done

podgroup-template.yaml

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
spec:
  # Specify minMember to 1 to make a driver pod
  minMember: 1
  # Specify minResources to support resource reservation (the driver pod resource and executors pod resource should be considered)
  # It is useful for ensource the available resources meet the minimum requirements of the Spark job and avoiding the
  # situation where drivers are scheduled, and then they are unable to schedule sufficient executors to progress.
  minResources:
    cpu: "10"
    memory: "20G"
  # Specify the priority, help users to specify job priority in the queue during scheduling.
  priorityClassName: system-node-critical
  # Specify the queue, indicates the resource queue which the job should be submitted to
  queue: support132x86

Anything else we need to know?: error task log

[INFO] 2022-12-29 16:46:15.060 [TaskLogInfo- - [taskAppId=TASK-7947514526080_14-1674-6086]-getOutputLogService]  -  -> 22/12/29 16:46:14 ERROR Client: Please check "kubectl auth can-i create pod" first. It should be yes.
	Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.32.226.132:6443/api/v1/namespaces/support132x86/pods. Message: admission webhook "validatepod.volcano.sh" denied the request: failed to create pod <support132x86/spkj-6086-driver> as the podgroup phase is Pending. Received status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=admission webhook "validatepod.volcano.sh" denied the request: failed to create pod <support132x86/spkj-6086-driver> as the podgroup phase is Pending, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)
		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)
		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)
		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)
		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:305)
		at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
		at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
		at io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
		at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:152)
		at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248)
		at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242)
		at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2764)
		at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242)
		at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214)
		at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
		at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
		at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
		at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
		at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
		at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
		at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
	22/12/29 16:46:14 INFO ShutdownHookManager: Shutdown hook called
	22/12/29 16:46:15 INFO ShutdownHookManager: Deleting directory /tmp/spark-8da822e9-7f18-433a-886e-eb8b56529792
[INFO] 2022-12-29 16:46:16.152 [TaskLogInfo- - [taskAppId=TASK-7947514526080_14-1674-6086]-getOutputLogService]  - FINALIZE_SESSION
[INFO] 2022-12-29 16:46:16.162 [TaskLogInfo- - [taskAppId=TASK-7947514526080_14-1674-6086]]  - process has exited, execute path:/tmp/dolphinschedu

Environment:

Volcano Version: latest
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.10", GitCommit:"98d5dc5d36d34a7ee13368a7893dcb400ec4e566", GitTreeState:"clean", BuildDate:"2021-04-15T03:28:42Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.10", GitCommit:"98d5dc5d36d34a7ee13368a7893dcb400ec4e566", GitTreeState:"clean", BuildDate:"2021-04-15T03:20:25Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):Linux master 5.4.121-1.el7.elrepo.x86_64 #1 SMP Thu May 20 19:22:37 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

remove gox
Signed-off-by: hwdef [email protected]

Removing the gox tool has the following benefits:

Simplify project construction complexity

Reduce the packages that the project depends on

Compilation is smoother in China's network environment, because it is likely to fail when installing gox

Preempt do not follow minAvailable.

What happened:

When I tried preempt action with priority plugin. I set minAvailable: 1 in high-priority job which has 4 tasks, and supposed that Volcano would evict only 1 task of low-priority job. But the result was that all 4 tasks from high-priority job started and evicted 4 low-priority tasks as below:

NAME               READY   STATUS        RESTARTS   AGE
job-high-task0-0   0/1     Pending       0          3s
job-high-task0-1   0/1     Pending       0          3s
job-high-task1-0   0/1     Pending       0          3s
job-high-task1-1   0/1     Pending       0          3s
job-low-task0-0    1/1     Terminating   0          13m
job-low-task0-1    1/1     Terminating   0          13m
job-low-task1-0    1/1     Terminating   0          13m
job-low-task1-1    1/1     Terminating   0          14m

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

volcano-scheduler-configmap was set as below:

    actions: "enqueue, allocate, preempt, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enableJobStarving: false
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: nodeorder
      - name: binpack

Set priorityclass:

NAME                      VALUE        GLOBAL-DEFAULT   AGE
high                      1000         false            49d
low                       10           false            31h

Apply low-priority job with 4 task and fullfilled the cluster:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-low
spec:
  schedulerName: volcano
  minAvailable: 4
  queue: default
  priorityClassName: low
  tasks:
    - replicas: 2
      name: "task0"
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 99999"]
              imagePullPolicy: IfNotPresent
              name: task
              resources:
                limits:
                  cpu: 3750m
                requests:
                  cpu: 3750m
          restartPolicy: OnFailure
    - replicas: 2
      name: "task1"
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 99999"]
              imagePullPolicy: IfNotPresent
              name: task
              resources:
                limits:
                  cpu: 3750m
                requests:
                  cpu: 3750m
          restartPolicy: OnFailure

Apply high-priority job with 4 task but minAvailable: 1:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-high
spec:
  schedulerName: volcano
  minAvailable: 1
  queue: default
  priorityClassName: high
  tasks:
    - replicas: 2
      name: "task0"
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 99999"]
              imagePullPolicy: IfNotPresent
              name: task
              resources:
                limits:
                  cpu: 3750m
                requests:
                  cpu: 3750m
          restartPolicy: OnFailure
    - replicas: 2
      name: "task1"
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 99999"]
              imagePullPolicy: IfNotPresent
              name: task
              resources:
                limits:
                  cpu: 3750m
                requests:
                  cpu: 3750m
          restartPolicy: OnFailure

Wait for preempt.

Anything else we need to know?:

Environment:

Volcano Version: latest
Kubernetes version (use kubectl version): 1.21
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

allocateIdleResource Method Calculate Node Idle Bugs
What happened: When using volcano to allocate some task into a node, there are UnexpectedAdmissionError events involved with the pod. The volcano scheduler will assign different pod to the same node because of the lack of gpu resources. As result, many job failed in a short period of time.

Here was error message:

Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 4, Available: 3, which is unexpected

What you expected to happen: Node info in the vc-scheduler should be calculated precisely and correctly, so that the task will not be scheduled to the node.

How to reproduce it (as minimally and precisely as possible): At some special moment, there was a node with 4 GPUs which are 3 healthy, 1 unhealthy, and used by an inference task. In the log of scheduler, the idle of the node was 3 GPUs and the used of the node was 4GPUs, causing the bugs.

Anything else we need to know?:

func (ni *NodeInfo) allocateIdleResource(ti *TaskInfo) error { if ti.Resreq.LessEqual(ni.Idle, Zero) { ni.Idle.Sub(ti.Resreq) return nil } return &AllocateFailError{Reason: fmt.Sprintf( "cannot allocate resource, <%s> idle: %s <%s/%s> req: %s", ni.Name, ni.Idle.String(), ti.Namespace, ti.Name, ti.Resreq.String(), )} }

In pkg/scheduler/api/node_info.go, when Resreq was larger than node idle, the node idle didn't set to the correct value.

A Kubernetes Native Batch System (Project under CNCF)

Overall Architecture

Talks

Ecosystem

Quick Start Guide

Prerequisites

Install with YAML files

Install from code

Install monitoring system

Meeting

Contact

Owner

Volcano

Comments

Failed to launch mpijob after installing volcano

Large memory is used by volcano-scheduler

Add GPU Numbers Predicates

Switch to cross-compiled docker containers & container build

Distinguish different pod-delete scenario

plugin ssh and mpi for HPC calculation for engine on earthquake

dynamically set tasks' replicas, with the range of [min, max]

SLA plugin doesn't work on `batch/v1` `Job` objects; `sla-waiting-time` from `volcano-scheduler.conf` is ignored

Pass conformance test

add admitPod and PGController

Fair sharing not working

Error occur when execute the same task many times

remove gox

Preempt do not follow minAvailable.

allocateIdleResource Method Calculate Node Idle Bugs

Related tags

A native Go clean room implementation of the Porter Stemming algorithm.

A recommender system service based on collaborative filtering written in Go

Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.

Go implementation of the yolo v3 object detection system

A Hackathon project created by Alpha Interface team for Agri-D Food Hack

Kubernetes Native Edge Computing Framework (project under CNCF)

OpenYurt - Extending your native Kubernetes to edge(project under CNCF)

Zero - If Google Drive says that 1 is under copyright, 0 must be under copyleft

Fabric-Batch-Chaincode (FBC) is a library that enables batch transactions in chaincode without additional trusted systems.

GoBatch is a batch processing framework in Go like Spring Batch in Java

dockin ops is a project used to handle the exec request for kubernetes under supervision

🐻 The Universal Service Mesh. CNCF Sandbox Project.

🐻 The Universal Service Mesh. CNCF Sandbox Project.

CNCF Jaeger, a Distributed Tracing Platform

CNCF Jaeger, a Distributed Tracing Platform

CNCF Jaeger, a Distributed Tracing Platform

Project E-Commerce Alta Store Program Immersive Back End Batch 4

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs

Cloud Native Electronic Trading System built on Kubernetes and Knative Eventing