A Kubernetes Native Batch System (Project under CNCF)

volcano-logo


Build Status Go Report Card RepoSize Release LICENSE CII Best Practices

Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workload including: machine learning/deep learning, bioinformatics/genomics and other "big data" applications. These types of applications typically run on generalized domain frameworks like TensorFlow, Spark, PyTorch, MPI, etc, which Volcano integrates with.

Volcano builds upon a decade and a half of experience running a wide variety of high performance workloads at scale using several systems and platforms, combined with best-of-breed ideas and practices from the open source community.

NOTE: the scheduler is built based on kube-batch; refer to #241 and #288 for more detail.

cncf_logo

Volcano is a sandbox project of the Cloud Native Computing Foundation (CNCF). Please consider joining the CNCF if you are an organization that wants to take an active role in supporting the growth and evolution of the cloud native ecosystem.

Overall Architecture

volcano

Talks

Ecosystem

Quick Start Guide

Prerequisites

  • Kubernetes 1.12+ with CRD support

You can try Volcano by one of the following two ways.

Note:

  • For Kubernetes v1.16+ use CRDs under config/crd/bases (recommended)
  • For Kubernetes versions < v1.16 use CRDs under config/crd/v1beta1 (deprecated)

Install with YAML files

Install Volcano on an existing Kubernetes cluster. This way is both available for x86_64 and arm64 architecture.

For x86_64:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

For arm64:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development-arm64.yaml

Enjoy! Volcano will create the following resources in volcano-system namespace.

NAME                                       READY   STATUS      RESTARTS   AGE
pod/volcano-admission-5bd5756f79-dnr4l     1/1     Running     0          96s
pod/volcano-admission-init-4hjpx           0/1     Completed   0          96s
pod/volcano-controllers-687948d9c8-nw4b4   1/1     Running     0          96s
pod/volcano-scheduler-94998fc64-4z8kh      1/1     Running     0          96s

NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/volcano-admission-service   ClusterIP   10.98.152.108   <none>        443/TCP   96s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           96s
deployment.apps/volcano-controllers   1/1     1            1           96s
deployment.apps/volcano-scheduler     1/1     1            1           96s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/volcano-admission-5bd5756f79     1         1         1       96s
replicaset.apps/volcano-controllers-687948d9c8   1         1         1       96s
replicaset.apps/volcano-scheduler-94998fc64      1         1         1       96s

NAME                               COMPLETIONS   DURATION   AGE
job.batch/volcano-admission-init   1/1           48s        96s

Install from code

If you don't have a kubernetes cluster, try one-click install from code base:

./hack/local-up-volcano.sh

This way is only available for x86_64 temporarily.

Install monitoring system

If you want to get prometheus and grafana volcano dashboard after volcano installed, try following commands:

make TAG=latest generate-yaml
kubectl create -f _output/release/volcano-monitoring-latest.yaml

Meeting

Regular Community Meeting:

The Volcano team meets once per week on Friday, alternating between 10am Beijing Time (Convert to your timezone.) and 3pm Beijing Time (Convert to your timezone.)

Resources:

Contact

If you have any question, feel free to reach out to us in the following ways:

CNCF Slack Channel

Mailing List

Owner
Volcano
A Kubernetes Native Batch System
Volcano
Comments
  • Failed to launch mpijob after installing volcano

    Failed to launch mpijob after installing volcano

    Hi everyone, I am trying to use the gang-scheduler in my k8s/kubeflow cluster and installed volcano following the tutorial here and here.

    $ kubectl get all -n volcano-system 
    NAME                                       READY   STATUS      RESTARTS   AGE
    pod/volcano-admission-5bd5756f79-5rxkh     1/1     Running     0          24h
    pod/volcano-admission-init-nf2mc           0/1     Completed   0          24h
    pod/volcano-controllers-687948d9c8-xclv7   1/1     Running     0          24h
    pod/volcano-scheduler-79f569766f-bxgnf     1/1     Running     0          24h
    
    
    NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
    service/volcano-admission-service   ClusterIP   10.107.67.206   <none>        443/TCP   24h
    
    
    NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/volcano-admission     1/1     1            1           24h
    deployment.apps/volcano-controllers   1/1     1            1           24h
    deployment.apps/volcano-scheduler     1/1     1            1           24h
    
    NAME                                             DESIRED   CURRENT   READY   AGE
    replicaset.apps/volcano-admission-5bd5756f79     1         1         1       24h
    replicaset.apps/volcano-controllers-687948d9c8   1         1         1       24h
    replicaset.apps/volcano-scheduler-79f569766f     1         1         1       24h
    
    
    
    NAME                               COMPLETIONS   DURATION   AGE
    job.batch/volcano-admission-init   1/1           24s        24h
    

    However, some error messages came up when I launched the mpijob. It seems the job queue is not working properly.

    $ kubectl logs -n volcano-system volcano-controllers-687948d9c8-xclv7 --tail 10                                                                                             
    I0917 02:26:57.418937       1 queue_controller.go:158] Begin sync queue default
    I0917 02:26:57.418960       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    I0917 02:43:37.419076       1 queue_controller.go:158] Begin sync queue default
    I0917 02:43:37.419106       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    I0917 03:00:17.419234       1 queue_controller.go:158] Begin sync queue default
    I0917 03:00:17.419268       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    I0917 03:16:57.419408       1 queue_controller.go:158] Begin sync queue default
    I0917 03:16:57.419431       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    I0917 03:33:37.419563       1 queue_controller.go:158] Begin sync queue default
    I0917 03:33:37.419590       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    

    The pods are all in "Pending" state

    $ kubectl get pods                 
    NAME                                      READY   STATUS    RESTARTS   AGE
    mxnet-horovod-job-launcher-7pncv          0/1     Pending   0          159m
    mxnet-horovod-job-worker-0                0/1     Pending   0          159m
    mxnet-horovod-job-worker-1                0/1     Pending   0          159m
    mxnet-horovod-job-worker-2                0/1     Pending   0          159m
    mxnet-horovod-job-worker-3                0/1     Pending   0          159m
    

    The output of the volcano-scheduler is like below

    $ kubectl logs -n volcano-system volcano-scheduler-79f569766f-bxgnf --tail 20
    I0917 03:38:21.543470       1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
    I0917 03:38:21.543496       1 enqueue.go:122] Leaving Enqueue ...
    I0917 03:38:21.543509       1 allocate.go:43] Enter Allocate ...
    I0917 03:38:21.543523       1 allocate.go:94] Try to allocate resource to 0 Namespaces
    I0917 03:38:21.543544       1 allocate.go:247] Leaving Allocate ...
    I0917 03:38:21.543552       1 backfill.go:42] Enter Backfill ...
    I0917 03:38:21.543562       1 backfill.go:91] Leaving Backfill ...
    I0917 03:38:21.547705       1 session.go:154] Close Session 989f0526-d8fc-11e9-af2b-46b0d5a5c4cd
    I0917 03:38:22.548180       1 cache.go:771] There are <1> Jobs, <1> Queues and <7> Nodes in total for scheduling.
    I0917 03:38:22.548205       1 session.go:135] Open Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd with <1> Job and <1> Queues
    I0917 03:38:22.548540       1 enqueue.go:43] Enter Enqueue ...
    I0917 03:38:22.548553       1 enqueue.go:58] Added Queue <default> for Job <default/mxnet-horovod-job>
    I0917 03:38:22.548564       1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
    I0917 03:38:22.548593       1 enqueue.go:122] Leaving Enqueue ...
    I0917 03:38:22.548606       1 allocate.go:43] Enter Allocate ...
    I0917 03:38:22.548621       1 allocate.go:94] Try to allocate resource to 0 Namespaces
    I0917 03:38:22.548642       1 allocate.go:247] Leaving Allocate ...
    I0917 03:38:22.548651       1 backfill.go:42] Enter Backfill ...
    I0917 03:38:22.548662       1 backfill.go:91] Leaving Backfill ...
    I0917 03:38:22.552921       1 session.go:154] Close Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd
    

    Really appreciate if someone can offer some help!

  • Large memory is used by volcano-scheduler

    Large memory is used by volcano-scheduler

    What happened: The volcano-scheduler uses large memory and the function of schedule seems to be unusual. The pod of job will always be pending. But there is no any error logs in pod. After restart it, everything is okay.

    The image link of the result of "kubectl top pods -n volcano-system": https://l4x826wg3c.feishu.cn/file/boxcnLnrbgq6CmvjAQWfOlWX9gc

    What you expected to happen: The schduler works well. Or know how to check what happened, then I could restart it when monitor some events.

    How to reproduce it (as minimally and precisely as possible): Not sure how to reproduce it, but it happend many times after running some days.

    Anything else we need to know?:

    1. The ETCD in our cluster uses normal Disk, not SSD.
    2. The Node volcano pods are deployed could be scheduled by other computer task like pytorchjob, and there is no Requests to volcano.

    Environment:

    • Volcano Version: https://github.com/volcano-sh/volcano/commit/1b96bdf4de821e1e4af2b4c056f67be7559a880d
    • Kubernetes version (use kubectl version): 1.22.2
    • Cloud provider or hardware configuration:
    • OS (e.g. from /etc/os-release):
    • Kernel (e.g. uname -a):
    • Install tools:
    • Others:
  • Add GPU Numbers Predicates

    Add GPU Numbers Predicates

    Support specify GPU numbers for pod resource requests issue#1440

    Currently, Volcano only supports specified GPU share memory. Specified GPU number is not supported. This pr supports defining GPU numbers for pod resource requests. You can check the design doc https://github.com/peiniliu/volcano/blob/dev/docs/user-guide/how_to_use_gpu_number.md for more details.

  • Switch to cross-compiled docker containers & container build

    Switch to cross-compiled docker containers & container build

    This allows for one container name to support multiple archs, and once pushed we can remove more of the arm64 specific installation stuff since the same container names will support both.

    docker does "the right thing" and pulls the container associated with the arch.

    You can take a look at the containers I built with this change in my own dockerhub at https://hub.docker.com/repository/docker/holdenk/volcanosh-scheduler , https://hub.docker.com/repository/docker/holdenk/volcanosh-controller-manager , https://hub.docker.com/repository/docker/holdenk/volcanosh-webhook-manager-base , etc.

    This is in response to https://github.com/volcano-sh/volcano/issues/1570 (although it could also solve https://github.com/volcano-sh/volcano/issues/1568 since we wouldn't need volcano-development-arm64.yaml anymore).

    To preserve backwards compatibility with users who might be developing locally in single arch mode I have that the default. If there are release docs I should update as well let me know.

    Signed-off-by: Holden Karau [email protected]

  • Distinguish different pod-delete scenario

    Distinguish different pod-delete scenario

    Try to address issue #791 It's a draft solution, need further discussion.

    In my ENV, seems it could work, but the pg status not correct, after delete(after success) the original pod will gone and not recreate but the status of pg is:

    status:
      phase: Running
      running: 2
    

    not what expect to:

    status:
      phase: Running
      running: 2
      success: 1
    
  • plugin ssh and mpi for HPC calculation for engine on earthquake

    plugin ssh and mpi for HPC calculation for engine on earthquake

    /kind feature

    Environment:

    • Volcano Version: 1.12
    • Kubernetes version (use kubectl version): Kind installation for testing: kubectl version Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-18T09:04:15Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}

    I want to use volcano as scheduler for our engine calculator for earthquakes. The communication of the cluster engine when we use VM or baremetal hosts is made by ssh

    I see that there are mpi plugin and also ssh plugin, but unfortunately I can't find any docs on what use these plugins in a deployment yaml. What i need is to understand in which way that plugin works to communicate from master to worker, look the follow example:

    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
      name: lm-mpi-job
    spec:
      minAvailable: 3
      schedulerName: volcano
      plugins:
        ssh: []
        svc: []
      tasks:
        - replicas: 1
          name: mpimaster
          policies:
            - event: TaskCompleted
              action: CompleteJob
          template:
            spec:
              containers:
                - command:
                    - /bin/sh
                    - -c
                    - |
                      sleep 10;
                      cat /etc/volcano/mpiworker.host | tr "\n" ","
                      MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`;
                      mkdir -p /var/run/sshd; /usr/sbin/sshd;
                      mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world ;
                      sleep 100;
                  image: volcanosh/example-mpi:0.0.1
                  name: mpimaster
                  ports:
                    - containerPort: 22
                      name: mpijob-port
                  workingDir: /home
              restartPolicy: OnFailure
        - replicas: 2
          name: mpiworker
          template:
            spec:
              containers:
                - command:
                    - /bin/sh
                    - -c
                    - |
                      mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
                  image: volcanosh/example-mpi:0.0.1
                  name: mpiworker
                  ports:
                    - containerPort: 22
                      name: mpijob-port
                  workingDir: /home
              restartPolicy: OnFailure
    

    In this example the user is root, but is ti possible to use a different user for ssh plugin to can ssh to worker from master? Because on our image container we don't use user root but we need ssh connection from master to worker like open mpi And mpi plugin works in the same way? I find only a PR but no documentation on site volcano.sh or github available

    Thanks

  • dynamically  set tasks' replicas, with the range of [min, max]

    dynamically set tasks' replicas, with the range of [min, max]

    Is this a BUG REPORT or FEATURE REQUEST?:

    Uncomment only one, leave it on its own line:

    /kind bug kind feature

    What happened: Suppose in Tensorflow area, when user submits a distributed Tensorflow job, he must decide the number of workers, and usually set MinAvalible := sum of replicas of tasks.

    What you expected to happen: If we can set replicas to a range number, e.g. [min, max], we can enhance our scheduling ability.

    1. If there are enough resources, we can set as many workers as possible ( <= max)
    2. If there are less resources, we can start the job as fast as possible( >= min).
    3. If tensorflow workload (or any other workload) allow dynamic workers(e.g. auto-scaling), it gives kube-volcano more possibility to schedule.
  • SLA plugin doesn't work on `batch/v1` `Job` objects; `sla-waiting-time` from `volcano-scheduler.conf` is ignored

    SLA plugin doesn't work on `batch/v1` `Job` objects; `sla-waiting-time` from `volcano-scheduler.conf` is ignored

    What happened:

    As mentioned in #1869 I am using Volcano to schedule Kubernetes Job objects, to try and prevent smaller jobs submitted later from immediately filling any available space and starving larger jobs submitted earlier.

    My cluster has a 96-core node with hostname "k1.kube".

    I installed Volcano from the Helm chart in tag v1.4.0, using this values.yaml:

    basic:
      image_tag_version: "v1.4.0"
      controller_image_name: "volcanosh/vc-controller-manager"
      scheduler_image_name: "volcanosh/vc-scheduler"
      admission_image_name: "volcanosh/vc-webhook-manager"
      admission_secret_name: "volcano-admission-secret"
      admission_config_file: "config/volcano-admission.conf"
      scheduler_config_file: "config/volcano-scheduler.conf"
    
      image_pull_secret: ""
      admission_port: 8443
      crd_version: "v1"
    custom:
      metrics_enable: "false"
    

    And then overriding the scheduler configmap with this and restarting the scheduler pod:

    apiVersion: v1
    data:
      volcano-scheduler.conf: |
        actions: "enqueue, allocate, backfill"
        tiers:
        - plugins:
          - name: priority
          - name: gang
          - name: conformance
          - name: sla
            arguments:
              # Stop letting little jobs pass big jobs after the big jobs have been
              # waiting this long
              sla-waiting-time: 5m
        - plugins:
          - name: overcommit
          - name: drf
          - name: predicates
          - name: proportion
          - name: nodeorder
            arguments:
              # Maybe this will try to fill already full nodes first?
              leastrequested.weight: 0
              mostrequested.weight: 2
              nodeaffinity.weight: 3
              podaffinity.weight: 3
              balancedresource.weight: 1
              tainttoleration.weight: 1
              imagelocality.weight: 1
          - name: binpack
    kind: ConfigMap
    metadata:
      annotations:
        meta.helm.sh/release-name: volcano
        meta.helm.sh/release-namespace: volcano-system
      labels:
        app.kubernetes.io/managed-by: Helm
      name: volcano-scheduler-configmap
      namespace: volcano-system
    

    So I should be using a global SLA of 5 minutes.

    Then, I prepared a test: fill up the node with some jobs, then queue a big job, then queue a bunch of smaller jobs after it:

    # Clean up
    kubectl delete job -l app=volcanotest
    
    # Make 10 10 core jobs that will block out our test job for at least 2 minutes
    # Make sure they don't all finish at once.
    rm -f jobs_before.yml
    for NUM in {1..10} ; do
    cat >>jobs_before.yml <<EOF
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: presleep${NUM}
      labels:
        app: volcanotest
    spec:
      template:
        spec:
          schedulerName: volcano
          nodeSelector:
            kubernetes.io/hostname: k1.kube
          containers:
          - name: main
            image: ubuntu:20.04
            command: ["sleep",  "$(( $RANDOM % 20 + 120 ))"]
            resources:
              limits:
                memory: 300M
                cpu: 10000m
                ephemeral-storage: 1G
              requests:
                memory: 300M
                cpu: 10000m
                ephemeral-storage: 1G
          restartPolicy: Never
      backoffLimit: 4
      ttlSecondsAfterFinished: 1000
    ---
    EOF
    done
    
    # And 200 10 core jobs that, if they all pass it, will keep it blocked out for 20 minutes
    # We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is working.
    rm -f jobs_after.yml
    for NUM in {1..200} ; do
    cat >>jobs_after.yml <<EOF
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: postsleep${NUM}
      labels:
        app: volcanotest
    spec:
      template:
        spec:
          schedulerName: volcano
          nodeSelector:
            kubernetes.io/hostname: k1.kube
          containers:
          - name: main
            image: ubuntu:20.04
            command: ["sleep",  "$(( $RANDOM % 20 + 60 ))"]
            resources:
              limits:
                memory: 300M
                cpu: 10000m
                ephemeral-storage: 1G
              requests:
                memory: 300M
                cpu: 10000m
                ephemeral-storage: 1G
          restartPolicy: Never
      backoffLimit: 4
      ttlSecondsAfterFinished: 1000
    ---
    EOF
    done
    
    # And the test job itself between them.
    rm -f job_middle.yml
    cat >job_middle.yml <<EOF
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: middle
      labels:
        app: volcanotest
    spec:
      template:
        spec:
          schedulerName: volcano
          nodeSelector:
            kubernetes.io/hostname: k1.kube
          containers:
          - name: main
            image: ubuntu:20.04
            command: ["sleep", "1"]
            resources:
              limits:
                memory: 300M
                cpu: 50000m
                ephemeral-storage: 1G
              requests:
                memory: 300M
                cpu: 50000m
                ephemeral-storage: 1G
          restartPolicy: Never
      backoffLimit: 4
      ttlSecondsAfterFinished: 1000
    EOF
    
    kubectl apply -f jobs_before.yml
    sleep 10
    kubectl apply -f job_middle.yml
    sleep 10
    CREATION_TIME="$(kubectl get job middle -o jsonpath='{.metadata.creationTimestamp}')"
    kubectl apply -f jobs_after.yml
    # Wait for it to finish
    COMPLETION_TIME=""
    while [[ -z "${COMPLETION_TIME}" ]] ; do
        sleep 10
        COMPLETION_TIME="$(kubectl get job middle -o jsonpath='{.status.completionTime}')"
    done
    echo "Test large job was created at ${CREATION_TIME} and completed at ${COMPLETION_TIME}"
    

    I observed jobs from jobs_after.yml being scheduled even when the job from job_middle.yml had had its pod pending for 10 minutes, which is double the global SLA time that should be being enforced.

    What you expected to happen:

    These shouldn't be much more than 5 minutes between the creation and completion times for the large middle job. When the job pod from job_middle.yml has been pending for 5 minutes, no more job pods from jobs_after.yml should be being scheduled by Volcano until job_middle.yml has been scheduled.

    How to reproduce it (as minimally and precisely as possible): Use the Volcano helm chart, the above configmap override, kubectl -n volcano-system delete pod "$(kubectl get pod -n volcano-system | grep volcano-scheduler | cut -f1 -d' ')" to bounce the schedule pod after reconfiguring it, and the above Bash code to generate test jobs. Adjust the hostname label selectors and job sizes as needed to fill the test cluster node you are using.

    Anything else we need to know?:

    Is the SLA plugin maybe not smart enough to clear out space for a job to meet the SLA from a node that matches its selectors? Are other plugins in the config maybe scheduling stuff that the SLA plugin has decided chouldn't be scheduled yet?

    The scheduler pod logs don't seem to include the string "sla", but they log a bunch for every pod that's waiting every second, so I might not be able to see the startup logs or every single line ever logged.

    The jobs are definitely getting PodGroups created for them. Here's the PodGroup description for the middle job when it should have been run according to the SLA but has not yet been:

    Name:         podgroup-31600c19-2282-47f1-934b-94026d88db1e
    Namespace:    vg
    Labels:       <none>
    Annotations:  <none>
    API Version:  scheduling.volcano.sh/v1beta1
    Kind:         PodGroup
    Metadata:
      Creation Timestamp:  2021-12-13T22:06:25Z
      Generation:          2
      Managed Fields:
        API Version:  scheduling.volcano.sh/v1beta1
        Fields Type:  FieldsV1
        fieldsV1:
          f:metadata:
            f:ownerReferences:
          f:spec:
            .:
            f:minMember:
            f:minResources:
              .:
              f:cpu:
              f:ephemeral-storage:
              f:memory:
            f:priorityClassName:
          f:status:
        Manager:      vc-controller-manager
        Operation:    Update
        Time:         2021-12-13T22:06:25Z
        API Version:  scheduling.volcano.sh/v1beta1
        Fields Type:  FieldsV1
        fieldsV1:
          f:status:
            f:conditions:
            f:phase:
        Manager:    vc-scheduler
        Operation:  Update
        Time:       2021-12-13T22:06:26Z
      Owner References:
        API Version:           batch/v1
        Block Owner Deletion:  true
        Controller:            true
        Kind:                  Job
        Name:                  middle
        UID:                   31600c19-2282-47f1-934b-94026d88db1e
      Resource Version:        122332555
      Self Link:               /apis/scheduling.volcano.sh/v1beta1/namespaces/vg/podgroups/podgroup-31600c19-2282-47f1-934b-94026d88db1e
      UID:                     8bee9cca-40d5-47b5-90e7-ebb1bc70059a
    Spec:
      Min Member:  1
      Min Resources:
        Cpu:                  50
        Ephemeral - Storage:  1G
        Memory:               300M
      Priority Class Name:    medium-priority
      Queue:                  default
    Status:
      Conditions:
        Last Transition Time:  2021-12-13T22:06:26Z
        Message:               1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
        Reason:                NotEnoughResources
        Status:                True
        Transition ID:         86f1b151-92dd-4893-bcd3-c2573b3029fc
        Type:                  Unschedulable
      Phase:                   Inqueue
    Events:
      Type     Reason         Age                   From     Message
      ----     ------         ----                  ----     -------
      Warning  Unschedulable  64s (x1174 over 21m)  volcano  1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
    

    Environment:

    • Volcano Version: v1.4.0
    • Kubernetes version (use kubectl version):
    Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:23:04Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
    
    • Cloud provider or hardware configuration: Nodes are hosted on AWS instances.
    • OS (e.g. from /etc/os-release):
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"
    
    CENTOS_MANTISBT_PROJECT="CentOS-7"
    CENTOS_MANTISBT_PROJECT_VERSION="7"
    REDHAT_SUPPORT_PRODUCT="centos"
    REDHAT_SUPPORT_PRODUCT_VERSION="7"
    
    • Kernel (e.g. uname -a):
    Linux master.kube 5.8.7-1.el7.elrepo.x86_64 #1 SMP Fri Sep 4 13:11:18 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
    
    • Install tools:
    helm version
    version.BuildInfo{Version:"v3.7.2", GitCommit:"663a896f4a815053445eec4153677ddc24a0a361", GitTreeState:"clean", GoVersion:"go1.16.10"}
    
    • Others:
  • Pass conformance test

    Pass conformance test

    Is this a BUG REPORT or FEATURE REQUEST?:

    /kind feature

    Description:

    Cherry pick related PR in kube-batch to volcano-sh/kube-batch for conformance test.

    /cc @asifdxtreme

  • add admitPod and PGController

    add admitPod and PGController

    Which issue(s) this PR fixes : Fixes #135 #134

    Special notes for your reviewer:

    1. new func AdmitPod in admission controller

    2. new PGcontroller in controller

    3. delete Inqueue job phase

    4. fix UT

    Release note:

    
    1. add ValidatingWebhookConfiguration volcano-validate-pod, only limit CREATE pods, allow pods to create when:
    - pod.spec.schedulerName is default-scheduler
    - podgroup phase isn't Pending
    - normal job, no podgroup
    
    2. new PGcontroller, create pg for normal job when use kube-batch.
    
    3. if create job, job phase will be Pending->Running...... , so fix UT
    
    
  • Fair sharing not working

    Fair sharing not working

    What happened: My cluster has total 11 CPU. I'm trying to create 2 queue(excluding default queue) with weight 5 for each queue. Queue manifest,

    apiVersion: scheduling.volcano.sh/v1beta1
    kind: Queue
    metadata:
      name: test
    spec:
      weight: 5
    
    ---
    
    apiVersion: scheduling.volcano.sh/v1beta1
    kind: Queue
    metadata:
      name: test1
    spec:
      weight: 5
    

    Queue List,

    Name                     Weight  State   Inqueue Pending Running Unknown
    default                  1       Open    0       0       0       0
    test                     5       Open    0       0       0       0
    test1                    5       Open    0       0       0       0
    

    Created 3 Jobs for test queue with CPU resource as follow, job1 -> CPU 5 job2 -> CPU 5 job3 -> CPU 1

    Now all 3 jobs are running and utilizing full cluster.

    Now i'm creating new Job in test1 queue with CPU 2. I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state.

    Name                     Weight  State   Inqueue Pending Running Unknown
    default                  1       Open    0       0       0       0
    test                     5       Open    0       0       3       0
    test1                    5       Open    1       0       0       0
    

    Configuration,

    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
    

    What you expected to happen: I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state. How to reproduce it (as minimally and precisely as possible):

    Anything else we need to know?:

    Environment:

    • Volcano Version: v1.3.0
    • Kubernetes version (use kubectl version):
    • Cloud provider or hardware configuration:
    • OS (e.g. from /etc/os-release):
    • Kernel (e.g. uname -a):
    • Install tools:
    • Others:
  • Error occur when execute the same task many times

    Error occur when execute the same task many times

    What happened: When test the performance of spark-native integration with volcano, when execute the same task with 50 times, interval is 3 seconds, half task failed.

    What you expected to happen: All task execute one by one and all of them should success.

    How to reproduce it (as minimally and precisely as possible): Executing task by the following code.

    #!/bin/bash
    s=0
    for ((i=1;i<=50;i=i+1))
    do
         nohup ${SPARK_HOME1}/bin/spark-submit \
         --master k8s://https://10.32.226.132:6443 \
         --deploy-mode cluster \
         --class cn.cestc.test.JavaSparkReadHiveInHahadoopForComponent \
         --driver-cores 1 \
         --driver-memory 2G \
         --num-executors 1 \
         --executor-cores 1 \
         --executor-memory 2G \
         --name native_modetask2 \
         --jars hdfs://dev-host-03:8082/dolphinscheduler_arm/supportarm/resources/performance1/mysql-connector-java-8.0.29.jar \
         --conf spark.executor.instances=1 \
         --conf spark.kubernetes.namespace=support132x86 \
         --conf spark.kubernetes.authenticate.driver.serviceAccountName=support132x86 \
         --conf spark.kubernetes.container.image=10.32.226.224:85/public-release/spark-volcano:3.3.0  \
         --conf spark.kubernetes.scheduler.name=volcano  \
         --conf spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/tmp/podgroup-template.yaml \
         --conf spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep \
         --conf spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep \
      hdfs://dev-host-03:8082/dolphinscheduler_x86/support132x86/resources/performance1/SparkJupiterTest-2.0.jar\
     datacluster hdfs://datacluster/user/hive/warehouse \
    thrift://dev-host-01:9083,thrift://dev-host-02:9083,thrift://dev-host-03:9083\
     dev-host-01:8082 dev-host-03:8082 testdata\
     hdfs://datacluster/dolphinscheduler_arm/supportarm/resources/performance1/yuanj2.sql \
    10.32.226.69 30306 testdata root CESTC1Dhr7El67KD3jG@ out_test_task2 2000 > log-$i.log &
         sleep 3s
    done
    
    

    podgroup-template.yaml

    apiVersion: scheduling.volcano.sh/v1beta1
    kind: PodGroup
    spec:
      # Specify minMember to 1 to make a driver pod
      minMember: 1
      # Specify minResources to support resource reservation (the driver pod resource and executors pod resource should be considered)
      # It is useful for ensource the available resources meet the minimum requirements of the Spark job and avoiding the
      # situation where drivers are scheduled, and then they are unable to schedule sufficient executors to progress.
      minResources:
        cpu: "10"
        memory: "20G"
      # Specify the priority, help users to specify job priority in the queue during scheduling.
      priorityClassName: system-node-critical
      # Specify the queue, indicates the resource queue which the job should be submitted to
      queue: support132x86 
    

    Anything else we need to know?: error task log

    [INFO] 2022-12-29 16:46:15.060 [TaskLogInfo- - [taskAppId=TASK-7947514526080_14-1674-6086]-getOutputLogService]  -  -> 22/12/29 16:46:14 ERROR Client: Please check "kubectl auth can-i create pod" first. It should be yes.
    	Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.32.226.132:6443/api/v1/namespaces/support132x86/pods. Message: admission webhook "validatepod.volcano.sh" denied the request: failed to create pod <support132x86/spkj-6086-driver> as the podgroup phase is Pending. Received status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=admission webhook "validatepod.volcano.sh" denied the request: failed to create pod <support132x86/spkj-6086-driver> as the podgroup phase is Pending, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
    		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)
    		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)
    		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
    		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)
    		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)
    		at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:305)
    		at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
    		at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
    		at io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
    		at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:152)
    		at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248)
    		at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242)
    		at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2764)
    		at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242)
    		at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214)
    		at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
    		at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    		at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    		at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    		at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
    		at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
    		at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    	22/12/29 16:46:14 INFO ShutdownHookManager: Shutdown hook called
    	22/12/29 16:46:15 INFO ShutdownHookManager: Deleting directory /tmp/spark-8da822e9-7f18-433a-886e-eb8b56529792
    [INFO] 2022-12-29 16:46:16.152 [TaskLogInfo- - [taskAppId=TASK-7947514526080_14-1674-6086]-getOutputLogService]  - FINALIZE_SESSION
    [INFO] 2022-12-29 16:46:16.162 [TaskLogInfo- - [taskAppId=TASK-7947514526080_14-1674-6086]]  - process has exited, execute path:/tmp/dolphinschedu
    

    Environment:

    • Volcano Version: latest
    • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.10", GitCommit:"98d5dc5d36d34a7ee13368a7893dcb400ec4e566", GitTreeState:"clean", BuildDate:"2021-04-15T03:28:42Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.10", GitCommit:"98d5dc5d36d34a7ee13368a7893dcb400ec4e566", GitTreeState:"clean", BuildDate:"2021-04-15T03:20:25Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
    
    • Cloud provider or hardware configuration:
    • OS (e.g. from /etc/os-release):
    • Kernel (e.g. uname -a):Linux master 5.4.121-1.el7.elrepo.x86_64 #1 SMP Thu May 20 19:22:37 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
    • Install tools:
    • Others:
  • remove gox

    remove gox

    Signed-off-by: hwdef [email protected]

    Removing the gox tool has the following benefits:

    • Simplify project construction complexity
    • Reduce the packages that the project depends on
    • Compilation is smoother in China's network environment, because it is likely to fail when installing gox
  • Preempt do not follow minAvailable.

    Preempt do not follow minAvailable.

    What happened:

    When I tried preempt action with priority plugin. I set minAvailable: 1 in high-priority job which has 4 tasks, and supposed that Volcano would evict only 1 task of low-priority job. But the result was that all 4 tasks from high-priority job started and evicted 4 low-priority tasks as below:

    NAME               READY   STATUS        RESTARTS   AGE
    job-high-task0-0   0/1     Pending       0          3s
    job-high-task0-1   0/1     Pending       0          3s
    job-high-task1-0   0/1     Pending       0          3s
    job-high-task1-1   0/1     Pending       0          3s
    job-low-task0-0    1/1     Terminating   0          13m
    job-low-task0-1    1/1     Terminating   0          13m
    job-low-task1-0    1/1     Terminating   0          13m
    job-low-task1-1    1/1     Terminating   0          14m
    

    What you expected to happen:

    How to reproduce it (as minimally and precisely as possible):

    volcano-scheduler-configmap was set as below:

        actions: "enqueue, allocate, preempt, backfill"
        tiers:
        - plugins:
          - name: priority
          - name: gang
            enableJobStarving: false
            enablePreemptable: false
          - name: conformance
        - plugins:
          - name: drf
            enablePreemptable: false
          - name: predicates
          - name: nodeorder
          - name: binpack
    
    1. Set priorityclass:
    NAME                      VALUE        GLOBAL-DEFAULT   AGE
    high                      1000         false            49d
    low                       10           false            31h
    
    1. Apply low-priority job with 4 task and fullfilled the cluster:
    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
      name: job-low
    spec:
      schedulerName: volcano
      minAvailable: 4
      queue: default
      priorityClassName: low
      tasks:
        - replicas: 2
          name: "task0"
          template:
            spec:
              containers:
                - image: alpine
                  command: ["/bin/sh", "-c", "sleep 99999"]
                  imagePullPolicy: IfNotPresent
                  name: task
                  resources:
                    limits:
                      cpu: 3750m
                    requests:
                      cpu: 3750m
              restartPolicy: OnFailure
        - replicas: 2
          name: "task1"
          template:
            spec:
              containers:
                - image: alpine
                  command: ["/bin/sh", "-c", "sleep 99999"]
                  imagePullPolicy: IfNotPresent
                  name: task
                  resources:
                    limits:
                      cpu: 3750m
                    requests:
                      cpu: 3750m
              restartPolicy: OnFailure
    
    1. Apply high-priority job with 4 task but minAvailable: 1:
    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
      name: job-high
    spec:
      schedulerName: volcano
      minAvailable: 1
      queue: default
      priorityClassName: high
      tasks:
        - replicas: 2
          name: "task0"
          template:
            spec:
              containers:
                - image: alpine
                  command: ["/bin/sh", "-c", "sleep 99999"]
                  imagePullPolicy: IfNotPresent
                  name: task
                  resources:
                    limits:
                      cpu: 3750m
                    requests:
                      cpu: 3750m
              restartPolicy: OnFailure
        - replicas: 2
          name: "task1"
          template:
            spec:
              containers:
                - image: alpine
                  command: ["/bin/sh", "-c", "sleep 99999"]
                  imagePullPolicy: IfNotPresent
                  name: task
                  resources:
                    limits:
                      cpu: 3750m
                    requests:
                      cpu: 3750m
              restartPolicy: OnFailure
    
    1. Wait for preempt.

    Anything else we need to know?:

    Environment:

    • Volcano Version: latest
    • Kubernetes version (use kubectl version): 1.21
    • Cloud provider or hardware configuration:
    • OS (e.g. from /etc/os-release):
    • Kernel (e.g. uname -a):
    • Install tools:
    • Others:
  • allocateIdleResource Method Calculate Node Idle Bugs

    allocateIdleResource Method Calculate Node Idle Bugs

    What happened: When using volcano to allocate some task into a node, there are UnexpectedAdmissionError events involved with the pod. The volcano scheduler will assign different pod to the same node because of the lack of gpu resources. As result, many job failed in a short period of time.

    Here was error message:

    Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 4, Available: 3, which is unexpected
    

    What you expected to happen: Node info in the vc-scheduler should be calculated precisely and correctly, so that the task will not be scheduled to the node.

    How to reproduce it (as minimally and precisely as possible): At some special moment, there was a node with 4 GPUs which are 3 healthy, 1 unhealthy, and used by an inference task. In the log of scheduler, the idle of the node was 3 GPUs and the used of the node was 4GPUs, causing the bugs.

    Anything else we need to know?:

    func (ni *NodeInfo) allocateIdleResource(ti *TaskInfo) error {
    	if ti.Resreq.LessEqual(ni.Idle, Zero) {
    		ni.Idle.Sub(ti.Resreq)
    		return nil
    	}
    
    	return &AllocateFailError{Reason: fmt.Sprintf(
    		"cannot allocate resource, <%s> idle: %s <%s/%s> req: %s",
    		ni.Name, ni.Idle.String(), ti.Namespace, ti.Name, ti.Resreq.String(),
    	)}
    }
    

    In pkg/scheduler/api/node_info.go, when Resreq was larger than node idle, the node idle didn't set to the correct value.

A native Go clean room implementation of the Porter Stemming algorithm.

Go Porter Stemmer A native Go clean room implementation of the Porter Stemming Algorithm. This algorithm is of interest to people doing Machine Learni

Jan 3, 2023
A recommender system service based on collaborative filtering written in Go

Language: English | 中文 gorse: Go Recommender System Engine Build Coverage Report GoDoc RTD Demo gorse is an offline recommender system backend based o

Dec 29, 2022
Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.
Learn how to design large-scale systems. Prep for the system design interview.  Includes Anki flashcards.

English ∙ 日本語 ∙ 简体中文 ∙ 繁體中文 | العَرَبِيَّة‎ ∙ বাংলা ∙ Português do Brasil ∙ Deutsch ∙ ελληνικά ∙ עברית ∙ Italiano ∙ 한국어 ∙ فارسی ∙ Polski ∙ русский язы

Jan 9, 2023
Go implementation of the yolo v3 object detection system
Go implementation of the yolo v3 object detection system

Go YOLO V3 This repository provides a plug and play implementation of the Yolo V3 object detection system in Go, leveraging gocv. Prerequisites Since

Dec 14, 2022
A Hackathon project created by Alpha Interface team for Agri-D Food Hack

Alpha Interface A Hackathon project created by Alpha Interface team for Agri-D Food Hack Installation Downloading Wasp and wasp-cli https://wiki.iota.

Oct 16, 2022
Kubernetes Native Edge Computing Framework (project under CNCF)
Kubernetes Native Edge Computing Framework (project under CNCF)

KubeEdge KubeEdge is built upon Kubernetes and extends native containerized application orchestration and device management to hosts at the Edge. It c

Jan 1, 2023
OpenYurt - Extending your native Kubernetes to edge(project under CNCF)
OpenYurt - Extending your native Kubernetes to edge(project under CNCF)

openyurtio/openyurt English | 简体中文 What is NEW! Latest Release: September 26th, 2021. OpenYurt v0.5.0. Please check the CHANGELOG for details. First R

Jan 7, 2023
Zero - If Google Drive says that 1 is under copyright, 0 must be under copyleft

zero Zero under copyleft license Google Drive's copyright detector says that fil

May 16, 2022
Fabric-Batch-Chaincode (FBC) is a library that enables batch transactions in chaincode without additional trusted systems.

Fabric-Batch-Chaincode Fabric-Batch-Chaincode (FBC) is a library that enables batch transactions in chaincode without additional trusted systems. Over

Nov 9, 2022
GoBatch is a batch processing framework in Go like Spring Batch in Java
GoBatch is a batch processing framework in Go like Spring Batch in Java

GoBatch English|中文 GoBatch is a batch processing framework in Go like Spring Batch in Java. If you are familiar with Spring Batch, you will find GoBat

Dec 25, 2022
dockin ops is a project used to handle the exec request for kubernetes under supervision
dockin ops is a project used to handle the exec request for kubernetes under supervision

Dockin Ops - Dockin Operation service English | 中文 Dockin operation and maintenance management system is a safe operation and maintenance management s

Aug 12, 2022
🐻 The Universal Service Mesh. CNCF Sandbox Project.
🐻 The Universal Service Mesh. CNCF Sandbox Project.

Kuma is a modern Envoy-based service mesh that can run on every cloud, in a single or multi-zone capacity, across both Kubernetes and VMs. Thanks to i

Aug 10, 2021
🐻 The Universal Service Mesh. CNCF Sandbox Project.
🐻 The Universal Service Mesh. CNCF Sandbox Project.

Kuma is a modern Envoy-based service mesh that can run on every cloud, in a single or multi-zone capacity, across both Kubernetes and VMs. Thanks to i

Jan 8, 2023
CNCF Jaeger, a Distributed Tracing Platform

Jaeger - a Distributed Tracing System Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing platform created by Uber Technologies and do

Jan 9, 2023
CNCF Jaeger, a Distributed Tracing Platform

Jaeger - a Distributed Tracing System Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing platform created by Uber Technologies and do

Jan 2, 2023
CNCF Jaeger, a Distributed Tracing Platform

Jaeger - a Distributed Tracing System Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing platform created by Uber Technologies and do

Jan 3, 2023
Project E-Commerce Alta Store Program Immersive Back End Batch 4
Project E-Commerce Alta Store Program Immersive Back End Batch 4

Project E-Commerce Project E-Commerce Alta Store Program Immersive Back End Batch 4 Explore the docs » Daftar Isi About The Project Built With ERD Con

Dec 30, 2021
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

kube-batch kube-batch is a batch scheduler for Kubernetes, providing mechanisms for applications which would like to run batch jobs leveraging Kuberne

Jan 6, 2023
Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs

Caelus Caelus is a set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs, these resources come from the underuti

Nov 22, 2022
Cloud Native Electronic Trading System built on Kubernetes and Knative Eventing

Ingenium -- Still heavily in prototyping stage -- Ingenium is a cloud native electronic trading system built on top of Kubernetes and Knative Eventing

Aug 29, 2022