The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes

DGL Operator

The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network distributed or non-distributed training on Kubernetes. Please check out here for an introduction to DGL and dgl distributed training philosophy.

🛠 Prerequisites

  • Kubernetes >= 1.16

🚀 Installation

You can deploy the operator with default settings by running the following commands:

git clone https://github.com/Qihoo360/dgl-operator
cd dgl-operator
kubectl create -f deploy/v1alpha1/dgl-operator.yaml

You can check whether the DGL Job custom resource is installed via:

kubectl get crd

The output should include dgljobs.qihoo.net like the following:

NAME                                       AGE
...
dgljobs.qihoo.net                          1m
...

🔬 Creating a DGL Job

You can create a DGL job by defining an DGLJob config file. See GraphSAGE.yaml or GraphSAGE_dist.yaml example config file for launching a single-node or multi-node GraphSAGE training job. You may change the config file based on your requirements.

# standalone GraphSAGE
cat examples/v1alpha1/GraphSAGE.yaml
# or a distributed version
cat examples/v1alpha1/GraphSAGE_dist.yaml

Deploy the DGLJob resource to start training:

# standalone GraphSAGE
kubectl create -f examples/v1alpha1/GraphSAGE.yaml
# or a distributed version
kubectl create -f examples/v1alpha1/GraphSAGE_dist.yaml

💭 Reference

Please check out these previous works that helped inspire the creation of DGL Operator

Owner
Qihoo 360
360 official github
Qihoo 360
Comments
  • No such file or directory: '/etc/dgl/hostfile'

    No such file or directory: '/etc/dgl/hostfile'

    deploy examples/v1alpha1/GraphSAGE_dist.yaml:

    error of dgl-graphsage-launcher pod

    Phase 3/5: dispatch partitions
    ----------
    Traceback (most recent call last):
    File "tools/dispatch.py", line 102, in 
    main()
    File "tools/dispatch.py", line 44, in main
    with open(args.ip_config) as f:
    FileNotFoundError: [Errno 2] No such file or directory: '/etc/dgl/hostfile'
    ----------
    Phase 3/5 error raised
    

    error of dgl-operator pod

    021-08-14T09:45:48.716Z	INFO	controllers.DGLJob	Finished reconciling job	{"dgljob": "dgl-operator/dgl-graphsage", "dgl-operator/dgl-graphsage": "80.81µs"}
    2021-08-14T09:45:48.722Z	ERROR	controllers.DGLJob	unable to fetch DGLJob	{"dgljob": "dgl-operator/dgl-graphsage", "error": "DGLJob.qihoo.net \"dgl-graphsage\" not found"}
    github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132
    github.com/Qihoo360/dgl-operator/controllers.(*DGLJobReconciler).Reconcile
    /workspace/controllers/dgljob_controller.go:115
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:297
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:252
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:215
    k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185
    k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
    k8s.io/apimachinery/pkg/util/wait.BackoffUntil
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
    k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
    k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185
    k8s.io/apimachinery/pkg/util/wait.UntilWithContext
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99
    
  • Support launcher workload

    Support launcher workload

    • Support workload on the launcher to do small scale job
    • Can skip the partitioning
    • Fix naming of PartitionModeSkip
    • Add launcher-workload support on dglrun
    • Add "node_classification" example
  • Migrate to kubebuilder v3

    Migrate to kubebuilder v3

    • Migrate to kubebuilder v3
    • Apply CRD apiversion on v1
    • Update dgl-operator.yaml
    • Update GraphSAGE example code
    • Update dglrun
    • Add "Running" as cleanPodPolicy default value
  • This will download 1.38GB. Will you proceed? (y/N)

    This will download 1.38GB. Will you proceed? (y/N)

    Phase 1/5: load and partition graph

    Using backend: pytorch Partition arguments: Namespace(balance_edges=True, balance_train=True, dataset_url='http://192.168.12.218:8000/ogbn_products.zip', graph_name='graphsage', num_parts=2, output='/dgl_workspace/dataset', part_method='metis', rel_data_path='dataset', undirected=False, workspace='/dgl_workspace') Download http://192.168.12.218:8000/ogbn_products.zip Extract /dgl_workspace/dataset/ogbn_products.zip load ogbn-products This will download 1.38GB. Will you proceed? (y/N) Traceback (most recent call last):   File "code/load_and_partition_graph.py", line 107, in      g, _ = load_dataset('ogbn-products', args.output, args.dataset_url)   File "code/load_and_partition_graph.py", line 33, in load_dataset     data = DglNodePropPredDataset(name=name, root=work_dir)   File "/usr/local/lib/python3.6/site-packages/ogb/nodeproppred/dataset_dgl.py", line 69, in init     self.pre_process()   File "/usr/local/lib/python3.6/site-packages/ogb/nodeproppred/dataset_dgl.py", line 98, in pre_process     if decide_download(url):   File "/usr/local/lib/python3.6/site-packages/ogb/utils/url.py", line 17, in decide_download     return input("This will download %.2fGB. Will you proceed? (y/N)\n" % (size)).lower() == "y" EOFError: EOF when reading a line WARNING:root:The OGB package is out of date. Your version is 1.3.0, while the latest version is 1.3.5.

    Phase 1/5 error raised

  • 运行kubectl create -f examples/v1alpha1/GraphSAGE_dist.yaml `报错

    运行kubectl create -f examples/v1alpha1/GraphSAGE_dist.yaml `报错

    Phase 3/5: dispatch partitions

    Traceback (most recent call last):   File "tools/dispatch.py", line 102, in      main()   File "tools/dispatch.py", line 52, in main     with open(args.part_config) as conf_f: FileNotFoundError: [Errno 2] No such file or directory: '/dgl_workspace/dataset/graphsage.json'

    Phase 3/5 error raised

  • Large dataset(products) can't be delivered

    Large dataset(products) can't be delivered

    Phase 2/5: deliver partitions

    • POD_NAME=dgl-sample-cuda-test-214-launcher -c watcher-loop-partitioner
    • shift
    • /opt/kube/kubectl exec dgl-sample-cuda-test-214-launcher -c watcher-loop-partitioner -- /bin/sh -c mkdir -p /dgl_workspace
    • /opt/kube/kubectl cp /dgl_workspace/dataset dgl-sample-cuda-test-214-launcher:/dgl_workspace -c watcher-loop-partitioner E0214 08:27:30.469178 152 v2.go:105] write tcp 12.100.12.173:49690->12.96.0.1:443: use of closed network connection error: Internal error occurred: error executing command in container: read unix @->/var/run/docker.sock: read: connection reset by peer sleep wake Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), [] 12.100.0.117 30050 dgl-sample-cuda-test-214-launcher

    ['12.100.0.117', '30050', 'dgl-sample-cuda-test-214-launcher'] copy /dgl_workspace/dataset to dgl-sample-cuda-test-214-launcher:/dgl_workspace Traceback (most recent call last): File "/dgl_workspace/tools/launch.py", line 284, in main() File "/dgl_workspace/tools/launch.py", line 253, in main run_cp_container(args) File "/dgl_workspace/tools/launch.py", line 105, in run_cp_container kubecp_container(source_file_path, pod_name, args.target_dir, args.container) File "/dgl_workspace/tools/launch.py", line 46, in kubecp_container subprocess.check_call(cmd, shell = True) File "/home/gnn/conda/envs/gnn/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'set -x; /opt/kube/kubectl cp /dgl_workspace/dataset dgl-sample-cuda-test-214-launcher:/dgl_workspace -c watcher-loop-partitioner' returned non-zero exit status 1.

    Phase 2/5 error raised

  • deliver partitions error

    deliver partitions error

    Phase 2/5: deliver partitions

    • POD_NAME=dgl-graphsage-launcher -c watcher-loop-partitioner
    • shift
    • /opt/kube/kubectl exec dgl-graphsage-launcher -c watcher-loop-partitioner -- /bin/sh -c mkdir -p /dgl_workspace error: unable to upgrade connection: container not found ("watcher-loop-partitioner") Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), [] 12.100.10.0 30050 dgl-graphsage-launcher

    ['12.100.10.0', '30050', 'dgl-graphsage-launcher'] Traceback (most recent call last): File "tools/launch.py", line 280, in main() File "tools/launch.py", line 252, in main run_cp_container(args) File "tools/launch.py", line 103, in run_cp_container kubexec_container(f'mkdir -p {args.target_dir}', pod_name, args.container) File "tools/launch.py", line 31, in kubexec_container subprocess.check_call(cmd, shell = True) File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'sh /etc/dgl/kubexec.sh 'dgl-graphsage-launcher -c watcher-loop-partitioner' 'mkdir -p /dgl_workspace'' returned non-zero exit status 1.

    Phase 2/5 error raised

    Some time another error may occur.

    Phase 2/5: deliver partitions

    Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), [] 30050 dgl-graphsage-launcher

    ['30050', 'dgl-graphsage-launcher'] Traceback (most recent call last): File "tools/launch.py", line 280, in main() File "tools/launch.py", line 252, in main run_cp_container(args) File "tools/launch.py", line 100, in run_cp_container for pod_info in get_ip_host_pairs(args.ip_config): File "tools/launch.py", line 64, in get_ip_host_pairs raise RuntimeError("Format error of ip_config.") RuntimeError: Format error of ip_config. /etc/dgl/leadfile may loss ip.

    At another cluster Phase 2/5: deliver partitions

    • POD_NAME=dgl-graphsage-launcher -c watcher-loop-partitioner
    • shift
    • /opt/kube/kubectl exec dgl-graphsage-launcher -c watcher-loop-partitioner -- /bin/sh -c mkdir -p /dgl_workspace error: unable to upgrade connection: error dialing backend: dial tcp 127.0.0.1:34248: connect: connection timed out connection error
K8s-network-config-operator - Kubernetes network config operator to push network config to switches

Kubernetes Network operator Will add more to the readme later :D Operations The

May 16, 2022
The OCI Service Operator for Kubernetes (OSOK) makes it easy to connect and manage OCI services from a cloud native application running in a Kubernetes environment.

OCI Service Operator for Kubernetes Introduction The OCI Service Operator for Kubernetes (OSOK) makes it easy to create, manage, and connect to Oracle

Sep 27, 2022
Basic Kubernetes operator that have multiple versions in CRD. This operator can be used to experiment and understand Operator/CRD behaviors.

add-operator Basic Kubernetes operator that have multiple versions in CRD. This operator can be used to experiment and understand Operator/CRD behavio

Dec 15, 2021
An operator which complements grafana-operator for custom features which are not feasible to be merged into core operator

Grafana Complementary Operator A grafana which complements grafana-operator for custom features which are not feasible to be merged into core operator

Aug 16, 2022
network-node-manager is a kubernetes controller that controls the network configuration of a node to resolve network issues of kubernetes.
network-node-manager is a kubernetes controller that controls the network configuration of a node to resolve network issues of kubernetes.

Network Node Manager network-node-manager is a kubernetes controller that controls the network configuration of a node to resolve network issues of ku

Dec 18, 2022
The Elastalert Operator is an implementation of a Kubernetes Operator, to easily integrate elastalert with gitops.

Elastalert Operator for Kubernetes The Elastalert Operator is an implementation of a Kubernetes Operator. Getting started Firstly, learn How to use el

Jun 28, 2022
Minecraft-operator - A Kubernetes operator for Minecraft Java Edition servers

Minecraft Operator A Kubernetes operator for dedicated servers of the video game

Dec 15, 2022
Pulumi-k8s-operator-example - OpenGitOps Compliant Pulumi Kubernetes Operator Example

Pulumi GitOps Example OpenGitOps Compliant Pulumi Kubernetes Operator Example Pr

May 6, 2022
Kubernetes Operator Samples using Go, the Operator SDK and OLM
Kubernetes Operator Samples using Go, the Operator SDK and OLM

Kubernetes Operator Patterns and Best Practises This project contains Kubernetes operator samples that demonstrate best practices how to develop opera

Nov 24, 2022
A kubernetes operator sample generated by kubebuilder , which run cmd in pod on specified time

init kubebuilder init --domain github.com --repo github.com/tonyshanc/sample-operator-v2 kubebuilder create api --group sample --version v1 --kind At

Jan 25, 2022
Test Operator using operator-sdk 1.15

test-operator Test Operator using operator-sdk 1.15 operator-sdk init --domain rbt.com --repo github.com/ravitri/test-operator Writing kustomize manif

Dec 28, 2021
a k8s operator 、operator-sdk

helloworld-operator a k8s operator 、operator-sdk Operator 参考 https://jicki.cn/kubernetes-operator/ https://learnku.com/articles/60683 https://opensour

Jan 27, 2022
Operator Permissions Advisor is a CLI tool that will take a catalog image and statically parse it to determine what permissions an Operator will request of OLM during an install

Operator Permissions Advisor is a CLI tool that will take a catalog image and statically parse it to determine what permissions an Operator will request of OLM during an install. The permissions are aggregated from the following sources:

Apr 22, 2022
PolarDB-X Operator is a Kubernetes extension that aims to create and manage PolarDB-X cluster on Kubernetes.

GalaxyKube -- PolarDB-X Operator PolarDB-X Operator is a Kubernetes extension that aims to create and manage PolarDB-X cluster on Kubernetes. It follo

Dec 19, 2022
Kubernetes Operator to sync secrets between different secret backends and Kubernetes

Vals-Operator Here at Digitalis we love vals, it's a tool we use daily to keep secrets stored securely. We also use secrets-manager on the Kubernetes

Nov 13, 2022
GoBinClassify - A library that makes it easy to classify into groups

GoBinClassify GoBinClassify is a library that makes it easy to classify into gro

Feb 12, 2022
A Kubernetes Network Fabric for Enterprises that is Rich in Functions and Easy in Operations
A Kubernetes Network Fabric for Enterprises that is Rich in Functions and Easy in Operations

中文教程 Kube-OVN, a CNCF Sandbox Level Project, integrates the OVN-based Network Virtualization with Kubernetes. It offers an advanced Container Network

Dec 29, 2022
An Easy to use Go framework for Kubernetes based on kubernetes/client-go

k8devel An Easy to use Go framework for Kubernetes based on kubernetes/client-go, see examples dir for a quick start. How to test it ? Download the mo

Mar 25, 2022