Compute cluster (HPC) job submission library for Go (#golang) based on the open DRMAA standard.

go-drmaa

GoDoc Apache V2 License Go Report Card

This is a job submission library for Go (#golang) which is compatible to the DRMAA standard. The Go library is a wrapper around the DRMAA C library implementation provided by many distributed resource managers (cluster schedulers).

The library was developed using Univa Grid Engine's libdrmaa.so. It was tested with Grid Engine, Torque, and SLURM, but it should work also other resource managers / cluster schedulers which provide libdrmaa.so.

The "gestatus" subpackage only works with Grid Engine (some values are only available on Univa Grid Engine).

The DRMAA (Distributed Resource Management Application API) standard is meanwhile available in version 2. DRMAA2 provides more functionality around cluster monitoring and job session management. DRMAA and DRMAA2 are not compatible hence it is expected that both libraries are co-existing for a while. The Go DRMAA2 can be found here.

Note: Univa Grid Engine 8.3.0 and later added new functions that allows you to submit a job on behalf of another user. This helps creating a DRMAA service (like a web portal) that submits jobs. This functionality is available in the UGE83_sudo branch: https://github.com/dgruber/drmaa/tree/UGE83_sudo The functions are: RunJobsAs(), RunBulkJobsAs(), and ControlAs()

Compilation

First download the package:

   export GOPATH=${GOPATH:-~/src/go}
   mkdir -p $GOPATH
   go get -d github.com/dgruber/drmaa
   cd $GOPATH/github.com/dgruber/drmaa

Next, we need to compile the code.

For Univa Grid Engine and original SGE:

   source /path/to/grid/engine/installation/default/settings.sh
   ./build.sh
   cd examples/simplesubmit
   go build
   export LD_LIBRARY_PATH=$SGE_ROOT/lib/lx-amd64
   ./simplesubmit

For Son of Grid Engine ("loveshack"):

   source /path/to/grid/engine/installation/default/settings.sh
   ./build.sh --sog
   cd examples/simplesubmit
   go build
   export LD_LIBRARY_PATH=$SGE_ROOT/lib/lx-amd64
   ./simplesubmit

For Torque:

If your Torque drmaa.h header file is not located under /usr/include/torque, you will have to modify the build.sh script before running it.

   ./build.sh --torque
   cd examples/simplesubmit
   go build
   ./simplesubmit

For SLURM and the updated SLURM C drmaa binding

   ./build.sh --slurm /usr/local

The example program submits a sleep job into the system and prints out detailed job information as soon as the job is started.

Short Introduction in Go DRMAA

Go DRMAA applications need to open a DRMAA session before the DRMAA calls can be executed. Opening a DRMAA session usually establishes a connection to the cluster scheduler (distributed resource manager). Hence if no more DRMAA calls are made the Exit() method of the session must be executed. This tears down the connection. When an application does not call the Exit() method this can leave a communication handle open on the cluster scheduler side (which can take a while to be removed automatically). It should be always avoided not to call Exit(). In Go the defer statement can be used but remember that the function is not executed when an os.Exit() call is made.

Creating a DRMAA session:

s, err := drmaa.MakeSession()

Usually jobs and job workflows are submitted within DRMAA applications. In order to submit a job first a job template needs to be allocated:

jt, errJT := s.AllocateJobTemplate()
if errJT != nil {
   fmt.Printf("Error during allocating a new job template: %s\n", errJT)
   return
}

Underneath a C job template is allocated which is out-of-scope of the Go system. Hence it must be ensured that the job template is deleted when it is not used anymore. Also here the Go defer statement is useful.

// prevent memory leaks by freeing the allocated C job template at the end
defer s.DeleteJobTemplate(&jt)

The job template contains the specification of the job, like the command to be executed and its parameters. Those can be set by the setter methods of the job.

// set the application to submit
jt.SetRemoteCommand("sleep")
// set the parameter (use SetArgs() when having more parameters)
jt.SetArg("1")

A job can be executed with the session RunJob() method. If the same command should be executed many times, running it as a job array would make sense. In Grid Engine each instance gets a task ID assigned which the job can see in the SGE_TASK_ID environment variable (which is set to unknown for normal jobs). This task ID can be used for finding the right data set the job (array job task) needs to process. Submitting an array job is done with the RunBulkJobs() method.

jobID, errSubmit := s.RunJob(&jt)

// submitting 1000 instances of the same job
jobIDs, errBulkSubmit := s.RunBulkJobs(&jt, 1, 1000, 1)

A job state can also be changed (suspended / resumed / put in hold / deleted):

errTerm := s.TerminateJob(jobID)

The JobInfo data structure contains the runtime information of the job, like exit status or the amount of used resources (memory / IO / etc.). The JobInfo data structure can be get with the Wait() method.

jinfo, errWait := s.Wait(jobID, drmaa.TimeoutWaitForever)

For more details please consult the documentation and the DRMAA standard specifications.

More examples can be found on my blog at http://www.gridengine.eu.

Owner
Comments
  • Request idiomatic use of error interface

    Request idiomatic use of error interface

    The choice to return _drmaa.Error instead of error is non-standard. It results in the following strange behaviour when a (_drmaa.Error)(nil) is assigned to an error interface.

    http://play.golang.org/p/hdLuS-gra8

  • imports should use fully qualified repo

    imports should use fully qualified repo

    We've done this in our fork (can't really do a pull request for this kind of change). Suggest you replace all the import "drmaa" in your subdirectories with the fully qualified repo URL, e.g., import "github.com/dgruber/drmaa".

  • Support for Slurm

    Support for Slurm

    I did the following modification in order to support SLURM. It is based on the last stable release of PSNC DRMAA.

    Apparently the TORQUE binding have similar issues drmaa_get_num_attr_values(drmaa_attr_values_t* values, int *size) VS drmaa_get_num_attr_values(drmaa_attr_values_t* values, size_t *size). I have the same issue with drmaa_get_num_attr_names and applied the same treatment, but I do not have torque available.

  • Updates to support building on darwin

    Updates to support building on darwin

    This commit adds support for using the arch script from UGE to determine library paths and adds unistd.h on MacOS builds in order to provide the gid_t definition.

  • Wait() does not retrieve exit status

    Wait() does not retrieve exit status

    I found that Wait() was always returning a JobInfo with ExitStatus() giving 1, even when the job had exited with status 0. This seems to be fixed by changing line 651 in drmaa.go from drmaa_wifexited to drmaa_wexitstatus. The function drmaa_wifexited is already called on line 641.

  • Support alternate DRMAA location

    Support alternate DRMAA location

    There was no way to specify an alternate drmaa location independently from the drmaa implementation selection.

    Note that there is room for improvement. For one, this option could be provided in addition of the implementation selection.

    As in:

    ./build.sh --drmaa <drmaa dir> --sge
    

    also, if short options are ok, bash builtin getopts could be used.

Dkron - Distributed, fault tolerant job scheduling system https://dkron.io
Dkron - Distributed, fault tolerant job scheduling system https://dkron.io

Dkron - Distributed, fault tolerant job scheduling system for cloud native environments Website: http://dkron.io/ Dkron is a distributed cron service,

Dec 28, 2022
Raft library Raft is a protocol with which a cluster of nodes can maintain a replicated state machine.

Raft library Raft is a protocol with which a cluster of nodes can maintain a replicated state machine. The state machine is kept in sync through the u

Oct 15, 2021
gathering distributed key-value datastores to become a cluster

go-ds-cluster gathering distributed key-value datastores to become a cluster About The Project This project is going to implement go-datastore in a fo

Aug 19, 2022
Go Open Source, Distributed, Simple and efficient Search Engine

Go Open Source, Distributed, Simple and efficient full text search engine.

Dec 31, 2022
💡 A Distributed and High-Performance Monitoring System. The next generation of Open-Falcon
💡 A Distributed and High-Performance Monitoring System.  The next generation of Open-Falcon

夜莺简介 夜莺是一套分布式高可用的运维监控系统,最大的特点是混合云支持,既可以支持传统物理机虚拟机的场景,也可以支持K8S容器的场景。同时,夜莺也不只是监控,还有一部分CMDB的能力、自动化运维的能力,很多公司都基于夜莺开发自己公司的运维平台。开源的这部分功能模块也是商业版本的一部分,所以可靠性有保

Jan 5, 2023
CockroachDB - the open source, cloud-native distributed SQL database.
CockroachDB - the open source, cloud-native distributed SQL database.

CockroachDB is a cloud-native distributed SQL database designed to build, scale, and manage modern, data-intensive applications. What is CockroachDB?

Dec 29, 2022
Golang client library for adding support for interacting and monitoring Celery workers, tasks and events.

Celeriac Golang client library for adding support for interacting and monitoring Celery workers and tasks. It provides functionality to place tasks on

Oct 28, 2022
Dec 27, 2022
Simple, fast and scalable golang rpc library for high load

gorpc Simple, fast and scalable golang RPC library for high load and microservices. Gorpc provides the following features useful for highly loaded pro

Dec 19, 2022
Parallel Digital Universe - A decentralized identity-based social network

Parallel Digital Universe Golang implementation of PDU. What is PDU? Usage Development Contributing PDU PDU is a decentralized identity-based social n

Nov 20, 2022
The Go language implementation of gRPC. HTTP/2 based RPC

gRPC-Go The Go implementation of gRPC: A high performance, open source, general RPC framework that puts mobile and HTTP/2 first. For more information

Jan 7, 2023
Cross-platform grid-based user interface framework.

Gruid The gruid module provides packages for easily building grid-based applications in Go. The library abstracts rendering and input for different pl

Nov 23, 2022
A distributed system for embedding-based retrieval
A distributed system for embedding-based retrieval

Overview Vearch is a scalable distributed system for efficient similarity search of deep learning vectors. Architecture Data Model space, documents, v

Dec 30, 2022
An implementation of a distributed access-control server that is based on Google Zanzibar

An implementation of a distributed access-control server that is based on Google Zanzibar - "Google's Consistent, Global Authorization System".

Dec 22, 2022
Golimit is Uber ringpop based distributed and decentralized rate limiter
Golimit is Uber ringpop based distributed and decentralized rate limiter

Golimit A Distributed Rate limiter Golimit is Uber ringpop based distributed and decentralized rate limiter. It is horizontally scalable and is based

Dec 21, 2022
Distributed disk storage database based on Raft and Redis protocol.
Distributed disk storage database based on Raft and Redis protocol.

IceFireDB Distributed disk storage system based on Raft and RESP protocol. High performance Distributed consistency Reliable LSM disk storage Cold and

Dec 31, 2022
dht is used by anacrolix/torrent, and is intended for use as a library in other projects both torrent related and otherwise

dht Installation Install the library package with go get github.com/anacrolix/dht, or the provided cmds with go get github.com/anacrolix/dht/cmd/....

Dec 28, 2022
A feature complete and high performance multi-group Raft library in Go.
A feature complete and high performance multi-group Raft library in Go.

Dragonboat - A Multi-Group Raft library in Go / 中文版 News 2021-01-20 Dragonboat v3.3 has been released, please check CHANGELOG for all changes. 2020-03

Dec 30, 2022