Robust, flexible and resource-efficient pipelines using Go and the commandline

SciPipe

Last update: Dec 25, 2022

Comments: 17

Project links: Documentation & Main Website | Issue Tracker | Chat

Why SciPipe?

Intuitive: SciPipe works by flowing data through a network of channels and processes
Flexible: Wrapped command-line programs can be combined with processes in Go
Convenient: Full control over how your files are named
Efficient: Workflows are compiled to binary code that run fast
Parallel: Pipeline paralellism between processes as well as task parallelism for multiple inputs, making efficient use of multiple CPU cores
Supports streaming: Stream data between programs to avoid wasting disk space
Easy to debug: Use available Go debugging tools or just println()
Portable: Distribute workflows as Go code or as self-contained executable files

Project updates

Jan 2020: New screencast: "Hello World" scientific workflow in SciPipe
May 2019: The SciPipe paper published open access in GigaScience: SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines
Nov 2018: Scientific study using SciPipe: Predicting off-target binding profiles with confidence using Conformal Prediction
Slides: Presentation on SciPipe and more at Go Stockholm Conference
Blog post: Provenance reports in Scientific Workflows - going into details about how SciPipe is addressing provenance.
Blog post: First production workflow run with SciPipe

Introduction

SciPipe is a library for writing Scientific Workflows, sometimes also called "pipelines", in the Go programming language.

When you need to run many commandline programs that depend on each other in complex ways, SciPipe helps by making the process of running these programs flexible, robust and reproducible. SciPipe also lets you restart an interrupted run without over-writing already produced output and produces an audit report of what was run, among many other things.

SciPipe is built on the proven principles of Flow-Based Programming (FBP) to achieve maximum flexibility, productivity and agility when designing workflows. Compared to plain dataflow, FBP provides the benefits that processes are fully self-contained, so that a library of re-usable components can be created, and plugged into new workflows ad-hoc.

Similar to other FBP systems, SciPipe workflows can be likened to a network of assembly lines in a factory, where items (files) are flowing through a network of conveyor belts, stopping at different independently running stations (processes) for processing, as depicted in the picture above.

SciPipe was initially created for problems in bioinformatics and cheminformatics, but works equally well for any problem involving pipelines of commandline applications.

Project status: SciPipe pretty stable now, and only very minor API changes might still occur. We have successfully used SciPipe in a handful of both real and experimental projects, and it has had occasional use outside the research group as well.

Known limitations

There are still a number of missing good-to-have features for workflow design. See the issue tracker for details.
There is not (yet) support for the Common Workflow Language.

Hello World example

Let's look at an example workflow to get a feel for what writing workflows in SciPipe looks like:

package main

import (
    // Import SciPipe, aliased to sp
    sp "github.com/scipipe/scipipe"
)

func main() {
    // Init workflow and max concurrent tasks
    wf := sp.NewWorkflow("hello_world", 4)

    // Initialize processes, and file extensions
    hello := wf.NewProc("hello", "echo 'Hello ' > {o:out|.txt}")
    world := wf.NewProc("world", "echo $(cat {i:in}) World > {o:out|.txt}")

    // Define data flow
    world.In("in").From(hello.Out("out"))

    // Run workflow
    wf.Run()
}

Running the example

Let's put the code in a file named hello_world.go and run it:

$ go run hello_world.go
AUDIT   2018/07/17 21:42:26 | workflow:hello_world             | Starting workflow (Writing log to log/scipipe-20180717-214226-hello_world.log)
AUDIT   2018/07/17 21:42:26 | hello                            | Executing: echo 'Hello ' > hello.out.txt
AUDIT   2018/07/17 21:42:26 | hello                            | Finished: echo 'Hello ' > hello.out.txt
AUDIT   2018/07/17 21:42:26 | world                            | Executing: echo $(cat ../hello.out.txt) World > hello.out.txt.world.out.txt
AUDIT   2018/07/17 21:42:26 | world                            | Finished: echo $(cat ../hello.out.txt) World > hello.out.txt.world.out.txt
AUDIT   2018/07/17 21:42:26 | workflow:hello_world             | Finished workflow (Log written to log/scipipe-20180717-214226-hello_world.log)

Let's check what file SciPipe has generated:

$ ls -1 hello*
hello.out.txt
hello.out.txt.audit.json
hello.out.txt.world.out.txt
hello.out.txt.world.out.txt.audit.json

As you can see, it has created a file hello.out.txt, and hello.out.world.out.txt, and an accompanying .audit.json for each of these files.

Now, let's check the output of the final resulting file:

$ cat hello.out.txt.world.out.txt
Hello World

Now we can rejoice that it contains the text "Hello World", exactly as a proper Hello World example should :)

Now, these were a little long and cumbersome filenames, weren't they? SciPipe gives you very good control over how to name your files, if you don't want to rely on the automatic file naming. For example, we could set the first filename to a static one, and then use the first name as a basis for the file name for the second process, like so:

package main

import (
    // Import the SciPipe package, aliased to 'sp'
    sp "github.com/scipipe/scipipe"
)

func main() {
    // Init workflow with a name, and max concurrent tasks
    wf := sp.NewWorkflow("hello_world", 4)

    // Initialize processes and set output file paths
    hello := wf.NewProc("hello", "echo 'Hello ' > {o:out}")
    hello.SetOut("out", "hello.txt")

    world := wf.NewProc("world", "echo $(cat {i:in}) World >> {o:out}")
    // The modifier 's/.txt//' will replace '.txt' in the input path with ''
    world.SetOut("out", "{i:in|s/.txt//}_world.txt")

    // Connect network
    world.In("in").From(hello.Out("out"))

    // Run workflow
    wf.Run()
}

Now, if we run this, the file names get a little cleaner:

$ ls -1 hello*
hello.txt
hello.txt.audit.json
hello.txt.world.go
hello.txt.world.txt
hello.txt.world.txt.audit.json

The audit logs

Finally, we could have a look at one of those audit file created:

$ cat hello.txt.world.txt.audit.json
{
    "ID": "99i5vxhtd41pmaewc8pr",
    "ProcessName": "world",
    "Command": "echo $(cat hello.txt) World \u003e\u003e hello.txt.world.txt.tmp/hello.txt.world.txt",
    "Params": {},
    "Tags": {},
    "StartTime": "2018-06-15T19:10:37.955602979+02:00",
    "FinishTime": "2018-06-15T19:10:37.959410102+02:00",
    "ExecTimeNS": 3000000,
    "Upstream": {
        "hello.txt": {
            "ID": "w4oeiii9h5j7sckq7aqq",
            "ProcessName": "hello",
            "Command": "echo 'Hello ' \u003e hello.txt.tmp/hello.txt",
            "Params": {},
            "Tags": {},
            "StartTime": "2018-06-15T19:10:37.950032676+02:00",
            "FinishTime": "2018-06-15T19:10:37.95468214+02:00",
            "ExecTimeNS": 4000000,
            "Upstream": {}
        }
    }

Each such audit-file contains a hierarchic JSON-representation of the full workflow path that was executed in order to produce this file. On the first level is the command that directly produced the corresponding file, and then, indexed by their filenames, under "Upstream", there is a similar chunk describing how all of its input files were generated. This process will be repeated in a recursive way for large workflows, so that, for each file generated by the workflow, there is always a full, hierarchic, history of all the commands run - with their associated metadata - to produce that file.

You can find many more examples in the examples folder in the GitHub repo.

For more information about how to write workflows using SciPipe, and much more, see SciPipe website (scipipe.org)!

More material on SciPipe

See a poster on SciPipe, presented at the e-Science Academy in Lund, on Oct 12-13 2016.
See slides from a recent presentation of SciPipe for use in a Bioinformatics setting.
The architecture of SciPipe is based on an flow-based programming like pattern in pure Go presented in this and this blog posts on Gopher Academy.

Citing SciPipe

If you use SciPipe in academic or scholarly work, please cite the following paper as source:

Lampa S, Dahlö M, Alvarsson J, Spjuth O. SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines Gigascience. 8, 5 (2019). DOI: 10.1093/gigascience/giz044

Acknowledgements

SciPipe is very heavily dependent on the proven principles form Flow-Based Programming (FBP), as invented by John Paul Morrison. From Flow-based programming, SciPipe uses the ideas of separate network (workflow dependency graph) definition, named in- and out-ports, sub-networks/sub-workflows and bounded buffers (already available in Go's channels) to make writing workflows as easy as possible.
This library is has been much influenced/inspired also by the GoFlow library by Vladimir Sibirov.
Thanks to Egon Elbre for helpful input on the design of the internals of the pipeline, and processes, which greatly simplified the implementation.
This work is financed by faculty grants and other financing for the Pharmaceutical Bioinformatics group of Dept. of Pharmaceutical Biosciences at Uppsala University, and by Swedish Research Council through the Swedish National Bioinformatics Infrastructure Sweden.
Supervisor for the project is Ola Spjuth.

Related tools

Find below a few tools that are more or less similar to SciPipe that are worth worth checking out before deciding on what tool fits you best (in approximate order of similarity to SciPipe):

Owner

SciPipe

A flow-based scientific workflow library and general pattern in Go

https://github.com/scipipe/scipipe https://scipipe.org

Comments

Implement audit logging
Implement some kind of structured data keeper for task info (for provenance etc):

Parameters

The command run

Previous tasks / parameters used to generate input files?

Execution time

SLURM execution time

...
Can not use a variable in an absolute path
Hello, I am rather new with both Go and Scipipe but I came up with an issue I am not sure how to solve. I am trying to create a pipeline where I want to import all the files that are in a folder and use those files in other procedures.

My though was to first create a list with the filenames of the files in the specific folder and then assign each of those to a variable and use that variable for both "targeting" the correct file (using absolute path for example) and giving each output file a different name depending the initial filename. My code goes like this:

package main import ( "io/ioutil" "log" sp "github.com/scipipe/scipipe" ) func main() { files, err := ioutil.ReadDir(".") if err != nil { log.Fatal(err) } var td []string for _, f := range files { td = append(td, f.Name()) // Creating a list with the filenames } td = append(td[:0], td[1], td[2], td[3]) // I removed the .DS_Store file I was getting in the list wf := sp.NewWorkflow("DB", 1) for _, target := range td { train_proc := wf.NewProc(target+"_train", `echo "$(cat ~/Desktop/Project/'$target')" > {o:out}`) train_proc.SetOut("out", target+"_file.txt") } wf.Run() }

I actually want to take the content of a file and copy it to an output file, but I can't find a way to point to the file with the specific name while using a variable in the absolute path of that file. In the SetOut stage, the target variable is replaced by the different values

Thank you in advance
Contributing to scipipe - working with github+golang

I have tried out scipipe and hope to contribute to it.

I am relatively new to golang and have a hard time testing out local changes I made in my test main program as it keeps pulling in the zip archive versioned copy of scipipe. Hence it is not picking my changes I made to the locally clone copy of scipipe.

I am familiar with the traditional way working with C++ and Python but Golang is quite challenging from a contribution perspective.

Any advice/workflow/pointer/best-practices ?

Cheers

Filename "" does not match expression [A-Za-z\/\.-_]+

Hello I am trying to use the streamToSubstream functionality but I am getting the error below. ERROR 2019/06/12 18:09:20 Filename "" does not match expression [A-Za-z\/\.-_]+

I also get this error when I try to run your example workflows https://github.com/pharmbio/scipipe-demo/tree/fdb98884edb98a693c2892930c088cd723070691/dnacanceranalysis

I believe the issue is in

func (p *StreamToSubStream) Run() {
	defer p.CloseAllOutPorts()

	scipipe.Debug.Println("Creating new information packet for the substream...")
	subStreamIP := scipipe.NewFileIP("")
	scipipe.Debug.Printf("Setting in-port of process %s to IP substream field\n", p.Name())
	subStreamIP.SubStream = p.In()

	scipipe.Debug.Printf("Sending sub-stream IP in process %s...\n", p.Name())
	p.OutSubStream().Send(subStreamIP)
	scipipe.Debug.Printf("Done sending sub-stream IP in process %s.\n", p.Name())
}```

Where `subStreamIP := scipipe.NewFileIP("")` is called.
This triggers the error due the `checkFilename` in `NewFileIP` in ip.go

Anything I am doing wrong? thanks for any help you can provide.

Use temp folders instead of temp filename extension, for running jobs?

This is needed sometimes when you can not control the file name that is created, but still need to check if it exists, such as when unpacking a tarball with a known folder name in it

EDIT: Old title: Add option to turn off .tmp path usage
Make number of simultaneous tasks per process configurable

Currently, a process will spawn as many tasks as there are incoming sets of data packets on in-ports. If running stuff locally, this might overbook the CPU.

Probably the best option is to have a global pool of "run leases", that are handed out to processes as they ask for them, and then handed back when they are finished.
Cloud execution

Hi. I've looked through the docs and there doesn't seem to be any particular reference on if this is possible. What I'm envisaging is having each process run via AWS Batch or the Google Cloud Life Sciences API, which are common targets for bioinformatics pipelines.
Better way of connecting components, to allow sanity checks and more
If we create special InPort and OutPort structs, with some convenience functionality, we can move from:

task2.InPorts["bar"] = task1.OutPorts["foo"]

... to something like:

task2.InPorts["bar"].connectFrom(task1.OutPorts["foo"])

... and one could allow going the other direction too:

task1.OutPorts["foo"].connectTo(task2.InPorts["bar"])

This should also work well with static port fields, such as:

task2.InBar.connectFrom(task1.OutFoo) task1.OutFoo.connectTo(task2.InBar)

This would allow us to make sure that there are no unconnected ports and other sanity checks, as well as to enable traversing the workflow dependency graph to produce a textual or graphical representation of the workflow.

An alternative approach would be to create a Channel component, that the "port" maps are initialized with, so that the assignment syntax still works (just that it is a Channel struct (with a real channel inside) that is assigned rather than a plain channel), but that wouldn't allow us the benefits stated above.
Add streaming support

Should probably be choosable on each output, whether it should stream its output or not!

... either in the commandline pattern, or as a struct map field.
Serialized workflow description in JSON ?

Hi,

I am looking into the possibility of using SciPipe for use outside of bioinformatics, namely film/visual-effects and possibly AEC.

Most of my work experience lately is in the film/vfx industry so I am looking to hook up SciPipe to studio facility running what we call a farm with software like Tractor (from Pixar) and maybe Deadline (Thinkbox).

I have also recently spent time at CSL (Pharmaceutical) setting up an HPC cluster running SLURM and integrating with CWL for the bioinformatics R&D.

I would like to know if SciPipe has a serialise description of the FBP in a form like JSON with which I can write translator to generate industry specific job management files.

In my short time running some of the demo code, I see *.audit.json files which are generated after a workflow has completed. I would like to generate the workflow description without running the workflow.

I am looking for something like the Dot output but in a JSON format with enough detail for me to recreate all the necessarily detail to submit jobs to SLURM, Tractor, Deadline or generate CWL, WDL or other workflow files for submission via their respective runtime like Cromwell.

Cheers
SciPipe doesn't fail on missing output files

It appeared through some of @jonalv's workflows, that scipipe does not properly fail when some declared outputs of a process are not properly created. These will AFAIK fail when trying to read from non-existing files in downstream processes, but it would be much more helpful when debugging to get the error where it happens.
Idea for a flexible component to generate parameters dynamically from shell code
A current shady spot in SciPipe is when needing to generate many sets of parameter values or file names to feed a downstream pipeline. This can be done to some extent using e.g. a globber component, but that is very limited to a very specific use case.

Below is a sketch on an idea on how to implement a type of component that can generate this based on shell scripts.

The idea is that you are supposed to write a shell script that produces a set of JSON-objects, one per line, with the parameter and output filename fields populated.

The API could look like this:

looper := wf.NewLooper("looper", "for f in data/*.csv; do echo \"{ 'outfile': '{o:outfile:$f}' }\"; done;") otherProc := wf.NewProc("other-proc", "some-command -in {i:infile} ... Etc etc") outerProc.In("infile").From(looper.Out("outfile")) // ... etc etc ...

The example above is basically just a globber, but the same method could be used for populating parameters as well. I will update the example shortly to illustrate the combined generation of filenames and parameters.
Requests for a tool that monitors the execution of scipipe workflows.

panoptes is such a tool for snakemake workflows. nf-tower is such a tool for nextflow workflows. Has scipipe provided such a tool or sort of APIs that I can use to monitor the execution of scipipe workflow? If not, could you please give me some hints about how to do that?

Regards, Zhen
Merge tags and params?

Right now, params and tags associated with IPs with AuditInfo serve quite similar roles, with the main difference being that parameters are aimed to be parameters sent to whetever program is being executed, while tags might be other metadata that is not sent to a program, but might be extracted from the filename or the file itself and used for further filtering, grouping etc.

Thus, it seems worth considering whether these could be stored in the same map of "tags".
Order globs on file size

When batch processing large number of files of different size and having multiple gather points it becomes important to make sure to start the long running things early. Often long running correlates well to file size. Hence it would be nice to be able to sort a file glob for file size in order to get the long running things fired of early.
Enable renaming paths to final paths on different partition
Now, if writing to paths that are on a different partition, for example /tmp/foo, if / is on a different hard drive partition than your /home/ folder where you might execute the workflow, the os.Rename() call in FinalizePaths() will fail with "invalid cross-device link".

To come around this, we could check for that specific error and if so, do a proper copy and remove instead.

Some links with pointers and/or ideas:

gist:MoveFile.go
Ability to depend on task completion

For some components which return multiple outputs, such as the globber component, it would be useful to be able to depend on the process' full completion in downstream tasks.

Reporter: @jonalv

Waiton - Commandline for executing command and waiting on output

waiton Commandline for executing command and waiting on output Output of waiton

Feb 4, 2022

Commandline Utility To Create Secure Password Hashes (scrypt / bcrypt / pbkdf2)

passhash Create Secure Password Hashes with different algorithms. I/O format is base64 conforming to RFC 4648 (also known as url safe base64 encoding)

Oct 10, 2022

An alternative syntax to generate YAML (or JSON) from commandline

yo An alternative syntax to generate YAML (or JSON) from commandline. The ultimate commanline YAML (or JSON) generator! ... I'm kidding of course! but

Jul 30, 2022

Commandline tool to generate Cistercian numerals

cistercian Commandline tool to generate Cistercian numerals. Installation go get github.com/rhardih/cistercian Example usage Text $ cistercian 7323

Sep 30, 2022

A commandline tool to resolve URI Templates expressions as specified in RFC 6570.

URI Are you tired to build, concat, replace URL(s) (via shell scripts sed/awk/tr) from your awesome commandline pipeline? Well! here is the missing pi

Jun 9, 2021

Teardown API for Commandline Based Applications

Building go build -ldflags "-s -w" -o ./build/api.exe ./ Get the latest XML from https://www.teardowngame.com/modding/api.xml Commands help list searc

Mar 1, 2022

tigrfont is a commandline tool for creating bitmap font sheets for TIGR from TTF or BDF font files.

tigrfont - bitmap font sheet generator for TIGR tigrfont is a commandline tool for creating bitmap font sheets for TIGR from TTF or BDF font files. TI

Dec 5, 2022

NYAGOS - The hybrid Commandline Shell between UNIX & DOS

The Nihongo Yet Another GOing Shell English / Japanese NYAGOS is the commandline-shell written with the Programming Language GO and Lua. There are som

Dec 30, 2022

Curried commandline

curry Install $ go install github.com/lambdasawa/curry@latest $ brew tap lambdasawa/tap $ brew install lambdasawa/tap/curry Usage Basic usage. $ curry

Dec 10, 2021

A simple golang marshaller from commandline to a struct

flagmarshal SYNOPSIS A simple golang marshaller from commandline to a struct ParseFlags(structptr interface{}) error DESCRIPTION Very simple implement

Jan 22, 2022

GodSpeed is a robust and intuitive manager for reverse shells.

GodSpeed is a robust and intuitive manager for reverse shells. It supports tab-completion, verbose listing of connected hosts and easy interaction with selected shells by passing their corresponding ID.

Dec 12, 2022

Clip - A simple and robust clipboard manager

clip A simple and robust clipboard manager Clip is currently ONLY for macos and

Jun 20, 2022

Buildkite-cli - Command line tool for interacting with Buildkite pipelines, builds, and more

Buildkite CLI Command line tool for interacting with Buildkite pipelines, builds

Jan 7, 2022

Bk - Command line tool for interacting with Buildkite pipelines, builds, and more

Buildkite CLI Command line tool for interacting with Buildkite pipelines, builds

Jan 7, 2022

A simple CLI tool to use the _simulate API of elasticsearch to quickly test pipelines

elasticsearch-pipeline-tester A simple CLI tool to use the _simulate API of elasticsearch to quickly test pipelines usage: pipelinetester [<flags>] <p

Oct 19, 2021

A CLI tool implemented by Golang to manage `CloudComb` resource

CloudComb CLI tool: comb Get Started comb is a CLI tool for manage resources in CloudComb base on cloudcomb-go-sdk. Support Mac, Linux and Windows. We

Jan 4, 2021

Sipexer - Modern and flexible SIP (RFC3261) command line tool

sipexer Modern and flexible SIP (RFC3261) command line tool. Overview sipexer is

Jan 1, 2023

Procmon is a Linux reimagining of the classic Procmon tool from the Sysinternals suite of tools for Windows. Procmon provides a convenient and efficient way for Linux developers to trace the syscall activity on the system.

Process Monitor for Linux (Preview) Process Monitor (Procmon) is a Linux reimagining of the classic Procmon tool from the Sysinternals suite of tools

Dec 29, 2022

Fast, secure, efficient backup program

Introduction restic is a backup program that is fast, efficient and secure. It supports the three major operating systems (Linux, macOS, Windows) and

Dec 31, 2022

Robust, flexible and resource-efficient pipelines using Go and the commandline

Why SciPipe?

Project updates

Introduction

Known limitations

Hello World example

Running the example

The audit logs

More material on SciPipe

Citing SciPipe

Acknowledgements

Related tools

Owner

SciPipe

Comments

Implement audit logging

Can not use a variable in an absolute path

Contributing to scipipe - working with github+golang

Filename "" does not match expression [A-Za-z\/\.-_]+

Use temp folders instead of temp filename extension, for running jobs?

Make number of simultaneous tasks per process configurable

Cloud execution

Better way of connecting components, to allow sanity checks and more

Add streaming support

Serialized workflow description in JSON ?

SciPipe doesn't fail on missing output files

Idea for a flexible component to generate parameters dynamically from shell code

Requests for a tool that monitors the execution of scipipe workflows.

Merge tags and params?

Order globs on file size

Enable renaming paths to final paths on different partition

Ability to depend on task completion

Related tags

Waiton - Commandline for executing command and waiting on output

Commandline Utility To Create Secure Password Hashes (scrypt / bcrypt / pbkdf2)

An alternative syntax to generate YAML (or JSON) from commandline

Commandline tool to generate Cistercian numerals

A commandline tool to resolve URI Templates expressions as specified in RFC 6570.

Teardown API for Commandline Based Applications

tigrfont is a commandline tool for creating bitmap font sheets for TIGR from TTF or BDF font files.

NYAGOS - The hybrid Commandline Shell between UNIX & DOS

Curried commandline

A simple golang marshaller from commandline to a struct

GodSpeed is a robust and intuitive manager for reverse shells.

Clip - A simple and robust clipboard manager

Buildkite-cli - Command line tool for interacting with Buildkite pipelines, builds, and more

Bk - Command line tool for interacting with Buildkite pipelines, builds, and more

A simple CLI tool to use the _simulate API of elasticsearch to quickly test pipelines

A CLI tool implemented by Golang to manage `CloudComb` resource

Sipexer - Modern and flexible SIP (RFC3261) command line tool

Procmon is a Linux reimagining of the classic Procmon tool from the Sysinternals suite of tools for Windows. Procmon provides a convenient and efficient way for Linux developers to trace the syscall activity on the system.

Fast, secure, efficient backup program