Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.

Gleam

Build Status GoDoc Wiki Go Report Card codecov

Gleam is a high performance and efficient distributed execution system, and also simple, generic, flexible and easy to customize.

Gleam is built in Go, and the user defined computation can be written in Go, Unix pipe tools, or any streaming programs.

High Performance

  • Pure Go mappers and reducers have high performance and concurrency.
  • Data flows through memory, optionally to disk.
  • Multiple map reduce steps are merged together for better performance.

Memory Efficient

  • Gleam does not have the common GC problem that plagued other languages. Each executor runs in a separated OS process. The memory is managed by the OS. One machine can host many more executors.
  • Gleam master and agent servers are memory efficient, consuming about 10 MB memory.
  • Gleam tries to automatically adjust the required memory size based on data size hints, avoiding the try-and-error manual memory tuning effort.

Flexible

  • The Gleam flow can run standalone or distributed.
  • Adjustable in memory mode or OnDisk mode.

Easy to Customize

  • The Go code is much simpler to read than Scala, Java, C++.

One Flow, Multiple ways to execute

Gleam code defines the flow, specifying each dataset(vertex) and computation step(edge), and build up a directed acyclic graph(DAG). There are multiple ways to execute the DAG.

The default way is to run locally. This works in most cases.

Here we mostly talk about the distributed mode.

Distributed Mode

The distributed mode has several names to explain: Master, Agent, Executor, Driver.

Gleam Driver

  • Driver is the program users write, it defines the flow, and talks to Master, Agents, and Executors.

Gleam Master

  • The Master is one single server that collects resource information from Agents.
  • It stores transient resource information and can be restarted.
  • When the Driver program starts, it asks the Master for available Executors on Agents.

Gleam Agent

  • Agents runs on any machine that can run computations.
  • Agents periodically send resource usage updates to Master.
  • When the Driver program has executors assigned, it talks to the Agents to start Executors.
  • Agents also manage datasets generated by each Executors.

Gleam Executor

  • Executors are started by Agents. They will read inputs from external or previous datasets, process them, and output to a new dataset.

Dataset

  • The datasets are managed by Agents. By default, the data run only through memory and network, not touching slow disk.
  • Optionally the data can be persist to disk.

By leaving it in memory, the flow can have back pressure, and can support stream computation naturally.

Documentation

Standalone Example

Word Count

Word Count

Basically, you need to register the Go functions first. It will return a mapper or reducer function id, which we can pass it to the flow.

package main

import (
	"flag"
	"strings"

	"github.com/chrislusf/gleam/distributed"
	"github.com/chrislusf/gleam/flow"
	"github.com/chrislusf/gleam/gio"
	"github.com/chrislusf/gleam/plugins/file"
)

var (
	isDistributed   = flag.Bool("distributed", false, "run in distributed or not")
	Tokenize  = gio.RegisterMapper(tokenize)
	AppendOne = gio.RegisterMapper(appendOne)
	Sum = gio.RegisterReducer(sum)
)

func main() {

	gio.Init()   // If the command line invokes the mapper or reducer, execute it and exit.
	flag.Parse() // optional, since gio.Init() will call this also.

	f := flow.New("top5 words in passwd").
		Read(file.Txt("/etc/passwd", 2)).  // read a txt file and partitioned to 2 shards
		Map("tokenize", Tokenize).    // invoke the registered "tokenize" mapper function.
		Map("appendOne", AppendOne).  // invoke the registered "appendOne" mapper function.
		ReduceByKey("sum", Sum).         // invoke the registered "sum" reducer function.
		Sort("sortBySum", flow.OrderBy(2, true)).
		Top("top5", 5, flow.OrderBy(2, false)).
		Printlnf("%s\t%d")

	if *isDistributed {
		f.Run(distributed.Option())
	} else {
		f.Run()
	}

}

func tokenize(row []interface{}) error {
	line := gio.ToString(row[0])
	for _, s := range strings.FieldsFunc(line, func(r rune) bool {
		return !('A' <= r && r <= 'Z' || 'a' <= r && r <= 'z' || '0' <= r && r <= '9')
	}) {
		gio.Emit(s)
	}
	return nil
}

func appendOne(row []interface{}) error {
	row = append(row, 1)
	gio.Emit(row...)
	return nil
}

func sum(x, y interface{}) (interface{}, error) {
	return gio.ToInt64(x) + gio.ToInt64(y), nil
}

Now you can execute the binary directly or with "-distributed" option to run in distributed mode. The distributed mode would need a simple setup described later.

A bit more blown up example is here, using the predefined mapper or reducer: https://github.com/chrislusf/gleam/blob/master/examples/word_count_in_go/word_count_in_go.go

Word Count by Unix Pipe Tools

Here is another way to do the similar by unix pipe tools.

Unix Pipes are easy for sequential pipes, but limited to fan out, and even more limited to fan in.

With Gleam, fan-in and fan-out parallel pipes become very easy.

package main

import (
	"fmt"

	"github.com/chrislusf/gleam/flow"
	"github.com/chrislusf/gleam/gio"
	"github.com/chrislusf/gleam/gio/mapper"
	"github.com/chrislusf/gleam/plugins/file"
	"github.com/chrislusf/gleam/util"
)

func main() {

	gio.Init()

	flow.New("word count by unix pipes").
		Read(file.Txt("/etc/passwd", 2)).
		Map("tokenize", mapper.Tokenize).
		Pipe("lowercase", "tr 'A-Z' 'a-z'").
		Pipe("sort", "sort").
		Pipe("uniq", "uniq -c").
		OutputRow(func(row *util.Row) error {

			fmt.Printf("%s\n", gio.ToString(row.K[0]))

			return nil
		}).Run()

}

This example used OutputRow() to process the output row directly.

Join two CSV files.

Assume there are file "a.csv" has fields "a1, a2, a3, a4, a5" and file "b.csv" has fields "b1, b2, b3". We want to join the rows where a1 = b2. And the output format should be "a1, a4, b3".

package main

import (
	. "github.com/chrislusf/gleam/flow"
	"github.com/chrislusf/gleam/gio"
	"github.com/chrislusf/gleam/plugins/file"
)

func main() {

	gio.Init()

	f := New("join a.csv and b.csv by a1=b2")
	a := f.Read(file.Csv("a.csv", 1)).Select("select", Field(1,4)) // a1, a4
	b := f.Read(file.Csv("b.csv", 1)).Select("select", Field(2,3)) // b2, b3

	a.Join("joinByKey", b).Printlnf("%s,%s,%s").Run()  // a1, a4, b3

}

Distributed Computing

Setup Gleam Cluster Locally

Start a gleam master and several gleam agents

// start "gleam master" on a server
> go get github.com/chrislusf/gleam/distributed/gleam
> gleam master --address=":45326"

// start up "gleam agent" on some different servers or ports
> gleam agent --dir=2 --port 45327 --host=127.0.0.1
> gleam agent --dir=3 --port 45328 --host=127.0.0.1

Setup Gleam Cluster on Kubernetes

Start a gleam master and several gleam agents

kubectl apply -f k8s/

Change Execution Mode.

After the flow is defined, the Run() function can be executed in local mode or distributed mode.

  f := flow.New("")
  ...
  // 1. local mode
  f.Run()

  // 2. distributed mode
  import "github.com/chrislusf/gleam/distributed"
  f.Run(distributed.Option())
  f.Run(distributed.Option().SetMaster("master_ip:45326"))

Important Features

Status

Gleam is just beginning. Here are a few todo items. Welcome any help!

  • Add new plugin to read external data.
  • Add windowing functions similar to Apache Beam/Flink. (in progress)
  • Add schema support for each dataset.
  • Support using SQL as a flow step, similar to LINQ.
  • Add dataset metadata for better caching of often re-calculated data.

Especially Need Help Now:

  • Go implementation to read Parquet files.

Please start to use it and give feedback. Help is needed. Anything is welcome. Small things count: fix documentation, adding a logo, adding docker image, blog about it, share it, etc.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner
Chris Lu
https://github.com/chrislusf/seaweedfs SeaweedFS the distributed file system and object store for billions of small files ...
Chris Lu
Comments
  • Agent log crammed with CollectExecutionStatistics(_) error

    Agent log crammed with CollectExecutionStatistics(_) error

    excerpt from the log file:

    2017/08/22 23:13:40 &{0xc420274000}.CollectExecutionStatistics(_) = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 &{0xc420252000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 &{0xc4201be1c0}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 &{0xc4201b21c0}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 &{0xc420130700}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 &{0xc4201561c0}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 &{0xc4201ba1c0}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 &{0xc420274000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 in memory 5:1 finished writing f1213273487-d5-s1 0 bytes 2017/08/22 23:13:40 in memory 6:0 finished reading f1213273487-d5-s1 0 bytes 2017/08/22 23:13:40 &{0xc420388000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 &{0xc42009c540}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 in memory 5:6 finished writing f1213273487-d5-s6 0 bytes 2017/08/22 23:13:40 in memory 6:0 finished reading f1213273487-d5-s6 0 bytes 2017/08/22 23:13:40 in memory 5:9 finished writing f1213273487-d5-s9 0 bytes 2017/08/22 23:13:40 in memory 6:0 finished reading f1213273487-d5-s9 0 bytes 2017/08/22 23:13:40 &{0xc4201fe380}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:40 in memory 5:11 finished writing f1213273487-d5-s11 0 bytes 2017/08/22 23:13:40 in memory 6:0 finished reading f1213273487-d5-s11 0 bytes 2017/08/22 23:13:40 &{0xc420374000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:41 deleting f1213273487-d5-s6 2017/08/22 23:13:41 deleting f1213273487-d5-s11 2017/08/22 23:13:41 deleting f1213273487-d5-s1 2017/08/22 23:13:41 deleting f1213273487-d5-s9 2017/08/22 23:13:46 in memory 5:5 starts writing f1750670629-d5-s5 expected reader:1 2017/08/22 23:13:46 &{0xc42034c000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:46 in memory 5:1 starts writing f1750670629-d5-s1 expected reader:1 2017/08/22 23:13:46 &{0xc4202a6000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:46 &{0xc420130380}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:46 &{0xc420230000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:46 &{0xc4201c81c0}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:46 &{0xc4201ca1c0}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:13:46 in memory 6:0 waits for f1750670629-d5-s1 2017/08/22 23:13:46 in memory 6:0 start reading f1750670629-d5-s1 2017/08/22 23:13:46 in memory 6:0 waits for f1750670629-d5-s5 2017/08/22 23:13:46 in memory 6:0 start reading f1750670629-d5-s5 2017/08/22 23:15:15 &{0xc42014e1c0}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:15:15 &{0xc4201c8540}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:15:15 &{0xc420278000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:15:15 &{0xc42023a000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:15:15 in memory 5:5 finished writing f1750670629-d5-s5 0 bytes 2017/08/22 23:15:15 in memory 6:0 finished reading f1750670629-d5-s5 0 bytes 2017/08/22 23:15:15 in memory 5:1 finished writing f1750670629-d5-s1 0 bytes 2017/08/22 23:15:15 in memory 6:0 finished reading f1750670629-d5-s1 0 bytes 2017/08/22 23:15:15 &{0xc420380000}.CollectExecutionStatistics() = , rpc error: code = Unavailable desc = grpc: the connection is unavailable 2017/08/22 23:15:15 &{0xc42009c1c0}.CollectExecutionStatistics() = _, rpc error: code = Unavailable desc = grpc: the connection is unavailable

    My application gives me unexpected result running in distributed mode which consists of 4 agent nodes. If I run it with local master and agent, everything seems fine. Not sure if this sort of error is related. 2017/08/22 23:15:15 &{0xc42014e1c0}.CollectExecutionStatistics(_) = _, rpc error: code = Unavailable desc = grpc: the connection is unavailable

  • Is there is any example usage case in streaming processing?

    Is there is any example usage case in streaming processing?

    Is there is any example usage case in streaming processing?

    Like the Driver is keep alive, and keep read from one streaming source like kafka, process each data batch and send to a streaming output like kafka.

    Some question: 1. Is any "example usage case" or "best practices" to process streaming data? 2. If Gleam does work, whether need to restart one Gleam each "streaming data slice"? Like Kafka source, whether need to reconnect to the broker list each time? 3. For the result each batch done, whether Gleam support any method to keep some "dataset" for next batch? Or a mechanism to support "Global Context Dataset" across Different Gleam batch? 4. Whether support one mechanism like one "Never Stop" distributed mapper can produce "streaming data slice" dataset, and trigger the slice dataset batch to run in sequence? To avoid recronnect to kafka. 5. Any orhter Idea is help to processing streaming data?

    thanks~

  • MsgPack instruction to bridge pipe with msgpack

    MsgPack instruction to bridge pipe with msgpack

    example:

    f := flow.New("top5 words in passwd"). Read(file.Txt("/etc/passwd", 1)). Map("tokenize", mapper.Tokenize). // invoke the registered "tokenize" mapper function. Pipe("debugWithPipe", "tee debug.txt").MsgPack("fromPipeToMsgpack"). Map("addOne", mapper.AppendOne). // invoke the registered "addOne" mapper function. ReduceByKey("sum", reducer.SumInt64). // invoke the registered "sum" reducer function. Sort("sortBySum", flow.OrderBy(2, true)). Top("top5", 5, flow.OrderBy(2, false)). Printlnf("%s\t%d")

  • Define Map in golang like glow?

    Define Map in golang like glow?

    I'd like to begin using gleam (having used glow for awhile now). Is there any way to write Map functions in golang? i.e. Is there a way to do something like the following so that I can continue to utilize go packages? Or can I only define Maps/Reduces with lua/shell scripts in gleam?

    package main
    
    import (
        ".../somepackage"
        "github.com/chrislusf/gleam/flow"
    )
    
    func main() {
    
        flow.New().TextFile("/etc/passwd").Map(func(line string) string {
            res, _ := somepackage.Transform(line)
            return res
        }).Fprintf(os.Stdout, "%s\n").Run()
    }
    
  • Word count example in distributed mode does not work

    Word count example in distributed mode does not work

    I followed the following to set the gleam cluster.

    // start "gleam master" on a server

    go get github.com/chrislusf/gleam/distributed/gleam gleam master --address=":45326"

    // start up "gleam agent" on some different servers or ports

    gleam agent --dir=2 --port 45327 --host=127.0.0.1 gleam agent --dir=3 --port 45328 --host=127.0.0.1

    Inside examples/word_count_in_go directory, I ran the following:

    make run_distributed
    

    The process does not seem to finish.

    The following in standalone mode does not have problems.

    make
    
  • flow.Sort changes order of keys and values

    flow.Sort changes order of keys and values

    If I run flow.Sort(..., flow.OrderBy(2, true)) on the row "1 2 3 4", the order of the keys and values gets changed to "2 1 3 4". However, when I run flow.LocalSort(..., flow.OrderBy(2, true)), the order stays the same. Is this intentional behaviour? Ideally, any operations in the flow shouldn't change the order of the keys and values unless explicitly asked for right?

  • Reading Parquet from HDFS or S3

    Reading Parquet from HDFS or S3

    Hi,

    Based on the README.md, gleam seems to be able to read data sources (e.g. Parquet) from HDFS or S3, but there doesn't seem to be any example on this.

    Is it possible to show how to read Parquet file from HDFS/S3?

    Thank you.

  • WriteTo encoding error: msgp: type

    WriteTo encoding error: msgp: type "map[string][]float64" not supported

    Every time I update my own project, gleam is updated along with it. My driver program fails to serve request with the following error:

    2017/09/03 23:22:10 Failed to run task Slices-0: WriteTo encoding error: msgp: type "map[string][]float64" not supported
    2017/09/03 23:22:10 Failed to execute on driver side: Failed to send source data: WriteTo encoding error: msgp: type "map[string][]float64" not supported
    

    Is there anything changed with the mapper/reducer api?

  • error when compile example

    error when compile example

    I get error when I compile example (parquet, word_count_in_pipeline, csv, word_count_in_go..) ../../plugins/file/parquet/parquet_file_reader.go:6:2: cannot find package "github.com/xitongsys/parquet-go/ParquetFile" in any of: /usr/local/go/src/github.com/xitongsys/parquet-go/ParquetFile (from $GOROOT) /Users/lwm/go/src/github.com/xitongsys/parquet-go/ParquetFile (from $GOPATH) ../../plugins/file/parquet/parquet_file_reader.go:7:2: cannot find package "github.com/xitongsys/parquet-go/ParquetReader" in any of: /usr/local/go/src/github.com/xitongsys/parquet-go/ParquetReader (from $GOROOT) /Users/lwm/go/src/github.com/xitongsys/parquet-go/ParquetReader (from $GOPATH) ../../plugins/file/parquet/parquet_file_reader.go:8:2: cannot find package "github.com/xitongsys/parquet-go/ParquetType" in any of: /usr/local/go/src/github.com/xitongsys/parquet-go/ParquetType (from $GOROOT) /Users/lwm/go/src/github.com/xitongsys/parquet-go/ParquetType (from $GOPATH)

  • Improve test coverge.

    Improve test coverge.

    I would like to learn more about the project by getting started with contributing to the tests to improve the coverage. I believe that would a productive start to understand the project.

  • Join lib error

    Join lib error

    "Join two CSV files." example gives the following error.

    # loader-engine/vendor/github.com/xitongsys/parquet-go/Layout
    vendor/github.com/xitongsys/parquet-go/Layout/DictPage.go:69:30: too many arguments in call to ts.Write
            have (context.Context, *parquet.PageHeader)
            want (thrift.TStruct)
    vendor/github.com/xitongsys/parquet-go/Layout/DictPage.go:200:30: too many arguments in call to ts.Write
            have (context.Context, *parquet.PageHeader)
            want (thrift.TStruct)
    vendor/github.com/xitongsys/parquet-go/Layout/Page.go:255:30: too many arguments in call to ts.Write
            have (context.Context, *parquet.PageHeader)
            want (thrift.TStruct)
    vendor/github.com/xitongsys/parquet-go/Layout/Page.go:354:30: too many arguments in call to ts.Write
            have (context.Context, *parquet.PageHeader)
            want (thrift.TStruct)
    
  • backward incompatible change in github.com/xitongsys/parquet-go

    backward incompatible change in github.com/xitongsys/parquet-go

    % go build                                                                                                                  
    # github.com/chrislusf/gleam/plugins/file/parquet
    ../../go/pkg/mod/github.com/chrislusf/[email protected]/plugins/file/parquet/parquet_file_reader.go:8:2: imported and not used: "github.com/xitongsys/parquet-go/types"
    ../../go/pkg/mod/github.com/chrislusf/[email protected]/plugins/file/parquet/parquet_file_reader.go:73:29: undefined: ParquetTypeToGoType
    

    it seems in v1.6.0 that functions was removed. Doing go get -u github.com/xitongsys/[email protected] solves the problem for me, but it would be better if gleam had some go.mod file

    @xitongsys breaking API Changes should be done in a v2 version, doing it in v1 will break every thing. please see https://blog.golang.org/v2-go-modules

  • Support for reading multiple files from S3

    Support for reading multiple files from S3

    I notice that the S3 virtual file system doesn't support listings. I also don't find any example code that shows how to use multiple S3 files as input so I assume it's currently not supported. I think that's a pretty common use-case so I'm creating a feature request here. ✋

    I guess the same issue applies to the Google Storage VFS. 🤷

    Finally, is there a workaround? I guess I could write my custom func(io.Writer, *pb.InstructionStat) error, perhaps? I'm new to this library.

  • Optimisation: S3/GS files shouldn't need to touch disk

    Optimisation: S3/GS files shouldn't need to touch disk

    I notice that the S3 and GS virtual file systems both copy the content first to disk. Strictly speaking, given that both APIs support range requests to fetch file content, I don't think they really need to touch disk at all. Instead they could be streamed straight over HTTP while still implementing https://github.com/chrislusf/gleam/blob/53a69565476a403024c43247a7eb0eb728feaa43/filesystem/vfs.go#L20-L25.

    I guess temporarily storing to disk is a simpler implementation, though. 🤷‍♂️

  • Possible risk of temporary file collision

    Possible risk of temporary file collision

    I'm reading https://github.com/chrislusf/gleam/blob/766b2213edfa10ea296bba3b3adb0862606b5518/filesystem/vfs_s3.go#L83 and https://github.com/chrislusf/gleam/blob/53a69565476a403024c43247a7eb0eb728feaa43/filesystem/vfs_gs.go#L108 and both has a slight risk of file name collisions. I propose migrating to use https://golang.org/pkg/io/ioutil/#TempFile which makes that scenario impossible.

  • Performance Optimization is required

    Performance Optimization is required

    I require a performance optimization for my code. Below is the gist link https://gist.github.com/moodleexpert/756d33613d08fc8581bce1bb4e806f21

    As I have a file size of 2.5GB. I processing it locally, but it is taking too much time. Same thing I am doing with a bash script it processes the data within 5 mins. Can tell me what exactly I am doing wrong

Map, Reduce, Filter, De/Multiplex, etc. for the Go language.

proto proto gives Go operations like Map, Reduce, Filter, De/Multiplex, etc. without sacrificing idiomatic harmony or speed. It also introduces a conv

Nov 17, 2022
Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go.

kanzi Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go. modern: state-of-the-art algorithms are impleme

Dec 22, 2022
Convert struct, slice, array, map or others for Golang

XConv zh-CN XConv is a golang type convertor. It convert any value between types (base type, struct, array, slice, map, etc.) Features Convert between

Dec 8, 2022
A distributed, fault-tolerant pipeline for observability data

Table of Contents What Is Veneur? Use Case See Also Status Features Vendor And Backend Agnostic Modern Metrics Format (Or Others!) Global Aggregation

Dec 25, 2022
Declarative streaming ETL for mundane tasks, written in Go
Declarative streaming ETL for mundane tasks, written in Go

Benthos is a high performance and resilient stream processor, able to connect various sources and sinks in a range of brokering patterns and perform hydration, enrichments, transformations and filters on payloads.

Dec 29, 2022
A clean, safe, user-friendly implementation of GraphQL's Dataloader, written with generics in go

go-dataloader A clean, safe, user-friendly implementation of GraphQL's Dataloader, written with generics in go (go1.18beta1). Features written in gene

Dec 30, 2022
DEPRECATED: Data collection and processing made easy.

This project is deprecated. Please see this email for more details. Heka Data Acquisition and Processing Made Easy Heka is a tool for collecting and c

Nov 30, 2022
Open source framework for processing, monitoring, and alerting on time series data

Kapacitor Open source framework for processing, monitoring, and alerting on time series data Installation Kapacitor has two binaries: kapacitor – a CL

Dec 24, 2022
Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more

Gonum Installation The core packages of the Gonum suite are written in pure Go with some assembly. Installation is done using go get. go get -u gonum.

Dec 29, 2022
Baker is a high performance, composable and extendable data-processing pipeline for the big data era

Baker is a high performance, composable and extendable data-processing pipeline for the big data era. It shines at converting, processing, extracting or storing records (structured data), applying whatever transformation between input and output through easy-to-write filters.

Dec 14, 2022
Graphik is a Backend as a Service implemented as an identity-aware document & graph database with support for gRPC and graphQL
Graphik is a Backend as a Service implemented as an identity-aware document & graph database with support for gRPC and graphQL

Graphik is a Backend as a Service implemented as an identity-aware, permissioned, persistant document/graph database & pubsub server written in Go.

Dec 30, 2022
Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Dud Website | Install | Getting Started | Source Code Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Jan 1, 2023
churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline applications.

Churro - ETL for Kubernetes churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline appli

Mar 10, 2022
Simple CRUD application using CockroachDB and Go

Simple CRUD application using CockroachDB and Go

Feb 20, 2022
Prometheus Common Data Exporter can parse JSON, XML, yaml or other format data from various sources (such as HTTP response message, local file, TCP response message and UDP response message) into Prometheus metric data.
Prometheus Common Data Exporter can parse JSON, XML, yaml or other format data from various sources (such as HTTP response message, local file, TCP response message and UDP response message) into Prometheus metric data.

Prometheus Common Data Exporter Prometheus Common Data Exporter 用于将多种来源(如http响应报文、本地文件、TCP响应报文、UDP响应报文)的Json、xml、yaml或其它格式的数据,解析为Prometheus metric数据。

May 18, 2022
CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

Jan 1, 2023
xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL.

xyr [WIP] xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL. Supported Drivers

Dec 2, 2022
Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data
Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data

Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data throughout the software development life cycle (SDLC) for engineering teams.

Dec 30, 2022
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.

Gleam Gleam is a high performance and efficient distributed execution system, and also simple, generic, flexible and easy to customize. Gleam is built

Jan 5, 2023