Minimal memory usage, cloud native logstash alternative

Last update: Aug 18, 2022

Comments: 14

Mr-Plow

Tiny and minimal tool to export data from relational db (postgres or mysql) to elasticsearch.

The tool does not implement all the logstash features, but its goal is to be an alternative to logstash when keeping in-sync elastic and a relational database.

Goal

Low memory usage: (~15 MB when idle, great to be deployed on cloud environments).

Stateless: a timestamp/date column is used in order to filter inserted/update data and to avoid fetching already seen data. During the startup Mr-Plow checks the data inserted into elasticsearch to check the last timestamp/date of the transferred data, and so it does not require a local state.

Usage:

Mr-Plow essentially executes queries on a relational database and writes these data to ElasticSearch. The configured queries are run in parallel, and data are written incrementally, it's only sufficient to specify a timestamp/date column in the queries in order to get only newly updated/inserted data.

This is a basic configuration template example, where we only specify two queries and the endpoint configuration (one Postgres database and one ElasticSearch cluster:

# example of config.yml
pollingSeconds: 5 #database polling interval
database: "postgres://user:pwd@localhost:5432/postgres?sslmode=disable" #specify here the db connection
queries: #put here one of more queries (each one will be executed in parallel):
  - query: "select * from my_table1 where last_update > $1" #please add a filter on an incrementing date/ts column using the $1 value as param
    index: "table1_index" #name of the elastic output index
    updateDate: "last_update" #name of the incrementing date column
    id: "customer_id" #optional, column to use as elasticsearch id
  - query: "select * from my_table2 where ts > $1"
    index: "table2_index"
    updateDate: "ts"
elastic:
  url: "http://localhost:9200"
  user: "elastic_user" #optional
  password: "my_secret" #optional
  numWorker: 10 #optional, number of worker for indexing each query
  caCertPath: "my/path/ca" #optional, path of custom CA file (it may be needed in some HTTPS connection..)

Anyway, Mr Plow has also additional features, for example interacting with a database like Postgres, supporting JSON columns, we can specify JSON fields, in order to create a complex (nested) object to be created in Elastic. In the following, we show an example where in the Employee table we store two dynamic JSON fields, one containing the Payment Data and another one containing additional informations for the employee:

pollingSeconds: 5
database: databaseValue
queries:
  - index: index_1
    query: select * from employees
    updateDate: last_update
    JSONFields:
      - fieldName: payment_data
      - fieldName: additional_infos
    id: MyId_1

And additionally, we can specify the type expected for some specific fields. Please note hat field type is optional and if not specified, the field is casted as String.

Actually supported type are: String, Integer, Float and Boolean

pollingSeconds: 5
database: databaseValue
queries:
  - index: index_1
    query: select * from employees
    updateDate: last_update
    fields: # Optional config, casting standard sql columns to specific data type
      - name: name
        type: String
      - name: working_hours
        type: Integer

Merging the previous two examples, we can apply the type casting also to inner JSON fields, here is a complete example of configuration:

pollingSeconds: 5
database: databaseValue
queries:
  - index: index_1
    query: select * from employees
    updateDate: last_update
    fields: # Optional, casting standard sql columns to specific data type
      - name: name
        type: String
      - name: working_hours
        type: Integer
    JSONFields:
      - fieldName: payment_info
        fields: # Optional, casting json fields to specific data type
          - name: bank_account
            type: String
          - name: validated
            type: Boolean
    id: MyId_1

Download or build the binary (docker images will be released soon):

make

Run the tool:

./bin/mr-plow -config /path/to/my/config.yml

To build as docker image, create a config.yml and put into the root folder of the project. Then run:

docker build .

Mr-Plow development with Visual Studio Code

Requirements

Linux and Mac users should ensure to have the following programs installed on their system:

bash, id, getent

Windows users must be aware that they should clone Mr-Plow repository over the WSL filesystem.
This is the recommended way because mounting a NTFS filesystem inside a container exposes the overall user's experience to major issues.
Windows users should also patch the .devcontainer/devcontainer.json as indicated in the comments inside the file.

Steps

Clone the project to a local folder
VSCode -> >< (left bottom button) -> Open Folder in Container...

Instructions

Users can develop and test Mr-Plow inside a docker container without being forced to install or configure anything on your own machine.
Visual Studio Code can take care of automatically download and build the developer's docker image.
For Linux and Mac users, an especial care has been devoted to make sure the host's user will match UID and GID with the user inside the container.
This ensures that every modification from inside the container will be completely transparent from the host's perspective.
Moreover, host's user ~.ssh directory will be mounted on the container's user ~.ssh directory. This is especially convenient if an ssh authentication type is configured to work with GitHub.
From inside the container, users will be able to access the host's docker engine as if they were just in a regular host's shell.
This capability allows users to launch the predefined docker-compose images directly from Visual Studio Code.
Users can simply access to the task menu pressing: ctrl + shift + p and select Docker: Compose Up.
Therefore, they can choose to spawn up the following services:

docker-compose-elastic.yml: ElasticSearch
docker-compose-kibana.yml: Kibana
docker-compose-postgres.yml: Postgres

Owner

RingLoop

https://github.com/Ringloop/mr-plow

Comments

- Remove test_util, it's not idiomatic looks like Java.
inserted test package testify.

pending to integrate with elastic.

modified: ../go.mod modified: ../go.sum modified: complete_config_test.go modified: config_test.go modified: insert_integration_test.go modified: invalid_config_test.go modified: scheduling_integration_test.go modified: upsert_integration_test.go deleted: ../test_util/test_util.go
47 vscode enhs

Hi, This is a first attempt to show you the remote container's attach capabilities of VS. I suggest you to read first the updated README.md to have a high level user's expectations for this enh. This enh has been created within a linux env and it heavily rely on bash capabilities. I expect this patch to currently not work from MS Windows (some discussion about this will be eventually required). I had been able to start the full composition: mr-plow + postgres + elastic + kibana; but I hadn't been able to successfully launch the test suite; it seems to me that every integration test source has service's ips and ports hardcoded. Also it seems to me that documentation in general is lacking of information on how to correctly execute the tests.
Would be nice having a complete dokerized env with vscode

For the purpose of both developing and testing would be nice having a predefined environment already available out of the box. At the moment In order to develop and test one must set up its own environment on his machine, this involves installing go compiler, make, mysql, elastic and so on. This is not always desiderable for the user; much better having the project to be fully enclosed on its own environment. A nice thing vscode can do is that it can take care of all this simply instructing it to build and then connect to a docker container. A developer could immediately start building and immediately test the project in a replicated and verified env.
1 ci

Testing the full import cycle is almost there, the changes are getting bigger, so I would like to merge it before completing. This will help you in developing the next features.

Feedback are welcome as usual ;)
add docker-compose with postgres and elastic search,
It will better for development and testing to add docker-compose for firing up postgres and elastic-search instances:

add the correct table in postgres

fireup elastic search

Generate data

move data.
Casting section in README.md

After the configuration section, add a little paragraph to present the type casting configuration.

we can present one simple configuration with one single query, and then one (more) complex configuration to present nested json and complex configuration.
10 json data parsing

This feature is not complete, but I'm opening a pull request to consider merging the branch, since a bit of changes has been done on the general code (defined the configuration structure, clean some code, etc)
Useful rosources for CI and Code Coverage

CodeCov: an interesting tool to manage integration processes and Code Coverage (free for Open Sources) https://about.codecov.io/for/open-source/

CI Templates found on GitHub: https://github.com/jandelgado/golang-ci-template-github-actions

https://gist.github.com/Harold2017/d98607f242659ca65e731c688cb92707
Exiting signal priority

In the scheduler.go the exit message should have more priority than the periodic tick message received for each polling interval. Now the two messages are received in a non deterministic way, so the tool sometimes is not exiting when you hit ctrl +c.
Include some tools for the testing and remove util classes
We shoud try to avoid using utility classes and static methods,

https://www.vojtechruzicka.com/avoid-utility-classes/#:~:text=Another%20problem%20is%20that%20existing,related%20to%20the%20original%20methods.

we should use some helpful packages, such as

https://github.com/stretchr/testify: useful for assertion and mocking (my preferred for unit testing in particular)

https://github.com/onsi/ginkgo: This seems useful for the Integrations / e2e tests

https://onsi.github.io/gomega: This seems related with GinkGo as the preferred assertion package, seems an overkill to me, but let's consider it
Json Field validation

func validateJsonFields(_ []JSONField, _ int) error { //TODO return nil } @DarioBalinzo @feed3r This is pretty weird, could you point out which kind of validation is required? Just to have some context
Feature request: support king cli parsing

Currently we're supporting flag a parsing mechanism. It would be intesresting add Kong, that we're using in nuvolaris cli.: https://github.com/alecthomas/kong
Mr Plow memory stats

I've found an interesting tool to actually print the content of what is inside a golang binary: https://github.com/nikolaydubina/go-binsize-treemap

here the result with mr plow:
Implement an interface for configuration and metrics
We should implement some kind REST API to adjust the configuration (write the configuration file) and to show some metrics such as:

How much data has been transferred

System status

other...?

MatrixOne is a planet scale, cloud-edge native big data engine crafted for heterogeneous workloads.

What is MatrixOne? MatrixOne is a planet scale, cloud-edge native big data engine crafted for heterogeneous workloads. It provides an end-to-end data

Dec 26, 2022

Go-get-it - Simple database query script for UNIX-terminal usage

go-get-it Simple database query script for UNIX-terminal usage Supports MongoDB Quick start Commands: Usage of ggi: -c string MongoDB collectio

Nov 1, 2022

A truly Open Source MongoDB alternative

FerretDB FerretDB (previously MangoDB) was founded to become the de-facto open-source substitute to MongoDB. FerretDB is an open-source proxy, convert

Jan 2, 2023

A minimal, single-table No-SQL database.

SimpleDB SimpleDB is a very basic No-SQL database format for long-term data storage in Golang. It is WIP, has a LOT of drawbacks, and definitely is no

Jan 16, 2022

Tracking down a Memory Leak in Go/SQLite

Tracking down a Memory Leak in Go/SQLite run make test - WARNING: long running - several minutes on my workstation OSs supported: Windows_NT => memory

Feb 28, 2022

Constant Database native golang implementation

CDB golang implementation cdb is a fast, reliable, simple package for creating and reading constant databases see docs for more details Advantages Ite

Jul 15, 2022

Dumpling is a fast, easy-to-use tool written by Go for dumping data from the database(MySQL, TiDB...) to local/cloud(S3, GCP...) in multifarious formats(SQL, CSV...).

?? Dumpling Dumpling is a tool and a Go library for creating SQL dump from a MySQL-compatible database. It is intended to replace mysqldump and mydump

Nov 9, 2022

Query and Provision Cloud Infrastructure using an extensible SQL based grammar

Deploy, Manage and Query Cloud Infrastructure using SQL [Documentation] [Developer Guide] Cloud infrastructure coding using SQL InfraQL allows you to

Oct 25, 2022

CloudQuery extracts, transforms, and loads your cloud assets into normalized PostgreSQL tables.

The open-source cloud asset inventory backed by SQL. CloudQuery extracts, transforms, and loads your cloud assets into normalized PostgreSQL tables. C

Dec 31, 2022

Crossplane provider for InfluxDB Cloud

provider-template provider-template is a minimal Crossplane Provider that is meant to be used as a template for implementing new Providers. It comes w

Jan 10, 2022

Google Cloud Spanner driver for Go's database/sql package.

go-sql-spanner Google Cloud Spanner driver for Go's database/sql package. This support is currently in the Preview release status. import _ "github.co

Dec 11, 2022

System resource usage profiler tool which regularly takes snapshots of the memory and CPU load of one or more running processes so as to dynamically build up a profile of their usage of system resources.

Vegeta is a system resource usage tracking tool built to regularly take snapshots of the memory and CPU load of one or more running processes, so as to dynamically build up a profile of their usage of system resources.

Jan 16, 2022

gpu-memory-monitor is a metrics server for collecting GPU memory usage of kubernetes pods.

gpu-memory-monitor is a metrics server for collecting GPU memory usage of kubernetes pods. If you have a GPU machine, and some pods are using the GPU device, you can run the container by docker or kubernetes when your GPU device belongs to nvidia. The gpu-memory-monitor will collect the GPU memory usage of pods, you can get those metrics by API of gpu-memory-monitor

Jul 27, 2022

An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers

Developer-oriented Continuous Delivery Product ⁣ English | 简体中文 Table of Contents Zadig Table of Contents What is Zadig Quick start How to use? How to

Oct 19, 2021

Config-loader - Minimal and safe way to load in configuration files without any extra boilerplate, made for my own personal usage

?? config-loader Minimal and safe way to load in configuration files without any

Jul 4, 2022

Zinc Search engine. A lightweight alternative to elasticsearch that requires minimal resources, written in Go.

Zinc Zinc is a search engine that does full text indexing. It is a lightweight alternative to elasticsearch and runs in less than 100 MB of RAM. It us

Jan 8, 2023

Minimal memory usage, cloud native logstash alternative

Mr-Plow

Goal

Usage:

Mr-Plow development with Visual Studio Code

Owner

RingLoop

Comments

- Remove test_util, it's not idiomatic looks like Java.

47 vscode enhs

Would be nice having a complete dokerized env with vscode

1 ci

add docker-compose with postgres and elastic search,

Casting section in README.md

10 json data parsing

Useful rosources for CI and Code Coverage

Exiting signal priority

Include some tools for the testing and remove util classes

Json Field validation

Feature request: support king cli parsing

Mr Plow memory stats

Implement an interface for configuration and metrics

Related tags