DEPRECATED: Data collection and processing made easy.

Last update: Nov 30, 2022

Comments: 16

This project is deprecated. Please see this email for more details.

Heka

Data Acquisition and Processing Made Easy

Heka is a tool for collecting and collating data from a number of different sources, performing "in-flight" processing of collected data, and delivering the results to any number of destinations for further analysis.

Heka is written in Go, but Heka plugins can be written in either Go or Lua. The easiest way to compile Heka is by sourcing (see below) the build script in the root directory of the project, which will set up a Go environment, verify the prerequisites, and install all required dependencies. The build process also provides a mechanism for easily integrating external plug-in packages into the generated hekad. For more details and additional installation options see Installing.

WARNING: YOU MUST SOURCE THE BUILD SCRIPT (i.e. source build.sh) TO BUILD HEKA. Setting up the Go build environment requires changes to the shell environment, if you simply execute the script (i.e. ./build.sh) these changes will not be made.

Resources:

Heka project docs: https://hekad.readthedocs.io/
GoDoc package docs: http://godoc.org/github.com/mozilla-services/heka
Mailing list: https://mail.mozilla.org/listinfo/heka
IRC: #heka on irc.mozilla.org

Owner

Mozilla Services

see also http://blog.mozilla.com/services

https://github.com/mozilla-services/heka http://hekad.readthedocs.org/

Comments

(slightly) Improving deb packaging with init scripts and user creation

As mentioned in PR: #791 there was some wishes for upstart and systemd jobs as well as an init script and there was also concerns that the install([...]) used in the PR would actually have included the files in all packaging. I've added in a CMake module that I found here: https://github.com/sebknzl/cmake-debhelper/ which is licensed under GPL (might be of concern? But it is only for the build system, so it should not go viral over the entire code base) This module is "needed" to execute dh_* scripts during the setup of the project - those in turn are needed to be able to use the debian packaging idioms of putting the init scripts into a debian/ folder.

I also found a "bug" in the current deb-package setup. The listed debian dependency did not get assigned to the package, which I'll admit caused me some headache - I initially tried adding the init files in a similar fashion. Passing it in as a variable to the custom make entry seems to have made it work, though. But is it really needed to have libc6 above 2.13? It means that heka won't install on squeeze...

Though, in the long run I would really recommend going away from having CPack do the Debian packaging. I didn't do that now since it would've involved rearranging a significant portion of the build, but if it is something you want to do with the project I think I could summon the time to do that too. :)
Support new InfluxDB 0.9.x line protocol write API

Well, it looks like they've decided to change the default format for the write API, per this PR: https://github.com/influxdb/influxdb/pull/2696. I'm logging this here to follow up on updating the functionality of the Schema InfluxDB Write Encoder to support this as the JSON format will eventually be deprecated and this provides better performance anyway.

Output buffer never flushed on restart

With the following output config:

[ESJsonEncoder]
type = "ESJsonEncoder"
index = "logs-%{Type}-%{%Y.%m.%d}"
es_index_from_timestamp = true
type_name = "%{Type}"
fields = ["Timestamp", "Logger", "Severity", "Payload", "Pid", "Hostname", "DynamicFields"]
  [ESJsonEncoder.field_mappings]
  Timestamp = '@timestamp'
  #Uuid = 'heka_uuid'
  #Type = 'type' # inutile, car dans type_name
  Logger = 'heka_logger'
  Severity = 'syslog_severity_code'
  Payload = 'message'
  #EnvVersion = 'heka_env_version' # inutile
  Pid = 'pid'
  Hostname = 'host'

[ElasticSearchOutput]
message_matcher = "Type != 'heka.all-report' && Type != 'heka.memstat'"
encoder = "ESJsonEncoder"
server = "http://localhost:9200"
flush_interval = 50
  [ElasticSearchOutput.buffering]
  max_file_size = 268435456  # 256MiB
  max_buffer_size = 536870912  # 512MiB
  full_action = "shutdown"

After a service heka restart, the queue is not processed. It will grow untill full.

I've looked at the code without any clue.

I've seen that #1724 has not been merged into dev yet, but this is a different problem isn't it?

Add FileOutput file rotation

We've had several requests for FileOutput to be able to do file rotation w/o the use of an external rotation tool, in part b/c less tools, and in part b/c Heka needs to get a HUP signal to actually notice that a file has been rotated out from under it, and the person running Heka doesn't always have access to when rotation has happened and when HUP needs to be sent.
Implementation of a multiline splitter
We work with a lot of Java and Scala stacktraces and the other options I've tried for supporting them in Heka don't work as well as I'd like. This is an implementation of a regex-based MultilineSplitter which works great for our stacktraces. The implementation is that you define a regex to use as the delimiter and a regex used to match lines that should be joined together. It first splits the buffer using the delimiter and then checks each section against the multiline regex to see if it's a match. All lines that are contiguous and match the multiline regex are joined. Because of the multiline nature, it always keeps the delimiter on the EOL.

This is going to be notably slower than the RegexSplitter because it has to find many matches on the first pass rather than the first one. (That's limited to 99 by default and is not currently configurable without a recompile.) Secondly, it will run a second regex on all those matches. Given's Go's performance-oriented Regex implementation and reasonable logging levels it appears to be tolerable. It can, in the worst case, re-split the first lines in a very large buffer repeatedly.

Here's an example configuration for the splitter:

[multiline_splitter] type = "MultilineSplitter" multiline = '(\] FATAL )|(\A\s*.+Exception: .)|(at \S+$\S+$)|(\A\s+... \d+ more)|(\A\s*Caused by:.)|(\A\s*Grave:)' delimiter = '\n'

Given a broken Kafka installation, this generates something like the following when encoded with the ESLogstashV0Encoder:

{ "@fields": { "ContainerName": "boring_bohr", "ContainerID": "910f097243d6" }, "@source_host": "docker1", "@uuid": "bda1cb47-8aa4-420e-9316-543364afd5fc", "@timestamp": "2016-01-24T14:22:17", "@type": "message", "@logger": "stdout", "@severity": 7, "@message": "java.net.ConnectException: Connection refused\n\tat sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n\tat sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)\n\tat org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)\n\tat org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)\n", "@envversion": "", "@pid": 0 }

Note that this output is from a splitter-enabled Docker input plugin that I will prepare a separate PR for.
Raw syslog datagrams multidecoder example
I am proposing to add an example to cover #1162 and the older #790.

This is particularly interesting in the case of containers, if one does not want to run syslog/rsyslog inside the container but rather directly read /dev/log via heka.

Although the best scenario is to let syslog/rsyslog poll /dev/log and write their own log (which is then feeded to/read by heka), I prefer this approach and I provide here a working example (tested with 0.9.0) that should also allow people to easily extend/build on.

I am new to heka, so please tell me about possible improvements. Feedback for PR changes welcome as well :)

NOTE: might require some refactoring of the name of sibling example.toml, if this is accepted at all

Is there already a decoder that does this?

If not, would this better be converted to a LUA decoder?

I didn't write (2) as I wanted a simple message proxying feature, but for official inclusion it might be a valid option instead of a multidecoder example (also to properly parse PRI).
New input plugin for Docker containers: DockerLogInput

Solves: #1092 This is a follow-up on PR #1095

This PR implements DockerLogInput, an input plugin based on Logspout for sending logs from Docker containers into the heka pipeline.

@rafrombrc Something like this? I renamed the plugin to DockerLogInput and wrote some basic docs for it.

Can't install heka in Ubuntu13.04

Hi, all!

We use ubuntu13.04, and we install Go lang use source code under /usr/local/go and I set the path like this

 export GOROOT=/usr/local/go
 export PATH=$PATH:$GOROOT/bin

# go version
go version go1.1.1 linux/amd64

after finished install golang, And install heka got a error

root@squidloganalyzer:~/heka# ./build.sh
CMake Error at /usr/share/cmake-2.8/Modules/FindPackageHandleStandardArgs.cmake:97 (message):
  Could NOT find Go (missing: GO_VERSION GO_PLATFORM GO_ARCH) (Required is at
  least version "1.1")
Call Stack (most recent call first):
  /usr/share/cmake-2.8/Modules/FindPackageHandleStandardArgs.cmake:291 (_FPHSA_FAILURE_MESSAGE)
  cmake/FindGo.cmake:32 (find_package_handle_standard_args)
  CMakeLists.txt:16 (find_package)


-- Configuring incomplete, errors occurred!
make: *** No targets specified and no makefile found.  Stop.

Can anyone tell me. How can I install heka the right way?

Update Schema InfluxDB Write Encoder to use InfluxDB 0.9.x+ line protocol

This PR updates the Schema InfluxDB Write Encoder to format metrics into the line protocol instead of the JSON format that was originally proposed for the 0.9.0 release. This also makes sure that proper formatting is done for the various fields to escape spaces, commas and double quotes as defined by the line protocol. There is also a somewhat hacky implementation to overcome the fact that Lua converts float values of 0.0 into an integer based on its internal representation of numerical data types. I've also added a couple of new configuration items that provide more flexibility in the naming of measurements as they are sent to InfluxDB. There is now more emphasis on utilizing tags instead of the Graphite style "paths" to uniquely identify series, so the defaults have been blanked out to work with this recommendation more naturally.
Can't install built heka deb package.
I can run generated hekad binary file directly, but after run

source build.sh make deb sudo dpkg -i heka_0.10.0_amd64.deb

OS output:

dpkg: dependency problems prevent configuration of heka: heka depends on libc6-amd64 (>= 2.15).

However, if i install v0.10.0 directly from source, there is no this dependency at all. Is there something wrong?
add example decoder for linux /proc/stat
/proc/stat is a good source for cpu utilization metric. Where /proc/loadavg is kinda tricky to work with. /proc/stat can give 1 second resolution without any issues.

however getting the information is not so simple. You have to pull /proc/stat twice to solve the delta of the values to gauge preformance.

This is my first time playing with lua lpeg and im still vary green to the heka project.

This pull request is just to get some thoughs on how a good heka plugin could be created for the /proc/stat cpu metric.

This solution is not ideal! but it's giving me good data.

[stat_ProcessInput] type = "ProcessInput" decoder = "StatDecoder" ticker_interval = 3 stdout = true stderr = false [stat_ProcessInput.command.0] bin = "/bin/sh" args = ["-c",'A=`head -1 /proc/stat`; sleep 1; B=`head -1 /proc/stat`; echo ${A}zzz${B}zzz;'] {# This would be best ... but I dont see how I can get the diff of the previous read... [stat] type = "FilePollingInput" ticker_interval = 1 file_path = "/proc/stat" decoder = "StatDecoder" #} [StatDecoder] type = "SandboxDecoder" filename = "lua_decoders/linux_stat.lua"
need opensearch bulkapi to send logs to aws opensearch

hi i tried to use the elasticsearch_bulk_api.lua in a new output file called opensearch.cfg. but i get http 400 in the logs. user/pass/endpoint is configured correctly so the only thing i can think of is that the opensearch_1,3 from aws does not support the elasticsearch_bulk_api.lua format somehow. one more thing is that as soon as the new output added to the hindsight, the current setup with elasticsearch (different output.cfg file) gets disrrupted
CODE_OF_CONDUCT.md file missing
As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.

Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please reach out to [email protected].

(Message COC001)
kafka partition issus

conf: `[kafkaInputTest] type = "KafkaInput" topic = "jie" addrs = ["172.20.3.50:9092"] splitter = "KafkaSplitter" decoder = "ProtobufDecoder"

[KafkaSplitter] type = "NullSplitter" use_message_bytes = true `

But error:

2018/09/17 16:01:18 Input 'kafkaInputTest' error: kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.
build failed on master branch

[ 85%] Performing build step for 'lua_sandbox' Scanning dependencies of target lua-5_1_5 [ 1%] Creating directories for 'lua-5_1_5' [ 2%] Performing download step (git clone) for 'lua-5_1_5' Cloning into 'lua-5_1_5'... remote: Repository not found. fatal: repository 'https://github.com/trink/lua.git/' not found
MySQL Slow Query Log Decoder issue
[hekad] maxprocs = 2 #base_dir = "./base_dir" share_dir = "/usr/share/heka" #log_info_filename = "logs/info.log" #log_error_filename = "logs/error.log" #log_file_max_size = 64 #log_file_max_backups = 7

[Sync-1_5-SlowQuery] type = "LogstreamerInput" log_directory = "/data/soft/" file_match = 'mysqlslowq.log' parser_type = "regexp" delimiter = "\n(# User@Host:)" delimiter_location = "start" decoder = "MySqlSlowQueryDecoder"

[MySqlSlowQueryDecoder] type = "SandboxDecoder" filename = "lua_decoders/mysql_slow_query.lua"

[MySqlSlowQueryDecoder.config] truncate_sql = 64

[ESJsonEncoder] index = "%{Type}-%{%Y.%m.%d}" es_index_from_timestamp = true type_name = "%{Type}" [ESJsonEncoder.field_mappings] Timestamp = "@timestamp" Severity = "level"

[output_file] type = "FileOutput" message_matcher = "TRUE" path = "/data/mysql-output.log" perm = "666" flush_count = 100 flush_operator = "OR" encoder = "ESJsonEncoder"

#######################################################################

[root@oskey heka]# hekad -config="mysql.toml"

2018/01/25 09:59:20 Pre-loading: [output_file] 2018/01/25 09:59:20 Pre-loading: [Sync-1_5-SlowQuery] 2018/01/25 09:59:20 Pre-loading: [MySqlSlowQueryDecoder] 2018/01/25 09:59:20 Pre-loading: [ESJsonEncoder] 2018/01/25 09:59:20 Pre-loading: [ProtobufDecoder] 2018/01/25 09:59:20 Loading: [ProtobufDecoder] 2018/01/25 09:59:20 Pre-loading: [ProtobufEncoder] 2018/01/25 09:59:20 Loading: [ProtobufEncoder] 2018/01/25 09:59:20 Pre-loading: [TokenSplitter] 2018/01/25 09:59:20 Loading: [TokenSplitter] 2018/01/25 09:59:20 Pre-loading: [HekaFramingSplitter] 2018/01/25 09:59:20 Loading: [HekaFramingSplitter] 2018/01/25 09:59:20 Pre-loading: [NullSplitter] 2018/01/25 09:59:20 Loading: [NullSplitter] 2018/01/25 09:59:20 Loading: [MySqlSlowQueryDecoder] 2018/01/25 09:59:20 Loading: [ESJsonEncoder] 2018/01/25 09:59:20 Loading: [Sync-1_5-SlowQuery] 2018/01/25 09:59:20 unknown config setting for 'Sync-1_5-SlowQuery': parser_type 2018/01/25 09:59:20 Loading: [output_file] 2018/01/25 09:59:20 Error reading config: 1 errors loading plugins

Related tags

Data Processing heka

Open source framework for processing, monitoring, and alerting on time series data

Kapacitor Open source framework for processing, monitoring, and alerting on time series data Installation Kapacitor has two binaries: kapacitor – a CL

Dec 24, 2022

Prometheus Common Data Exporter can parse JSON, XML, yaml or other format data from various sources (such as HTTP response message, local file, TCP response message and UDP response message) into Prometheus metric data.

Prometheus Common Data Exporter Prometheus Common Data Exporter 用于将多种来源(如http响应报文、本地文件、TCP响应报文、UDP响应报文)的Json、xml、yaml或其它格式的数据，解析为Prometheus metric数据。

May 18, 2022

A stream processing API for Go (alpha)

A data stream processing API for Go (alpha) Automi is an API for processing streams of data using idiomatic Go. Using Automi, programs can process str

Dec 28, 2022

DataKit is collection agent for DataFlux.

DataKit DataKit is collection agent for DataFlux Build Dependencies apt-get install gcc-multilib: for building oracle input apt-get install tree: for

Dec 29, 2022

Go Collection Stream API, inspired in Java 8 Stream.

GoStream gostream 是一个数据流式处理库。它可以声明式地对数据进行转换、过滤、排序、分组、收集，而无需关心操作细节。 Changelog 2021-11-18 add ToSet() collector Roadmap 移除go-linq依赖 Get GoStream go get

Nov 21, 2022

Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Dud Website | Install | Getting Started | Source Code Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Jan 1, 2023

CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

Jan 1, 2023

xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL.

xyr [WIP] xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL. Supported Drivers

Dec 2, 2022

Glow is an easy-to-use distributed computation system written in Go, similar to Hadoop Map Reduce, Spark, Flink, Storm, etc. I am also working on another similar pure Go system, https://github.com/chrislusf/gleam , which is more flexible and more performant.

glow Purpose Glow is providing a library to easily compute in parallel threads or distributed to clusters of machines. This is written in pure Go. I a

Dec 30, 2022

Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go.

kanzi Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go. modern: state-of-the-art algorithms are impleme

Dec 22, 2022

churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline applications.

Churro - ETL for Kubernetes churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline appli

Mar 10, 2022

Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data

Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data throughout the software development life cycle (SDLC) for engineering teams.

Dec 30, 2022

A library for performing data pipeline / ETL tasks in Go.

Ratchet A library for performing data pipeline / ETL tasks in Go. The Go programming language's simplicity, execution speed, and concurrency support m

Jan 19, 2022

A distributed, fault-tolerant pipeline for observability data

Table of Contents What Is Veneur? Use Case See Also Status Features Vendor And Backend Agnostic Modern Metrics Format (Or Others!) Global Aggregation

Dec 25, 2022

Data syncing in golang for ClickHouse.

ClickHouse Data Synchromesh Data syncing in golang for ClickHouse. based on go-zero ARCH A typical data warehouse architecture design of data sync Aut

Jan 1, 2023

sq is a command line tool that provides jq-style access to structured data sources such as SQL databases, or document formats like CSV or Excel.

sq: swiss-army knife for data sq is a command line tool that provides jq-style access to structured data sources such as SQL databases, or document fo

Jan 1, 2023

Machine is a library for creating data workflows.

Machine is a library for creating data workflows. These workflows can be either very concise or quite complex, even allowing for cycles for flows that need retry or self healing mechanisms.

Dec 26, 2022

Stream data into Google BigQuery concurrently using InsertAll() or BQ Storage.

bqwriter A Go package to write data into Google BigQuery concurrently with a high throughput. By default the InsertAll() API is used (REST API under t

Dec 16, 2022

Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.

Gleam Gleam is a high performance and efficient distributed execution system, and also simple, generic, flexible and easy to customize. Gleam is built

Jan 5, 2023

DEPRECATED: Data collection and processing made easy.

Heka

Owner

Mozilla Services

Comments

(slightly) Improving deb packaging with init scripts and user creation

Support new InfluxDB 0.9.x line protocol write API

Output buffer never flushed on restart

Add FileOutput file rotation

Implementation of a multiline splitter

Raw syslog datagrams multidecoder example

New input plugin for Docker containers: DockerLogInput

Can't install heka in Ubuntu13.04

Update Schema InfluxDB Write Encoder to use InfluxDB 0.9.x+ line protocol

Can't install built heka deb package.

add example decoder for linux /proc/stat

need opensearch bulkapi to send logs to aws opensearch

CODE_OF_CONDUCT.md file missing

kafka partition issus

build failed on master branch

MySQL Slow Query Log Decoder issue

Related tags

Open source framework for processing, monitoring, and alerting on time series data

Prometheus Common Data Exporter can parse JSON, XML, yaml or other format data from various sources (such as HTTP response message, local file, TCP response message and UDP response message) into Prometheus metric data.

A stream processing API for Go (alpha)

DataKit is collection agent for DataFlux.

Go Collection Stream API, inspired in Java 8 Stream.

Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL.

Glow is an easy-to-use distributed computation system written in Go, similar to Hadoop Map Reduce, Spark, Flink, Storm, etc. I am also working on another similar pure Go system, https://github.com/chrislusf/gleam , which is more flexible and more performant.

Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go.

churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline applications.

Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data

A library for performing data pipeline / ETL tasks in Go.

A distributed, fault-tolerant pipeline for observability data

Data syncing in golang for ClickHouse.

sq is a command line tool that provides jq-style access to structured data sources such as SQL databases, or document formats like CSV or Excel.

Machine is a library for creating data workflows.

Stream data into Google BigQuery concurrently using InsertAll() or BQ Storage.

Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.