Open source framework for processing, monitoring, and alerting on time series data

InfluxData

Last update: Dec 26, 2022

Comments: 17

Kapacitor

Open source framework for processing, monitoring, and alerting on time series data

Installation

Kapacitor has two binaries:

kapacitor – a CLI program for calling the Kapacitor API.
kapacitord – the Kapacitor server daemon.

You can either download the binaries directly from the downloads page or go get them:

go get github.com/influxdata/kapacitor/cmd/kapacitor
go get github.com/influxdata/kapacitor/cmd/kapacitord

Configuration

An example configuration file can be found here

Kapacitor can also provide an example config for you using this command:

kapacitord config

Getting Started

This README gives you a high level overview of what Kapacitor is and what its like to use it. As well as some details of how it works. To get started using Kapacitor see this guide. After you finish the getting started exercise you can check out the TICKscripts for different Telegraf plugins.

Basic Example

Kapacitor uses a DSL named TICKscript to define tasks.

A simple TICKscript that alerts on high cpu usage looks like this:

stream
    |from()
        .measurement('cpu_usage_idle')
        .groupBy('host')
    |window()
        .period(1m)
        .every(1m)
    |mean('value')
    |eval(lambda: 100.0 - "mean")
        .as('used')
    |alert()
        .message('{{ .Level}}: {{ .Name }}/{{ index .Tags "host" }} has high cpu usage: {{ index .Fields "used" }}')
        .warn(lambda: "used" > 70.0)
        .crit(lambda: "used" > 85.0)

        // Send alert to hander of choice.

        // Slack
        .slack()
        .channel('#alerts')

        // VictorOps
        .victorOps()
        .routingKey('team_rocket')

        // PagerDuty
        .pagerDuty()

Place the above script into a file cpu_alert.tick then run these commands to start the task:

# Define the task (assumes cpu data is in db 'telegraf')
kapacitor define \
    cpu_alert \
    -type stream \
    -dbrp telegraf.default \
    -tick ./cpu_alert.tick
# Start the task
kapacitor enable cpu_alert

Owner

InfluxData

https://github.com/influxdata/kapacitor

Comments

Compiled stateful expression

Hi,

This is a one big pull request with the next bottom line changes:

Performance of evaluating stateful expression signifactly improved
Added 11 unit tests for stateful expression and coverage got up from 16.2% to 18.2%
All tests are passing - I changed all the usages of tick.NewStatefulExper to use the new one - and all integrations tests passed.
There a behaviour changes - priority to errors have changes, etc - but in my opinion they are not big
DurationNode is not supported
Currently, I didn't replaced the stateful expression with the new one.

Implementation

Those are explanations of the core algorithm, if there are more questions/clarifications requested, I will update this.

Basic explanation

The overall idea: Instead of using stack-based AST interpreter compiled the expressions to specialized functions. For example: given this expression "value" > 8.0, let's assume two assumptions:

"value" is float64
8.0 is float64

The specializer will take this expression and will evantually run float64 > float64 all the time, instead of doing for every evaluation:

Type checking and guessing: checking the type of ref node and the right node type
Run through the whole AST tree

Deeper explanation

First, let's set up simple terminology:

Dynamic Node - node that is value changes on runtime like FunctionNode and ReferenceNode
Constant Node - node that is value is constant for the whole lifetime of the tick script
Evaluation Function - evaluation function is the function that accepts three arguments: scope, left and right node (this is a simplified version)

When we get a BinaryNode we determine if it's dynamic or constant - let's examine the dynamic case.

If this is dynamic node in the constructor (NewStatefulExpr) we set the evaluation function to be "dynamic evaluation function" otherwise we fetch the matching evaluation function based on the nodes types and their operator.

The dynamic evaluation function is doing the next instructions (this were the "specialization" happens):

Read the values of the left and right node (for example: for a reference node we will access the scope and read the value)
Find a matching evaluation function based on the types we got and save it (in field in the StatefulExpression struct)
call EvalBool

The real meat is in EvalBool/EvalNum:

If the evaluation function is null it means that we have some error:
Type mismatch: int > string
Not a comparison/math operator: int - int
Invalid operator for type: bool > bool
1. We have evaluation function and evaluate her - the evaluation function returns bool and error
2. We examine the error if it's our special error (ErrTypeGuardFailed) that indicates we ran the wrong comparison function - this can happen on type changes - for example: "value" started as int64 and eventually changed to float6
If we have an error - go to dynamic evaluation - to specialise the evaluation function
1. Return the results - bool and error

It's important to say: that we handle single nodes as well for example: EvalBool(BoolNode), etc.

Performance

I ran the tests on MacBook Pro (13-inch, Late 2011) - i5 2.4ghz, 8GB RAM and 128GB SSD. The tests ran with the flag of "--count=5" and compared using benchstat.

EvalBool Benchmarks

name                                                                       old time/op    new time/op    delta
_EvalBool_OneOperator_UnaryNode_BoolNode-4                                    252ns ± 2%      68ns ± 1%   -73.02%  (p=0.008 n=5+5)
_EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                           540ns ± 2%      41ns ± 2%   -92.33%  (p=0.008 n=5+5)
_EvalBool_OneOperator_NumberFloat64_NumberInt64-4                             550ns ± 3%      43ns ± 3%   -92.23%  (p=0.008 n=5+5)
_EvalBool_OneOperator_NumberInt64_NumberInt64-4                               539ns ± 2%      40ns ± 3%   -92.56%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                    524ns ± 3%      76ns ± 3%   -85.57%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                      526ns ± 1%      78ns ± 6%   -85.21%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4             495ns ± 3%     121ns ± 2%   -75.46%  (p=0.008 n=5+5)
_EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4     534ns ± 3%      94ns ± 3%   -82.37%  (p=0.008 n=5+5)
_EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4       2.98µs ± 1%    1.25µs ± 3%   -58.21%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                 503ns ± 3%     118ns ± 4%   -76.49%  (p=0.008 n=5+5)
_EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4         533ns ± 1%      89ns ± 4%   -83.23%  (p=0.008 n=5+5)
_EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4           3.08µs ± 4%    1.25µs ± 3%   -59.33%  (p=0.008 n=5+5)

name                                                                       old alloc/op   new alloc/op   delta
_EvalBool_OneOperator_UnaryNode_BoolNode-4                                    18.0B ± 0%      8.0B ± 0%   -55.56%  (p=0.008 n=5+5)
_EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                           72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_NumberFloat64_NumberInt64-4                             72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_NumberInt64_NumberInt64-4                               72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                    64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                      64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4             49.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4     64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4        64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                 49.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4         64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4            64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)

name                                                                       old allocs/op  new allocs/op  delta
_EvalBool_OneOperator_UnaryNode_BoolNode-4                                     3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.008 n=5+5)
_EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                            5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_NumberFloat64_NumberInt64-4                              5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_NumberInt64_NumberInt64-4                                5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                     4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                       4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4              3.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4      4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4         4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                  3.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4          4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
_EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4             4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)

AlertTask benchmarks

name                     old time/op    new time/op    delta
_T10_P500_AlertTask-4       138ms ± 5%     133ms ± 6%     ~     (p=0.421 n=5+5)
_T10_P50000_AlertTask-4     13.7s ± 6%     13.1s ± 5%     ~     (p=0.421 n=5+5)
_T1000_P500_AlertTask-4     13.7s ± 2%     13.0s ± 3%   -4.91%  (p=0.008 n=5+5)

name                     old alloc/op   new alloc/op   delta
_T10_P500_AlertTask-4      33.0MB ± 0%    32.0MB ± 0%   -2.85%  (p=0.008 n=5+5)
_T10_P50000_AlertTask-4    3.36GB ± 0%    3.26GB ± 0%   -2.86%  (p=0.008 n=5+5)
_T1000_P500_AlertTask-4    3.29GB ± 0%    3.19GB ± 0%   -2.90%  (p=0.008 n=5+5)

name                     old allocs/op  new allocs/op  delta
_T10_P500_AlertTask-4        466k ± 0%      408k ± 0%  -12.58%  (p=0.008 n=5+5)
_T10_P50000_AlertTask-4     47.5M ± 0%     41.5M ± 0%  -12.62%  (p=0.008 n=5+5)
_T1000_P500_AlertTask-4     46.1M ± 0%     40.2M ± 0%  -12.73%  (p=0.008 n=5+5)

Questions / Notes

Tests

I added more tests to stateful expression, to make sure we cover more and more cases. The coverage for eval package is now 73.5%. I added those tests:

TestStatefulExpression_EvalBool_BinaryNodeWithDurationNode
TestStatefulExpression_EvalNum_FunctionWithTimeValue
TestStatefulExpression_Eval_NotSupportedNode
TestStatefulExpression_Eval_NodeAndEvalTypeNotMatching
TestStatefulExpression_EvalBool_BinaryNodeWithBoolUnaryNode
TestStatefulExpression_EvalBool_BinaryNodeWithNumericUnaryNode
TestStatefulExpression_EvalBool_TwoLevelsDeepBinaryWithEvalNum_Int64
TestStatefulExpression_EvalBool_TwoLevelsDeepBinaryWithEvalNum_Float64
TestStatefulExpression_EvalBool_SanityCallingFunction
TestStatefulExpression_EvalNum_SanityCallingFunctionWithArgs
TestStatefulExpression_EvalBool_SanityCallingFunctionWithArgs

Important

@nathanielc / pull request reviewer, please read those very carefully and answer them please! The notes/questions are ordered by importance:

Didn't tested function return type changes - there is need to? If so, do we have function to do so? or should I need to create new one and stub it in?
Not supported DurationNode - I saw the stateful expression did handle DurationNode, but I can't figure out where it's used - not in BinaryNode and not as single node (ex: EvalNum(DurationNode))
In StatefulExpression we are calling "node.eval" - why so? in the new one we don't call this methods are all tests are passing, are we missing tests?.
Creating expression return error - this is new "behaviour", compiling an expression can return an error, there is test for it: TestStatefulExpression_Eval_NotSupportedNode, examples:
passing invalid node to compile, example: commentnode
passing invalid node in binarynode

5.@nathanielc - you requested to separate to packages as ast and etc, I didn't do this in this pull request, because it's too much big PR 6. I can fix #490 pretty easily, do you want to?

Nice-To-Haves

Those are nice to haves, maybe in this pull request and maybe another:

Debug logs for optimising: add debug log for when guard files and etc, can be useful in performance investigations
Performance optimisation (not related to this pr): In mergeFieldsAndTags we put all tags and fields in the scope, I think we can traverse the node AST and get a list of needed scope variables and then fetch them, in my opinion it can yield great performance improvement - I will research this after this PR will get merged

Fee, I finished 👍 That was a really fun and educating experience, thanks @nathanielc for being open to changes :)

Yosi

Fork by measurement

Hi,

This pull request greatly improves performance on the write benchmarks. I did this performance improvements, in 5 steps:

All of benchmark ran on my Macbook Pro (13-inch, Late 2011) with Intel Core i5 (2.4Ghz), 8gb memory and 120gb SSD

Filtering by measurement

To the fork struct I added measurements map from string to bool and compared it in the forkPoint. And got the next improvement:

benchmark                                            old ns/op      new ns/op      delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     8633314476     57042          -100.00%
Benchmark_Write_MeasurementNameMatches_1000-4        7915678886     8229547562     +3.97%
Benchmark_Write_MeasurementNameNotMatches_100-4      37434          22472          -39.97%
Benchmark_Write_MeasurementNameMatches_100-4         38474          41502          +7.87%
Benchmark_Write_MeasurementNameNotMatches_10-4       22950          23601          +2.84%
Benchmark_Write_MeasurementNameMatches_10-4          23109          24814          +7.38%

benchmark                                            old allocs     new allocs     delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     57450          50             -99.91%
Benchmark_Write_MeasurementNameMatches_1000-4        57426          57424          -0.00%
Benchmark_Write_MeasurementNameNotMatches_100-4      49             49             +0.00%
Benchmark_Write_MeasurementNameMatches_100-4         49             49             +0.00%
Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%

benchmark                                            old bytes     new bytes     delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     4264608       3950          -99.91%
Benchmark_Write_MeasurementNameMatches_1000-4        4261568       4261440       -0.00%
Benchmark_Write_MeasurementNameNotMatches_100-4      3889          3837          -1.34%
Benchmark_Write_MeasurementNameMatches_100-4         3889          3889          +0.00%
Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
Benchmark_Write_MeasurementNameMatches_10-4          3838          3839          +0.03%

This performance numbers are compared to the baseline - benchmarks run on the master

Changing equality order

I tried to change check of:

if fork.dbrps[dbrp] && fork.measurements[p.Name] {
   // ...
}

To first check the measurement and then dbrp, and got the next results:

benchmark                                            old ns/op      new ns/op      delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     57042          29203          -48.80%
Benchmark_Write_MeasurementNameMatches_1000-4        8229547562     8787711023     +6.78%
Benchmark_Write_MeasurementNameNotMatches_100-4      22472          36940          +64.38%
Benchmark_Write_MeasurementNameMatches_100-4         41502          55299          +33.24%
Benchmark_Write_MeasurementNameNotMatches_10-4       23601          36820          +56.01%
Benchmark_Write_MeasurementNameMatches_10-4          24814          44957          +81.18%

benchmark                                            old allocs     new allocs     delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     50             49             -2.00%
Benchmark_Write_MeasurementNameMatches_1000-4        57424          57438          +0.02%
Benchmark_Write_MeasurementNameNotMatches_100-4      49             49             +0.00%
Benchmark_Write_MeasurementNameMatches_100-4         49             50             +2.04%
Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%

benchmark                                            old bytes     new bytes     delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     3950          3837          -2.86%
Benchmark_Write_MeasurementNameMatches_1000-4        4261440       4262336       +0.02%
Benchmark_Write_MeasurementNameNotMatches_100-4      3837          3888          +1.33%
Benchmark_Write_MeasurementNameMatches_100-4         3889          3953          +1.65%
Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
Benchmark_Write_MeasurementNameMatches_10-4          3839          3838          -0.03%

This is compared between the first step and the second As you can see the performance got better for Benchmark_Write_MeasurementNameNotMatches_1000-4 but worse for the benchmarks (+33 to +64)

Change the fork structure - map from dbrp&measurement to edges

I am skipping the forth step which is to take "dbrp" struct assignment in forkPoint out of the loop, and going to the biggest perf improvement.

Instead of checking all forks if they match criteria I pivoted it to map from the criteria (db,rp,measurement) to edges.

And we get this huge improvement:

benchmark                                            old ns/op      new ns/op      delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     29203          20774          -28.86%
Benchmark_Write_MeasurementNameMatches_1000-4        8787711023     5675405636     -35.42%
Benchmark_Write_MeasurementNameNotMatches_100-4      36940          21771          -41.06%
Benchmark_Write_MeasurementNameMatches_100-4         55299          36193          -34.55%
Benchmark_Write_MeasurementNameNotMatches_10-4       36820          23315          -36.68%
Benchmark_Write_MeasurementNameMatches_10-4          44957          24562          -45.37%

benchmark                                            old allocs     new allocs     delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     49             48             -2.04%
Benchmark_Write_MeasurementNameMatches_1000-4        57438          57436          -0.00%
Benchmark_Write_MeasurementNameNotMatches_100-4      49             48             -2.04%
Benchmark_Write_MeasurementNameMatches_100-4         50             49             -2.00%
Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%

benchmark                                            old bytes     new bytes     delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     3837          3800          -0.96%
Benchmark_Write_MeasurementNameMatches_1000-4        4262336       4262208       -0.00%
Benchmark_Write_MeasurementNameNotMatches_100-4      3888          3800          -2.26%
Benchmark_Write_MeasurementNameMatches_100-4         3953          3888          -1.64%
Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
Benchmark_Write_MeasurementNameMatches_10-4          3838          3839          +0.03%

The baseline is 'Changing equality order'

Another sign of performance improvement, while running "Benchmark_Write_MeasurementNameNotMatches_1000-4" on the master my 4 cores are 99% steady after this improvement only 2 cores are 59% ~ and the other 2 cores are 9%

Final Results

And the overall benchmark results, where the baseline is the master benchmark results and the new perf is the current status of this branch:

benchmark                                            old ns/op      new ns/op      delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     8633314476     23139          -100.00%
Benchmark_Write_MeasurementNameMatches_1000-4        7915678886     6381307112     -19.38%
Benchmark_Write_MeasurementNameNotMatches_100-4      37434          23787          -36.46%
Benchmark_Write_MeasurementNameMatches_100-4         38474          34923          -9.23%
Benchmark_Write_MeasurementNameNotMatches_10-4       22950          24076          +4.91%
Benchmark_Write_MeasurementNameMatches_10-4          23109          25433          +10.06%

benchmark                                            old allocs     new allocs     delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     57450          48             -99.92%
Benchmark_Write_MeasurementNameMatches_1000-4        57426          57442          +0.03%
Benchmark_Write_MeasurementNameNotMatches_100-4      49             48             -2.04%
Benchmark_Write_MeasurementNameMatches_100-4         49             49             +0.00%
Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%

benchmark                                            old bytes     new bytes     delta
Benchmark_Write_MeasurementNameNotMatches_1000-4     4264608       3799          -99.91%
Benchmark_Write_MeasurementNameMatches_1000-4        4261568       4262592       +0.02%
Benchmark_Write_MeasurementNameNotMatches_100-4      3889          3800          -2.29%
Benchmark_Write_MeasurementNameMatches_100-4         3889          3889          +0.00%
Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
Benchmark_Write_MeasurementNameMatches_10-4          3838          3838          +0.00%

Drawbacks

This benchmark come with one drawback, the creation and deletion of a task will be slower (I have no benchmarks, but we are doing more - we no longer have o(1) complexity) and the deletion is harder to read thanks to "Change the fork structure - map from dbrp&measurement to edges".

I am open to suggestions on how to improve the delFork method for better readability.

Alert handler for Microsoft Teams
Required for all non-trivial PRs

[x] Rebased/mergable

[x] Tests pass

[x] CHANGELOG.md updated

[x] Sign CLA (if not already signed)

Required only if applicable

You can erase any checkboxes below this note if they are not applicable to your Pull Request. N/A

This adds support for sending alerts via Microsoft Teams (similar to Slack or HipChat). I followed the alert handlers guide where possible, and when I ran into problems, I looked at the source code for other alerts (e.g., HipChat). The tests implemented follow the same pattern of tests performed by the HipChat handler.

All tests are passing for me locally (except some unrelated UDF tests which fail due to python issues on my Mac).
JoinNode ignores Delete BarrierNode messages.

After some testing, I found out that the JoinNode cardinality doesn't decrease when a BarrierMessage is emitted for a group that should expire. This effectively leads the JoinNode's cardinality to increase forever, leading to a memory leak.
[Feature Request] Kapacitor needs a way to automatically load tick scripts from a directory.
having to manually invoke kapacitor for each script is pretty annoying for deployment. We should just be able to load from a directory. Main goal is to put the scripts under version control and ease of deployment.

things that may need to be thought about:

how does kapacitor know which db/rp to use?

could implement a directory structure. scripts/{dp}/{rp}/myscript.tick

how could templates be handled?

not sure havent used these yet.
Add kafka as metrics consumer

This will be awesome if instead of using the InfluxDB resources like query it or add UDP subscriptions, the Kapacitor will be more standalone solution, so it will be able to consume metrics from Kafka and analyze them as sliding window.

The stream is very powerful for the feature above and can complete the kafka consumer. This integration may need to work with a small db to be able store the sliding window metrics for further queries.

D.

Scope reusing & smaller scopes

This pull request, is experiment, If you like the idea, we can improve the readability and the quality of the code

For each expression we are creating "scope pool", which is object pool of scopes - with some extra magic. By doing quick analysis on the node AST I know which tags and fields he requires. so we put only the required ones. For example: "value" > 10, I fill only "value" from field or tag.

name                     old time/op    new time/op    delta
_T10_P500_AlertTask-4       133ms ± 4%     123ms ± 4%   -7.58%  (p=0.008 n=5+5)
_T10_P50000_AlertTask-4     13.4s ± 8%     12.3s ± 7%     ~     (p=0.056 n=5+5)
_T1000_P500_AlertTask-4     13.5s ± 4%     12.1s ± 3%  -10.46%  (p=0.008 n=5+5)

name                     old alloc/op   new alloc/op   delta
_T10_P500_AlertTask-4      32.2MB ± 0%    26.0MB ± 0%  -19.32%  (p=0.008 n=5+5)
_T10_P50000_AlertTask-4    3.26GB ± 0%    2.62GB ± 0%  -19.71%  (p=0.008 n=5+5)
_T1000_P500_AlertTask-4    3.21GB ± 0%    2.61GB ± 0%  -18.56%  (p=0.008 n=5+5)

name                     old allocs/op  new allocs/op  delta
_T10_P500_AlertTask-4        408k ± 0%      335k ± 0%  -17.85%  (p=0.008 n=5+5)
_T10_P50000_AlertTask-4     41.5M ± 0%     34.1M ± 0%  -17.98%  (p=0.008 n=5+5)
_T1000_P500_AlertTask-4     40.2M ± 0%     33.1M ± 0%  -17.61%  (p=0.008 n=5+5)

I thought about this idea while researching the performance of alerts, but before that I wanted to implement "compiled stateful expression" ( #491 ). If we combine this pull request with #491 , we will have great performance and low memory usage while evaluating predicates.

[Proposal] Make TICKscript branch points more readable

Since TICKscript ignores whitespace it is possible to define a TICKscript that is really hard to read since it is not clear when a new node is being created vs a property is being set on a node. Example:

stream.from()
.groupBy('service')
.alert()
.id('kapacitor/{{ index .Tags "service" }}')
.message('{{ .ID }} is {{ .Level }} value:{{ index .Fields "value" }}')
.info(lambda: "value" > 10)
.warn(lambda: "value" > 20)
.crit(lambda: "value" > 30)
.post("http://example.com/api/alert")
.post("http://another.example.com/api/alert")
.email().to('[email protected]')

A possible solution is to use a different operator for what the docs call property methods and chaining methods, where a property method modifies a node and a chaining method creates a new node in the pipeline. Using the example above and not changing whitespace.

stream->from()
.groupBy('service')
->alert()
.id('kapacitor/{{ index .Tags "service" }}')
.message('{{ .ID }} is {{ .Level }} value:{{ index .Fields "value" }}')
.info(lambda: "value" > 10)
.warn(lambda: "value" > 20)
.crit(lambda: "value" > 30)
.post("http://example.com/api/alert")
.post("http://another.example.com/api/alert")
.email().to('[email protected]')

Or another example with more chaining methods:

stream
->from()
.where(lambda: ...)
.groupBy(...)
->window()
.period(10s)
.every(10s)
->mapReduce(influxql.count('value')).as('value')
->alert()

Or even an example where it is necessary to disambiguate between a property and chaining method.

batch->query('SELECT mean(used_percent) FROM "telegraf"."default"."disk"')
      .period(10s)
      .every(10s)
      .groupBy('host','path') // We want to compute the mean by host and path
    ->groupBy() // But then to we want to compute the top of all groups so we need to change the groupBy. Without a different operator or a node between these steps it is impossible.
    ->top(2, 'mean')
    ->influxDBOut()
      .database('mean_output')
      .measurement('avg_disk')
      .retentionPolicy('default')
      .flushInterval(1s)
      .precision('s')

Questions:

Does using a different operator make writing a TICKscript overly complex? You will not be able to define a the task until you have used the correct operator for chaining vs property methods. You will have to learn via trial and error as well as consulting docs.
Is -> a good operator? Would | or something else read better?

stream
|from()
.where(lambda: ...)
.groupBy(...)
|window()
.period(10s)
.every(10s)
|mapReduce(influxql.count('value')).as('value')
|alert()

Using whitespace to further improve readability

stream
    |from()
        .where(lambda: ...)
        .groupBy(...)
    |window()
        .period(10s)
        .every(10s)
    |mapReduce(influxql.count('value')).as('value')
    |alert()

Preserve tags to join/window

Hi,

I am creating tick script with measurement with tags (server_group, dc, etc), my tick script is something like this:

var windows = stream.from('some_measurement')
                                      .where(lambda: 'dc' = 'europe')
                                        .window()
                                            .every(10s)
                                            .period(40s)

var first = windows.first('value')
var last = windows.last('value')


first.join(last)
         .eval(lambda: 'last.last' - 'first.first').as('cvalue')
         .alert()
            // some levels..
            .post('http://some-service')

In the json I am getting on the service I don't have all tags I have in "some_measurement" whom I need. Is there a way to preserve the tags?

Custom JSON output for Alert Post and HttpPost Nodes

This is a feature request for the ability to specify custom JSON output for the Alert Post and HttpPost nodes. As it stands now there is no control over how the JSON looks and adding additional elements to it is not a trivial task.

In thinking how this might be implemented I could see a parameter that might point to a template file that could perform the mapping:

.template(String template, Boolean appendUnusedValues) templateFIle -- The path and name pf the template file or a string with the template definition. This would allow you to specify the template in the TICK script as a var or separately as a file appendUnusedValues -- If true would append any remaining tags or fields to the end of the json. This would provide the ability to transform certain tags or fields while retaining many of the original tags and fields. If false only the tag or field values specified in the template will be in the output json

stream |httpPost() .template('myTemplate.tmpl', true) .endpoint('example')

Where the template file might look like: { "myParam1": {{tag.tagName}}, "myParam2": {{field.fieldName}} }

Thoughts?

RHEL7 failed to enable service

Upon installing Kapacitor on RHEL7, doing the following to try to start it on startup comes up with this error...

#systemctl enable kapacitor
Failed to execute operation: Too many levels of symbolic links

I believe this is due to...

# ls -lh /etc/systemd/system
lrwxrwxrwx. 1 root root   41 Apr  7 10:01 kapacitor.service -> /usr/lib/systemd/system/kapacitor.service
-rw-rw-r--. 1 root root  466 Mar 22 22:47 kibana.service
-rw-r--r--. 1 root root  511 Mar 30 13:47 logstash.service

You can see the others are real .service files, but this is a symlink.

deps: bump flux to 0.191.0
Required checklist

[ ] Sample config files updated (both /etc folder and NewDemoConfig methods) (influxdb and plutonium)

[ ] openapi swagger.yml updated (if modified API) - link openapi PR

[x] Signed CLA (if not already signed)

Description

Bumping flux to v0.191.0

Context

Currently kapacitor has some build issue with Rust 1.64.0 due to the outdated dependency with flux, thus updating it to the latest.

realtes to https://github.com/influxdata/flux/pull/5273

also relates to https://github.com/Homebrew/homebrew-core/pull/118242
build: remove dep files in favor of go module build
Required checklist

[ ] Sample config files updated (both /etc folder and NewDemoConfig methods) (influxdb and plutonium)

[ ] openapi swagger.yml updated (if modified API) - link openapi PR

[x] Signed CLA (if not already signed)

Description

project has already moved into go module build

go module build is the default build mechansim

golang/dep project was deprecated long time back

Context

Remove unused files for the repo

[Bug]: Receiving opentsdb data (plaintext) causes panic

The latest version of kapacitord, v1.6.5-1, seem to have some bug in the opentsdb handling.

To reproduce: On a Debian 11 machine I have a netdata process that export its metrics (opentsdb) to localhost:4242 where kapacitord is listening.

In your repo, there are currently two versions of kapacitor available:

1.6.5-1
1.6.4-1

I did an apt full-upgrade which gave me v1.6.5-1, and kapacitord now constantly fails. :( Every time a chunk of opentsdb metrics (plaintext) is received on port 4242 it says:

Dec 14 15:25:58 netdatacentral kapacitord[1041]: ts=2022-12-14T15:25:58.592+01:00 lvl=info msg="http request" service=http host=::1 username=- start=2022-12-14T15:25:58.592460338+01:00 method=POST uri=/write?consistency=&db=_internal&precision=ns&rp=monitor protocol=HTTP/1.1 status=204 referer=- user-agent=InfluxDBClient request-id=3a524601-7bbb-11ed-800a-0666a6579300 duration=290.345µs
Dec 14 15:26:00 netdatacentral kapacitord[1041]: panic: not implemented
Dec 14 15:26:00 netdatacentral kapacitord[1041]: goroutine 109 [running]:
Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/kapacitor.(*TaskMaster).WritePointsPrivileged(0x0?, {{0x4?, 0x203001?}}, {0xc001d89e80?, 0x4?}, {0x0?, 0x2000100000060?}, 0x0?, {0xc00200a000, 0x5b, ...})
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/root/kapacitor/task_master.go:273 +0x27
Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/influxdb/services/opentsdb.(*Service).processBatches(0xc000124900, 0xc00235eea0)
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/[email protected]/services/opentsdb/service.go:483 +0x3ae
Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/influxdb/services/opentsdb.(*Service).Open.func1()
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/[email protected]/services/opentsdb/service.go:127 +0x65
Dec 14 15:26:00 netdatacentral kapacitord[1041]: created by github.com/influxdata/influxdb/services/opentsdb.(*Service).Open
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/[email protected]/services/opentsdb/service.go:127 +0x2df
Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Failed with result 'exit-code'.
Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Service RestartSec=100ms expired, scheduling restart.

(and netdata log that it lost its connection when kapacitord restarted itself:
Dec 14 15:25:59 netdatacentral netdata-error.log: 2022-12-14 15:25:59: netdata ERROR : MAIN : EXPORTING: 'localhost:4242' closed the socket
)

Every time a new chunk of metrics is received, kapacitord panic and restart itself. No data is actually processed, kapacitord just panics and dies.

I now downgrade to the other, older, version available:

apt install kapacitor=1.6.4-1
reboot

Now it works again. The plaintext opentsdb metrics are received, processed and sent to our InfluxDB as it should.

I have done no changes in the configuration or TICK script. So the bug must be in the kapacitor package for v1.6.5-1. The regression happened after v1.6.4-1.

I have also tried changing the netdata export to use [opentsdb:http:opentsdb_POST_to_kapacitor] (just in case the new version of kapacitor should expect HTTP-formatted metric data instead of plaintext) but that didn't work either.

Additional info:

A tcpdump show that the format of the plaintext metrics are the same (i.e. it is not netdata that has changed logging format).

16:01:59.480522 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [S], seq 2855994911, win 65495, options [mss 65495,sackOK,TS val 2211832732 ecr 0,nop,wscale 7], length 0
E..<.Y@.@..`.............;...........0.........
............
16:01:59.480537 IP 127.0.0.1.4242 > 127.0.0.1.32932: Flags [S.], seq 861833801, ack 2855994912, win 65483, options [mss 65495,sackOK,TS val 2211832732 ecr 2211832732,nop,wscale 7], length 0
E..<..@.@.<.............3^.I.;. .....0.........
............
16:01:59.480551 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [.], ack 1, win 512, options [nop,nop,TS val 2211832733 ecr 2211832732], length 0
E..4.Z@[email protected].............;. 3^.J.....(.....
........
16:02:09.484044 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [.], seq 1:32742, ack 1, win 512, options [nop,nop,TS val 2211842736 ecr 2211832732], length 32741
E....[@.@.:..............;. 3^.J....~......
..
.....put netdata.disk_svctm.nvme0n1.svctm 1670857326 1.0000000 host=netdatacentral
put netdata.disk_ext_avgsz.nvme0n1.discards 1670857326 0.0000000 host=netdatacentral
put netdata.disk_avgsz.nvme0n1.reads 1670857326 0.0000000 host=netdatacentral
put netdata.disk_avgsz.nvme0n1.writes 1670857326 -26.7857143 host=netdatacentral
...and so on... A few large packets are sent/received before the server send a FIN and the next packet from the client get a RST (since nothing is now listening at tcp/4242 while kapacitord is restarting).

Let me know if you need more conf-files. Here are what I guess is the relevant stuff:

# cat /etc/kapacitor/kapacitor.conf
hostname = "localhost"
data_dir = "/var/lib/kapacitor/.kapacitor"
skip-config-overrides = false
default-retention-policy = ""

[http]
  bind-address = ":9092"
  auth-enabled = false
  log-enabled = true
  write-tracing = false
  pprof-enabled = false
  https-enabled = false
  https-certificate = "/etc/ssl/kapacitor.pem"
  https-private-key = ""
  shutdown-timeout = "10s"
  shared-secret = ""

[replay]
  dir = "/var/lib/kapacitor/.kapacitor/replay"

[storage]
  boltdb = "/var/lib/kapacitor/.kapacitor/kapacitor.db"

[task]
  dir = "/var/lib/kapacitor/.kapacitor/tasks"
  snapshot-interval = "1m0s"

[load]
  enabled = true
  dir = "/etc/kapacitor/load"

[[influxdb]]
  enabled = true
  default = true
  name = "default"
  urls = ["http://localhost:8086"]
  username = ""
  password = ""
  ssl-ca = ""
  ssl-cert = ""
  ssl-key = ""
  insecure-skip-verify = false
  timeout = "0s"
  disable-subscriptions = false
  subscription-protocol = "http"
  subscription-mode = "cluster"
  kapacitor-hostname = ""
  http-port = 0
  udp-bind = ""
  udp-buffer = 1000
  udp-read-buffer = 0
  startup-timeout = "5m0s"
  subscriptions-sync-interval = "1m0s"
  [influxdb.excluded-subscriptions]
    _kapacitor = ["autogen"]

[logging]
  file = "STDERR"
  level = "DEBUG"

[config-override]
  enabled = true

[opentsdb]
  enabled = true
  bind-address = "127.0.0.1:4242"
  database = "opentsdb"
  retention-policy = "autogen"
  consistency-level = "one"
  tls-enabled = false
  certificate = "/etc/ssl/influxdb.pem"
  batch-size = 1000
  batch-pending = 5
  batch-timeout = "1s"
  log-point-errors = true

[reporting]
  enabled = false
  url = "https://usage.influxdata.com"

[stats]
  enabled = true
  stats-interval = "10s"
  database = "_kapacitor"
  retention-policy = "autogen"
  timing-sample-rate = 0.1
  timing-movavg-size = 1000

# Connect to a second InfluxDB
[[influxdb]]
  enabled = true
  default = false
  name = "InfluxCloud"
  urls = ["https://blahblahblah.influxcloud.net:8086"]
  username = "blahblah"
  password = "blahblah"
  timeout = 0

# cat /etc/netdata/exporting.conf
[exporting:global]
    enabled = yes

[opentsdb:opentsdb_plaintext_to_kapacitor]
    enabled = yes
    destination = localhost:4242
    data source = average
    update every = 60
    send hosts matching = *
    send charts matching = system.cpu system.uptime system.load system.entropy disk_space.* system.ram system.swap disk_ops.*

# cat /etc/kapacitor/load/tasks/stream_netdata_to_influxdb.tick
// Stream data from Netdata to remote InfluxDB
dbrp "opentsdb"."autogen"

var data = stream
    |from()
        .database('opentsdb')
        .retentionPolicy('autogen')
        .groupByMeasurement()
    |window()
        .period(1m)
        .every(1m)

data
    |influxDBOut()
        .database('opentsdb')
        .retentionPolicy('autogen')
        .cluster('InfluxCloud')

JOINING kapacitor queries with count and difference methods

I have been trying to run this script but no alert is getting created on alerta and logs are not showing any error as well, can anyone help me with this ? avg_time_gap = sum(current_time - previous_time)/count(num_of_entries) current_time_gap = (current_value-previous_value) thresh = abs(current_time_gap-avg_time_gap) alert when thresh > 0

var window = 20s
var every = 5s // defines the frequency at which the window is emitted to the next node in the pipeline.
var timeout = 360s // alert expiry time
var avg_time = batch
    |query('select 20.0/count(content) as avg_time_gap from mydb..measurement')
        .period(window)
        .every(every)
    //numerator must be as same as the window value

var adjacent_time_gap = batch
    |query('select difference(content) as time_gap from mydb..measurement')
       .period(window)
       .every(every)

var data = avg_time
    |join(adjacent_time_gap)
        .as('avg_time','actual_time')
        .tolerance(1s)

    |eval(lambda: abs("avg_time.avg_time_gap"-"actual_time.time_gap"))
        .as('time_delay')

var alert = data
    |alert()
        .id(event_name)
        .crit(lambda: "time_delay" > 0)

OK events are not generating even on matching condition on stateDuration. Data is in stream mode, and using window() node

var db = 'telegraf' var rp = 'autogen' var measurement = 'cpu' var groupBy = ['host', 'gms-rule'] var period = 5m var every = 1m var whereFilter = lambda: isPresent("gms-rule") AND "cpu" == 'cpu-total' AND isPresent("usage_pct") AND "host" == 'venu-test' var data = stream |from() .database(db) .retentionPolicy(rp) .measurement(measurement) .groupBy(groupBy) .where(whereFilter) |window() .period(period) .every(every) .align() |eval(lambda: "usage_pct") .as('value') |httpOut('stream')

var trigger = custom |stateDuration(lambda: "warningEnabled" AND "value" >= "warningThreshold" AND "enable") .as('actual_warn_duration') |stateDuration(lambda: "criticalEnabled" AND "value" >= "criticalThreshold" AND "enable") .as('actual_crit_duration') |stateDuration(lambda: "value" < "warningThreshold" OR "value" < "criticalThreshold" AND global_ok_stateDuration_enabled) .as('ok_duration_counter') |httpOut('trigger') |log() |alert() .warn(lambda: "actual_warn_duration" >= "warningStateDuration") .crit(lambda: "actual_crit_duration" >= "criticalStateDuration") // .message(message) .critReset(lambda: "ok_duration_counter" > "okstateDuration") .warnReset(lambda: "ok_duration_counter" > "okstateDuration") .id(idVar) .levelTag(levelTag) .messageField(messageField) .stateChangesOnly() .log('/apps/helios/influx-ent/logs/kapacitor/a-test-ok-1.log')

With this tick script, i am getting a warning or critical alert, but never doing a reset (no OK events are coming) even the critReset or warnReset conditions are becoming true.

Is this a kind of bug? or reset will not happen when i use a window() node?

If i am not using the window() node, its working as expected. Getting crit/warn alerts, and resetting back to OK on a matching condition.
Kapacitor alert template adding escape character for double quotes

I am using Kapacitor to send alerts via post method to elasticsearch. However, I was unable to do, so to debug further I was using a local django application to read the post response. By using alert template I am able to get the data in desired format: {"\"id\": \"4g-alert\",\"time\": \"2022-10-19 13:30:01 +0000 UTC\",\"tag1\": \"0005B951C718_10213\",\"KPI_Name\": \"KPI_name,\"KPI_value\": \"0\",\"Level\": CRITICAL,\"previousLevel\": CRITICAL\""}

However the kapacitor alert is adding '' as escape character, when I tried to send the same json using curl command it worked. Any way to remove the slash() or any other solution for this.

Open source framework for processing, monitoring, and alerting on time series data

Kapacitor

Installation

Configuration

Getting Started

Basic Example

Owner

InfluxData

Comments

Compiled stateful expression

Implementation

Basic explanation

Deeper explanation

Performance

EvalBool Benchmarks

AlertTask benchmarks

Questions / Notes

Tests

Important

Nice-To-Haves

Fork by measurement

Filtering by measurement

Changing equality order

Change the fork structure - map from dbrp&measurement to edges

Final Results

Drawbacks

Alert handler for Microsoft Teams

Required for all non-trivial PRs

Required only if applicable

JoinNode ignores Delete BarrierNode messages.

[Feature Request] Kapacitor needs a way to automatically load tick scripts from a directory.

Add kafka as metrics consumer

Scope reusing & smaller scopes

[Proposal] Make TICKscript branch points more readable

Preserve tags to join/window

Custom JSON output for Alert Post and HttpPost Nodes

RHEL7 failed to enable service

deps: bump flux to 0.191.0

Required checklist

Description

Context

build: remove dep files in favor of go module build

Required checklist

Description

Context

[Bug]: Receiving opentsdb data (plaintext) causes panic

JOINING kapacitor queries with count and difference methods

OK events are not generating even on matching condition on stateDuration. Data is in stream mode, and using window() node

Kapacitor alert template adding escape character for double quotes

Related tags

SigNoz helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc. 🔥 🖥. 👉 Open source Application Performance Monitoring (APM) & Observability tool

The Prometheus monitoring system and time series database.

checkah is an agentless SSH system monitoring and alerting tool.

An open-source and enterprise-level monitoring system.

Open Source Software monitoring platform tools.

Monitoring-go - A simple monitoring tool to sites of MOVA

Butler - Aggregation and Alerting Platform

Benchmore - A package allows you to report On-CPU Time in addition to the wall time measured by Go's builtin benchmarking framework

A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go

mtail - extract internal monitoring data from application logs for collection into a timeseries database

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

Distributed simple and robust release management and monitoring system.

An Open Source video surveillance management system for people making this world a safer place.

Open Source Supreme Monitor Based on GoLang

An open source Pusher server implementation compatible with Pusher client libraries written in GO

A GNU/Linux monitoring and profiling tool focused on single processes.

Simple and extensible monitoring agent / library for Kubernetes: https://gravitational.com/blog/monitoring_kubernetes_satellite/

A system and resource monitoring tool written in Golang!