Open source framework for processing, monitoring, and alerting on time series data

Kapacitor Circle CI Docker pulls

Open source framework for processing, monitoring, and alerting on time series data

Installation

Kapacitor has two binaries:

  • kapacitor – a CLI program for calling the Kapacitor API.
  • kapacitord – the Kapacitor server daemon.

You can either download the binaries directly from the downloads page or go get them:

go get github.com/influxdata/kapacitor/cmd/kapacitor
go get github.com/influxdata/kapacitor/cmd/kapacitord

Configuration

An example configuration file can be found here

Kapacitor can also provide an example config for you using this command:

kapacitord config

Getting Started

This README gives you a high level overview of what Kapacitor is and what its like to use it. As well as some details of how it works. To get started using Kapacitor see this guide. After you finish the getting started exercise you can check out the TICKscripts for different Telegraf plugins.

Basic Example

Kapacitor uses a DSL named TICKscript to define tasks.

A simple TICKscript that alerts on high cpu usage looks like this:

stream
    |from()
        .measurement('cpu_usage_idle')
        .groupBy('host')
    |window()
        .period(1m)
        .every(1m)
    |mean('value')
    |eval(lambda: 100.0 - "mean")
        .as('used')
    |alert()
        .message('{{ .Level}}: {{ .Name }}/{{ index .Tags "host" }} has high cpu usage: {{ index .Fields "used" }}')
        .warn(lambda: "used" > 70.0)
        .crit(lambda: "used" > 85.0)

        // Send alert to hander of choice.

        // Slack
        .slack()
        .channel('#alerts')

        // VictorOps
        .victorOps()
        .routingKey('team_rocket')

        // PagerDuty
        .pagerDuty()

Place the above script into a file cpu_alert.tick then run these commands to start the task:

# Define the task (assumes cpu data is in db 'telegraf')
kapacitor define \
    cpu_alert \
    -type stream \
    -dbrp telegraf.default \
    -tick ./cpu_alert.tick
# Start the task
kapacitor enable cpu_alert
Comments
  • Compiled stateful expression

    Compiled stateful expression

    Hi,

    This is a one big pull request with the next bottom line changes:

    • Performance of evaluating stateful expression signifactly improved
    • Added 11 unit tests for stateful expression and coverage got up from 16.2% to 18.2%
    • All tests are passing - I changed all the usages of tick.NewStatefulExper to use the new one - and all integrations tests passed.
    • There a behaviour changes - priority to errors have changes, etc - but in my opinion they are not big
    • DurationNode is not supported
    • Currently, I didn't replaced the stateful expression with the new one.

    Implementation

    Those are explanations of the core algorithm, if there are more questions/clarifications requested, I will update this.

    Basic explanation

    The overall idea: Instead of using stack-based AST interpreter compiled the expressions to specialized functions. For example: given this expression "value" > 8.0, let's assume two assumptions:

    • "value" is float64
    • 8.0 is float64

    The specializer will take this expression and will evantually run float64 > float64 all the time, instead of doing for every evaluation:

    • Type checking and guessing: checking the type of ref node and the right node type
    • Run through the whole AST tree

    Deeper explanation

    First, let's set up simple terminology:

    • Dynamic Node - node that is value changes on runtime like FunctionNode and ReferenceNode
    • Constant Node - node that is value is constant for the whole lifetime of the tick script
    • Evaluation Function - evaluation function is the function that accepts three arguments: scope, left and right node (this is a simplified version)

    When we get a BinaryNode we determine if it's dynamic or constant - let's examine the dynamic case.

    If this is dynamic node in the constructor (NewStatefulExpr) we set the evaluation function to be "dynamic evaluation function" otherwise we fetch the matching evaluation function based on the nodes types and their operator.

    The dynamic evaluation function is doing the next instructions (this were the "specialization" happens):

    • Read the values of the left and right node (for example: for a reference node we will access the scope and read the value)
    • Find a matching evaluation function based on the types we got and save it (in field in the StatefulExpression struct)
    • call EvalBool

    The real meat is in EvalBool/EvalNum:

    1. If the evaluation function is null it means that we have some error:
    2. Type mismatch: int > string
    3. Not a comparison/math operator: int - int
    4. Invalid operator for type: bool > bool
      1. We have evaluation function and evaluate her - the evaluation function returns bool and error
      2. We examine the error if it's our special error (ErrTypeGuardFailed) that indicates we ran the wrong comparison function - this can happen on type changes - for example: "value" started as int64 and eventually changed to float6
    5. If we have an error - go to dynamic evaluation - to specialise the evaluation function
      1. Return the results - bool and error

    It's important to say: that we handle single nodes as well for example: EvalBool(BoolNode), etc.

    Performance

    I ran the tests on MacBook Pro (13-inch, Late 2011) - i5 2.4ghz, 8GB RAM and 128GB SSD. The tests ran with the flag of "--count=5" and compared using benchstat.

    EvalBool Benchmarks

    name                                                                       old time/op    new time/op    delta
    _EvalBool_OneOperator_UnaryNode_BoolNode-4                                    252ns ± 2%      68ns ± 1%   -73.02%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                           540ns ± 2%      41ns ± 2%   -92.33%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberInt64-4                             550ns ± 3%      43ns ± 3%   -92.23%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberInt64_NumberInt64-4                               539ns ± 2%      40ns ± 3%   -92.56%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                    524ns ± 3%      76ns ± 3%   -85.57%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                      526ns ± 1%      78ns ± 6%   -85.21%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4             495ns ± 3%     121ns ± 2%   -75.46%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4     534ns ± 3%      94ns ± 3%   -82.37%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4       2.98µs ± 1%    1.25µs ± 3%   -58.21%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                 503ns ± 3%     118ns ± 4%   -76.49%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4         533ns ± 1%      89ns ± 4%   -83.23%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4           3.08µs ± 4%    1.25µs ± 3%   -59.33%  (p=0.008 n=5+5)
    
    name                                                                       old alloc/op   new alloc/op   delta
    _EvalBool_OneOperator_UnaryNode_BoolNode-4                                    18.0B ± 0%      8.0B ± 0%   -55.56%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                           72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberInt64-4                             72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberInt64_NumberInt64-4                               72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                    64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                      64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4             49.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4     64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4        64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                 49.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4         64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4            64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    
    name                                                                       old allocs/op  new allocs/op  delta
    _EvalBool_OneOperator_UnaryNode_BoolNode-4                                     3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                            5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberInt64-4                              5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberInt64_NumberInt64-4                                5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                     4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                       4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4              3.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4      4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4         4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                  3.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4          4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4             4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    

    AlertTask benchmarks

    name                     old time/op    new time/op    delta
    _T10_P500_AlertTask-4       138ms ± 5%     133ms ± 6%     ~     (p=0.421 n=5+5)
    _T10_P50000_AlertTask-4     13.7s ± 6%     13.1s ± 5%     ~     (p=0.421 n=5+5)
    _T1000_P500_AlertTask-4     13.7s ± 2%     13.0s ± 3%   -4.91%  (p=0.008 n=5+5)
    
    name                     old alloc/op   new alloc/op   delta
    _T10_P500_AlertTask-4      33.0MB ± 0%    32.0MB ± 0%   -2.85%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4    3.36GB ± 0%    3.26GB ± 0%   -2.86%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4    3.29GB ± 0%    3.19GB ± 0%   -2.90%  (p=0.008 n=5+5)
    
    name                     old allocs/op  new allocs/op  delta
    _T10_P500_AlertTask-4        466k ± 0%      408k ± 0%  -12.58%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4     47.5M ± 0%     41.5M ± 0%  -12.62%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4     46.1M ± 0%     40.2M ± 0%  -12.73%  (p=0.008 n=5+5)
    

    Questions / Notes

    Tests

    I added more tests to stateful expression, to make sure we cover more and more cases. The coverage for eval package is now 73.5%. I added those tests:

    • TestStatefulExpression_EvalBool_BinaryNodeWithDurationNode
    • TestStatefulExpression_EvalNum_FunctionWithTimeValue
    • TestStatefulExpression_Eval_NotSupportedNode
    • TestStatefulExpression_Eval_NodeAndEvalTypeNotMatching
    • TestStatefulExpression_EvalBool_BinaryNodeWithBoolUnaryNode
    • TestStatefulExpression_EvalBool_BinaryNodeWithNumericUnaryNode
    • TestStatefulExpression_EvalBool_TwoLevelsDeepBinaryWithEvalNum_Int64
    • TestStatefulExpression_EvalBool_TwoLevelsDeepBinaryWithEvalNum_Float64
    • TestStatefulExpression_EvalBool_SanityCallingFunction
    • TestStatefulExpression_EvalNum_SanityCallingFunctionWithArgs
    • TestStatefulExpression_EvalBool_SanityCallingFunctionWithArgs

    Important

    @nathanielc / pull request reviewer, please read those very carefully and answer them please! The notes/questions are ordered by importance:

    1. Didn't tested function return type changes - there is need to? If so, do we have function to do so? or should I need to create new one and stub it in?
    2. Not supported DurationNode - I saw the stateful expression did handle DurationNode, but I can't figure out where it's used - not in BinaryNode and not as single node (ex: EvalNum(DurationNode))
    3. In StatefulExpression we are calling "node.eval" - why so? in the new one we don't call this methods are all tests are passing, are we missing tests?.
    4. Creating expression return error - this is new "behaviour", compiling an expression can return an error, there is test for it: TestStatefulExpression_Eval_NotSupportedNode, examples:
    5. passing invalid node to compile, example: commentnode
    6. passing invalid node in binarynode

    5.@nathanielc - you requested to separate to packages as ast and etc, I didn't do this in this pull request, because it's too much big PR 6. I can fix #490 pretty easily, do you want to?

    Nice-To-Haves

    Those are nice to haves, maybe in this pull request and maybe another:

    • Debug logs for optimising: add debug log for when guard files and etc, can be useful in performance investigations
    • Performance optimisation (not related to this pr): In mergeFieldsAndTags we put all tags and fields in the scope, I think we can traverse the node AST and get a list of needed scope variables and then fetch them, in my opinion it can yield great performance improvement - I will research this after this PR will get merged

    Fee, I finished 👍 That was a really fun and educating experience, thanks @nathanielc for being open to changes :)

    • Yosi
  • Fork by measurement

    Fork by measurement

    Hi,

    This pull request greatly improves performance on the write benchmarks. I did this performance improvements, in 5 steps:

    All of benchmark ran on my Macbook Pro (13-inch, Late 2011) with Intel Core i5 (2.4Ghz), 8gb memory and 120gb SSD

    Filtering by measurement

    To the fork struct I added measurements map from string to bool and compared it in the forkPoint. And got the next improvement:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     8633314476     57042          -100.00%
    Benchmark_Write_MeasurementNameMatches_1000-4        7915678886     8229547562     +3.97%
    Benchmark_Write_MeasurementNameNotMatches_100-4      37434          22472          -39.97%
    Benchmark_Write_MeasurementNameMatches_100-4         38474          41502          +7.87%
    Benchmark_Write_MeasurementNameNotMatches_10-4       22950          23601          +2.84%
    Benchmark_Write_MeasurementNameMatches_10-4          23109          24814          +7.38%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     57450          50             -99.91%
    Benchmark_Write_MeasurementNameMatches_1000-4        57426          57424          -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_100-4         49             49             +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     4264608       3950          -99.91%
    Benchmark_Write_MeasurementNameMatches_1000-4        4261568       4261440       -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3889          3837          -1.34%
    Benchmark_Write_MeasurementNameMatches_100-4         3889          3889          +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3838          3839          +0.03%
    

    This performance numbers are compared to the baseline - benchmarks run on the master

    Changing equality order

    I tried to change check of:

    if fork.dbrps[dbrp] && fork.measurements[p.Name] {
       // ...
    }
    

    To first check the measurement and then dbrp, and got the next results:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     57042          29203          -48.80%
    Benchmark_Write_MeasurementNameMatches_1000-4        8229547562     8787711023     +6.78%
    Benchmark_Write_MeasurementNameNotMatches_100-4      22472          36940          +64.38%
    Benchmark_Write_MeasurementNameMatches_100-4         41502          55299          +33.24%
    Benchmark_Write_MeasurementNameNotMatches_10-4       23601          36820          +56.01%
    Benchmark_Write_MeasurementNameMatches_10-4          24814          44957          +81.18%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     50             49             -2.00%
    Benchmark_Write_MeasurementNameMatches_1000-4        57424          57438          +0.02%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_100-4         49             50             +2.04%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     3950          3837          -2.86%
    Benchmark_Write_MeasurementNameMatches_1000-4        4261440       4262336       +0.02%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3837          3888          +1.33%
    Benchmark_Write_MeasurementNameMatches_100-4         3889          3953          +1.65%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3839          3838          -0.03%
    

    This is compared between the first step and the second As you can see the performance got better for Benchmark_Write_MeasurementNameNotMatches_1000-4 but worse for the benchmarks (+33 to +64)

    Change the fork structure - map from dbrp&measurement to edges

    I am skipping the forth step which is to take "dbrp" struct assignment in forkPoint out of the loop, and going to the biggest perf improvement.

    Instead of checking all forks if they match criteria I pivoted it to map from the criteria (db,rp,measurement) to edges.

    And we get this huge improvement:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     29203          20774          -28.86%
    Benchmark_Write_MeasurementNameMatches_1000-4        8787711023     5675405636     -35.42%
    Benchmark_Write_MeasurementNameNotMatches_100-4      36940          21771          -41.06%
    Benchmark_Write_MeasurementNameMatches_100-4         55299          36193          -34.55%
    Benchmark_Write_MeasurementNameNotMatches_10-4       36820          23315          -36.68%
    Benchmark_Write_MeasurementNameMatches_10-4          44957          24562          -45.37%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     49             48             -2.04%
    Benchmark_Write_MeasurementNameMatches_1000-4        57438          57436          -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             48             -2.04%
    Benchmark_Write_MeasurementNameMatches_100-4         50             49             -2.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     3837          3800          -0.96%
    Benchmark_Write_MeasurementNameMatches_1000-4        4262336       4262208       -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3888          3800          -2.26%
    Benchmark_Write_MeasurementNameMatches_100-4         3953          3888          -1.64%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3838          3839          +0.03%
    

    The baseline is 'Changing equality order'

    Another sign of performance improvement, while running "Benchmark_Write_MeasurementNameNotMatches_1000-4" on the master my 4 cores are 99% steady after this improvement only 2 cores are 59% ~ and the other 2 cores are 9%

    Final Results

    And the overall benchmark results, where the baseline is the master benchmark results and the new perf is the current status of this branch:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     8633314476     23139          -100.00%
    Benchmark_Write_MeasurementNameMatches_1000-4        7915678886     6381307112     -19.38%
    Benchmark_Write_MeasurementNameNotMatches_100-4      37434          23787          -36.46%
    Benchmark_Write_MeasurementNameMatches_100-4         38474          34923          -9.23%
    Benchmark_Write_MeasurementNameNotMatches_10-4       22950          24076          +4.91%
    Benchmark_Write_MeasurementNameMatches_10-4          23109          25433          +10.06%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     57450          48             -99.92%
    Benchmark_Write_MeasurementNameMatches_1000-4        57426          57442          +0.03%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             48             -2.04%
    Benchmark_Write_MeasurementNameMatches_100-4         49             49             +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     4264608       3799          -99.91%
    Benchmark_Write_MeasurementNameMatches_1000-4        4261568       4262592       +0.02%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3889          3800          -2.29%
    Benchmark_Write_MeasurementNameMatches_100-4         3889          3889          +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3838          3838          +0.00%
    

    Drawbacks

    This benchmark come with one drawback, the creation and deletion of a task will be slower (I have no benchmarks, but we are doing more - we no longer have o(1) complexity) and the deletion is harder to read thanks to "Change the fork structure - map from dbrp&measurement to edges".

    I am open to suggestions on how to improve the delFork method for better readability.

  • Alert handler for Microsoft Teams

    Alert handler for Microsoft Teams

    Required for all non-trivial PRs
    • [x] Rebased/mergable
    • [x] Tests pass
    • [x] CHANGELOG.md updated
    • [x] Sign CLA (if not already signed)
    Required only if applicable

    You can erase any checkboxes below this note if they are not applicable to your Pull Request. N/A

    This adds support for sending alerts via Microsoft Teams (similar to Slack or HipChat). I followed the alert handlers guide where possible, and when I ran into problems, I looked at the source code for other alerts (e.g., HipChat). The tests implemented follow the same pattern of tests performed by the HipChat handler.

    All tests are passing for me locally (except some unrelated UDF tests which fail due to python issues on my Mac).

  • JoinNode ignores Delete BarrierNode messages.

    JoinNode ignores Delete BarrierNode messages.

    After some testing, I found out that the JoinNode cardinality doesn't decrease when a BarrierMessage is emitted for a group that should expire. This effectively leads the JoinNode's cardinality to increase forever, leading to a memory leak.

  • [Feature Request] Kapacitor needs a way to automatically load tick scripts from a directory.

    [Feature Request] Kapacitor needs a way to automatically load tick scripts from a directory.

    having to manually invoke kapacitor for each script is pretty annoying for deployment. We should just be able to load from a directory. Main goal is to put the scripts under version control and ease of deployment.

    things that may need to be thought about:

    how does kapacitor know which db/rp to use?

    • could implement a directory structure. scripts/{dp}/{rp}/myscript.tick

    how could templates be handled?

    • not sure havent used these yet.
  • Add kafka as metrics consumer

    Add kafka as metrics consumer

    This will be awesome if instead of using the InfluxDB resources like query it or add UDP subscriptions, the Kapacitor will be more standalone solution, so it will be able to consume metrics from Kafka and analyze them as sliding window.

    The stream is very powerful for the feature above and can complete the kafka consumer. This integration may need to work with a small db to be able store the sliding window metrics for further queries.

    D.

  • Scope reusing & smaller scopes

    Scope reusing & smaller scopes

    This pull request, is experiment, If you like the idea, we can improve the readability and the quality of the code

    For each expression we are creating "scope pool", which is object pool of scopes - with some extra magic. By doing quick analysis on the node AST I know which tags and fields he requires. so we put only the required ones. For example: "value" > 10, I fill only "value" from field or tag.

    name                     old time/op    new time/op    delta
    _T10_P500_AlertTask-4       133ms ± 4%     123ms ± 4%   -7.58%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4     13.4s ± 8%     12.3s ± 7%     ~     (p=0.056 n=5+5)
    _T1000_P500_AlertTask-4     13.5s ± 4%     12.1s ± 3%  -10.46%  (p=0.008 n=5+5)
    
    name                     old alloc/op   new alloc/op   delta
    _T10_P500_AlertTask-4      32.2MB ± 0%    26.0MB ± 0%  -19.32%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4    3.26GB ± 0%    2.62GB ± 0%  -19.71%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4    3.21GB ± 0%    2.61GB ± 0%  -18.56%  (p=0.008 n=5+5)
    
    name                     old allocs/op  new allocs/op  delta
    _T10_P500_AlertTask-4        408k ± 0%      335k ± 0%  -17.85%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4     41.5M ± 0%     34.1M ± 0%  -17.98%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4     40.2M ± 0%     33.1M ± 0%  -17.61%  (p=0.008 n=5+5)
    

    I thought about this idea while researching the performance of alerts, but before that I wanted to implement "compiled stateful expression" ( #491 ). If we combine this pull request with #491 , we will have great performance and low memory usage while evaluating predicates.

  • [Proposal] Make TICKscript branch points more readable

    [Proposal] Make TICKscript branch points more readable

    Since TICKscript ignores whitespace it is possible to define a TICKscript that is really hard to read since it is not clear when a new node is being created vs a property is being set on a node. Example:

    stream.from()
    .groupBy('service')
    .alert()
    .id('kapacitor/{{ index .Tags "service" }}')
    .message('{{ .ID }} is {{ .Level }} value:{{ index .Fields "value" }}')
    .info(lambda: "value" > 10)
    .warn(lambda: "value" > 20)
    .crit(lambda: "value" > 30)
    .post("http://example.com/api/alert")
    .post("http://another.example.com/api/alert")
    .email().to('[email protected]')
    

    A possible solution is to use a different operator for what the docs call property methods and chaining methods, where a property method modifies a node and a chaining method creates a new node in the pipeline. Using the example above and not changing whitespace.

    stream->from()
    .groupBy('service')
    ->alert()
    .id('kapacitor/{{ index .Tags "service" }}')
    .message('{{ .ID }} is {{ .Level }} value:{{ index .Fields "value" }}')
    .info(lambda: "value" > 10)
    .warn(lambda: "value" > 20)
    .crit(lambda: "value" > 30)
    .post("http://example.com/api/alert")
    .post("http://another.example.com/api/alert")
    .email().to('[email protected]')
    

    Or another example with more chaining methods:

    stream
    ->from()
    .where(lambda: ...)
    .groupBy(...)
    ->window()
    .period(10s)
    .every(10s)
    ->mapReduce(influxql.count('value')).as('value')
    ->alert()
    

    Or even an example where it is necessary to disambiguate between a property and chaining method.

    batch->query('SELECT mean(used_percent) FROM "telegraf"."default"."disk"')
          .period(10s)
          .every(10s)
          .groupBy('host','path') // We want to compute the mean by host and path
        ->groupBy() // But then to we want to compute the top of all groups so we need to change the groupBy. Without a different operator or a node between these steps it is impossible.
        ->top(2, 'mean')
        ->influxDBOut()
          .database('mean_output')
          .measurement('avg_disk')
          .retentionPolicy('default')
          .flushInterval(1s)
          .precision('s')
    

    Questions:

    • Does using a different operator make writing a TICKscript overly complex? You will not be able to define a the task until you have used the correct operator for chaining vs property methods. You will have to learn via trial and error as well as consulting docs.
    • Is -> a good operator? Would | or something else read better?
    stream
    |from()
    .where(lambda: ...)
    .groupBy(...)
    |window()
    .period(10s)
    .every(10s)
    |mapReduce(influxql.count('value')).as('value')
    |alert()
    

    Using whitespace to further improve readability

    stream
        |from()
            .where(lambda: ...)
            .groupBy(...)
        |window()
            .period(10s)
            .every(10s)
        |mapReduce(influxql.count('value')).as('value')
        |alert()
    
  • Preserve tags to join/window

    Preserve tags to join/window

    Hi,

    I am creating tick script with measurement with tags (server_group, dc, etc), my tick script is something like this:

    var windows = stream.from('some_measurement')
                                          .where(lambda: 'dc' = 'europe')
                                            .window()
                                                .every(10s)
                                                .period(40s)
    
    var first = windows.first('value')
    var last = windows.last('value')
    
    
    first.join(last)
             .eval(lambda: 'last.last' - 'first.first').as('cvalue')
             .alert()
                // some levels..
                .post('http://some-service')
    

    In the json I am getting on the service I don't have all tags I have in "some_measurement" whom I need. Is there a way to preserve the tags?

  • Custom JSON output for Alert Post and HttpPost Nodes

    Custom JSON output for Alert Post and HttpPost Nodes

    This is a feature request for the ability to specify custom JSON output for the Alert Post and HttpPost nodes. As it stands now there is no control over how the JSON looks and adding additional elements to it is not a trivial task.

    In thinking how this might be implemented I could see a parameter that might point to a template file that could perform the mapping:

    .template(String template, Boolean appendUnusedValues) templateFIle -- The path and name pf the template file or a string with the template definition. This would allow you to specify the template in the TICK script as a var or separately as a file appendUnusedValues -- If true would append any remaining tags or fields to the end of the json. This would provide the ability to transform certain tags or fields while retaining many of the original tags and fields. If false only the tag or field values specified in the template will be in the output json

    stream |httpPost() .template('myTemplate.tmpl', true) .endpoint('example')

    Where the template file might look like: { "myParam1": {{tag.tagName}}, "myParam2": {{field.fieldName}} }

    Thoughts?

  • RHEL7 failed to enable service

    RHEL7 failed to enable service

    Upon installing Kapacitor on RHEL7, doing the following to try to start it on startup comes up with this error...

    #systemctl enable kapacitor
    Failed to execute operation: Too many levels of symbolic links
    

    I believe this is due to...

    # ls -lh /etc/systemd/system
    lrwxrwxrwx. 1 root root   41 Apr  7 10:01 kapacitor.service -> /usr/lib/systemd/system/kapacitor.service
    -rw-rw-r--. 1 root root  466 Mar 22 22:47 kibana.service
    -rw-r--r--. 1 root root  511 Mar 30 13:47 logstash.service
    

    You can see the others are real .service files, but this is a symlink.

  • deps: bump flux to 0.191.0

    deps: bump flux to 0.191.0

    Required checklist

    • [ ] Sample config files updated (both /etc folder and NewDemoConfig methods) (influxdb and plutonium)
    • [ ] openapi swagger.yml updated (if modified API) - link openapi PR
    • [x] Signed CLA (if not already signed)

    Description

    Bumping flux to v0.191.0

    Context

    Currently kapacitor has some build issue with Rust 1.64.0 due to the outdated dependency with flux, thus updating it to the latest.

    • realtes to https://github.com/influxdata/flux/pull/5273
    • also relates to https://github.com/Homebrew/homebrew-core/pull/118242
  • build: remove dep files in favor of go module build

    build: remove dep files in favor of go module build

    Required checklist

    • [ ] Sample config files updated (both /etc folder and NewDemoConfig methods) (influxdb and plutonium)
    • [ ] openapi swagger.yml updated (if modified API) - link openapi PR
    • [x] Signed CLA (if not already signed)

    Description

    • project has already moved into go module build
    • go module build is the default build mechansim
    • golang/dep project was deprecated long time back

    Context

    Remove unused files for the repo

  • [Bug]: Receiving opentsdb data (plaintext) causes panic

    [Bug]: Receiving opentsdb data (plaintext) causes panic

    The latest version of kapacitord, v1.6.5-1, seem to have some bug in the opentsdb handling.

    To reproduce: On a Debian 11 machine I have a netdata process that export its metrics (opentsdb) to localhost:4242 where kapacitord is listening.

    In your repo, there are currently two versions of kapacitor available:

    • 1.6.5-1
    • 1.6.4-1

    I did an apt full-upgrade which gave me v1.6.5-1, and kapacitord now constantly fails. :( Every time a chunk of opentsdb metrics (plaintext) is received on port 4242 it says:

    Dec 14 15:25:58 netdatacentral kapacitord[1041]: ts=2022-12-14T15:25:58.592+01:00 lvl=info msg="http request" service=http host=::1 username=- start=2022-12-14T15:25:58.592460338+01:00 method=POST uri=/write?consistency=&db=_internal&precision=ns&rp=monitor protocol=HTTP/1.1 status=204 referer=- user-agent=InfluxDBClient request-id=3a524601-7bbb-11ed-800a-0666a6579300 duration=290.345µs
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: panic: not implemented
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: goroutine 109 [running]:
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/kapacitor.(*TaskMaster).WritePointsPrivileged(0x0?, {{0x4?, 0x203001?}}, {0xc001d89e80?, 0x4?}, {0x0?, 0x2000100000060?}, 0x0?, {0xc00200a000, 0x5b, ...})
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/root/kapacitor/task_master.go:273 +0x27
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/influxdb/services/opentsdb.(*Service).processBatches(0xc000124900, 0xc00235eea0)
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/[email protected]/services/opentsdb/service.go:483 +0x3ae
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/influxdb/services/opentsdb.(*Service).Open.func1()
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/[email protected]/services/opentsdb/service.go:127 +0x65
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: created by github.com/influxdata/influxdb/services/opentsdb.(*Service).Open
    Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/[email protected]/services/opentsdb/service.go:127 +0x2df
    Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
    Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Failed with result 'exit-code'.
    Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Service RestartSec=100ms expired, scheduling restart.
    
    (and netdata log that it lost its connection when kapacitord restarted itself:
    Dec 14 15:25:59 netdatacentral netdata-error.log: 2022-12-14 15:25:59: netdata ERROR : MAIN : EXPORTING: 'localhost:4242' closed the socket
    )
    

    Every time a new chunk of metrics is received, kapacitord panic and restart itself. No data is actually processed, kapacitord just panics and dies.

    I now downgrade to the other, older, version available:

    apt install kapacitor=1.6.4-1
    reboot
    

    Now it works again. The plaintext opentsdb metrics are received, processed and sent to our InfluxDB as it should.

    I have done no changes in the configuration or TICK script. So the bug must be in the kapacitor package for v1.6.5-1. The regression happened after v1.6.4-1.

    I have also tried changing the netdata export to use [opentsdb:http:opentsdb_POST_to_kapacitor] (just in case the new version of kapacitor should expect HTTP-formatted metric data instead of plaintext) but that didn't work either.


    Additional info:

    A tcpdump show that the format of the plaintext metrics are the same (i.e. it is not netdata that has changed logging format).

    16:01:59.480522 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [S], seq 2855994911, win 65495, options [mss 65495,sackOK,TS val 2211832732 ecr 0,nop,wscale 7], length 0
    E..<.Y@.@..`.............;...........0.........
    ............
    16:01:59.480537 IP 127.0.0.1.4242 > 127.0.0.1.32932: Flags [S.], seq 861833801, ack 2855994912, win 65483, options [mss 65495,sackOK,TS val 2211832732 ecr 2211832732,nop,wscale 7], length 0
    E..<..@.@.<.............3^.I.;. .....0.........
    ............
    16:01:59.480551 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [.], ack 1, win 512, options [nop,nop,TS val 2211832733 ecr 2211832732], length 0
    E..4.Z@[email protected].............;. 3^.J.....(.....
    ........
    16:02:09.484044 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [.], seq 1:32742, ack 1, win 512, options [nop,nop,TS val 2211842736 ecr 2211832732], length 32741
    E....[@.@.:..............;. 3^.J....~......
    ..
    .....put netdata.disk_svctm.nvme0n1.svctm 1670857326 1.0000000 host=netdatacentral
    put netdata.disk_ext_avgsz.nvme0n1.discards 1670857326 0.0000000 host=netdatacentral
    put netdata.disk_avgsz.nvme0n1.reads 1670857326 0.0000000 host=netdatacentral
    put netdata.disk_avgsz.nvme0n1.writes 1670857326 -26.7857143 host=netdatacentral
    ...and so on... A few large packets are sent/received before the server send a FIN and the next packet from the client get a RST (since nothing is now listening at tcp/4242 while kapacitord is restarting).
    

    Let me know if you need more conf-files. Here are what I guess is the relevant stuff:

    # cat /etc/kapacitor/kapacitor.conf
    hostname = "localhost"
    data_dir = "/var/lib/kapacitor/.kapacitor"
    skip-config-overrides = false
    default-retention-policy = ""
    
    [http]
      bind-address = ":9092"
      auth-enabled = false
      log-enabled = true
      write-tracing = false
      pprof-enabled = false
      https-enabled = false
      https-certificate = "/etc/ssl/kapacitor.pem"
      https-private-key = ""
      shutdown-timeout = "10s"
      shared-secret = ""
    
    [replay]
      dir = "/var/lib/kapacitor/.kapacitor/replay"
    
    [storage]
      boltdb = "/var/lib/kapacitor/.kapacitor/kapacitor.db"
    
    [task]
      dir = "/var/lib/kapacitor/.kapacitor/tasks"
      snapshot-interval = "1m0s"
    
    [load]
      enabled = true
      dir = "/etc/kapacitor/load"
    
    [[influxdb]]
      enabled = true
      default = true
      name = "default"
      urls = ["http://localhost:8086"]
      username = ""
      password = ""
      ssl-ca = ""
      ssl-cert = ""
      ssl-key = ""
      insecure-skip-verify = false
      timeout = "0s"
      disable-subscriptions = false
      subscription-protocol = "http"
      subscription-mode = "cluster"
      kapacitor-hostname = ""
      http-port = 0
      udp-bind = ""
      udp-buffer = 1000
      udp-read-buffer = 0
      startup-timeout = "5m0s"
      subscriptions-sync-interval = "1m0s"
      [influxdb.excluded-subscriptions]
        _kapacitor = ["autogen"]
    
    [logging]
      file = "STDERR"
      level = "DEBUG"
    
    [config-override]
      enabled = true
    
    [opentsdb]
      enabled = true
      bind-address = "127.0.0.1:4242"
      database = "opentsdb"
      retention-policy = "autogen"
      consistency-level = "one"
      tls-enabled = false
      certificate = "/etc/ssl/influxdb.pem"
      batch-size = 1000
      batch-pending = 5
      batch-timeout = "1s"
      log-point-errors = true
    
    [reporting]
      enabled = false
      url = "https://usage.influxdata.com"
    
    [stats]
      enabled = true
      stats-interval = "10s"
      database = "_kapacitor"
      retention-policy = "autogen"
      timing-sample-rate = 0.1
      timing-movavg-size = 1000
    
    # Connect to a second InfluxDB
    [[influxdb]]
      enabled = true
      default = false
      name = "InfluxCloud"
      urls = ["https://blahblahblah.influxcloud.net:8086"]
      username = "blahblah"
      password = "blahblah"
      timeout = 0
    
    # cat /etc/netdata/exporting.conf
    [exporting:global]
        enabled = yes
    
    [opentsdb:opentsdb_plaintext_to_kapacitor]
        enabled = yes
        destination = localhost:4242
        data source = average
        update every = 60
        send hosts matching = *
        send charts matching = system.cpu system.uptime system.load system.entropy disk_space.* system.ram system.swap disk_ops.*
    
    # cat /etc/kapacitor/load/tasks/stream_netdata_to_influxdb.tick
    // Stream data from Netdata to remote InfluxDB
    dbrp "opentsdb"."autogen"
    
    var data = stream
        |from()
            .database('opentsdb')
            .retentionPolicy('autogen')
            .groupByMeasurement()
        |window()
            .period(1m)
            .every(1m)
    
    data
        |influxDBOut()
            .database('opentsdb')
            .retentionPolicy('autogen')
            .cluster('InfluxCloud')
    
  • JOINING kapacitor queries with count and difference methods

    JOINING kapacitor queries with count and difference methods

    I have been trying to run this script but no alert is getting created on alerta and logs are not showing any error as well, can anyone help me with this ? avg_time_gap = sum(current_time - previous_time)/count(num_of_entries) current_time_gap = (current_value-previous_value) thresh = abs(current_time_gap-avg_time_gap) alert when thresh > 0

    var window = 20s
    var every = 5s // defines the frequency at which the window is emitted to the next node in the pipeline.
    var timeout = 360s // alert expiry time
    var avg_time = batch
        |query('select 20.0/count(content) as avg_time_gap from mydb..measurement')
            .period(window)
            .every(every)
        //numerator must be as same as the window value
    
    var adjacent_time_gap = batch
        |query('select difference(content) as time_gap from mydb..measurement')
           .period(window)
           .every(every)
    
    var data = avg_time
        |join(adjacent_time_gap)
            .as('avg_time','actual_time')
            .tolerance(1s)
    
        |eval(lambda: abs("avg_time.avg_time_gap"-"actual_time.time_gap"))
            .as('time_delay')
    
    var alert = data
        |alert()
            .id(event_name)
            .crit(lambda: "time_delay" > 0)
    
    
  • OK events are not generating even on matching condition on stateDuration. Data is in stream mode, and using window() node

    OK events are not generating even on matching condition on stateDuration. Data is in stream mode, and using window() node

    var db = 'telegraf' var rp = 'autogen' var measurement = 'cpu' var groupBy = ['host', 'gms-rule'] var period = 5m var every = 1m var whereFilter = lambda: isPresent("gms-rule") AND "cpu" == 'cpu-total' AND isPresent("usage_pct") AND "host" == 'venu-test' var data = stream |from() .database(db) .retentionPolicy(rp) .measurement(measurement) .groupBy(groupBy) .where(whereFilter) |window() .period(period) .every(every) .align() |eval(lambda: "usage_pct") .as('value') |httpOut('stream')

    var trigger = custom |stateDuration(lambda: "warningEnabled" AND "value" >= "warningThreshold" AND "enable") .as('actual_warn_duration') |stateDuration(lambda: "criticalEnabled" AND "value" >= "criticalThreshold" AND "enable") .as('actual_crit_duration') |stateDuration(lambda: "value" < "warningThreshold" OR "value" < "criticalThreshold" AND global_ok_stateDuration_enabled) .as('ok_duration_counter') |httpOut('trigger') |log() |alert() .warn(lambda: "actual_warn_duration" >= "warningStateDuration") .crit(lambda: "actual_crit_duration" >= "criticalStateDuration") // .message(message) .critReset(lambda: "ok_duration_counter" > "okstateDuration") .warnReset(lambda: "ok_duration_counter" > "okstateDuration") .id(idVar) .levelTag(levelTag) .messageField(messageField) .stateChangesOnly() .log('/apps/helios/influx-ent/logs/kapacitor/a-test-ok-1.log')

    With this tick script, i am getting a warning or critical alert, but never doing a reset (no OK events are coming) even the critReset or warnReset conditions are becoming true.

    Is this a kind of bug? or reset will not happen when i use a window() node?

    If i am not using the window() node, its working as expected. Getting crit/warn alerts, and resetting back to OK on a matching condition.

  • Kapacitor alert template adding escape character for double quotes

    Kapacitor alert template adding escape character for double quotes

    I am using Kapacitor to send alerts via post method to elasticsearch. However, I was unable to do, so to debug further I was using a local django application to read the post response. By using alert template I am able to get the data in desired format: {"\"id\": \"4g-alert\",\"time\": \"2022-10-19 13:30:01 +0000 UTC\",\"tag1\": \"0005B951C718_10213\",\"KPI_Name\": \"KPI_name,\"KPI_value\": \"0\",\"Level\": CRITICAL,\"previousLevel\": CRITICAL\""}

    However the kapacitor alert is adding '' as escape character, when I tried to send the same json using curl command it worked. Any way to remove the slash() or any other solution for this.

SigNoz helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc. 🔥 🖥. 👉 Open source Application Performance Monitoring (APM) & Observability tool
SigNoz helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc. 🔥 🖥.   👉  Open source Application Performance Monitoring (APM) & Observability tool

Monitor your applications and troubleshoot problems in your deployed applications, an open-source alternative to DataDog, New Relic, etc. Documentatio

Sep 24, 2021
The Prometheus monitoring system and time series database.

Prometheus Visit prometheus.io for the full documentation, examples and guides. Prometheus, a Cloud Native Computing Foundation project, is a systems

Dec 31, 2022
checkah is an agentless SSH system monitoring and alerting tool.

CHECKAH checkah is an agentless SSH system monitoring and alerting tool. Features: agentless check over SSH (password, keyfile, agent) config file bas

Oct 14, 2022
An open-source and enterprise-level monitoring system.
 An open-source and enterprise-level monitoring system.

Falcon+ Documentations Usage Open-Falcon API Prerequisite Git >= 1.7.5 Go >= 1.6 Getting Started Docker Please refer to ./docker/README.md. Build from

Jan 1, 2023
Open Source Software monitoring platform tools.

ByteOpen Open Source Software monitoring platform tools. Usage Clone the repo to your own go src path cd ~/go/src git clone https://code.byted.org/inf

Nov 21, 2021
Monitoring-go - A simple monitoring tool to sites of MOVA

Monitoring GO A simple monitoring tool to sites of MOVA How to use Clone Repo gi

Feb 14, 2022
Butler - Aggregation and Alerting Platform
Butler - Aggregation and Alerting Platform

Welcome to Butler Table of Contents Welcome About The Project Contributing Developer Workflow Getting Started Configuration About The Project Contribu

Mar 1, 2022
Feb 9, 2022
A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go
A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go

Package monkit is a flexible code instrumenting and data collection library. See documentation at https://godoc.org/gopkg.in/spacemonkeygo/monkit.v3 S

Dec 14, 2022
mtail - extract internal monitoring data from application logs for collection into a timeseries database
 mtail - extract internal monitoring data from application logs for collection into a timeseries database

mtail - extract internal monitoring data from application logs for collection into a timeseries database mtail is a tool for extracting metrics from a

Dec 29, 2022
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Nov 10, 2022
The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

The open-source platform for monitoring and observability. Grafana allows you to query, visualize, alert on and understand your metrics no matter wher

Jan 3, 2023
Distributed simple and robust release management and monitoring system.
Distributed simple and robust release management and monitoring system.

Agente Distributed simple and robust release management and monitoring system. **This project on going work. Road map Core system First worker agent M

Nov 17, 2022
An Open Source video surveillance management system for people making this world a safer place.
An Open Source video surveillance management system for people making this world a safer place.

Kerberos Open Source Docker Hub | Documentation | Website Kerberos Open source (v3) is a cutting edge video surveillance management system made availa

Dec 30, 2022
Open Source Supreme Monitor Based on GoLang

Open Source Supreme Monitor Based on GoLang A module built for personal use but ended up being worthy to have it open sourced.

Nov 4, 2022
An open source Pusher server implementation compatible with Pusher client libraries written in GO

Try browsing the code on Sourcegraph! IPÊ An open source Pusher server implementation compatible with Pusher client libraries written in Go. Why I wro

Jan 3, 2023
A GNU/Linux monitoring and profiling tool focused on single processes.
A GNU/Linux monitoring and profiling tool focused on single processes.

Uroboros is a GNU/Linux monitoring tool focused on single processes. While utilities like top, ps and htop provide great overall details, they often l

Dec 26, 2022
Simple and extensible monitoring agent / library for Kubernetes: https://gravitational.com/blog/monitoring_kubernetes_satellite/

Satellite Satellite is an agent written in Go for collecting health information in a kubernetes cluster. It is both a library and an application. As a

Nov 10, 2022
A system and resource monitoring tool written in Golang!
A system and resource monitoring tool written in Golang!

Grofer A clean and modern system and resource monitor written purely in golang using termui and gopsutil! Currently compatible with Linux only. Curren

Jan 8, 2023