The latest version of kapacitord, v1.6.5-1, seem to have some bug in the opentsdb handling.
To reproduce:
On a Debian 11 machine I have a netdata process that export its metrics (opentsdb) to localhost:4242 where kapacitord is listening.
In your repo, there are currently two versions of kapacitor available:
I did an apt full-upgrade
which gave me v1.6.5-1, and kapacitord now constantly fails. :(
Every time a chunk of opentsdb metrics (plaintext) is received on port 4242 it says:
Dec 14 15:25:58 netdatacentral kapacitord[1041]: ts=2022-12-14T15:25:58.592+01:00 lvl=info msg="http request" service=http host=::1 username=- start=2022-12-14T15:25:58.592460338+01:00 method=POST uri=/write?consistency=&db=_internal&precision=ns&rp=monitor protocol=HTTP/1.1 status=204 referer=- user-agent=InfluxDBClient request-id=3a524601-7bbb-11ed-800a-0666a6579300 duration=290.345µs
Dec 14 15:26:00 netdatacentral kapacitord[1041]: panic: not implemented
Dec 14 15:26:00 netdatacentral kapacitord[1041]: goroutine 109 [running]:
Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/kapacitor.(*TaskMaster).WritePointsPrivileged(0x0?, {{0x4?, 0x203001?}}, {0xc001d89e80?, 0x4?}, {0x0?, 0x2000100000060?}, 0x0?, {0xc00200a000, 0x5b, ...})
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/root/kapacitor/task_master.go:273 +0x27
Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/influxdb/services/opentsdb.(*Service).processBatches(0xc000124900, 0xc00235eea0)
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/[email protected]/services/opentsdb/service.go:483 +0x3ae
Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/influxdb/services/opentsdb.(*Service).Open.func1()
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/[email protected]/services/opentsdb/service.go:127 +0x65
Dec 14 15:26:00 netdatacentral kapacitord[1041]: created by github.com/influxdata/influxdb/services/opentsdb.(*Service).Open
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/[email protected]/services/opentsdb/service.go:127 +0x2df
Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Failed with result 'exit-code'.
Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Service RestartSec=100ms expired, scheduling restart.
(and netdata log that it lost its connection when kapacitord restarted itself:
Dec 14 15:25:59 netdatacentral netdata-error.log: 2022-12-14 15:25:59: netdata ERROR : MAIN : EXPORTING: 'localhost:4242' closed the socket
)
Every time a new chunk of metrics is received, kapacitord panic and restart itself. No data is actually processed, kapacitord just panics and dies.
I now downgrade to the other, older, version available:
apt install kapacitor=1.6.4-1
reboot
Now it works again. The plaintext opentsdb metrics are received, processed and sent to our InfluxDB as it should.
I have done no changes in the configuration or TICK script. So the bug must be in the kapacitor package for v1.6.5-1.
The regression happened after v1.6.4-1.
I have also tried changing the netdata export to use [opentsdb:http:opentsdb_POST_to_kapacitor]
(just in case the new version of kapacitor should expect HTTP-formatted metric data instead of plaintext) but that didn't work either.
Additional info:
A tcpdump show that the format of the plaintext metrics are the same (i.e. it is not netdata that has changed logging format).
16:01:59.480522 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [S], seq 2855994911, win 65495, options [mss 65495,sackOK,TS val 2211832732 ecr 0,nop,wscale 7], length 0
E..<.Y@.@..`.............;...........0.........
............
16:01:59.480537 IP 127.0.0.1.4242 > 127.0.0.1.32932: Flags [S.], seq 861833801, ack 2855994912, win 65483, options [mss 65495,sackOK,TS val 2211832732 ecr 2211832732,nop,wscale 7], length 0
E..<..@.@.<.............3^.I.;. .....0.........
............
16:01:59.480551 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [.], ack 1, win 512, options [nop,nop,TS val 2211832733 ecr 2211832732], length 0
E..4.Z@[email protected].............;. 3^.J.....(.....
........
16:02:09.484044 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [.], seq 1:32742, ack 1, win 512, options [nop,nop,TS val 2211842736 ecr 2211832732], length 32741
E....[@.@.:..............;. 3^.J....~......
..
.....put netdata.disk_svctm.nvme0n1.svctm 1670857326 1.0000000 host=netdatacentral
put netdata.disk_ext_avgsz.nvme0n1.discards 1670857326 0.0000000 host=netdatacentral
put netdata.disk_avgsz.nvme0n1.reads 1670857326 0.0000000 host=netdatacentral
put netdata.disk_avgsz.nvme0n1.writes 1670857326 -26.7857143 host=netdatacentral
...and so on... A few large packets are sent/received before the server send a FIN and the next packet from the client get a RST (since nothing is now listening at tcp/4242 while kapacitord is restarting).
Let me know if you need more conf-files. Here are what I guess is the relevant stuff:
# cat /etc/kapacitor/kapacitor.conf
hostname = "localhost"
data_dir = "/var/lib/kapacitor/.kapacitor"
skip-config-overrides = false
default-retention-policy = ""
[http]
bind-address = ":9092"
auth-enabled = false
log-enabled = true
write-tracing = false
pprof-enabled = false
https-enabled = false
https-certificate = "/etc/ssl/kapacitor.pem"
https-private-key = ""
shutdown-timeout = "10s"
shared-secret = ""
[replay]
dir = "/var/lib/kapacitor/.kapacitor/replay"
[storage]
boltdb = "/var/lib/kapacitor/.kapacitor/kapacitor.db"
[task]
dir = "/var/lib/kapacitor/.kapacitor/tasks"
snapshot-interval = "1m0s"
[load]
enabled = true
dir = "/etc/kapacitor/load"
[[influxdb]]
enabled = true
default = true
name = "default"
urls = ["http://localhost:8086"]
username = ""
password = ""
ssl-ca = ""
ssl-cert = ""
ssl-key = ""
insecure-skip-verify = false
timeout = "0s"
disable-subscriptions = false
subscription-protocol = "http"
subscription-mode = "cluster"
kapacitor-hostname = ""
http-port = 0
udp-bind = ""
udp-buffer = 1000
udp-read-buffer = 0
startup-timeout = "5m0s"
subscriptions-sync-interval = "1m0s"
[influxdb.excluded-subscriptions]
_kapacitor = ["autogen"]
[logging]
file = "STDERR"
level = "DEBUG"
[config-override]
enabled = true
[opentsdb]
enabled = true
bind-address = "127.0.0.1:4242"
database = "opentsdb"
retention-policy = "autogen"
consistency-level = "one"
tls-enabled = false
certificate = "/etc/ssl/influxdb.pem"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"
log-point-errors = true
[reporting]
enabled = false
url = "https://usage.influxdata.com"
[stats]
enabled = true
stats-interval = "10s"
database = "_kapacitor"
retention-policy = "autogen"
timing-sample-rate = 0.1
timing-movavg-size = 1000
# Connect to a second InfluxDB
[[influxdb]]
enabled = true
default = false
name = "InfluxCloud"
urls = ["https://blahblahblah.influxcloud.net:8086"]
username = "blahblah"
password = "blahblah"
timeout = 0
# cat /etc/netdata/exporting.conf
[exporting:global]
enabled = yes
[opentsdb:opentsdb_plaintext_to_kapacitor]
enabled = yes
destination = localhost:4242
data source = average
update every = 60
send hosts matching = *
send charts matching = system.cpu system.uptime system.load system.entropy disk_space.* system.ram system.swap disk_ops.*
# cat /etc/kapacitor/load/tasks/stream_netdata_to_influxdb.tick
// Stream data from Netdata to remote InfluxDB
dbrp "opentsdb"."autogen"
var data = stream
|from()
.database('opentsdb')
.retentionPolicy('autogen')
.groupByMeasurement()
|window()
.period(1m)
.every(1m)
data
|influxDBOut()
.database('opentsdb')
.retentionPolicy('autogen')
.cluster('InfluxCloud')