💡 A Distributed and High-Performance Monitoring System. The next generation of Open-Falcon

Last update: Jan 5, 2023

Comments: 16

夜莺简介

夜莺是一套分布式高可用的运维监控系统，最大的特点是混合云支持，既可以支持传统物理机虚拟机的场景，也可以支持K8S容器的场景。同时，夜莺也不只是监控，还有一部分CMDB的能力、自动化运维的能力，很多公司都基于夜莺开发自己公司的运维平台。开源的这部分功能模块也是商业版本的一部分，所以可靠性有保障、会持续维护，诸君可放心使用。效果图如下：

OCE认证

OCE是一个认证机制和交流平台，为夜莺生产用户量身打造，我们会为OCE企业提供更好的技术支持，比如专属的技术沙龙、企业一对一的交流机会、专属的答疑群等，如果贵司已将夜莺上了生产，快来加入吧

文档资料

文档手册： https://n9e.didiyun.com/ 欢迎大家一起完善
视频教程：https://space.bilibili.com/442531657 欢迎大家关注
音频答疑：https://www.ximalaya.com/keji/45095827/ 欢迎大家关注
二次开发：https://xie.infoq.cn/article/30d37e98fbe52ff2a79fc04b4 欢迎大家共建

交流互助

关注公众号 Obsuite(官方公众号) 回复 "夜莺加群"

Owner

DiDi

滴滴出行

https://github.com/didi/nightingale https://n9e.didiyun.com

Comments

oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊
Relevant server.conf | webapi.conf

[OIDC] Enable = true RedirectURL = "http://ip:18000/callback" SsoAddr = "http://ip/oidc/login" ClientId = "xxxxxxxxxxxxxx" ClientSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" CoverAttributes = true # 默认角色 DefaultRoles = ["Standard"] # 属性映射 [OIDC.Attributes] Nickname = "nickname" Phone = "phone_number" Email = "email"

Relevant logs

oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

System info

oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

Steps to reproduce

1.oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊 2. 3. ...

Expected behavior

oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

Actual behavior

oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

Additional info

oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

求解：各位大佬老师，在用ibex categraf-v0.2.10 做自愈脚本的时候，告警规则上设置自愈脚本不生效，我categraf 和ibex-agent 装在一台服务器。然后夜莺的一套是用docker 装在了另外一台服务器上，求排查思路。

Relevant server.conf | webapi.conf

cat /data/categraf-v0.2.10-linux-amd64/conf/config.toml
[global]
# whether print configs
print_configs = false

# add label(agent_hostname) to series
# "" -> auto detect hostname
# "xx" -> use specified string xx
# "$hostname" -> auto detect hostname
# "$ip" -> auto detect ip
# "$hostname-$ip" -> auto detect hostname and ip to replace the vars
hostname = "$ip"


/ibex/etc/agentd.conf 配置
# debug, release
RunMode = "debug"

# task meta storage dir
MetaDir = "./meta"

[HTTP]
Enable = true
# http listening address
Host = "0.0.0.0"
# http listening port
Port = 2090
# https cert file path
CertFile = ""
# https key file path
KeyFile = ""
# whether print access log
PrintAccessLog = true
# whether enable pprof
PProf = false
# http graceful shutdown timeout, unit: s
ShutdownTimeout = 30
# max content length: 64M
MaxContentLength = 67108864
# http server read timeout, unit: s
ReadTimeout = 20
# http server write timeout, unit: s
WriteTimeout = 40
# http server idle timeout, unit: s
IdleTimeout = 120

[Heartbeat]
# unit: ms
Interval = 1000
# rpc servers
Servers = ["172.18.89.20:20090"]
# $ip or $hostname or specified string
Host = "172.18.89.18"

Relevant logs

none

System info

前端版本：5.9.0 后端版本：v5.10.3-23d7e5a7de5d0ffea7ab941e9621ef7f53071775

Steps to reproduce

...

Expected behavior

告警规则报警的时候触发不了自愈脚本。

Actual behavior

none

Additional info

none

3.0页面显示不正常

根据安装步骤进行安装，估计是前端界面存在问题。 console报错如下： SyntaxError: Unexpected end of JSON input at JSON.parse () at layout-2ad14e8510d33d9e5c84.js:81829 at ia (layout-2ad14e8510d33d9e5c84.js:59318) at La (layout-2ad14e8510d33d9e5c84.js:59318) at Va (layout-2ad14e8510d33d9e5c84.js:59318) at Ha (layout-2ad14e8510d33d9e5c84.js:59318) at Mc (layout-2ad14e8510d33d9e5c84.js:59318) at xc (layout-2ad14e8510d33d9e5c84.js:59318) at yc (layout-2ad14e8510d33d9e5c84.js:59318) at Ga (layout-2ad14e8510d33d9e5c84.js:59318) layout-2ad14e8510d33d9e5c84.js:81829 SyntaxError: Unexpected end of JSON input at JSON.parse () at layout-2ad14e8510d33d9e5c84.js:81829 at ia (layout-2ad14e8510d33d9e5c84.js:59318) at La (layout-2ad14e8510d33d9e5c84.js:59318) at Va (layout-2ad14e8510d33d9e5c84.js:59318) at Ha (layout-2ad14e8510d33d9e5c84.js:59318) at Mc (layout-2ad14e8510d33d9e5c84.js:59318) at xc (layout-2ad14e8510d33d9e5c84.js:59318) at yc (layout-2ad14e8510d33d9e5c84.js:59318) at Ga (layout-2ad14e8510d33d9e5c84.js:59318)
告警配置的与条件似乎有问题

当前的告警策略配置：

当前的对应 matrics 的状态：

想实现的是当proc.port.listen <1 与 file.lock.exist < 1都满足的时候就报警
当前的状态是满足告警的条件的。但是并没有触发告警。

单独配置不使用与条件的时候，是能够分别触发告警的。
夜莺v3版本 agent 停止时无法生产报警事件

我将夜莺从v2版本重新部署成v3版本时其他报警可以正常的生成报警信息，agent 停止时无法生成告警事件

监控策略如下

{ "name": "监控agent失联", "category": 1, "alert_dur": 60, "recovery_dur": 0, "recovery_notify": 1, "enable_stime": "00:00", "enable_etime": "23:59", "priority": 1, "exprs": [ { "eopt": "=", "func": "nodata", "metric": "proc.agent.alive", "params": [], "threshold": 0 } ], "tags": [], "enable_days_of_week": [ 0, 1, 2, 3, 4, 5, 6 ], "converge": [ 36000, 1 ], "endpoints": null },

其余报警都正常，并且我的监控策略都是放置在一个主节点的当agent停止时可以可以从监控看图正常的看到proc.agent.alive 监控项没有上报获取到信息在未恢复报警中没有生成事件为啥求大佬指点

通过api上报的数据查询结果为NaN

Relevant server.conf | webapi.conf

tsdb.yml
rrd:
  storage: /home/storage/n9e_data/8011
cache:
  keepMinutes: 120
logger:
  dir: logs/tsdb
  level: WARNING
  keepHours: 2

transfer.yml
backend:
  datasource: "tsdb"
  m3db:
    enabled: false
    maxSeriesPoints: 720                       # default 720
    name: "m3db"
    namespace: "default"
    seriesLimit: 0
    docsLimit: 0
    daysLimit: 7                               # max query time
    # https://m3db.github.io/m3/m3db/architecture/consistencylevels/
    writeConsistencyLevel: "majority"          # one|majority|all
    readConsistencyLevel: "unstrict_majority"  # one|unstrict_majority|majority|all
    config:
      service:
        # KV environment, zone, and service from which to write/read KV data (placement
        # and configuration). Leave these as the default values unless you know what
        # you're doing.
        env: default_env
        zone: embedded
        service: m3db
        etcdClusters:
          - zone: embedded
            endpoints:
              - 127.0.0.1:2379
            tls:
              caCrtPath: /etc/etcd/certs/ca.pem
              crtPath: /etc/etcd/certs/etcd-client.pem
              keyPath: /etc/etcd/certs/etcd-client-key.pem
  tsdb:
    enabled: true
    name: "tsdb"
    cluster:
      tsdb01: 127.0.0.1:8011
  influxdb:
    enabled: false
    username: "influx"
    password: "admin123"
    precision: "s"
    database: "n9e"
    address: "http://127.0.0.1:8086"
  opentsdb:
    enabled: false
    address: "127.0.0.1:4242"
  kafka:
    enabled: false
    brokersPeers: "192.168.1.1:9092,192.168.1.2:9092"
    topic: "n9e"
logger:
  dir: logs/transfer
  level: INFO
  keepHours: 24

Relevant logs

2022-07-06 18:39:47.942693 WARNING rpc/query.go:118 debug: true, /home/storage/n9e_data/8011/cd/cdf3bd9e6ba35f20b66aa65ddd330365_GAUGE_7200.rrd
2022-07-06 18:39:47.943026 WARNING rpc/query.go:145 data: [<RRDData:Value:NaN TS:1654480800 2022-06-06 10:00:00> <RRDData:Value:NaN TS:1654488000 2022-06-06 12:00:00> <RRDData:Value:NaN TS:1654495200 2022-06-06 14:00:00> <RRDData:Value:NaN TS:1654502400 2022-06-06 16:00:00> <RRDData:Value:NaN TS:1654509600 2022-06-06 18:00:00> <RRDData:Value:NaN TS:1654516800 2022-06-06 20:00:00> <RRDData:Value:NaN TS:1654524000 2022-06-06 22:00:00> <RRDData:Value:NaN TS:1654531200 2022-06-07 00:00:00> <RRDData:Value:NaN TS:1654538400 2022-06-07 02:00:00> <RRDData:Value:NaN TS:1654545600 2022-06-07 04:00:00> <RRDData:Value:NaN TS:1654552800 2022-06-07 06:00:00> <RRDData:Value:NaN TS:1654560000 2022-06-07 08:00:00> <RRDData:Value:NaN TS:1654567200 2022-06-07 10:00:00> <RRDData:Value:NaN TS:1654574400 2022-06-07 12:00:00> <RRDData:Value:NaN TS:1654581600 2022-06-07 14:00:00> <RRDData:Value:NaN TS:1654588800 2022-06-07 16:00:00> <RRDData:Value:NaN TS:1654596000 2022-06-07 18:00:00> <RRDData:Value:NaN TS:1654603200 2022-06-07 20:00:00> <RRDData:Value:NaN TS:1654610400 2022-06-07 22:00:00> <RRDData:Value:NaN TS:1654617600 2022-06-08 00:00:00>

System info

n9e 3.8.0

Steps to reproduce

1.首先通过/api/transfer/data批量上报以前收集的历史数据 2.在监控页面查看数据图表时发现数据无法查看 3.排查日志发现数据文件可以正常打开，但是查出来的数据部分的Value是NaN

Expected behavior

可以正常查询数据，可以显示数据图表

Actual behavior

无法显示数据图表，数据查询的结果如日志显示的是Value为NaN

Additional info

No response

活跃告警聚合规则使用问题

前端版本：5.5.1 后端版本：5.9.3

第一个问题这时他会提示是否公开，逻辑上公开就打开，不公开就关上，然后我公开再不公开就正常创建了建议优化下这个逻辑

第二个问题

我的__name__聚合规则添加上了，但是实际显示出来的是Null 我编辑聚合规则，删除__name__ 标签，提示：unsupported field: name，这个很奇怪，有时候可以添加上标签，有时候又不可以。我试了一下告警中其他的标签，也是同样的问题。

【多集群配置】配置好webapic.conf和server.conf后，只有节点信息（ident）到了中心端，prometheus的即时数据无法查询。

Relevant server.conf | webapi.conf

中心端：webapi.conf

# 中心端cluster info
[[Clusters]]
# Prometheus cluster name
Name = "Default"
# Prometheus APIs base url
Prom = "http://127.0.0.1:9090"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Timeout = 30000
DialTimeout = 3000
MaxIdleConnsPerHost = 100

# 局部地区cluster info
[[Clusters]]
# Prometheus cluster name
Name = "zhifawang_cluster"
# Prometheus APIs base url
Prom = "http://局部地区ip:9090"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Timeout = 30000
DialTimeout = 3000
MaxIdleConnsPerHost = 100

分地区server.conf
[DB]
# postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%s
DSN="root:mysqlpasswd@tcp(中心端mysql-ip:3306)/n9e_v5?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
# enable debug mode or not
Debug = false
# mysql postgres
DBType = "mysql"
# unit: s
MaxLifetime = 7200
# max open connections
MaxOpenConns = 150
# max idle connections
MaxIdleConns = 50
# table prefix
TablePrefix = ""
# enable auto migrate or not
EnableAutoMigrate = false

#中心端-server.conf
[Reader]
# prometheus base url
Url = "http://127.0.0.1:9090"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Timeout = 30000
DialTimeout = 10000
TLSHandshakeTimeout = 30000
ExpectContinueTimeout = 1000
IdleConnTimeout = 90000
# time duration, unit: ms
KeepAlive = 30000
MaxConnsPerHost = 0
MaxIdleConns = 100
MaxIdleConnsPerHost = 10

[[Writers]]
Url = "http://127.0.0.1:9090/api/v1/write"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Timeout = 10000
DialTimeout = 3000
TLSHandshakeTimeout = 30000
ExpectContinueTimeout = 1000
IdleConnTimeout = 90000
# time duration, unit: ms
KeepAlive = 30000
MaxConnsPerHost = 0
MaxIdleConns = 100
MaxIdleConnsPerHost = 100

# 局部地区集群server.conf
[Reader]
# prometheus base url
Url = "http://prometheus:9090"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Timeout = 30000
DialTimeout = 10000
TLSHandshakeTimeout = 30000
ExpectContinueTimeout = 1000
IdleConnTimeout = 90000
# time duration, unit: ms
KeepAlive = 30000
MaxConnsPerHost = 0
MaxIdleConns = 100
MaxIdleConnsPerHost = 10

[[Writers]]
Url = "http://prometheus:9090/api/v1/write"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Timeout = 30000
DialTimeout = 10000
TLSHandshakeTimeout = 30000
ExpectContinueTimeout = 1000
IdleConnTimeout = 90000
# time duration, unit: ms
KeepAlive = 30000
MaxConnsPerHost = 0
MaxIdleConns = 100
MaxIdleConnsPerHost = 100

Relevant logs

局部nserverd日志
2022-06-21 09:44:09.616229 WARNING writer/writer.go:42 post to http://prometheus:9090/api/v1/write got error: push data with remote write request got status code: 500, response body: label name "busigroup" is not unique: invalid sample
2022-06-21 09:44:09.616338 WARNING writer/writer.go:43 example timeseries:labels:<name:"__name__" value:"kernel_processes_forked" > labels:<name:"ident" value:"11.68.150.59_\351\251\254\351\201\223\345\244\264" > labels:<name:"busigroup" value:"jykj_dt" > labels:<name:"busigroup" value:"jykj_dt" > samples:<value:4.0868227e+07 timestamp:1655775848000 >

System info

前端版本：5.5.1 后端版本：5.9.2

Steps to reproduce

1.修改中心端webapi.conf文件中cluster信息 2.修改局部n9e，server.con中DB信息 3.重启局部n9e server服务 ...

Expected behavior

中心端显示多套集群的节点信息及即时信息注：中心端使用组件部署，局部使用docker部署

Actual behavior

中心端在对象列表显示多集群节点信息，在监控看图无法显示另一集群的信息。

Additional info

数据有断层，不知如何排查
What happened:一台主句的指标数据断断续续

What you expected to happen:log没有问题

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

OS (e.g: cat /etc/os-release):centos 6.6

Logs:

Others:
feat: persist notify cur number

feat: persist notify cur number

持久化连续通知次数，方便后面能一眼看出来该告警已连续通知多少次。

pg更新脚本： ALTER TABLE alert_cur_event ADD notify_cur_number int not null default 0; ALTER TABLE alert_his_event ADD notify_cur_number int not null default 0;
监控大盘promql 引用变量会报错，无法配置
夜莺版本: 下载 5.6.3 版本源码，使用docker-compose部署

前端版本：5.2.1

后端版本：5.6.3

chrome 版本 84.0.4147.105（正式版本）（64 位）

问题和复现方法: 部署完成后，前端配置一个监控大盘

监控大盘新增一个 host 变量,，然后随便新增一个监控图表，配置 promql如下 cpu_usage_user{ident="$host"} 点击保存，无法生效，控制台报如下错误

vendor.afcc874e.js:27 TypeError: i.replaceAll is not a function at u1 (index.a3d7ea18.js:33) at index.a3d7ea18.js:33 at Oe (vendor.afcc874e.js:49) at Function.xa (vendor.afcc874e.js:49) at y (index.a3d7ea18.js:33) at index.a3d7ea18.js:33 at Ss (vendor.afcc874e.js:27) at t.unstable_runWithPriority (vendor.afcc874e.js:18) at Ri (vendor.afcc874e.js:27) at _s (vendor.afcc874e.js:27)

作为对比，直接写死ident的值，图表加载正常： cpu_usage_user{ident="telegraf01"}

帮忙分析一下感谢，server日志中出现大量client 404

Relevant server.conf | webapi.conf

空

Relevant logs

2022-12-28 16:50:42.090274 ERROR engine/worker.go:209 rule_eval:245 promql:increase(net_drop_out[1m]) > 0, error:client_error: client error: 404
2022-12-28 16:50:42.093313 ERROR engine/worker.go:209 rule_eval:263 promql:http_response_http_response_code > 500, error:client_error: client error: 404
2022-12-28 16:50:42.114400 ERROR engine/worker.go:209 rule_eval:246 promql:netstat_tcp_time_wait > 20000, error:client_error: client error: 404
2022-12-28 16:50:42.127133 ERROR engine/worker.go:209 rule_eval:244 promql:increase(net_drop_in[1m]) > 0, error:client_error: client error: 404
2022-12-28 16:50:42.128994 ERROR engine/worker.go:209 rule_eval:238 promql:target_up != 1, error:client_error: client error: 404
2022-12-28 16:50:42.139551 ERROR engine/worker.go:209 rule_eval:249 promql:procstat_lookup_result_code != 0, error:client_error: client error: 404
2022-12-28 16:50:42.147291 ERROR engine/worker.go:209 rule_eval:242 promql:rate(diskio_io_time[1m])/10 > 99, error:client_error: client error: 404
2022-12-28 16:50:42.152683 ERROR engine/worker.go:209 rule_eval:277 promql:disk_used_percent > 85, error:client_error: client error: 404
2022-12-28 16:50:42.155019 ERROR engine/worker.go:209 rule_eval:239 promql:net_response_result_code != 0, error:client_error: client error: 404
2022-12-28 16:50:42.156042 ERROR engine/worker.go:209 rule_eval:247 promql:procstat_lookup_running == 0, error:client_error: client error: 404
2022-12-28 16:50:42.182513 ERROR engine/worker.go:209 rule_eval:241 promql:mem_available_percent < 10, error:client_error: client error: 404
2022-12-28 16:50:42.186227 ERROR engine/worker.go:209 rule_eval:248 promql:procstat_rlimit_num_fds_soft < 2048, error:client_error: client error: 404
2022-12-28 16:50:42.190165 ERROR engine/worker.go:209 rule_eval:240 promql:cpu_usage_idle{cpu="cpu-total"} < 25, error:client_error: client error: 404
2022-12-28 16:50:42.265185 ERROR engine/worker.go:209 rule_eval:237 promql:ping_result_code != 0, error:client_error: client error: 404
2022-12-28 16:50:42.298631 ERROR engine/worker.go:209 rule_eval:243 promql:predict_linear(disk_free[1h], 4*3600) < 0, error:client_error: client error: 404

System info

n9e-v5.14.3-linux-amd64 centos

Steps to reproduce

空

Expected behavior

空

Actual behavior

空

Additional info

No response

5.14.4屏蔽中的报警会报出来
Relevant server.conf | webapi.conf

在5.13.1的基础上只在server.conf的 [WriterOpt]下加了 ShardingKey = "ident" 其他同5.13.1

Relevant logs

2022-12-23 18:38:58.012905 INFO engine/logger.go:19 event(568b8a2de13649785ce5ae2648171945 triggered) consume: rule_id=62 [__name__=port_plugin_collector_8080 env=online host=xxxxxx0004-vm instance=xxxxxx job=node_exporters_http node=xxxxxxx4-vm rulename=服务挂了(8080端口挂了) service=xxxxxx]2@1671791937

System info

n9e 5.14.4,n9e-fe 5.14.3,centos

Steps to reproduce

1.上线前屏蔽机器的报警，时长20分钟(18:34到18:54)

2.开始上线重启服务 3.发现屏蔽中的报警被报出来了，期间看夜莺的屏蔽规则里是有的(忘了截图了)，而且报警里面持续时间2s也不太可能，我的规则是60s一次，连续3次才报警

...

Expected behavior

屏蔽中的规则，不用报警

Actual behavior

屏蔽中的规则却报出来了

Additional info

No response
update: view metrics data by instance

What type of PR is this? update mongo dashboard template

What this PR does / why we need it:

Each chart in the dashboard contains the metric data of all instances. After updating, you can select the instance Which issue(s) this PR fixes:

Fixes # https://github.com/flashcatcloud/categraf/issues/255

Special notes for your reviewer:
飞书告警消息模板支持 ”消息卡片“ 模式，能够使用 ”lark_md“ 消息格式

What would you like to be added: 飞书告警消息模板支持 ”消息卡片“ 模式，能够使用 ”lark_md“ 消息格式。

Why is this needed: 飞书 V2 报警模板，能够根据告警的状态（类似于钉钉和企业微信告警模板里的功能）实现不同的 title color。偏于区分告警的状态，更加方便运维以及业务人员关注告警信息。

$v5.14.1和v5.14.2引用的toolkits\pkg 1.3.1存在bug$

v5.14.1和v5.14.2引用的toolkits\pkg 1.3.1存在bug

Relevant server.conf | webapi.conf

# debug, release
RunMode = "release"

# my cluster name
ClusterName = "Default"

# Default busigroup Key name
# do not change
BusiGroupLabelKey = "busigroup"

# sleep x seconds, then start judge engine
EngineDelay = 60

DisableUsageReport = false

# config | database
ReaderFrom = "config"

[Log]
# log write dir
Dir = "logs"
# log level: DEBUG INFO WARNING ERROR
Level = "INFO"
# stdout, stderr, file
Output = "stdout"
# # rotate by time
# KeepHours: 4
# # rotate by size
# RotateNum = 3
# # unit: MB
# RotateSize = 256

[HTTP]
# http listening address
Host = "0.0.0.0"
# http listening port
Port = 19000
# https cert file path
CertFile = ""
# https key file path
KeyFile = ""
# whether print access log
PrintAccessLog = false
# whether enable pprof
PProf = false
# http graceful shutdown timeout, unit: s
ShutdownTimeout = 30
# max content length: 64M
MaxContentLength = 67108864
# http server read timeout, unit: s
ReadTimeout = 20
# http server write timeout, unit: s
WriteTimeout = 40
# http server idle timeout, unit: s
IdleTimeout = 120

# [BasicAuth]
# user002 = "ccc26da7b9aba533cbb263a36c07dcc9"

[Heartbeat]
# auto detect if blank
IP = ""
# unit ms
Interval = 1000

[SMTP]
Host = "smtp.163.com"
Port = 994
User = "username"
Pass = "password"
From = "[email protected]"
InsecureSkipVerify = true
Batch = 5

[Alerting]
# timeout settings, unit: ms, default: 30000ms
Timeout=30000
TemplatesDir = "./etc/template"
NotifyConcurrency = 10
# use builtin go code notify
NotifyBuiltinChannels = ["email", "dingtalk", "wecom", "feishu", "mm"]

[Alerting.CallScript]
# built in sending capability in go code
# so, no need enable script sender
Enable = false
ScriptPath = "./etc/script/notify.py"

[Alerting.CallPlugin]
Enable = false
# use a plugin via `go build -buildmode=plugin -o notify.so`
PluginPath = "./etc/script/notify.so"
# The first letter must be capitalized to be exported
Caller = "N9eCaller"

[Alerting.RedisPub]
Enable = false
# complete redis key: ${ChannelPrefix} + ${Cluster}
ChannelPrefix = "/alerts/"

[Alerting.Webhook]
Enable = false
Url = "http://a.com/n9e/callback"
BasicAuthUser = ""
BasicAuthPass = ""
Timeout = "5s"
Headers = ["Content-Type", "application/json", "X-From", "N9E"]

[NoData]
Metric = "target_up"
# unit: second
Interval = 120

[Ibex]
# callback: ${ibex}/${tplid}/${host}
Address = "127.0.0.1:10090"
# basic auth
BasicAuthUser = "ibex"
BasicAuthPass = "ibex"
# unit: ms
Timeout = 3000

[Redis]
# address, ip:port or ip1:port,ip2:port for cluster and sentinel(SentinelAddrs)
Address = "127.0.0.1:6379"
# Username = ""
# Password = ""
# DB = 0
# UseTLS = false
# TLSMinVersion = "1.2"
# standalone cluster sentinel
RedisType = "standalone"
# Mastername for sentinel type
# MasterName = "mymaster"

[DB]
# postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%s
DSN="root:1234@tcp(127.0.0.1:3306)/n9e_v5?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
# enable debug mode or not
Debug = false
# mysql postgres
DBType = "mysql"
# unit: s
MaxLifetime = 7200
# max open connections
MaxOpenConns = 150
# max idle connections
MaxIdleConns = 50
# table prefix
TablePrefix = ""
# enable auto migrate or not
# EnableAutoMigrate = false

[Reader]
# prometheus base url
Url = "http://127.0.0.1:9090"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Timeout = 30000
DialTimeout = 3000
MaxIdleConnsPerHost = 100

[WriterOpt]
# queue channel count
QueueCount = 1000
# queue max size
QueueMaxSize = 1000000
# once pop samples number from queue
QueuePopSize = 1000
# metric or ident
ShardingKey = "ident"

[[Writers]]
Url = "http://127.0.0.1:9090/api/v1/write"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Headers = ["X-From", "n9e"]
Timeout = 10000
DialTimeout = 3000
TLSHandshakeTimeout = 30000
ExpectContinueTimeout = 1000
IdleConnTimeout = 90000
# time duration, unit: ms
KeepAlive = 30000
MaxConnsPerHost = 0
MaxIdleConns = 100
MaxIdleConnsPerHost = 100
# [[Writers.WriteRelabels]]
# Action = "replace"
# SourceLabels = ["__address__"]
# Regex = "([^:]+)(?::\\d+)?"
# Replacement = "$1:80"
# TargetLabel = "__address__"

# [[Writers]]
# Url = "http://127.0.0.1:7201/api/v1/prom/remote/write"
# # Basic auth username
# BasicAuthUser = ""
# # Basic auth password
# BasicAuthPass = ""
# # timeout settings, unit: ms
# Timeout = 30000
# DialTimeout = 10000
# TLSHandshakeTimeout = 30000
# ExpectContinueTimeout = 1000
# IdleConnTimeout = 90000
# # time duration, unit: ms
# KeepAlive = 30000
# MaxConnsPerHost = 0
# MaxIdleConns = 100
# MaxIdleConnsPerHost = 100

Relevant logs

# github.com/toolkits/pkg/logger
\go\pkg\mod\github.com\toolkits\[email protected]\logger\config.go:37:32: cannot use sb (variable of type *syslogBackend) as type Backend in argument to log.SetLogging:
	*syslogBackend does not implement Backend (missing Close method)
		have close()
		want Close()

System info

n9e v5.14.1 v5.14.2

Steps to reproduce

设置启动参数 server conf=server.json后启动

Expected behavior

希望启动成功

Actual behavior

无法启动提示：

github.com/toolkits/pkg/logger

\go\pkg\mod\github.com\toolkits\[email protected]\logger\config.go:37:32: cannot use sb (variable of type *syslogBackend) as type Backend in argument to log.SetLogging: *syslogBackend does not implement Backend (missing Close method) have close() want Close()

Additional info

这个属于 toolkits/pkg的bug https://github.com/toolkits/pkg/issues/10

💡 A Distributed and High-Performance Monitoring System. The next generation of Open-Falcon

夜莺简介

OCE认证

文档资料

交流互助

Owner

DiDi

Comments

oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

Relevant server.conf | webapi.conf

Relevant logs

System info

Steps to reproduce

Expected behavior

Actual behavior

Additional info

求解：各位大佬老师，在用ibex categraf-v0.2.10 做自愈脚本的时候，告警规则上设置自愈脚本不生效， 我categraf 和ibex-agent 装在一台服务器。然后夜莺的一套是用docker 装在了另外一台服务器上，求排查思路。

Relevant server.conf | webapi.conf

Relevant logs

System info

Steps to reproduce

Expected behavior

Actual behavior

Additional info

3.0页面显示不正常

告警配置的与条件 似乎有问题

夜莺v3版本 agent 停止时无法生产报警事件

通过api上报的数据查询结果为NaN

Relevant server.conf | webapi.conf

Relevant logs

System info

Steps to reproduce

Expected behavior

Actual behavior

Additional info

活跃告警聚合规则使用问题

【多集群配置】配置好webapic.conf和server.conf后，只有节点信息（ident）到了中心端，prometheus的即时数据无法查询。

Relevant server.conf | webapi.conf

Relevant logs

System info

Steps to reproduce

Expected behavior

Actual behavior

Additional info

数据有断层，不知如何排查

feat: persist notify cur number

监控大盘promql 引用 变量会报错，无法配置

帮忙分析一下感谢，server日志中出现大量client 404

Relevant server.conf | webapi.conf

Relevant logs

System info

Steps to reproduce

Expected behavior

Actual behavior

Additional info

5.14.4屏蔽中的报警会报出来

Relevant server.conf | webapi.conf

Relevant logs

System info

Steps to reproduce

Expected behavior

Actual behavior

Additional info

update: view metrics data by instance

飞书告警消息模板支持 ”消息卡片“ 模式，能够使用 ”lark_md“ 消息格式

v5.14.1和v5.14.2引用的toolkits\pkg 1.3.1存在bug

Relevant server.conf | webapi.conf

Relevant logs

System info

Steps to reproduce

Expected behavior

Actual behavior

github.com/toolkits/pkg/logger

Additional info

Related tags

High performance, distributed and low latency publish-subscribe platform.

short-url distributed and high-performance

Distributed-Services - Distributed Systems with Golang to consequently build a fully-fletched distributed service

High-Performance server for NATS, the cloud native messaging system.

Distributed reliable key-value store for the most critical data of a distributed system

求解：各位大佬老师，在用ibex categraf-v0.2.10 做自愈脚本的时候，告警规则上设置自愈脚本不生效，我categraf 和ibex-agent 装在一台服务器。然后夜莺的一套是用docker 装在了另外一台服务器上，求排查思路。

告警配置的与条件似乎有问题

监控大盘promql 引用变量会报错，无法配置