💡 A Distributed and High-Performance Monitoring System. The next generation of Open-Falcon

Nightingale


夜莺简介

夜莺是一套分布式高可用的运维监控系统,最大的特点是混合云支持,既可以支持传统物理机虚拟机的场景,也可以支持K8S容器的场景。同时,夜莺也不只是监控,还有一部分CMDB的能力、自动化运维的能力,很多公司都基于夜莺开发自己公司的运维平台。开源的这部分功能模块也是商业版本的一部分,所以可靠性有保障、会持续维护,诸君可放心使用。效果图如下:

Nightingale

OCE认证

OCE是一个认证机制和交流平台,为夜莺生产用户量身打造,我们会为OCE企业提供更好的技术支持,比如专属的技术沙龙、企业一对一的交流机会、专属的答疑群等,如果贵司已将夜莺上了生产,快来加入吧

文档资料

交流互助

关注公众号 Obsuite(官方公众号) 回复 "夜莺加群"

Nightingale

Comments
  • oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

    oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

    Relevant server.conf | webapi.conf

    [OIDC]
    Enable = true
    RedirectURL = "http://ip:18000/callback"
    SsoAddr = "http://ip/oidc/login"
    ClientId = "xxxxxxxxxxxxxx"
    ClientSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    CoverAttributes = true
    # 默认角色
    DefaultRoles = ["Standard"]
    
    # 属性映射
    [OIDC.Attributes]
    Nickname = "nickname"
    Phone = "phone_number"
    Email = "email"
    

    Relevant logs

    oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊
    

    System info

    oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

    Steps to reproduce

    1.oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊 2. 3. ...

    Expected behavior

    oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

    Actual behavior

    oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

    Additional info

    oidc配置了,没有显示oidc的登陆信息,这是怎么回事啊

  • 求解:各位大佬老师,在用ibex categraf-v0.2.10 做自愈脚本的时候,告警规则上设置自愈脚本不生效, 我categraf  和ibex-agent 装在一台服务器。然后夜莺的一套是用docker 装在了另外一台服务器上,求排查思路。

    求解:各位大佬老师,在用ibex categraf-v0.2.10 做自愈脚本的时候,告警规则上设置自愈脚本不生效, 我categraf 和ibex-agent 装在一台服务器。然后夜莺的一套是用docker 装在了另外一台服务器上,求排查思路。

    Relevant server.conf | webapi.conf

    cat /data/categraf-v0.2.10-linux-amd64/conf/config.toml
    [global]
    # whether print configs
    print_configs = false
    
    # add label(agent_hostname) to series
    # "" -> auto detect hostname
    # "xx" -> use specified string xx
    # "$hostname" -> auto detect hostname
    # "$ip" -> auto detect ip
    # "$hostname-$ip" -> auto detect hostname and ip to replace the vars
    hostname = "$ip"
    
    
    /ibex/etc/agentd.conf 配置
    # debug, release
    RunMode = "debug"
    
    # task meta storage dir
    MetaDir = "./meta"
    
    [HTTP]
    Enable = true
    # http listening address
    Host = "0.0.0.0"
    # http listening port
    Port = 2090
    # https cert file path
    CertFile = ""
    # https key file path
    KeyFile = ""
    # whether print access log
    PrintAccessLog = true
    # whether enable pprof
    PProf = false
    # http graceful shutdown timeout, unit: s
    ShutdownTimeout = 30
    # max content length: 64M
    MaxContentLength = 67108864
    # http server read timeout, unit: s
    ReadTimeout = 20
    # http server write timeout, unit: s
    WriteTimeout = 40
    # http server idle timeout, unit: s
    IdleTimeout = 120
    
    [Heartbeat]
    # unit: ms
    Interval = 1000
    # rpc servers
    Servers = ["172.18.89.20:20090"]
    # $ip or $hostname or specified string
    Host = "172.18.89.18"
    

    Relevant logs

    none
    

    System info

    前端版本:5.9.0 后端版本:v5.10.3-23d7e5a7de5d0ffea7ab941e9621ef7f53071775

    Steps to reproduce

    ...

    Expected behavior

    告警规则报警的时候触发不了自愈脚本。

    Actual behavior

    none

    Additional info

    none

  • 3.0页面显示不正常

    3.0页面显示不正常

    根据安装步骤进行安装,估计是前端界面存在问题。 console报错如下: SyntaxError: Unexpected end of JSON input at JSON.parse () at layout-2ad14e8510d33d9e5c84.js:81829 at ia (layout-2ad14e8510d33d9e5c84.js:59318) at La (layout-2ad14e8510d33d9e5c84.js:59318) at Va (layout-2ad14e8510d33d9e5c84.js:59318) at Ha (layout-2ad14e8510d33d9e5c84.js:59318) at Mc (layout-2ad14e8510d33d9e5c84.js:59318) at xc (layout-2ad14e8510d33d9e5c84.js:59318) at yc (layout-2ad14e8510d33d9e5c84.js:59318) at Ga (layout-2ad14e8510d33d9e5c84.js:59318) layout-2ad14e8510d33d9e5c84.js:81829 SyntaxError: Unexpected end of JSON input at JSON.parse () at layout-2ad14e8510d33d9e5c84.js:81829 at ia (layout-2ad14e8510d33d9e5c84.js:59318) at La (layout-2ad14e8510d33d9e5c84.js:59318) at Va (layout-2ad14e8510d33d9e5c84.js:59318) at Ha (layout-2ad14e8510d33d9e5c84.js:59318) at Mc (layout-2ad14e8510d33d9e5c84.js:59318) at xc (layout-2ad14e8510d33d9e5c84.js:59318) at yc (layout-2ad14e8510d33d9e5c84.js:59318) at Ga (layout-2ad14e8510d33d9e5c84.js:59318)

  • 告警配置的与条件 似乎有问题

    告警配置的与条件 似乎有问题

    当前的告警策略配置: image

    当前的对应 matrics 的状态: image

    想实现的是 当proc.port.listen <1 与 file.lock.exist < 1都满足的时候 就报警
    当前的状态是满足告警的条件的。 但是并没有触发告警 。

    单独配置 不使用 与 条件的时候,是能够分别触发告警的。

  • 夜莺v3版本 agent 停止时无法生产报警事件

    夜莺v3版本 agent 停止时无法生产报警事件

    我将夜莺从v2版本重新部署成v3版本时其他报警可以正常的生成报警信息,agent 停止时无法生成告警事件

    监控策略如下

    { "name": "监控agent失联", "category": 1, "alert_dur": 60, "recovery_dur": 0, "recovery_notify": 1, "enable_stime": "00:00", "enable_etime": "23:59", "priority": 1, "exprs": [ { "eopt": "=", "func": "nodata", "metric": "proc.agent.alive", "params": [], "threshold": 0 } ], "tags": [], "enable_days_of_week": [ 0, 1, 2, 3, 4, 5, 6 ], "converge": [ 36000, 1 ], "endpoints": null },

    其余报警都正常,并且我的监控策略都是放置在一个主节点的 当agent停止时可以可以从监控看图正常的看到proc.agent.alive 监控项没有上报获取到信息 在未恢复报警中没有生成事件为啥 求大佬指点

  • 通过api上报的数据查询结果为NaN

    通过api上报的数据查询结果为NaN

    Relevant server.conf | webapi.conf

    tsdb.yml
    rrd:
      storage: /home/storage/n9e_data/8011
    cache:
      keepMinutes: 120
    logger:
      dir: logs/tsdb
      level: WARNING
      keepHours: 2
    
    transfer.yml
    backend:
      datasource: "tsdb"
      m3db:
        enabled: false
        maxSeriesPoints: 720                       # default 720
        name: "m3db"
        namespace: "default"
        seriesLimit: 0
        docsLimit: 0
        daysLimit: 7                               # max query time
        # https://m3db.github.io/m3/m3db/architecture/consistencylevels/
        writeConsistencyLevel: "majority"          # one|majority|all
        readConsistencyLevel: "unstrict_majority"  # one|unstrict_majority|majority|all
        config:
          service:
            # KV environment, zone, and service from which to write/read KV data (placement
            # and configuration). Leave these as the default values unless you know what
            # you're doing.
            env: default_env
            zone: embedded
            service: m3db
            etcdClusters:
              - zone: embedded
                endpoints:
                  - 127.0.0.1:2379
                tls:
                  caCrtPath: /etc/etcd/certs/ca.pem
                  crtPath: /etc/etcd/certs/etcd-client.pem
                  keyPath: /etc/etcd/certs/etcd-client-key.pem
      tsdb:
        enabled: true
        name: "tsdb"
        cluster:
          tsdb01: 127.0.0.1:8011
      influxdb:
        enabled: false
        username: "influx"
        password: "admin123"
        precision: "s"
        database: "n9e"
        address: "http://127.0.0.1:8086"
      opentsdb:
        enabled: false
        address: "127.0.0.1:4242"
      kafka:
        enabled: false
        brokersPeers: "192.168.1.1:9092,192.168.1.2:9092"
        topic: "n9e"
    logger:
      dir: logs/transfer
      level: INFO
      keepHours: 24
    

    Relevant logs

    2022-07-06 18:39:47.942693 WARNING rpc/query.go:118 debug: true, /home/storage/n9e_data/8011/cd/cdf3bd9e6ba35f20b66aa65ddd330365_GAUGE_7200.rrd
    2022-07-06 18:39:47.943026 WARNING rpc/query.go:145 data: [<RRDData:Value:NaN TS:1654480800 2022-06-06 10:00:00> <RRDData:Value:NaN TS:1654488000 2022-06-06 12:00:00> <RRDData:Value:NaN TS:1654495200 2022-06-06 14:00:00> <RRDData:Value:NaN TS:1654502400 2022-06-06 16:00:00> <RRDData:Value:NaN TS:1654509600 2022-06-06 18:00:00> <RRDData:Value:NaN TS:1654516800 2022-06-06 20:00:00> <RRDData:Value:NaN TS:1654524000 2022-06-06 22:00:00> <RRDData:Value:NaN TS:1654531200 2022-06-07 00:00:00> <RRDData:Value:NaN TS:1654538400 2022-06-07 02:00:00> <RRDData:Value:NaN TS:1654545600 2022-06-07 04:00:00> <RRDData:Value:NaN TS:1654552800 2022-06-07 06:00:00> <RRDData:Value:NaN TS:1654560000 2022-06-07 08:00:00> <RRDData:Value:NaN TS:1654567200 2022-06-07 10:00:00> <RRDData:Value:NaN TS:1654574400 2022-06-07 12:00:00> <RRDData:Value:NaN TS:1654581600 2022-06-07 14:00:00> <RRDData:Value:NaN TS:1654588800 2022-06-07 16:00:00> <RRDData:Value:NaN TS:1654596000 2022-06-07 18:00:00> <RRDData:Value:NaN TS:1654603200 2022-06-07 20:00:00> <RRDData:Value:NaN TS:1654610400 2022-06-07 22:00:00> <RRDData:Value:NaN TS:1654617600 2022-06-08 00:00:00>
    

    System info

    n9e 3.8.0

    Steps to reproduce

    1.首先通过/api/transfer/data批量上报以前收集的历史数据 2.在监控页面查看数据图表时发现数据无法查看 3.排查日志发现数据文件可以正常打开,但是查出来的数据部分的Value是NaN

    Expected behavior

    可以正常查询数据,可以显示数据图表

    Actual behavior

    无法显示数据图表,数据查询的结果如日志显示的是Value为NaN

    Additional info

    No response

  • 活跃告警聚合规则使用问题

    活跃告警聚合规则使用问题

    前端版本:5.5.1 后端版本:5.9.3

    第一个问题 image image 这时他会提示是否公开,逻辑上公开就打开,不公开就关上,然后我公开再不公开就正常创建了 image 建议优化下这个逻辑

    第二个问题

    image image 我的__name__聚合规则添加上了,但是实际显示出来的是Null image 我编辑聚合规则,删除__name__ 标签,提示:unsupported field: name,这个很奇怪,有时候可以添加上标签,有时候又不可以。 我试了一下告警中其他的标签,也是同样的问题。

  • 【多集群配置】配置好webapic.conf和server.conf后,只有节点信息(ident)到了中心端,prometheus的即时数据无法查询。

    【多集群配置】配置好webapic.conf和server.conf后,只有节点信息(ident)到了中心端,prometheus的即时数据无法查询。

    Relevant server.conf | webapi.conf

    中心端:webapi.conf
    
    # 中心端cluster info
    [[Clusters]]
    # Prometheus cluster name
    Name = "Default"
    # Prometheus APIs base url
    Prom = "http://127.0.0.1:9090"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 3000
    MaxIdleConnsPerHost = 100
    
    # 局部地区cluster info
    [[Clusters]]
    # Prometheus cluster name
    Name = "zhifawang_cluster"
    # Prometheus APIs base url
    Prom = "http://局部地区ip:9090"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 3000
    MaxIdleConnsPerHost = 100
    
    分地区server.conf
    [DB]
    # postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%s
    DSN="root:mysqlpasswd@tcp(中心端mysql-ip:3306)/n9e_v5?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
    # enable debug mode or not
    Debug = false
    # mysql postgres
    DBType = "mysql"
    # unit: s
    MaxLifetime = 7200
    # max open connections
    MaxOpenConns = 150
    # max idle connections
    MaxIdleConns = 50
    # table prefix
    TablePrefix = ""
    # enable auto migrate or not
    EnableAutoMigrate = false
    
    #中心端-server.conf
    [Reader]
    # prometheus base url
    Url = "http://127.0.0.1:9090"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 10000
    TLSHandshakeTimeout = 30000
    ExpectContinueTimeout = 1000
    IdleConnTimeout = 90000
    # time duration, unit: ms
    KeepAlive = 30000
    MaxConnsPerHost = 0
    MaxIdleConns = 100
    MaxIdleConnsPerHost = 10
    
    [[Writers]]
    Url = "http://127.0.0.1:9090/api/v1/write"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 10000
    DialTimeout = 3000
    TLSHandshakeTimeout = 30000
    ExpectContinueTimeout = 1000
    IdleConnTimeout = 90000
    # time duration, unit: ms
    KeepAlive = 30000
    MaxConnsPerHost = 0
    MaxIdleConns = 100
    MaxIdleConnsPerHost = 100
    
    # 局部地区集群server.conf
    [Reader]
    # prometheus base url
    Url = "http://prometheus:9090"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 10000
    TLSHandshakeTimeout = 30000
    ExpectContinueTimeout = 1000
    IdleConnTimeout = 90000
    # time duration, unit: ms
    KeepAlive = 30000
    MaxConnsPerHost = 0
    MaxIdleConns = 100
    MaxIdleConnsPerHost = 10
    
    [[Writers]]
    Url = "http://prometheus:9090/api/v1/write"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 10000
    TLSHandshakeTimeout = 30000
    ExpectContinueTimeout = 1000
    IdleConnTimeout = 90000
    # time duration, unit: ms
    KeepAlive = 30000
    MaxConnsPerHost = 0
    MaxIdleConns = 100
    MaxIdleConnsPerHost = 100
    

    Relevant logs

    局部nserverd日志
    2022-06-21 09:44:09.616229 WARNING writer/writer.go:42 post to http://prometheus:9090/api/v1/write got error: push data with remote write request got status code: 500, response body: label name "busigroup" is not unique: invalid sample
    2022-06-21 09:44:09.616338 WARNING writer/writer.go:43 example timeseries:labels:<name:"__name__" value:"kernel_processes_forked" > labels:<name:"ident" value:"11.68.150.59_\351\251\254\351\201\223\345\244\264" > labels:<name:"busigroup" value:"jykj_dt" > labels:<name:"busigroup" value:"jykj_dt" > samples:<value:4.0868227e+07 timestamp:1655775848000 >
    

    System info

    前端版本:5.5.1 后端版本:5.9.2

    Steps to reproduce

    1.修改中心端webapi.conf文件中cluster信息 2.修改局部n9e,server.con中DB信息 3.重启局部n9e server服务 ...

    Expected behavior

    中心端显示多套集群的节点信息及即时信息 注:中心端使用组件部署,局部使用docker部署

    Actual behavior

    中心端在对象列表显示多集群节点信息,在监控看图无法显示另一集群的信息。

    Additional info

    image

  • 数据有断层,不知如何排查

    数据有断层,不知如何排查

    What happened:一台主句的指标数据断断续续

    What you expected to happen:log没有问题

    How to reproduce it (as minimally and precisely as possible):

    Anything else we need to know?:

    Environment:

    • OS (e.g: cat /etc/os-release):centos 6.6
    • Logs:
    • Others:
  • feat: persist notify cur number

    feat: persist notify cur number

    feat: persist notify cur number

    持久化连续通知次数,方便后面能一眼看出来该告警已连续通知多少次。

    pg更新脚本: ALTER TABLE alert_cur_event ADD notify_cur_number int not null default 0; ALTER TABLE alert_his_event ADD notify_cur_number int not null default 0;

  • 监控大盘promql 引用 变量会报错,无法配置

    监控大盘promql 引用 变量会报错,无法配置

    夜莺版本: 下载 5.6.3 版本源码,使用docker-compose部署

    • 前端版本:5.2.1

    • 后端版本:5.6.3

    • chrome 版本 84.0.4147.105(正式版本) (64 位)

    问题和复现方法: 部署完成后,前端配置一个监控大盘

    监控大盘新增一个 host 变量,,然后随便新增一个监控图表,配置 promql如下 cpu_usage_user{ident="$host"} 点击保存,无法生效,控制台报如下错误

    vendor.afcc874e.js:27 TypeError: i.replaceAll is not a function
        at u1 (index.a3d7ea18.js:33)
        at index.a3d7ea18.js:33
        at Oe (vendor.afcc874e.js:49)
        at Function.xa (vendor.afcc874e.js:49)
        at y (index.a3d7ea18.js:33)
        at index.a3d7ea18.js:33
        at Ss (vendor.afcc874e.js:27)
        at t.unstable_runWithPriority (vendor.afcc874e.js:18)
        at Ri (vendor.afcc874e.js:27)
        at _s (vendor.afcc874e.js:27)
    

    作为对比,直接写死ident的值,图表加载正常: cpu_usage_user{ident="telegraf01"}

  • 帮忙分析一下感谢,server日志中出现大量client 404

    帮忙分析一下感谢,server日志中出现大量client 404

    Relevant server.conf | webapi.conf

    Relevant logs

    2022-12-28 16:50:42.090274 ERROR engine/worker.go:209 rule_eval:245 promql:increase(net_drop_out[1m]) > 0, error:client_error: client error: 404
    2022-12-28 16:50:42.093313 ERROR engine/worker.go:209 rule_eval:263 promql:http_response_http_response_code > 500, error:client_error: client error: 404
    2022-12-28 16:50:42.114400 ERROR engine/worker.go:209 rule_eval:246 promql:netstat_tcp_time_wait > 20000, error:client_error: client error: 404
    2022-12-28 16:50:42.127133 ERROR engine/worker.go:209 rule_eval:244 promql:increase(net_drop_in[1m]) > 0, error:client_error: client error: 404
    2022-12-28 16:50:42.128994 ERROR engine/worker.go:209 rule_eval:238 promql:target_up != 1, error:client_error: client error: 404
    2022-12-28 16:50:42.139551 ERROR engine/worker.go:209 rule_eval:249 promql:procstat_lookup_result_code != 0, error:client_error: client error: 404
    2022-12-28 16:50:42.147291 ERROR engine/worker.go:209 rule_eval:242 promql:rate(diskio_io_time[1m])/10 > 99, error:client_error: client error: 404
    2022-12-28 16:50:42.152683 ERROR engine/worker.go:209 rule_eval:277 promql:disk_used_percent > 85, error:client_error: client error: 404
    2022-12-28 16:50:42.155019 ERROR engine/worker.go:209 rule_eval:239 promql:net_response_result_code != 0, error:client_error: client error: 404
    2022-12-28 16:50:42.156042 ERROR engine/worker.go:209 rule_eval:247 promql:procstat_lookup_running == 0, error:client_error: client error: 404
    2022-12-28 16:50:42.182513 ERROR engine/worker.go:209 rule_eval:241 promql:mem_available_percent < 10, error:client_error: client error: 404
    2022-12-28 16:50:42.186227 ERROR engine/worker.go:209 rule_eval:248 promql:procstat_rlimit_num_fds_soft < 2048, error:client_error: client error: 404
    2022-12-28 16:50:42.190165 ERROR engine/worker.go:209 rule_eval:240 promql:cpu_usage_idle{cpu="cpu-total"} < 25, error:client_error: client error: 404
    2022-12-28 16:50:42.265185 ERROR engine/worker.go:209 rule_eval:237 promql:ping_result_code != 0, error:client_error: client error: 404
    2022-12-28 16:50:42.298631 ERROR engine/worker.go:209 rule_eval:243 promql:predict_linear(disk_free[1h], 4*3600) < 0, error:client_error: client error: 404
    

    System info

    n9e-v5.14.3-linux-amd64 centos

    Steps to reproduce

    Expected behavior

    Actual behavior

    Additional info

    No response

  • 5.14.4屏蔽中的报警会报出来

    5.14.4屏蔽中的报警会报出来

    Relevant server.conf | webapi.conf

    在5.13.1的基础上只在server.conf的
    [WriterOpt]下加了
    ShardingKey = "ident"  其他同5.13.1
    

    Relevant logs

    2022-12-23 18:38:58.012905 INFO engine/logger.go:19 event(568b8a2de13649785ce5ae2648171945 triggered) consume: rule_id=62 [__name__=port_plugin_collector_8080 env=online host=xxxxxx0004-vm instance=xxxxxx job=node_exporters_http node=xxxxxxx4-vm rulename=服务挂了(8080端口挂了) service=xxxxxx]2@1671791937
    

    System info

    n9e 5.14.4,n9e-fe 5.14.3,centos

    Steps to reproduce

    1.上线前屏蔽机器的报警,时长20分钟(18:34到18:54) image

    2.开始上线重启服务 3.发现屏蔽中的报警被报出来了,期间看夜莺的屏蔽规则里是有的(忘了截图了),而且报警里面持续时间2s也不太可能,我的规则是60s一次,连续3次才报警 image

    ...

    Expected behavior

    屏蔽中的规则,不用报警

    Actual behavior

    屏蔽中的规则却报出来了

    Additional info

    No response

  • update: view metrics data by instance

    update: view metrics data by instance

    What type of PR is this? update mongo dashboard template

    What this PR does / why we need it:

    Each chart in the dashboard contains the metric data of all instances. After updating, you can select the instance Which issue(s) this PR fixes:

    Fixes # https://github.com/flashcatcloud/categraf/issues/255

    Special notes for your reviewer:

  • 飞书告警消息模板支持 ”消息卡片“ 模式,能够使用 ”lark_md“ 消息格式

    飞书告警消息模板支持 ”消息卡片“ 模式,能够使用 ”lark_md“ 消息格式

    What would you like to be added: 飞书告警消息模板支持 ”消息卡片“ 模式,能够使用 ”lark_md“ 消息格式。

    Why is this needed: 飞书 V2 报警模板,能够根据告警的状态(类似于钉钉和企业微信告警模板里的功能)实现不同的 title color。偏于区分告警的状态,更加方便运维以及业务人员关注告警信息。

  • v5.14.1和v5.14.2引用的toolkits\pkg 1.3.1存在bug

    v5.14.1和v5.14.2引用的toolkits\pkg 1.3.1存在bug

    Relevant server.conf | webapi.conf

    # debug, release
    RunMode = "release"
    
    # my cluster name
    ClusterName = "Default"
    
    # Default busigroup Key name
    # do not change
    BusiGroupLabelKey = "busigroup"
    
    # sleep x seconds, then start judge engine
    EngineDelay = 60
    
    DisableUsageReport = false
    
    # config | database
    ReaderFrom = "config"
    
    [Log]
    # log write dir
    Dir = "logs"
    # log level: DEBUG INFO WARNING ERROR
    Level = "INFO"
    # stdout, stderr, file
    Output = "stdout"
    # # rotate by time
    # KeepHours: 4
    # # rotate by size
    # RotateNum = 3
    # # unit: MB
    # RotateSize = 256
    
    [HTTP]
    # http listening address
    Host = "0.0.0.0"
    # http listening port
    Port = 19000
    # https cert file path
    CertFile = ""
    # https key file path
    KeyFile = ""
    # whether print access log
    PrintAccessLog = false
    # whether enable pprof
    PProf = false
    # http graceful shutdown timeout, unit: s
    ShutdownTimeout = 30
    # max content length: 64M
    MaxContentLength = 67108864
    # http server read timeout, unit: s
    ReadTimeout = 20
    # http server write timeout, unit: s
    WriteTimeout = 40
    # http server idle timeout, unit: s
    IdleTimeout = 120
    
    # [BasicAuth]
    # user002 = "ccc26da7b9aba533cbb263a36c07dcc9"
    
    [Heartbeat]
    # auto detect if blank
    IP = ""
    # unit ms
    Interval = 1000
    
    [SMTP]
    Host = "smtp.163.com"
    Port = 994
    User = "username"
    Pass = "password"
    From = "[email protected]"
    InsecureSkipVerify = true
    Batch = 5
    
    [Alerting]
    # timeout settings, unit: ms, default: 30000ms
    Timeout=30000
    TemplatesDir = "./etc/template"
    NotifyConcurrency = 10
    # use builtin go code notify
    NotifyBuiltinChannels = ["email", "dingtalk", "wecom", "feishu", "mm"]
    
    [Alerting.CallScript]
    # built in sending capability in go code
    # so, no need enable script sender
    Enable = false
    ScriptPath = "./etc/script/notify.py"
    
    [Alerting.CallPlugin]
    Enable = false
    # use a plugin via `go build -buildmode=plugin -o notify.so`
    PluginPath = "./etc/script/notify.so"
    # The first letter must be capitalized to be exported
    Caller = "N9eCaller"
    
    [Alerting.RedisPub]
    Enable = false
    # complete redis key: ${ChannelPrefix} + ${Cluster}
    ChannelPrefix = "/alerts/"
    
    [Alerting.Webhook]
    Enable = false
    Url = "http://a.com/n9e/callback"
    BasicAuthUser = ""
    BasicAuthPass = ""
    Timeout = "5s"
    Headers = ["Content-Type", "application/json", "X-From", "N9E"]
    
    [NoData]
    Metric = "target_up"
    # unit: second
    Interval = 120
    
    [Ibex]
    # callback: ${ibex}/${tplid}/${host}
    Address = "127.0.0.1:10090"
    # basic auth
    BasicAuthUser = "ibex"
    BasicAuthPass = "ibex"
    # unit: ms
    Timeout = 3000
    
    [Redis]
    # address, ip:port or ip1:port,ip2:port for cluster and sentinel(SentinelAddrs)
    Address = "127.0.0.1:6379"
    # Username = ""
    # Password = ""
    # DB = 0
    # UseTLS = false
    # TLSMinVersion = "1.2"
    # standalone cluster sentinel
    RedisType = "standalone"
    # Mastername for sentinel type
    # MasterName = "mymaster"
    
    [DB]
    # postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%s
    DSN="root:1234@tcp(127.0.0.1:3306)/n9e_v5?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
    # enable debug mode or not
    Debug = false
    # mysql postgres
    DBType = "mysql"
    # unit: s
    MaxLifetime = 7200
    # max open connections
    MaxOpenConns = 150
    # max idle connections
    MaxIdleConns = 50
    # table prefix
    TablePrefix = ""
    # enable auto migrate or not
    # EnableAutoMigrate = false
    
    [Reader]
    # prometheus base url
    Url = "http://127.0.0.1:9090"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 3000
    MaxIdleConnsPerHost = 100
    
    [WriterOpt]
    # queue channel count
    QueueCount = 1000
    # queue max size
    QueueMaxSize = 1000000
    # once pop samples number from queue
    QueuePopSize = 1000
    # metric or ident
    ShardingKey = "ident"
    
    [[Writers]]
    Url = "http://127.0.0.1:9090/api/v1/write"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Headers = ["X-From", "n9e"]
    Timeout = 10000
    DialTimeout = 3000
    TLSHandshakeTimeout = 30000
    ExpectContinueTimeout = 1000
    IdleConnTimeout = 90000
    # time duration, unit: ms
    KeepAlive = 30000
    MaxConnsPerHost = 0
    MaxIdleConns = 100
    MaxIdleConnsPerHost = 100
    # [[Writers.WriteRelabels]]
    # Action = "replace"
    # SourceLabels = ["__address__"]
    # Regex = "([^:]+)(?::\\d+)?"
    # Replacement = "$1:80"
    # TargetLabel = "__address__"
    
    # [[Writers]]
    # Url = "http://127.0.0.1:7201/api/v1/prom/remote/write"
    # # Basic auth username
    # BasicAuthUser = ""
    # # Basic auth password
    # BasicAuthPass = ""
    # # timeout settings, unit: ms
    # Timeout = 30000
    # DialTimeout = 10000
    # TLSHandshakeTimeout = 30000
    # ExpectContinueTimeout = 1000
    # IdleConnTimeout = 90000
    # # time duration, unit: ms
    # KeepAlive = 30000
    # MaxConnsPerHost = 0
    # MaxIdleConns = 100
    # MaxIdleConnsPerHost = 100
    

    Relevant logs

    # github.com/toolkits/pkg/logger
    \go\pkg\mod\github.com\toolkits\[email protected]\logger\config.go:37:32: cannot use sb (variable of type *syslogBackend) as type Backend in argument to log.SetLogging:
    	*syslogBackend does not implement Backend (missing Close method)
    		have close()
    		want Close()
    

    System info

    n9e v5.14.1 v5.14.2

    Steps to reproduce

    1. 设置启动参数 server conf=server.json后 启动

    Expected behavior

    希望启动成功

    Actual behavior

    无法启动 提示:

    github.com/toolkits/pkg/logger

    \go\pkg\mod\github.com\toolkits\[email protected]\logger\config.go:37:32: cannot use sb (variable of type *syslogBackend) as type Backend in argument to log.SetLogging: *syslogBackend does not implement Backend (missing Close method) have close() want Close()

    Additional info

    这个属于 toolkits/pkg的bug https://github.com/toolkits/pkg/issues/10

High performance, distributed and low latency publish-subscribe platform.
High performance, distributed and low latency publish-subscribe platform.

Emitter: Distributed Publish-Subscribe Platform Emitter is a distributed, scalable and fault-tolerant publish-subscribe platform built with MQTT proto

Jan 2, 2023
short-url distributed and high-performance

durl 是一个分布式的高性能短链服务,逻辑简单,并提供了相关api接口,开发人员可以快速接入,也可以作为go初学者练手项目.

Jan 2, 2023
Distributed-Services - Distributed Systems with Golang to consequently build a fully-fletched distributed service

Distributed-Services This project is essentially a result of my attempt to under

Jun 1, 2022
High-Performance server for NATS, the cloud native messaging system.
High-Performance server for NATS, the cloud native messaging system.

NATS is a simple, secure and performant communications system for digital systems, services and devices. NATS is part of the Cloud Native Computing Fo

Jan 8, 2023
Distributed reliable key-value store for the most critical data of a distributed system

etcd Note: The main branch may be in an unstable or even broken state during development. For stable versions, see releases. etcd is a distributed rel

Dec 30, 2022
A feature complete and high performance multi-group Raft library in Go.
A feature complete and high performance multi-group Raft library in Go.

Dragonboat - A Multi-Group Raft library in Go / 中文版 News 2021-01-20 Dragonboat v3.3 has been released, please check CHANGELOG for all changes. 2020-03

Dec 30, 2022
Collection of high performance, thread-safe, lock-free go data structures

Garr - Go libs in a Jar Collection of high performance, thread-safe, lock-free go data structures. adder - Data structure to perform highly-performant

Dec 26, 2022
Go Open Source, Distributed, Simple and efficient Search Engine

Go Open Source, Distributed, Simple and efficient full text search engine.

Dec 31, 2022
Distributed lock manager. Warning: very hard to use it properly. Not because it's broken, but because distributed systems are hard. If in doubt, do not use this.

What Dlock is a distributed lock manager [1]. It is designed after flock utility but for multiple machines. When client disconnects, all his locks are

Dec 24, 2019
Golang client library for adding support for interacting and monitoring Celery workers, tasks and events.

Celeriac Golang client library for adding support for interacting and monitoring Celery workers and tasks. It provides functionality to place tasks on

Oct 28, 2022
CockroachDB - the open source, cloud-native distributed SQL database.
CockroachDB - the open source, cloud-native distributed SQL database.

CockroachDB is a cloud-native distributed SQL database designed to build, scale, and manage modern, data-intensive applications. What is CockroachDB?

Dec 29, 2022
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.

Gleam Gleam is a high performance and efficient distributed execution system, and also simple, generic, flexible and easy to customize. Gleam is built

Jan 1, 2023
A distributed and coördination-free log management system
A distributed and coördination-free log management system

OK Log is archived I hoped to find the opportunity to continue developing OK Log after the spike of its creation. Unfortunately, despite effort, no su

Dec 26, 2022
JuiceFS is a distributed POSIX file system built on top of Redis and S3.
JuiceFS is a distributed POSIX file system built on top of Redis and S3.

JuiceFS is a high-performance POSIX file system released under GNU Affero General Public License v3.0. It is specially optimized for the cloud-native

Jan 4, 2023
Distributed-system - Practicing and learning the foundations of DS with Go

Distributed-System For practicing and learning the foundations of distributed sy

May 4, 2022
BlobStore is a highly reliable,highly available and ultra-large scale distributed storage system

BlobStore Overview Documents Build BlobStore Deploy BlobStore Manage BlobStore License Overview BlobStore is a highly reliable,highly available and ul

Oct 10, 2022
A distributed system for embedding-based retrieval
A distributed system for embedding-based retrieval

Overview Vearch is a scalable distributed system for efficient similarity search of deep learning vectors. Architecture Data Model space, documents, v

Dec 30, 2022