Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

Last update: Dec 26, 2022

Comments: 16

kepler

Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

Architecture

Requirement

Kernel 4.18+, Cgroup V2

Installation and Configuration for Prometheus

Prerequisites

Need access to a Kubernetes cluster.

Deploy the Kepler exporter

Deploying the Kepler exporter as a daemonset to run on all nodes. The following deployment will also create a service listening on port 9102.

# kubectl create -f manifests/kubernetes/deployment.yaml

Deploy the Prometheus operator and the whole monitoring stack

Clone the kube-prometheus project to your local folder.

# git clone https://github.com/prometheus-operator/kube-prometheus

Deploy the whole monitoring stack using the config in the manifests directory. Create the namespace and CRDs, and then wait for them to be available before creating the remaining resources

# cd kube-prometheus
# kubectl apply --server-side -f manifests/setup
# until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
# kubectl apply -f manifests/kubernetes/

Configure Prometheus to scrape Kepler-exporter endpoints.

# cd ../kepler
# kubectl create -f manifests/kubernetes/keplerExporter-serviceMonitor.yaml

Sample Grafana dashboard

Owner

Sustainable Computing

https://github.com/sustainable-computing-io/kepler

Comments

Cannot start up exporter with Kind

Describe the bug Try kepler with Kind, but the exporter cannot startup, get below errors in logs: not sure if the Kind is not in the support list.

[root@experimental kepler]# k -n kepler logs -f kepler-exporter-kgvdl
panic: runtime error: index out of range [16] with length 16

goroutine 1 [running]:
github.com/sustainable-computing-io/kepler/pkg/power/rapl/source.mapPackageAndCore()
	/opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/power/rapl/source/msr_util.go:88 +0x234
github.com/sustainable-computing-io/kepler/pkg/power/rapl/source.InitUnits()
	/opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/power/rapl/source/msr_util.go:147 +0x1d
github.com/sustainable-computing-io/kepler/pkg/power/rapl/source.(*PowerMSR).IsSupported(...)
	/opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/power/rapl/source/msr.go:22
github.com/sustainable-computing-io/kepler/pkg/power/rapl.init.0()
	/opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/power/rapl/power.go:53 +0xdb

To Reproduce Steps to reproduce the behavior:

Additional context Add any other context about the problem here.

Make kepler metrics conform to the Prometheus metrics guideline

Why this PR is needed? Currently it is hard to understand all the prometheus metrics and even know which are the metrics that we are exporting... The metric naming is complex and does not follow the prometheus metric name guideline. More details are written in the issue #286

What this PR does? This PR updates the prometheus metrics along with some changes to enable the new metrics.

Additionally, Prometheus suggest to only report metrics in joules instead of watts. Giving that, we don't need to report the current power consumption since it can be calculated using promQL. There are more details about this in the issue #286

For the sake of compatibility with other modules, we keep some deprecated metrics and will remove it later.

Additional comments The changes are carefully separated in different commits to be easier to review.

I will update the Grafana dashboard in another PR. There are already many updates in this PR....

Signed-off-by: Marcelo Amaral [email protected]
[WIP][don't merge] Dev:1st impl for integration test
Details for this PR e2e folder with end to end testing, will run with two models, base on environment kepler_address.

build kepler during test process and run it targets on local.

run test case targets on cluster after cluster port forwarding.

Signed-off-by: Sam Yuan [email protected]
dial error: dial unix /tmp/estimator.sock: connect: no such file or directory
Describe the bug After rolling over the daemonset to the latest image on quay.io registry (sha256:01a86339a8acb566ddcee848640ed4419ad0bffac98529e9b489a3dcb1e671f5) there is the message from title being shown constantly. Example output of the problem:

2022/08/25 12:30:53 Kubelet Read: map[<pod-list-trimmed>] 2022/08/25 12:30:53 dial error: dial unix /tmp/estimator.sock: connect: no such file or directory energy from pod (0 processes): name: <some-pod> namespace: <some-namespace>

Is the estimator.sock expected to be missing in current state of the project?

Each node is reporting the same error. As a sidenote, since then nodes are not logging any new kepler metrics to Prometheus. I am in no place to suggest that these are connected issues and the missing metrics might be some other local issue, but there's that.

To Reproduce Steps to reproduce the behavior:

Run kepler on OpenShift 4.11

Check kepler-exporter container logs for presence of '/tmp/estimator.sock: connect: no such file or directory'

Expected behavior /tmp/estimator.sock error is not reported.

Desktop (please complete the following information):

OS: RedHat CoreOS 4.11
Exclude VM node when deploy kepler exporter

Currently kepler exporter cannot get data successfully on VM node. With this PR change, kepler exporter will not deploy on VM nodes. Only bare metal nodes will be scheduled for kepler deployment.

Signed-off-by: Hao, Ruomeng [email protected]
Energy consumption of CPU is 0

Describe the bug Checking "Pod Current Energy Consumption" on Grafana dashboard, CPU energy consumption of each pod is 0. Checking Prometheus, "pod_curr_energy_in_core_millijoule" of all pods are 0. "Total" and "DRAM" have data but "CPU" is 0.

The issue exists on both RHEL8.6 and Ubuntu 22.04 host.
Update Grafana dashboards with the new container metrics
Why this PR is needed? PR #287 will update Prometheus metrics and affect the current Grafana dashboard. Where the new metrics will report energy per container and have more meaningful names. More details are written in issue #286.

Currently, it is difficult to understand all the queries in the existing Grafana dashboard. There are some constant values that are not obvious and some queries that are wrong. For example:

The sum_over_time(pod_curr_energy_in_core_millijoule{pod_namespace=\"$namespace\", pod_name=\"$pod\"}[24h])*15/3/3600000000 metric: sum_over_time sum the metric within the timeframe (the value in the square brackets) by getting a cumulative number from the gauge. The problem here is the granularity, we know the gauge is reported every 3s. So the query will not sum the aggregation across the 3s. Instead of a gauge, a counter should be used, e.g., pod_aggr_energy_in_core_millijoule, but of course it won't make sense to use sum_over_time. If we use the counter, to get the kw*h, we will need to use the increase function:

1W*s = 1J and 1J = (1/3600000)kWh = 0.000000277777777777778 (sum(increase(pod_aggr_energy_in_core_millijoule{}[1h])))*0.000000277777777777778

So, in Prometheus, metrics are based on averages and approximations. In fact, the increase function takes the average of the time period and multiplies it by the interval.

Also, in case we are using a counter, division by 3 makes no sense, as the rate function already returns values per second... and the increase just get the rate and multiply by the interval.

Additionally, I didn't understand the multiplication by 15 and the division by 3600000000...

Another example: The rate(pod_curr_energy_in_gpu_millijoule{}[1m])/3 metric. The previous metric pod_curr_energy_in_gpu_millijoule was a gauge, and rate over a gauge metric doesn't make sense... Again, it would make sense to use the counter pod_aggr_energy_in_core_millijoule, but not divide by 3....

What this PR does? This PR updates the Grafana dashboard with the new metrics and the properly queries.

For the query that will return watt, we will have:

sum without (command, container_name)( rate(kepler_container_package_joules_total{}[5s]) )

And another query will return kWh per day: Note that, to calculate the kwh we need to multiply the kilowatts by the hours of daily use, therefore we will count the how many hours within a day the container is running.

sum by (pod_name, container_name) ( (increase(kepler_container_package_joules_total{}[1h]) * $watt_per_second_to_kWh) * (count_over_time(kepler_container_package_joules_total{}[24h]) / count_over_time(kepler_container_package_joules_total{}[1h]) ) )

I have also fixed other minor issues in the dashboard, such as

have the All value in the namespace and pod variables

make the Coal, Natural Gas and Petroleum Coefficient transparent and editable

Additional comments

Signed-off-by: Marcelo Amaral [email protected]
Fix CI error

resolve https://github.com/sustainable-computing-io/kepler/issues/193

change log:

add commit push condition for main branch. move test coverage for default unit test.(to avoid test coverage based on specific build tag as bcc) bug fix for test coverage file missing.

Signed-off-by: Sam Yuan [email protected]

VM: all node / pod energy report 0 (again..)

Describe the bug A clear and concise description of what the bug is.

after use latest update with a few enhancement, my pod/node report energy become 0 again.. switch to v0.3 I can see the data correctly reported ,...

latest

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0kepler_node_energy_stat{cpu_architecture="Haswell",node_block_devices_used="0",node_curr_bytes_read="0",node_curr_bytes_writes="0",node_curr_cache_miss="0",node_curr_container_cpu_usage_seconds_total="0",node_curr_container_memory_working_set_bytes="0",node_curr_cpu_cycles="0",node_curr_cpu_instr="0",node_curr_cpu_time="0",node_curr_energy_in_core_joule="0",node_curr_energy_in_dram_joule="0",node_curr_energy_in_gpu_joule="0",node_curr_energy_in_other_joule="0",node_curr_energy_in_pkg_joule="0",node_curr_energy_in_uncore_joule="0",node_name="jitest40"} 0

v0.3

# TYPE node_curr_energy_in_core_joule gauge
node_curr_energy_in_core_joule{instance="jitest40"} 0.026
# HELP node_curr_energy_in_dram_joule node_ current energy consumption in dram (joule)
# TYPE node_curr_energy_in_dram_joule gauge
node_curr_energy_in_dram_joule{instance="jitest40"} 0.708
# HELP node_curr_energy_in_gpu_joule node_ current energy consumption in gpu (joule)
# TYPE node_curr_energy_in_gpu_joule gauge
node_curr_energy_in_gpu_joule{instance="jitest40"} 0
# HELP node_curr_energy_in_other_joule node_ current energy consumption in other (joule)
# TYPE node_curr_energy_in_other_joule gauge
node_curr_energy_in_other_joule{instance="jitest40"} 0
# HELP node_curr_energy_in_pkg_joule node_ current energy consumption in pkg (joule)
# TYPE node_curr_energy_in_pkg_joule gauge
node_curr_energy_in_pkg_joule{instance="jitest40"} 0.026
# HELP node_curr_energy_in_uncore_joule node_ current energy consumption in uncore (joule)
# TYPE node_curr_energy_in_uncore_joule gauge
node_curr_energy_in_uncore_joule{instance="jitest40"} 0
# HELP node_curr_energy_joule node_ current energy consumption (joule)
# TYPE node_curr_energy_joule gauge
node_curr_energy_joule{instance="jitest40"} 0.026

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context Add any other context about the problem here.

getKernelVersion doesn't work at all
Describe the bug A clear and concise description of what the bug is.

https://github.com/sustainable-computing-io/kepler/blob/main/pkg/config/config.go#L65

paste those to https://go.dev/play/ and run it

package main import ( "encoding/json" "fmt" "github.com/zcalusic/sysinfo" ) func main() { var si sysinfo.SysInfo si.GetSysInfo() data, err := json.MarshalIndent(&si, "", " ") if err == nil { var result map[string]map[string]string if err = json.Unmarshal(data, &result); err != nil { fmt.Println("----") fmt.Println(err) fmt.Println("----") } } fmt.Println("done") }

---- json: cannot unmarshal number into Go value of type string ---- done

To Reproduce Steps to reproduce the behavior:

Go to '...'

Click on '....'

Scroll down to '....'

See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]

Browser [e.g. chrome, safari]

Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]

OS: [e.g. iOS8.1]

Browser [e.g. stock browser, safari]

Version [e.g. 22]

Additional context Add any other context about the problem here.
implement model-based power estimator
This PR introduces a dynamic way to estimate the power by Estimator class (pkg/model/estimator.go).

the model is supposed to be dynamically downloaded to the folder data/model

python program running as a child process to apply the trained model to the read value via unix domain socket

model class is implemented in python now supporting .h5 of keras model, .sav of scikit-learn model, and simple ratio model computed metric importance by correlation to power

There are additional three dependent points to integrate this class to the Kepler

initialize in exporter.go

errCh := make(chan error) estimator := &model.Estimator{ Err: errCh, } // start python program (pkg/model/py/estimator.py) // it will listen for PowerRequest by the unix domain socket "/tmp/estimator.sock" go estimator.StartPyEstimator() defer estimator.Destroy()

call GetPower function in reader.go

// it will create PowerRequest and send to estimator.py via the unix domain socket (e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {}

modelName refers to the model folder in /data/model which contains metadata.json giving the rest details of model such as model file, feature engineering pkl files, features, error, so on. (auto-select the minimum error model if it is empty, "")

xCols refers to features

xValues refers to values of each features for each pods [no. pods x no. features]

corePower refers to core power for each package (leave it empty if not available)

dramPower, gpuPower, otherPower same to corePower

put initial models to data/model of container folder (can be done by statically add in the docker image or deployment manifest volumes)

check example use in pkg/model/estimator_test.go

If you are agree with this direction, we can modify estimator.py to

support other modeling classes

select the applicable features from available features

connect to kepler-model-server to update the model

Signed-off-by: Sunyanan Choochotkaew [email protected]
why use dummy Impl of power component instead of estimated?

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

we are using dummy if no RAPL or MSR https://github.com/sustainable-computing-io/kepler/blob/main/pkg/power/components/power.go#L60

but we do have estimated https://github.com/sustainable-computing-io/kepler/blob/main/pkg/power/components/source/estimate.go

so from name, same estimated is suitable than dummy? Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.
Consume hardware power metrics from Hardware Sentry

Is your feature request related to a problem? Please describe. Solutions exist to collect power metrics, and even semantic conventions. It would be nice if Kepler would leverage that.

Describe the solution you'd like Example of a solution that collects hardware power metrics: Hardware Sentry. It's free but it's not yet open-source.

It would be greatly beneficial to Kepler if it could use Hardware Sentry as a source for hardware power metrics (notably hw_host_energy_joules_total and hw_energy_joules_total{hw_type="cpu|gpu|memory|physical_disk|network"}).

Also, OpenTelemetry have defined semantic conventions for hardware, including for power and energy metrics. Kepler should follow these conventions.

Describe alternatives you've considered None, really.

Additional context I am the CEO of the company who develops Hardware Sentry. We're pushing for solutions that help companies reduce the carbon footprint of their data centers (notably with temperature optimization). I'm very happy to discover Kepler and sustainable-computing-io!
containerIDToContainerInfo should be updated to reflect removed container?
Describe the bug A clear and concise description of what the bug is.

https://github.com/sustainable-computing-io/kepler/blob/main/pkg/cgroup/resolve_container.go#L56

seems this defined somewhere and updated when creation seems no place to update it when pod got destroyed?

To Reproduce Steps to reproduce the behavior:

Go to '...'

Click on '....'

Scroll down to '....'

See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]

Browser [e.g. chrome, safari]

Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]

OS: [e.g. iOS8.1]

Browser [e.g. stock browser, safari]

Version [e.g. 22]

Additional context Add any other context about the problem here.
e2e tests for Kepler, estimator, and model server
Is your feature request related to a problem? Please describe. Having all of the components e2e tested on baremetal and VM (especially on CI)

Describe the solution you'd like The tests should verify that:

[ ] all the components are configured correctly, up and running

[ ] the models (ebpf, cgroup, etc) can be trained and updated online
Build manifest deployment with options
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The current manifest build contains duplicating points separately for plain Kubernetes, OpenShift, VM, BM

Adding integration of estimator sidecar and model server requires combination to the above duplication.

Some setting such as image, namespace of users in rolebinding can be done more properly by kustomize instead of sed or manual change.

Describe the solution you'd like A clear and concise description of what you want to happen.

The solution is to build manifest based on the same base defined patching and additional resources.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2

CloudLinux LVE Exporter for Prometheus LVE Exporter - A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2 Help on flags: -h, --h

Nov 2, 2021

cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resource objects related of Kubernetes Cluster API.

Overview cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resou

Oct 27, 2022

Json-log-exporter - A Nginx log parser exporter for prometheus metrics

json-log-exporter A Nginx log parser exporter for prometheus metrics. Installati

Jan 5, 2022

The metrics-agent collects allocation metrics from a Kubernetes cluster system and sends the metrics to cloudability

metrics-agent The metrics-agent collects allocation metrics from a Kubernetes cluster system and sends the metrics to cloudability to help you gain vi

Jan 14, 2022

Vulnerability-exporter - A Prometheus Exporter for managing vulnerabilities in kubernetes by using trivy

Kubernetes Vulnerability Exporter A Prometheus Exporter for managing vulnerabili

Dec 4, 2022

How to build production-level services in Go leveraging the power of Kubernetes

Oct 22, 2021

Netstat exporter - Prometheus exporter for exposing reserved ports and it's mapped process

Netstat exporter Prometheus exporter for exposing reserved ports and it's mapped

Feb 3, 2022

Metrics collector and ebpf-based profiler for C, C++, Golang, and Rust

Apache SkyWalking Rover SkyWalking Rover: Metrics collector and ebpf-based profiler for C, C++, Golang, and Rust. Documentation Official documentation

Jan 6, 2023

Openvpn exporter - Prometheus OpenVPN exporter For golang

Prometheus OpenVPN exporter Please note: This repository is currently unmaintain

Jan 2, 2022

Amplitude-exporter - Amplitude charts to prometheus exporter PoC

Amplitude exporter Amplitude charts to prometheus exporter PoC. Work in progress

May 26, 2022

📡 Prometheus exporter that exposes metrics from SpaceX Starlink Dish

Starlink Prometheus Exporter A Starlink exporter for Prometheus. Not affiliated with or acting on behalf of Starlink(™) ?? Starlink Monitoring System

Dec 19, 2022

Prometheus exporter for Chia node metrics

chia_exporter Prometheus metric collector for Chia nodes, using the local RPC API Building and Running With the Go compiler tools installed: go build

Sep 19, 2022

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

DCGM-Exporter This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM. Documentation

Dec 27, 2022

A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover

YAAE (Yet Another AWS Exporter) A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover About This exporter is meant to expo

Dec 10, 2022

Prometheus metrics exporter for libvirt.

Libvirt exporter Prometheus exporter for vm metrics written in Go with pluggable metric collectors. Installation and Usage If you are new to Prometheu

Jul 4, 2022

Prometheus Exporter for Kvrocks Metrics

Prometheus Kvrocks Metrics Exporter This is a fork of oliver006/redis_exporter to export the kvrocks metrics. Building and running the exporter Build

Sep 7, 2022

A prometheus exporter which reports metrics about your Gmail inbox.

prometheus-gmail-exporter-go A prometheus exporter for gmail. Heavily inspired by https://github.com/jamesread/prometheus-gmail-exporter, but written

Nov 15, 2022

Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using HPE Smart Storage Administrator tool

hpessa-exporter Overview Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using

Jan 17, 2022

Exporter your cypress.io dashboard into prometheus Metrics

Cypress.io dashboard Prometheus exporter Prometheus exporter for a project from Cypress.io dashboards, giving the ability to alert, make special opera

Feb 8, 2022

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

kepler

Architecture

Requirement

Installation and Configuration for Prometheus

Prerequisites

Deploy the Kepler exporter

Deploy the Prometheus operator and the whole monitoring stack

Configure Prometheus to scrape Kepler-exporter endpoints.

Sample Grafana dashboard

Owner

Sustainable Computing

Comments

Cannot start up exporter with Kind

Make kepler metrics conform to the Prometheus metrics guideline

[WIP][don't merge] Dev:1st impl for integration test

dial error: dial unix /tmp/estimator.sock: connect: no such file or directory

Exclude VM node when deploy kepler exporter

Energy consumption of CPU is 0

Update Grafana dashboards with the new container metrics

Fix CI error