Why this PR is needed?
PR #287 will update Prometheus metrics and affect the current Grafana dashboard. Where the new metrics will report energy per container and have more meaningful names. More details are written in issue #286.
Currently, it is difficult to understand all the queries in the existing Grafana dashboard. There are some constant values that are not obvious and some queries that are wrong. For example:
The sum_over_time(pod_curr_energy_in_core_millijoule{pod_namespace=\"$namespace\", pod_name=\"$pod\"}[24h])*15/3/3600000000
metric:
sum_over_time
sum the metric within the timeframe (the value in the square brackets) by getting a cumulative number from the gauge. The problem here is the granularity, we know the gauge is reported every 3s. So the query will not sum the aggregation across the 3s. Instead of a gauge, a counter should be used, e.g., pod_aggr_energy_in_core_millijoule
, but of course it won't make sense to use sum_over_time
. If we use the counter, to get the kw*h
, we will need to use the increase
function:
1W*s = 1J and 1J = (1/3600000)kWh = 0.000000277777777777778
(sum(increase(pod_aggr_energy_in_core_millijoule{}[1h])))*0.000000277777777777778
So, in Prometheus, metrics are based on averages and approximations. In fact, the increase
function takes the average of the time period and multiplies it by the interval.
Also, in case we are using a counter, division by 3 makes no sense, as the rate
function already returns values per second... and the increase
just get the rate and multiply by the interval.
Additionally, I didn't understand the multiplication by 15
and the division by 3600000000
...
Another example:
The rate(pod_curr_energy_in_gpu_millijoule{}[1m])/3
metric.
The previous metric pod_curr_energy_in_gpu_millijoule
was a gauge, and rate
over a gauge metric doesn't make sense... Again, it would make sense to use the counter pod_aggr_energy_in_core_millijoule
, but not divide by 3....
What this PR does?
This PR updates the Grafana dashboard with the new metrics and the properly queries.
For the query that will return watt, we will have:
sum without (command, container_name)(
rate(kepler_container_package_joules_total{}[5s])
)
And another query will return kWh per day
:
Note that, to calculate the kwh
we need to multiply the kilowatts by the hours of daily use, therefore we will count the how many hours within a day the container is running.
sum by (pod_name, container_name) (
(increase(kepler_container_package_joules_total{}[1h]) * $watt_per_second_to_kWh)
*
(count_over_time(kepler_container_package_joules_total{}[24h]) /
count_over_time(kepler_container_package_joules_total{}[1h])
)
)
I have also fixed other minor issues in the dashboard, such as
- have the
All
value in the namespace and pod variables
- make the Coal, Natural Gas and Petroleum Coefficient transparent and editable
Additional comments

Signed-off-by: Marcelo Amaral [email protected]