You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The energy measurement seems to overflow when the application under test runs for longer than around 1.5k seconds.
Example:
In case no. 1, the application runs for 1546.03 sec and PKG1 reads 248,923.0 J.
In case no. 2, the application runs for 1795.02 sec and PKG1 reads _30,811.1 J.
To Reproduce
LIKWID build from GitHub
$ likwid-powermeter --version
likwid-powermeter -- Version 5.3.0 (commit: 0123456789)
OS: Rocky 9.3 (5.14.0-427.28.1.el9_4.x86_64)
CPU: 2x Intel Xeon Platinum 8260L
MPI-based application:
$ mpirun --version
mpirun (Open MPI) 4.1.1
CMD line in batch script: srun likwid-powermeter ./simulation
Additional context
I see this effect at the same point when using my own RAPL measurement plugin (C++), which accumulates the energy consumption over time from the sysfs interface instead of the MSR.
This leads me to suspect that it may be caused by summing up the measurements in uJ and only converting them at the end for the output.
The text was updated successfully, but these errors were encountered:
This is unfortunately a limitation of the RAPL counter. The exact overflow point depends on the RAPL energy unit in combination with the counter being 32 bit only. In your case the energy unit is 2^(-14)J, so 61 µJ. Currently, there is not really anything that you can do about this, except repeatedly reading the counter and keeping track how many times it has overflowed.
Now, we should be able to keep track of overflows in likwid-powermeter. However, the LIKWID API currently does not expose the energy unit. So we could just guess the overflow point by taking the last measurement and rounding up to the next power of two. I'll check if we can solve this in a bugfix release for LIKWID 5.4.1.
In the meantime the only real option is that you keep track of the last reading yourself using sysfs. I'm not sure if sysfs exposes the range, so you'll have to guess the power-of-two where it overflows.
Describe the bug
The energy measurement seems to overflow when the application under test runs for longer than around 1.5k seconds.
Example:
In case no. 1, the application runs for 1546.03 sec and PKG1 reads 248,923.0 J.
In case no. 2, the application runs for 1795.02 sec and PKG1 reads _30,811.1 J.
To Reproduce
srun likwid-powermeter ./simulation
Additional context
I see this effect at the same point when using my own RAPL measurement plugin (C++), which accumulates the energy consumption over time from the sysfs interface instead of the MSR.
This leads me to suspect that it may be caused by summing up the measurements in uJ and only converting them at the end for the output.
The text was updated successfully, but these errors were encountered: