Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Energy count overflow at >O(1k sec)? #646

Open
rubenhorn opened this issue Nov 18, 2024 · 1 comment
Open

[BUG] Energy count overflow at >O(1k sec)? #646

rubenhorn opened this issue Nov 18, 2024 · 1 comment
Labels

Comments

@rubenhorn
Copy link

Describe the bug
The energy measurement seems to overflow when the application under test runs for longer than around 1.5k seconds.

Example:
In case no. 1, the application runs for 1546.03 sec and PKG1 reads 248,923.0 J.
In case no. 2, the application runs for 1795.02 sec and PKG1 reads _30,811.1 J.

To Reproduce

  • LIKWID build from GitHub
$ likwid-powermeter --version
likwid-powermeter -- Version 5.3.0 (commit: 0123456789)
  • OS: Rocky 9.3 (5.14.0-427.28.1.el9_4.x86_64)
  • CPU: 2x Intel Xeon Platinum 8260L
  • MPI-based application:
$ mpirun --version
mpirun (Open MPI) 4.1.1
  • CMD line in batch script: srun likwid-powermeter ./simulation

Additional context
image

I see this effect at the same point when using my own RAPL measurement plugin (C++), which accumulates the energy consumption over time from the sysfs interface instead of the MSR.
This leads me to suspect that it may be caused by summing up the measurements in uJ and only converting them at the end for the output.

@rubenhorn rubenhorn added the bug label Nov 18, 2024
@ipatix
Copy link
Contributor

ipatix commented Nov 18, 2024

This is unfortunately a limitation of the RAPL counter. The exact overflow point depends on the RAPL energy unit in combination with the counter being 32 bit only. In your case the energy unit is 2^(-14)J, so 61 µJ. Currently, there is not really anything that you can do about this, except repeatedly reading the counter and keeping track how many times it has overflowed.

Now, we should be able to keep track of overflows in likwid-powermeter. However, the LIKWID API currently does not expose the energy unit. So we could just guess the overflow point by taking the last measurement and rounding up to the next power of two. I'll check if we can solve this in a bugfix release for LIKWID 5.4.1.

In the meantime the only real option is that you keep track of the last reading yourself using sysfs. I'm not sure if sysfs exposes the range, so you'll have to guess the power-of-two where it overflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants