Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Improve test.summary rule #462

Closed

Conversation

marshallward
Copy link
Member

@marshallward marshallward commented Aug 17, 2023

The test.summary rule was causing errors on our lustre filesystem, even when all tests were successful. While the reason is still not yet clear, I suspect it is related to our if ls results/*/* tests, which may not behave as intended on all filesystems.

To somewhat secure us against this, I have replaced the top-level check with if [ -d results ] to check if there are any output from failed tests. The subdirectory tests still remain, but at least these will only happen if any actual results exists.

Other minor changes:

  • The script to generate the summary was moved out of the Makefile and into a separate script.

  • Unrelated to these changes, error output was extended from 20 to 40 lines, to provide more readable backtrace output.

@codecov
Copy link

codecov bot commented Aug 17, 2023

Codecov Report

Merging #462 (1778eca) into dev/gfdl (d60c2e0) will increase coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 1778eca differs from pull request most recent head 95ca617. Consider uploading reports for the commit 95ca617 to get more accurate results

@@            Coverage Diff            @@
##           dev/gfdl     #462   +/-   ##
=========================================
  Coverage     38.03%   38.03%           
=========================================
  Files           269      269           
  Lines         77554    77554           
  Branches      14319    14319           
=========================================
+ Hits          29494    29496    +2     
+ Misses        42709    42707    -2     
  Partials       5351     5351           

see 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@marshallward
Copy link
Member Author

I should say that while I think this is an overall improvement, I am still unsure if it fixes the original problem, which I have yet to actually replicate.

The issue is that make test.summary produces odd output, something like this:

lustre : 

and then also raises an error.

I believe this is because (for example) ls results/*/std.*.out produces output containing the word lustre, which is placed in the output. Any error is sent /dev/null, so we ultimately are left with very little information.

Despite the supposed portability of this method, I think if ls results/*/* etc was a bad strategy, and should be phased out. I am killing it at the top, hoping that it is sufficient to catch the rest.

Copy link
Member

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes should be helpful for providing more information, and hopefully they will reduce the rates of lustre-based failures as intended.

The test.summary rule was causing errors on our lustre filesystem, even
when all tests were successful.  While the reason is still not yet
clear, I suspect it is related to our `if ls results/*` tests, which may
not behave as intended on all filesystems.

To somewhat secure us against this, I have replaced the top-level check
with `if [ -d results ]` to check if there are any output from failed
tests.  The subdirectory tests still remain, but at least these will
only happen if any actual results exists.

Other minor changes:

- The script to generate the summary was moved out of the Makefile and
  into a separate script.

- Unrelated to these changes, error output was extended from 20 to 40
  lines, to provide more readable backtrace output.
@marshallward
Copy link
Member Author

These tests failed in the CI; the if [ -d results ] check was still evaluating as true despite all tests passing (and thus results should never have been created). Individual tests for regressions and checksums were also failing, despite no such tests.

I'll leave this open a bit longer to investigate the problem, but if the problem gets too deep then I may need to retract it.

@marshallward
Copy link
Member Author

I'm going to close this for now, since I may need to make a lot of changes and don't want to spam the channel. The good news is that whatever is going on is now reproducible.

@marshallward
Copy link
Member Author

marshallward commented Aug 24, 2023

I believe at least part of the problem here is that GNU and Intel are sharing the runner work directory on Gaea (configured with WORKSPACE), and leftover results directories are probably confusing the test summary script, which was never meant to be very sophisticated.

I don't see any benefit to even using WORKSPACE for the .testing runs on Gaea, and I would suggest we not use them, but perhaps I need to discuss this with @adcroft, since he was the one who requested this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants