CI: Improve test.summary rule #462

marshallward · 2023-08-17T20:23:15Z

The test.summary rule was causing errors on our lustre filesystem, even when all tests were successful. While the reason is still not yet clear, I suspect it is related to our if ls results/*/* tests, which may not behave as intended on all filesystems.

To somewhat secure us against this, I have replaced the top-level check with if [ -d results ] to check if there are any output from failed tests. The subdirectory tests still remain, but at least these will only happen if any actual results exists.

Other minor changes:

The script to generate the summary was moved out of the Makefile and into a separate script.
Unrelated to these changes, error output was extended from 20 to 40 lines, to provide more readable backtrace output.

codecov · 2023-08-17T20:28:16Z

Codecov Report

Merging #462 (1778eca) into dev/gfdl (d60c2e0) will increase coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 1778eca differs from pull request most recent head 95ca617. Consider uploading reports for the commit 95ca617 to get more accurate results

@@            Coverage Diff            @@
##           dev/gfdl     #462   +/-   ##
=========================================
  Coverage     38.03%   38.03%           
=========================================
  Files           269      269           
  Lines         77554    77554           
  Branches      14319    14319           
=========================================
+ Hits          29494    29496    +2     
+ Misses        42709    42707    -2     
  Partials       5351     5351

see 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

marshallward · 2023-08-18T18:49:01Z

I should say that while I think this is an overall improvement, I am still unsure if it fixes the original problem, which I have yet to actually replicate.

The issue is that make test.summary produces odd output, something like this:

lustre :

and then also raises an error.

I believe this is because (for example) ls results/*/std.*.out produces output containing the word lustre, which is placed in the output. Any error is sent /dev/null, so we ultimately are left with very little information.

Despite the supposed portability of this method, I think if ls results/*/* etc was a bad strategy, and should be phased out. I am killing it at the top, hoping that it is sufficient to catch the rest.

Hallberg-NOAA

These changes should be helpful for providing more information, and hopefully they will reduce the rates of lustre-based failures as intended.

The test.summary rule was causing errors on our lustre filesystem, even when all tests were successful. While the reason is still not yet clear, I suspect it is related to our `if ls results/*` tests, which may not behave as intended on all filesystems. To somewhat secure us against this, I have replaced the top-level check with `if [ -d results ]` to check if there are any output from failed tests. The subdirectory tests still remain, but at least these will only happen if any actual results exists. Other minor changes: - The script to generate the summary was moved out of the Makefile and into a separate script. - Unrelated to these changes, error output was extended from 20 to 40 lines, to provide more readable backtrace output.

marshallward · 2023-08-23T19:03:50Z

These tests failed in the CI; the if [ -d results ] check was still evaluating as true despite all tests passing (and thus results should never have been created). Individual tests for regressions and checksums were also failing, despite no such tests.

I'll leave this open a bit longer to investigate the problem, but if the problem gets too deep then I may need to retract it.

marshallward · 2023-08-24T14:15:33Z

I'm going to close this for now, since I may need to make a lot of changes and don't want to spam the channel. The good news is that whatever is going on is now reproducible.

marshallward · 2023-08-24T14:41:27Z

I believe at least part of the problem here is that GNU and Intel are sharing the runner work directory on Gaea (configured with WORKSPACE), and leftover results directories are probably confusing the test summary script, which was never meant to be very sophisticated.

I don't see any benefit to even using WORKSPACE for the .testing runs on Gaea, and I would suggest we not use them, but perhaps I need to discuss this with @adcroft, since he was the one who requested this feature.

marshallward requested a review from adcroft August 17, 2023 21:43

adcroft approved these changes Aug 21, 2023

View reviewed changes

Hallberg-NOAA approved these changes Aug 22, 2023

View reviewed changes

Hallberg-NOAA force-pushed the make_test_summary_fix branch from a259dc2 to 95ca617 Compare August 22, 2023 19:13

marshallward closed this Aug 24, 2023

marshallward mentioned this pull request Aug 24, 2023

CI: Run test (and test.summary) locally #473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Improve test.summary rule #462

CI: Improve test.summary rule #462

marshallward commented Aug 17, 2023 •

edited

Loading

codecov bot commented Aug 17, 2023 •

edited

Loading

marshallward commented Aug 18, 2023

Hallberg-NOAA left a comment

marshallward commented Aug 23, 2023

marshallward commented Aug 24, 2023

marshallward commented Aug 24, 2023 •

edited

Loading

CI: Improve test.summary rule #462

CI: Improve test.summary rule #462

Conversation

marshallward commented Aug 17, 2023 • edited Loading

codecov bot commented Aug 17, 2023 • edited Loading

Codecov Report

marshallward commented Aug 18, 2023

Hallberg-NOAA left a comment

Choose a reason for hiding this comment

marshallward commented Aug 23, 2023

marshallward commented Aug 24, 2023

marshallward commented Aug 24, 2023 • edited Loading

marshallward commented Aug 17, 2023 •

edited

Loading

codecov bot commented Aug 17, 2023 •

edited

Loading

marshallward commented Aug 24, 2023 •

edited

Loading