-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] update Gaea modulefile #836
[develop] update Gaea modulefile #836
Conversation
@natalie-perlin I was able to clone your forked branch and successfully build the SRW App using your updated Gaea modulefile. However, I'm noting that there is an issue with the workflow_tools conda environment on Gaea. In PR #793, the regional_workflow conda environment was replaced with workflow_tools. However, on Gaea, workflow_tools isn't currently setup to work with the SRW (while attempting to run either the fundamental or coverage tests, this conda environment is unable to find jinja2). During the SRW CM meeting, you noted that there were changes that needed to be brought over to Gaea following the library changes to the machine. Would the workflow_tools conda environment be one of the modifications that still needs to be addressed before running the SRW will be possible on Gaea? |
@MichaelLueken - |
@natalie-perlin I'm seeing similar to what you are, tests and tasks are reporting back with "SUCCEEDED", "UNAVAILABLE", or "DEAD". Interestingly, for the tasks that return "UNAVAILABLE", the log files show that the test successfully ran to completion. It's not clear to me why several tasks are returning "UNAVAILABLE". I am noticing that the "UNAVAILABLE" entries correspond with "STALLED" in the verbose output from . I have noted that your branch is seven commits behind the current HEAD of the authoritative develop branch. It might be interesting to see if things change after updating to the latest develop. |
@MichaelLueken - thank you for your comments and looking into the logs! Launching the WE2E tests by hand may involve explicit steps to initialize Lmod (source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh), to load the module wflow_gaea, and to activate regional_workflow environment. This is now needed because the launch test switch to python. These steps of Lmod and module loading/acitvation are likely hard-coded in Jenkins tests. |
@natalie-perlin You are correct. The Jenkins scripts will force I use csh on the RDHPCS machines, and I'm finding issues with the current None of this explains why the tests and tasks are returning with "UNAVAILABLE" or "DEAD", unfortunately. I also need to point out that your branch is missing the changes from PR #793. This PR removed the use of regional_workflow and replaced it with workflow_tools. We need to make sure that workflow_tools is also working correctly before we can move forward with this PR. |
Oh, thanks for pointing out at changes due for the etc/lmod-setup.csh script. |
Still sorting out workflow issues for Gaea: The whole job script needs to be submitted to a specific cluster (--clusters=c3,c4), not only srun command. What job template would need to be examined or/and possibly updated? |
@natalie-perlin It looks like the jobs are being submitted via the |
@MichaelLueken - thank you, Michael. |
All worked OK after removing the following from my previously tested configuration: The test
|
Branch natalie-perlin:update-gaea-stack is updated as well. |
The fundamental tests have successfully run through to completion:
Will now try running the coverage tests using the Jenkins scripts. |
The Gaea WE2E coverage test suite successfully ran through to completion:
We should re-add Gaea to the Jenkinsfile so that the automated tests will once again run on Gaea. How would you like to go about doing this? I can open a PR in your fork to update the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SRW App both builds and runs (both fundamental and coverage tests have been successfully run) on Gaea, I will now approve these changes.
We will need to decide how to proceed with updating the Jenkinsfile (reactivating Gaea and moving from Jet to Jet-EPIC since Jet's hfv3gfs account's quota has been exceeded) before adding the Jenkins label to this PR.
…PIC due to resource issues on Jet (requiring update to .cicd/scripts/srw_*.sh scripts).
@BruceKropp-Raytheon's Functional Workflow Task Tests have failed on Gaea ( |
@MichaelLueken - please let me know if I could help and look into the logs |
The directory that contains the experiment on Gaea is - /lustre/f2/dev/wpo/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-836 and the pipeline can be viewed - https://jenkins.epic.oarcloud.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2Fpipeline/detail/PR-836/1/pipeline/232 |
I suspect that the issue is that a version of
and then |
I was able to find a fix and have pushed it. The fix was also tested on Hera. Will resubmit the Jenkins tests on Gaea. |
@MichaelLueken - |
@MichaelLueken -
In the Jenkins file in your link, python/3.9 is loaded as one of the default modules, which is not expected. Please see lines 354 and 357 of
This indeed would create an issue with loading miniconda3. I wonder why this did not cause troubles in my tests (unless they were overlooked, and did not cause any failures!) |
It looks like the issue might be in The fix I pushed on Friday only adds |
@MichaelLueken - |
The Cheyenne Intel tests were manually run on Hera and all successfully passed:
The Cheyenne GNU tests were manually run on Hera and all successfully passed:
The Jet tests were manually run and all successfully passed:
The rerun of the automated tests on Gaea also succeeded. Moving forward with merging this work now. |
DESCRIPTION OF CHANGES:
File changed:
./modulefiles/build_gaea_intel.lua
./modulefiles/wflow_gaea.lua
./modulefiles/tasks/gaea/python_srw.lua
./modulefiles/tasks/gaea/plot_allvars.local.lua
./modulefiles/tasks/gaea/run_vx.local.lua
./ush/machine/gaea.yaml
./lmod-setup.csh
Type of change
TESTS CONDUCTED:
-- grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot, task plot_allvars (DEAD)
-- test nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16, tasks make_ics_mem000, make_lbcs_mem000 (DEAD), and subsequent tasks that were not run
-- grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16, tasks run_MET_EnsembleStat_vx_APCP01h, run_MET_EnsembleStat_vx_APCP03h , run_MET_EnsembleStat_vx_APCP06h, run_MET_EnsembleStat_vx_REFC, run_MET_EnsembleStat_vx_RETOP, run_MET_EnsembleStat_vx_SFC , run_MET_EnsembleStat_vx_UPA (DEAD)
DEPENDENCIES:
DOCUMENTATION:
ISSUE:
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR:
CONTRIBUTORS (optional):