-
Notifications
You must be signed in to change notification settings - Fork 5
ClimaCoupler Lessons Learned
- set
export CLIMACOMMS_CONTEXT=SINGLETON
(avoids thePMI2_Init
error) - get an interactive session on a compute node:
srun --pty -t hh:mm:ss -n tasks -N nodes /bin/bash -l
- for more SLURM commands see https://www.hpc.caltech.edu/documentation/slurm-commands
- We use Clima’s CaltechBox to store external files, namely in
ClimaCoupler/data
- To download a file, grab a link by clicking on
Share
. Be sure to grab the Direct Link (with the file extension) by clicking onLink Settings
. If you use the default link, your file may not be readable. - For example, to download file
https://caltech.box.com/s/123.hdf5
from Julia you can use two approaches:-
- use
Downloads.download(“https://caltech.box.com/s/123.hdf5”, “your_file_name.hdf5”)
- use
-
- use
Artifacts.jl
(andArtifactWrapper.jl
with some convenience functions), which we use for more formal file tracking and for more complex containers (e.g., tarballs):
- use
-
function your_dataset_path()
_dataset = AW.ArtifactWrapper(
@__DIR__,
"123",
AW.ArtifactFile[AW.ArtifactFile(
url = "Downloads.download(“https://caltech.box.com/s/123.hdf5",
filename = "your_file_name.hdf5",
),],
)
return AW.get_data_folder(_dataset)
end
sst_data = joinpath(your_dataset_path(), "your_file_name.hdf5")
- Reading files
- NetCDF: use NCDatasets.NCDataset(“your_file_name.nc”)
- Hdf5: use ClimaCore.InputOutput.HDF5Reader("your_file_name.hdf5")
This example will run the tests in ClimaCoupler.jl/test/mpi_tests/run_mpi_tests.jl
using MPI with up to 3 nodes. To run other tests, navigate to the desired directory and change the include
command accordingly. These commands should be run from the HPC.
srun -n 3 -t 01:00:00 --pty bash # request 3 processors for 1 hour
cd ClimaCoupler.jl/test # navigate to correct test directory
module purge
module load julia/1.8.1 openmpi/4.1.1 hdf5/1.12.1-ompi411
export CLIMACORE_DISTRIBUTED="MPI"
export JULIA_MPI_BINARY="system"
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.build()'
julia --project -e 'using Pkg; Pkg.build("MPI"); Pkg.build("HDF5")'
julia --project -e 'using MPIPreferences; MPIPreferences.use_system_binary()'
julia --project -e 'include("mpi_tests/run_mpi_tests.jl")'
Alternatively, this set of commands can be run using sbatch
instead of srun
by adding all lines after the srun
line to a bash script, e.g. script.sh
, then running sbatch -n 3 -t 01:00:00 script.sh
.
For debugging it may be useful to run the MPI.mpiexec
command in the REPL. In that case, make sure you set up your environment and builds as above before entering Julia.
Note that the cluster can be unreliable when running with MPI, so these commands may raise an error. If that occurs, exit and try again to log in onto a different node.
- Our MPI package, ClimaComms.jl, contains convenience functions that utilize the
MPI.jl
backend, soClimaCommsMPI.MPICommsContext
wrapsMPI.COMM_WORLD
, which is a communicator object that describes partitioning into multiple MPI processes. -
ClimaComms.SingletonCommsContext
is a dummy object of typeAbstractCommsContext
used for non-distributed runs. This enables us to write more generic functions that dispatch on the type of the communicator, with the high-level code remaining unaffected. - Make sure that the communications context being used for MPI is both instantiated (
comms_ctx = ClimaCommsMPI.MPICommsContext()
) and initialized (pid, nprocs = ClimaComms.init(comms_ctx)
, withpid
denoting the current process ID, andnproc
the total number of processes) before its first use. - If using a
ClimaCore
topology/space, make sure to use one which allows distributed computing, and initialize it using the corresponding MPI communications context (topology = Topologies.DistributedTopology2D(comms_ctx, mesh, Topologies.spacefillingcurve(mesh))
).
- A helpful setup is a test file which contains tests that use MPI (e.g.
ClimaCoupler.jl/test/mpi_tests/regridder_mpi_tests.jl
, and a file (e.g.ClimaCoupler.jl/test/mpi_tests/run_mpi_tests.jl
), which runs the test files with the MPI setup - MPI can be unreliable on Windows, so we generally do not run it on that OS. This and the fact that GH Actions only allows a limited number of processes are the reasons why we run our MPI unit tests only on Buildkite.
We’ve tried two different approaches to implement remappings:
- Non-conservative
clean_mask
function applied after non-monotone remapping- Cuts off values outside of desired range
- Advantages: fast, does not reduce spatial resolution
- Disadvantages: does not conserve global total of quantity being remapped
- Conservative monotone remapping function
- A monotone remapping has the quality that no new minima nor maxima are introduced (i.e. all weights used in the mapping are between [0,1])
- Advantage: remapping is both conservative and monotone
- Disadvantages: decreases spatial resolution (i.e., it's a lower order method), slower approach
TempestRemap
has multiple functions that generate remappings. The currently-used function in ClimaCoupler
is GenerateOfflineMap
, and the alternative is GenerateTransposeMap
. Note that GenerateTransposeMap
reverses a previous remapping and can only map FV griddings to CGLL (not vice versa). Thus, to apply this function, a minimum of 2 and potentially 3 remappings must be applied, which is quite costly.
To compare GenerateOfflineMap
and GenerateTransposeMap
, as well as monotone vs non-monotone remappings, we performed a number of remappings using the seamask.nc dataset.
We find that at a spatial resolution of h_elem=6
, monotone and non-monotone remappings produce qualitatively similar results when applying the map created by either GenerateOfflineMap
or GenerateTransposeMap
. However, at a spatial resolution of h_elem=4
, the monotonicity has a substantial effect. This suggests that considerations of enforcing monotonicity of the mapping may be important at lower resolutions, but is not as essential for slightly higher resolutions (for which monotone remapping is always recommended).
- Monotone remappings should be used when being applied to quantities where global conservation is important (i.e. for fluxes), but are not strictly speaking necessary for values where this is not required (i.e. for land cover) or when spatial resolution is sufficiently high.
-
GenerateOfflineMap
is the currently used method to create remappings inClimaCoupler
, and performs about as well asGenerateTransposeMap
. In addition, it is easier to use as it doesn’t require a previous mapping, so we will continue to use it moving forward.
Note that while it may appear that the monotone plots contain negative values, this is merely a consequence of the plotting method used. The values for these plots are contained in [0, 1].
Also note that these boundary conditions cause numerical instability when used in an atmospheric simulation with h_elem=4
and n_poly=3
(=polynomial degree = number of GLL nodes - 1), whether monotonous or not. The same resolution for the aquaplanet setup is stable.
- dss needs to be applied to all variables that are being calculated as part of the step! and passed to other models. For example here is is tropical ice growth vs not, due to
q_sfc
not being dss'd:
- using either just
slurm_ntasks
, orslurm_ntasks_per_node
+slurm_nodes
(seethe docs) - global
slurm_mem
clashes withslurm_mem_per_cpu
(issue highlighted here)