-
Notifications
You must be signed in to change notification settings - Fork 23
Job Schedulers and Platforms
MinSAR runs on a variety of HPC platforms. Job scheduling is controlled by job_submission_scheme
and QUEUENAME
.
Supported schedulers are SLURM, LSF, PBS.
For each queue, job submission is controlled by JOB_CPUS_PER_NODE, THREADS_PER_CORE, MEM_PER_NODE, MAX_JOBS_PER_QUEUE.
If you work on one of the supported systems, these values will be set by default, however you can change them in your template file.
following is the part you need to edit/add in your template (check process_rsmas.py -H
)
################################# HPC JOB QUEUE Parameters ####################################
## These systems are supported and default values are set,
## you can change the queue name and all others will adjust
# Frontera --> available queue names:[nvdimm, development, normal, rtx, rtx-dev, flex], default = normal
# stampede2 --> available queue names:[normal, skx-normal, development, skx-dev], default = skx-normal
# comet --> available queue names:[compute, shared, gpu], default = gpu
# pegasus --> available queue names:[general, parallel, bigmem], default = general
# eos_sanghoon --> default = batch
# beijing_server --> default = batch
# deqing_server --> default = batch
QUEUENAME = auto # defaults based on the above systems
WALLTIME_FACTOR = auto # default = 1 this factor multiplies by the defaults values of wall_time
# if you are not using one of the above systems, you need to set up the following options as well
JOB_CPUS_PER_NODE = auto # defaults based on the above systems
THREADS_PER_CORE = auto # defaults based on the above systems
MEM_PER_NODE = auto # defaults based on the above systems
MAX_JOBS_PER_QUEUE = auto # defaults based on the above systems
# Following are the job submission schemes supported by minsar:
# singleTask ---> submit each task of a batch file separately in a job
# multiTask_singleNode ---> distribute tasks of a batch file into jobs with one node
# multiTask_multiNode ---> submit tasks of a batch file in one job with required number of nodes
# launcher_multiTask_singleNode ---> distribute tasks of a batch file into jobs with one node, submit with launcher
# launcher_multiTask_multiNode ---> submit tasks of a batch file in one job with required number of nodes using launcher
job_submission_scheme = auto # defaults = launcher_multiTask_singleNode
Default parameters for each queue can be found in minsar/defaults/queues.cfg
You can reserve a node by:
idev -p <queue name> -N <number of nodes> -m <requested time>
and run process_rsmas.py from there.
In SLURM systems check the limits of queues by:
qlimits
On Stampede2 process_rsmas.py runs on the head node and submits the jobs using sbatch
. You can logout from your shell using the tmux
utility (ctrl b d
to detach and tmux a
or tmux a -t 0
to re-attach), or use at
to submit a job at a desired time (for default job submission scheme: check process_rsmas.py -H).
minsarApp.bash $SAMPLESDIR/unittestGalapagosSenDT128.template --start dem
(without --submit
option)
Set permissions to make SCRATCHDIR writable and WORK readable for group members:
chmod u=rwx,g=rwx,o=rwx -R $SCRATCHDIR
chmod u=rwx,g=rx,o=rx -R $STOCKYARD
Check your group using groups
G-820134 G-820609
. Add to your .bashrc umask 027
. For more information visit this page of Stampede User Guide.
Don't run on login node!!
If you do, your login will be disabled (you token will stop working). On the head node you need to use --submit
. Else start an interactive node using the alias idev or
idevdevand then run
smallbaselineApp.py`.
On pegasus we used to submit process_rsmas.py to a compute node (--submit
submits using bsub
). This occupies one job just for job submission. This is OK for shared single-core queues but not for systems without single-core queue such as Stampede2. Furthermore, many systems don't allow to submit jobs from a job. For now run:
process_rsmas.py $SAMPLESDIR/unittestGalapagosSenDT128.template --start dem --submit
(this uses singletask
the plan is to eliminate --submit
as well.
High performance systems such as Frontera and Triton allow for only limited number of jobs. On these systems we need to first start the job using e.g. 10 nodes, start process_rsmas.py and then submit jobs given the available cores but without calling the scheduler (QUEUENAME=none
as on a Mac).