A set of scripts to help prepare, submit, run and monitor cluster jobs.
All scripts have options, so use the -h|--help
argument for more info. If not, just run as-is for the defaults.
Scripts to:
- Create a Docker image (based on the nvidia pytorch container):
create_docker_im.sh
. - Submit a job:
rubmit
. - The script that is executed when the job runs:
runai_startup.sh
.
Example:
- Create the docker image:
create_docker_im.sh --docker_push
. - Submit with
rubmit
. - Modify
runai_startup.sh
as necessary, this will be run once the runai job is created.
Typical submit command:
rubmit --job-name rb-train -- "cd <somewhere>\npython training.py --output_model model.pt"
Or to have a job that doesn't do anything (allowing you to SSH in and perform remote development), simply omit the command:
rubmit -j test
You can then SSH to this job or use port forwarding to use e.g., tensorboard.
To use jupyter or vs-code, check the relevant sections of rubmit --help
.
Scripts to:
- Submit a job:
jubmit
. - View jobs (wraps
sacct
):jlist
- View cluster-wide resource usage:
jtop
.
Use this command to submit jobs on JADE.
Typical submit command:
jubmit -p devel -- "/jmain02/home/J2AD019/exk01/rjb87-exk01/Documents/Code/miniconda/envs/py3.11/bin/python VertSeg/instance_segmentation.py -i False"