This is the code repository used for the "Scaling Things Up" section of the EBI course Genome Bioinformatics, named in previous years as "NGS Bioinformatics".
This sections follows the previous 3 days of the course, where command line tools and basic bioinformatics commands to index files and align fastqs to a reference genome have been acquired. Here we focus on reusing the commands learnt during previous days, to run the same commands using parallelisation and job scheduling.
The following README is a copy of the 2021 Google Docs walkthrough of the interactive part of the session.
- Run git clone on this repository
- Go into the folder you just cloned, and then inside the “Parallelisation” folder
- Open the align_all_extra_fqs.sh script. What do you think the script will do?
- Do you think the script will take a long time to run? What command could we use to time how long a script takes?
- Modify the script so that instead of running each alignment, it echos the align command to a file we will call align_commands.sh
- Run the script using the parallel command, you can even use the time command to measure how long it takes to run
- How long did it take when using parallel to run the command?
-
Remove the
echo
we added to align_all_extra_fqs.sh so that it will run everything in a for loop -
Do you remember how to submit a job with slurm? (hint: its the
sbatch
command followed by what you want to run) -
Run
squeue
to see your job running. You should see something like this: -
We will now kill our job, we do this using the
scancel
command followed by the JOBID. For me, this isscancel 8
. Find your jobid withsqueue
and cancel the job -
Remove the bam files we generated here
-
Edit the align_all_extra_fqs.sh file to submit each
bwa mem
command to slurm