The pair-ended shotgun sequencing data are first pre-processed by fastqc/fastp/bfc for QC, trimming/adaptor removal and read error correction. The processed reads are assembled by metaSPAdes or megahit to obtain contigs. The assembled contigs are further processed by Autometa or metabat/gtdbtk for taxonomic profiling. The metabat outputs can also being sent to dbcan/kofamscan for functional annotation.
Currently the workflow is still under development. The results are not guaranteed correct. Please verify all the results if you intend to publish them.
Install the workflow management system snakemake
Snakemake is a Python based language and execution environment for GNU Make-like workflows.
The workflow is defined by specifying rules in a snakefile
.
Rules further specify how to create sets of output files from sets of input files as well as the parameters and the command.
Snakemake automatically determines the dependencies between the rules by matching file names. Please refer to snakemake tutorial page to see how to define the rules.
We have to make sure whether snakemake is installed in your system before we start.
UCR hpcc have snakemake installed in environment module. If you're on UCR hpcc you can simply load snakemake module by the following command.
module load snakemake
If you're NOT on UCR hpcc and you don't have snakemake in the enviornment module, please follow the below steps to create a snakemake conda environment.
-
First, create an environment named snakemake.
conda create -n snakemake
After create the environment, activate it by:
conda activate snakemake
-
Install the package mamba, which is a faster version of conda.
conda install -c conda-forge mamba
After mamba being installed, you can later switch from
conda install [package]
tomamba install [package]
to speed up the package installation. -
Next, install the package snakemake through mamba.
mamba install snakemake
-
Then you can execute the
snakemake --help
to show the snakemake helping page.
Clone the repo to your computer.
Clone by the following command if you're using public key for github connection.
git clone --recurse-submodules git@github.com:chtsai0105/smk-metagenome.git
Or clone by https link.
git clone --recurse-submodules https://github.com/chtsai0105/smk-metagenome.git
Otherwise, clone the submodules as a second step by:
cd smk-metagenome
git submodule update --init
Next, go to the directory by cd smk-metagenome
. The entire folder structure and its details are listed below:
.
├── config/
│ ├── config.yaml # Define the path for data and metadata.
│ ├── sample.csv # The metadata for samples. Define the names of the samples and the fastq files.
├── workflow/
│ ├── rules/ # The folder that contains the rules/submodules of the workflow.
│ ├── envs/ # The folder that contains the yaml config for conda environments.
│ ├── wrappers/ # The folder that stores the self-defined wrappers.
│ ├── snakefile # The workflow entrypoint. Define the targets for the workflow.
├── data/ # The folder for the data and the workflow outputs.
│ ├── fastq/ # The folder where the initial fastq files should be placed.
├── logs/ # The folder that stores all the logs.
├── slurm/ # The folder that contains the slurm profile for stajichlab partition@UCR hpcc.
└── run_snakemake.bash # The bash script for running the workflow.
You can edit the config.yaml
to setup the behavior of the workflow.
The key metadata refers to the sample.csv
, which have all the details of the sample.
The key partition is used to define the partition parameters. By defining it here we can avoid hardcoded the partion name directly in the script. Right now we have a highmem partition for the assembly process.
The other keys represent major steps in the workflow. You can switch on/off a particular step by changing the subkey run
to True/False. You can also change the
output path in some of the steps.
You should properly defined your metadata, which is recorded in the sample.csv
, before running the workflow.
There are 4 columns in this csv table - sample, R1, R2 and interleaved.
The column sample defines the sample name. You can change it to names which are more distinguishable instead of accession numbers. These names will also being used as the wildcard to define the workflow outputs.
The column R1 and R2 defines the fastq file names you put in the folder data/fastq
.
If the sequencing data is in interleaved format. Please leave the R1 and R2 blanked and fill the interleaved file names to the interleaved column.
Please make sure they are identical to the fastq files you have otherwise the workflow may have trouble to input the files. Please also confirm that the names in each column are unique.
After setup the paramters, the next step is to run the workflow.
Snakemake provides a dry-run feature to examine the workflow before truly running it. You should always test the workflow beforehand to make sure it execute as expected by the following command:
snakemake -np
After confirming all the steps. You can run the workflow by executing the script run_snakemake.bash
or the following command:
snakemake -p --profile slurm
Note that there are additional parameter defined in the slurm profile, please refer to smk_profile-slurm for more details.
If not running on the cluster, please remove the --profile
option and add additional parameters as you need:
snakemake -p --use-conda --jobs 4 --max-threads 4