This is the main repository for this article.
It contains the pipelines used to process and generate data and results, the notebooks used to process the results and generate figures, as well as the publicly available data used in this study.
The processed results used to generate figures are also available in this repository.
The data used in the manuscript is included in this repository. More information here
To be able to run this pipeline several steps are needed.
for this pipeline you will need python >= 3.6
, snakemake>=5.26.1
and the packages specified in utils_hiv/requirements.txt
. To install the necessary package in a conda virtual environment run:
$ cd /path/to/this/directory
$ conda create -n pipelineDRMs python=3.7 snakemake">=5.26.1" -y
$ conda activate pipelineDRMs
$ pip install -e utils_hiv
For this pipeline to run you will need several alignments of HIV-1 pol RT sequences from at least 2 datasets: a training set and at least one testing set.
Each training/testing set is composed of a FASTA alignment of treatment-naive sequences and another of treatment-experienced sequences.
To get the positions of each residue w.r.t. the reference HXB2 sequence and get a suitable format for encoding, you should upload each of your alignments trainNaive.fa
, trainTreated.fa
, testNaive.fa
and testTreated.fa
into Stanford's HIVdb program. For each uploaded alignment you will get the PrettyRTAA.tsv
and ResistanceSummary.tsv
files which are need for dataset encoding.
In our study the training data corresponds to the UK dataset, and the testing data corresponds to the African dataset
This pipeline takes the files generated above by Stanford's HIVdB and encodes them to vectorial form.
The pipeline takes as input the directory where those files are stored, each dataset you want to encode must be in a separate subdirectory. The pipeline also needs a directory where metadata files are located.
In our example we want to encode 2 datasets, a UK dataset and an African dataset, so our directory and files should look like this:
.
├── data_dir
│ ├── Africa
│ │ ├── PrettyRT_naive.tsv
│ │ ├── PrettyRT_treated.tsv
│ │ ├── ResistanceSummary_naive.tsv
│ │ └── ResistanceSummary_treated.tsv
│ └── UK
│ ├── PrettyRT_naive.tsv
│ ├── PrettyRT_treated.tsv
│ ├── ResistanceSummary_naive.tsv
│ └── ResistanceSummary_treated.tsv
└── metadata_dir
├── Africa-metadata.tsv
└── UK-metadata.tsv
The pipeline looks as follows, with the process_data
rule encoding sequences as binary vectors of mutation presence/absence, and the homogenize_data
rule making sure all encoded datasets have the same set of features so that classifiers trained on one dataset can predict labels for another.
This pipeline trains the classifiers on a training set and gets predictions on a testing set. The inputs are specified in the config.yaml
configuration file. The input data is the one generated by the preprocessing pipeline.
The pipeline takes as input and encoded training and testing set (ie. the UK dataset) and any number of external testing sets (ie the African dataset).
All configuration options are described and listed in the configuration file config.yaml
which must be given to the pipeline.
The following figure shows a run for our pipeline, we specified the following options:
- we want 3 models trained: Random Forest (RF), Naive Bayes (Bayes) and Logistic regression (Logistic)
- we want 3 training sessions:
- training on B subtype of the training set and testing on C subtype of the training set
- training on C subtype of the training set and testing on B subtype of the training set
- training on All subtypes of the training set and testing on All subtypes of the external testing set
In an actual run of this pipeline we might want to also increase the number of repeated training sessions for models such as Random Forests that have a random aspect.
To execute the pipeline execute the following steps:
$ conda activate pipelineDRMs
$ snakemake \
--snakefile=Snakefile_main.smk \
--configfile=path/to/config.yml \
--kep-going \
--jobs [nb. of cores/threads to use]
To execute this pipeline in a SLURM cluster environment (fill out partition/account name and qos accordingly):
$ module load [modules] # (ie. conda, python, ...)
$ conda activate pipelineDRMs
$ snakemake \
--snakefile=Snakefile_main.smk \
--configfile=path/to/config.yml \
--keep-going \
--cluster "sbatch -c {threads} -o {params.logs}/{params.name}.log -e {params.logs}/{params.name}.log --mem {params.mem} -p [partition name] --qos=[qos name] -A [account name] -J {params.name}" \
--jobs [nb. of cores/threads to use]
For more information on pipeline execution in HPC cluster environments see the snakemake documentation
The results from the main pipeline can then be processed by the gather_results.py
script. This script takes as input the list of result directories that were created by the pipeline and outputs a concatenated tab-delimited file with all predictions, as well as a concatenated tab-delimited file containing the importances/weights assigned by all trained models to the dataset features. This script can also be used to concatenate the results of several runs of the main pipeline.
These files can then be used for interpretation and figure generation (several examples in the notebooks
directory).