- Introduction
- Example Usage
- UniProt Parser
- Protein-Protein Interaction Parser
- PPI Experiment Count
- Build Interactome (True Binary Interactions only)
- Build Interactome (True Binary Interactions with Expansion)
- Module Input File Generator
- Uniprot2ENSG Mapper
- Naïve Approach (1-hop Neighborhood Approach)
- DREAM Challenge: Cluster File Processing
- Naïve with Clustering Approach
- Output
- Interactome Clustering Methods
- Arguments
- Metadata files
- Dependencies
- License
- This is the main repository containing all the scripts for my Master's thesis project.
- My project was to develop a multi-omics method for scoring genomic variants that might be potentially causal to a particular phenotype in the patient by supervised machine learning.
- The classifier takes
input features
(an aggregation of different omics data as scalar values) to produce ascore between 0 and 1
.- Higher score: more likely that the variant is potentially causal
- Lower score: less likely that the variant is potentially causal
- The algorithm finally ranks the variants based on these scores to identify novel disease genes in each patient.
- This repository contains individual scripts which work at the
Gene level
. - I have integrated these into the Exome-Seq Secondary Analysis Pipeline (click here) which works at the
Clinical level
. - The result files produced by this pipeline now contain the following data:
- Clinical
- Variant
- Gene
- Interactome
- Expression
- The
Machine learning
scripts are used for prioritizing genomic variants. - It can be used on the
patient sample result file
generated by the pipeline. - You can find the example usage of the scripts in the MachineLearning directory of the repository.
- Parses on STDIN a UniProt file and extracts the required data from each record
- Prints to STDOUT in .tsv format
-> Grab the latest UniProt data with:
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
-> Parse UniProt data to produce output with:
gunzip -c uniprot_sprot.dat.gz | python3 1_Uniprot_parser.py > Uniprot_output.tsv
Protein-Protein Interaction Parser
- Parses a Protein-Protein Interaction (PPI) File (miTAB 2.5 or 2.7)
- Maps to UniProt using the output files produced by
and prints to STDOUT in .tsv format
1] Grab the latest BioGRID data with:
wget https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/BIOGRID-ORGANISM-LATEST.mitab.zip
-> Unzip with:
-> This will produce one miTAB File per Organism (Use BIOGRID-ORGANISM-Homo_sapiens*.mitab.txt for human data)
2] Grab the latest IntAct data with:
wget ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psimitab/intact.zip
-> Unzip with:
unzip intact.zip
-> This will produce 2 files (intact.txt & intact_negative.txt). Use intact.txt for further steps
3] Parse PPI data with:
python3 2_Interaction_parser.py --inInteraction BIOGRID-ORGANISM-Homo_sapiens*.mitab.txt --inUniprot Uniprot_output.tsv > Exp_Biogrid.tsv
python3 2_Interaction_parser.py --inInteraction intact.txt --inUniprot Uniprot_output.tsv > Exp_Intact.tsv
-> The above example is for the Protein-Protein Interaction data from BioGRID and IntAct. However, you can retrieve PPI data (in miTAB format) from any database and feed it to the script to produce an output file.
- Parses a Protein-Protein Interaction File (miTAB 2.5 or 2.7)
- Prints the count of Human-Human Protein Interaction experiments to STDOUT
-> Provide a STDIN miTAB 2.5 or 2.7 file with:
python3 3_Count_HumanPPIExp.py < miTAB File
Build Interactome (True Binary Interactions only)
High-Quality Interactome Criteria:
1] Filtering Interactions based on Interaction Detection Method: - We filter out pull down (MI:0096), genetic interference (MI:0254) & unspecified method (MI:0686)
2] Filtering Interactions based on Interaction Type: - We keep only direct interaction (MI:0407) & physical association (MI:0915)
3] Here, we try to eliminate most of the EXPANSION DATA and consider only TRUE BINARY INTERACTIONS
4] Each Interaction has ≥ 2 experiments, of which at least one of them should be proved by any BINARY METHOD
5] Eliminating Hub/Sticky proteins (A protein is considered a hub if it has > 120 interactors. This number is based upon the degree distribution of the entire Interactome before eliminating hub/sticky proteins).
-> Build High-Quality Human Interactome with:
python3 4_BuildInteractome_BinaryPPIonly.py --inExpFile Exp_Biogrid.tsv Exp_Intact.tsv --inUniprot Uniprot_output.tsv --inCanonicalFile canonicalTranscripts_*.tsv.gz > Interactome_human.tsv
-> For getting canonical transcripts file
, please refer to grexome-TIMC-Secondary
-> Build Interactome scripts accepts multiple processed Protein-Protein Interaction experiment file (--inExpFile)
Build Interactome (True Binary Interactions with Expansion)
High-Quality Interactome Criteria:
1] Filtering Interactions based on Interaction Detection Method: - We filter out genetic interference (MI:0254) & unspecified method (MI:0686)
2] Here, we consider both TRUE BINARY INTERACTIONS and PPIs derived from EXPANSION
4] Each Interaction should be proven by ≥ 2 experiments
Note: The Interactome containing expansion data should not be used for identifying disease-enriched modules as the clustering algorithms fail to cluster the network correctly, leading to wrong results. I optionally included this script if someone wants to use it for other purposes.
-> Build High-Quality Human Interactome with:
python3 5_BuildInteractome_BinaryPPIwithExpansion.py --inExpFile Exp_Biogrid.tsv Exp_Intact.tsv --inUniprot Uniprot_output.tsv --inCanonicalFile canonicalTranscripts_*.tsv.gz > Interactome_human_binarywithexpansion.tsv
- Parses the output produced by
- Assigns a default edge weight = 1 for each interaction and prints to STDOUT in .tsv format
- This can be used as INPUT for most of the module identification/clustering methods
-> Generate Module Input File with:
python3 6_ModuleInputFile.py < Interactome_human.tsv
- Parses the output files produced by
and thecanonical transcripts file
(Ex: canonicalTranscripts_220221.tsv) - Maps UniProt accession to ENSG and prints to STDOUT
-> Run UniProt2ENSG Mapper with:
python3 7_Uniprot2ENSG.py --inUniprot Uniprot_output.tsv --inCanonicalFile canonicalTranscripts_220221.tsv
Naïve Approach (1-hop Neighborhood Approach)
- Parses the Sample metadata file (.xlsx), UniProt File, Canonical transcripts file, Candidate Gene file(s), Interactome file, and GTEX File
- For a given gene:
- Checks if the gene is already a known candidate
- Checks the number of Interactors
- Checks the number of Interactors that are known, candidates
- Applies Fisher's Exact test to compute P-values
- Adds the total count & a comma-separated list of candidate genes within the 2-hop neighborhood
- Additionally adds GTEX data
- Prints to STDOUT in .tsv format
- This script provides one of the scoring components for the Machine Learning step
-> Run 8_NaiveApproach.py script with:
python3 8_NaiveApproach.py --inSampleFile sample.xlsx --inUniprot Uniprot_output.tsv --inCandidateFile candidateGenes.xlsx --inCanonicalFile canonicalTranscripts_220221.tsv --inInteractome Interactome_human.tsv --inGTEXFile E-MTAB-5214-query-results.tpms.tsv
-> You can use the GTEX file provided in this repository. (Note: The GTEX file provided in the repository might not be the latest. If you want to retrieve the latest GTEX file, please visit https://www.ebi.ac.uk/gxa/home).
DREAM Challenge: Cluster File Processing
- This script is for processing the cluster file produced by the MONET TOOL (DREAM Challenge)
- Parses on STDIN a .tsv file produced by the MONET tool, processes it, and prints to STDOUT in .cls format
- The output can be used as the input Cluster File for
- Note: For producing the Interactome Clustering File and using this script, please refer to the "Interactome Clustering Methods" section
Naïve with Clustering Approach
- This script is similar to
, but the output additionally contains the Interactome Clustering data
-> Run 10_Naive_withClusteringApproach.py script with:
python 10_Naive_withClusteringApproach.py --inSampleFile sample.xlsx --inUniprot Uniprot_out.tsv --inCandidateFile candidateGenes_*.xlsx --inCanonicalFile canonicalTranscripts_220221.tsv --inInteractome Interactome_human --inClusterFile K1Clustering_clusterFile.cls --inGTEXFile E-MTAB-5214-query-results.tpms.tsv
- Clustering data provide an additional scoring component for the Machine Learning step
- For a detailed description of the scripts' output, please use the-help, -h option.
- You can also view the sample output files provided in the Sample_Output_Files directory of the repository
We consider Clusters with a size of >= 3 and 130 (max)
If the cluster size exceeds 130, the methods are applied recursively (MONET tool automatically does this) to obtain clusters of the desired size.
I have tested mainly four types of clustering methods:
1] Kernel clustering approach (K1 method from DREAM Challenge) (Choobdar, Sarvenaz, et al. "Assessment of network module identification across complex diseases." Nature methods vol. 16,9 (2019): 843-852. doi:10.1038/s41592-019-0509-5)
2] Modularity Optimization method (M1 method from DREAM Challenge) (Choobdar, Sarvenaz, et al. "Assessment of network module identification across complex diseases." Nature methods vol. 16,9 (2019): 843-852. doi:10.1038/s41592-019-0509-5)
3] Random-walk-based method (R1 method from DREAM Challenge) (Choobdar, Sarvenaz, et al. "Assessment of network module identification across complex diseases." Nature methods vol. 16,9 (2019): 843-852. doi:10.1038/s41592-019-0509-5)
- To run the above clustering methods on the Interactome file generated by Build_Interactome.py, please use the MONET tool (Tomasoni, Mattia et al. "MONET: a toolbox integrating top-performing methods for network modularization." Bioinformatics (Oxford, England) vol. 36,12 (2020): 3920-3921. doi:10.1093/bioinformatics/btaa236) available at: https://github.com/BergmannLab/MONET - If you will be using the cluster file produced by these two methods, then please process it using ProcessClusterFile_MONET.py script using the command: % cat cluster_outputFile.tsv | python3 9_ProcessClusterFile_MONET.py > ClusterFile.cls - Input File Description: cluster_outputFile.tsv: Clustering Output File produced by the MONET tool (DREAM Challenge)
4] Randomized optimization of modularity (Didier, Gilles, et al. "Identifying communities from multiplex biological networks by randomized optimization of modularity." F1000Research vol. 7 1042. 10 Jul. 2018, doi:10.12688/f1000research.15486.2)
- To run this clustering method on the Interactome file generated by Build_Interactome.py, please use the MolTi-DREAM tool described at: https://github.com/gilles-didier/MolTi-DREAM - This might generate some large clusters (i.e., size > 130). In such cases, please run the tool recursively as described on the MolTi-DREAM GitHub page - The output produced by this tool need not be processed further and can be directly used for the 5.2_addInteractome.py script
You can use one of the above or any other clustering methods, but the Cluster File (please refer to the sample_clusterFile.cls File in the Sample_Input_Files directory of the repository) should be of the format:
- Header: (Ex: #ClustnSee analysis export) - Followed by ClusterID (Ex: ClusterID:1||) - Followed by Name(ENSG) of the Cluster(s) (Ex: ENSG00000162819) - An empty line indicates end of a given Cluster
Arguments [defaults] -> Can be abbreviated to shortest unambiguous prefixes
# UniProt Files
--inUniprot A tab-separated Input File name (produced by 1_Uniprot_parser.py) containing UniProt Primary Accession, Taxonomy Identifier, ENST(s), ENSG(s), UniProt Secondary Accession(s), Gene ID(s) & Gene name(s)
# Protein-Protein Interaction File(s)
--inInteraction miTAB 2.5 or 2.7 Input File name (Protein-Protein Interaction File)
# Protein-Protein Interaction Experiment File(s)
--inExpFile PPI Experiments Input File name (produced by 2_Interaction_parser.py)
# Canonical Transcripts File
--inCanonicalFile Canonical Transcripts Input File name (.gz or non .gz)
# Sample File
--inSampleFile Sample Metadata Input File name (.xlsx)
# Candidate Gene File(s)
--inCandidateFile Candidate Gene Input Files(s) name (.xlsx)
# Interactome File
--inInteractome High-Quality Interactome Input File name (produced by 4_BuildInteractome_BinaryPPIonly.py/5_BuildInteractome_BinaryPPIwithExpansion.py)
# GTEX File
--inGTEXFile GTEX Input File name (.tsv)
# Help
-h, --help Show the help message and exit
- Currently, the scripts use 2 metadata files i.e.
- This metadata file describes the samples.
- Required column:
- pathologyID: the phenotype of each patient/sample, used to define the "cohorts".
- sampleID: unique identifier for each sample (Used by
Machine learning
- Optional columns (currently not used by the scripts) such as:
- specimenID: the external identifier for each sample, typically related to the BAM or FASTQ filenames.
- patientID: a more user-friendly identifier for each sample
- Sex: 'F' or 'M'
- Lists known candidate genes/implicated seed genes
- Required columns:
- Gene: gene name (should be the HGNC name, see www.genenames.org).
- pathologyID: pathology/phenotype
- Optional column (currently not used by the scripts) such as:
- Confidence score: indicates how confident you are that LOF variants in this gene are causal for this pathology. Value: integers from 1 and 5 (5 meaning the gene is definitely causal, while 1 is a lower-confidence candidate).
- Python version >= 3
- External dependencies are kept to a minimum in all the scripts. The only required python modules are listed below:
- OpenPyXl == 3.0.10
- SciPy == 1.5.2
- You can install these with pip/conda
- Most other standard core modules should already be available on your system
- Additional dependencies for
Machine learning
:- Scikit-learn == 1.1.1
- Imbalanced-learn == 0.9.1
- Pandas == 1.4.3
- Joblib == 1.1.0
Licensed under GNU General Public License v3.0 (Refer to LICENSE file for more details)