Mainly, the project aims to discover correlations between specific motifs in the protein structure and a bias in the codons that encode these motifs. It also explores the evolutionary implications of these biases on protein families.
The project is developed in Python 3.11 or later. All packages are installed respictively in a conda environment. An environment with the required packages is available in the nustru_environment.yml
file. To create the environment, the following command can be run:
conda env create -f setup/nustru_environment.yml
For the phylogenetic tree construction and analysis ete3
is used, which has some specific installation dependencies. So the package need to be installed in an extra environment. The environment with the required packages is available in the nustru-phyl_environment.yml
file. To create the environment, the following command can be run:
conda env create -f setup/nustru-phyl_environment.yml
One additional package needs to be installed manually, as the CAI python package has some problems with the pip installation. The package can be installed with the following command:
pip install git+https://github.com/Benjamin-Lee/CodonAdaptationIndex.git
Some of the scripts require also external software packages to be installed. Which includes:
dssp 4.4.0
FastTree 2.1.11
mad 2.2
mafft 7.525
While the installation procedure for these packages is mentioned in the specific sections, there is also a script provided to install all the required software packages at once. The script can be run with the following command:
bash setup/package_helper.sh
otherwise try to install the packages with the following commands: On Linux:
sudo apt install [package_name] (like dssp, mafft, fasttree, mad)
On MacOS:
brew install [package_name] (like brewsci/bio/dssp, mafft, fasttree, mad)
Sometimes mad can not be installed with these methods, so the source code can be downloaded from the website and installed manually. The source code can be found here: https://www.mikrobio.uni-kiel.de/de/ag-dagan/ressourcen
To construct and filter the data in a more fast forward way, a pipeline script is provided. But alternatively if only parts of the pipeline are needed, the scripts can be run individually as described in the following sections. Several improvements have been made to parallelize and asynchronize the data fetching over API calls. Even pagination was used to limit the requests if applicable. However, many websites like NCBI limit API requests to around 3-10 requests per second. Dealing with hundreds of thousands of uniprot entries, this can slow down the process.
Note: Some of the scripts require an api key for NCBI to accelerate and parallelize the data fetching. The api key can be created by following the website instructions: https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us. The scripts are optimized to use the api key and the mail address to fetch the data, if requested changes can be made to lower the request rate and avoid the api key.
The database construction includes the following steps for individual entries:
- Fetch entry from PDB or Uniprot
- Map the nucleotide sequence to the protein sequence (Uniprot is seperated in two steps)
- Filter the entries for wrong translations and redundancy
- Assign secondary structure and other features to the entries
The database construction includes the following steps for protein families (only for Uniprot):
- Fetch protein members of a family from InterPro
- Fetch entries from Uniprot
- other steps are the same as for individual entries
While the individual scripts are described in the following sections, there is a complete pipeline script called nustrufiller.py
. Based on the input, one or a list of Uniprot id(s) or Intepro id(s), it will fetch and construct the required data. The pipeline script can be run with the following arguments / options:
usage: nustrufiller.py [-h] -i INPUT -o OUTPUT_PATH [-n NAME] [-u UNIQUE] [-w] -m API_MAIL -k API_KEY [-d]
Complete pipeline to fetch the data for uniprot IDs or interpro IDs (complete protein families).
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Single uniprotID or interproID, or file with comma seperated ids.
-o OUTPUT_PATH, --output OUTPUT_PATH
Output path or directory to store the log file and the data.
-n NAME, --name NAME Name of the output files and log file.
-u UNIQUE, --unique UNIQUE
pssibility to drop duplicate. To keep duplicates use None. Default: organisms
-w, --overwrite If file name already exists, overwrite it. Default is False.
-m API_MAIL, --mail API_MAIL
Provide mail for ncbi api account: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/.
-k API_KEY, --key API_KEY
Provide api key from ncbi api account: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/.
-d, --download Download the structure files.
Note: This pipeline script only works for Uniprot data. For PDB data, the script in the "Create nustruDB for PDB data" section should be used.
Both PDB and Uniprot provide coding sequences for the deployed protein structures, but follow different mapping strategies to the nucleotide sequence.
To create the database from PDB, a graphql api is used to fetch the nucleotide sequences. The coding sequence of the protein is annotated with the start of the nucleotide sequence, the end of the nucleotide sequence, the strand and eventual exon shifts from the NCBI nt database. The csv file is created with the following arguments / options:
usage: PDBmapNT [-h] -i ENTRYID [--sql] [--pandas] -o OUTPUT_PATH -n NAME [--map-uniprot] [--create-fasta {protein,nucleotide,all}] [-w]
Map PDB ID to nucleotide sequence and prints an allignment of the pdb protein sequence to the nucleotide sequence
options:
-h, --help show this help message and exit
-i ENTRYID, --input ENTRYID
Path to file containing PDB IDs with comma separation.
--sql Store the data in a SQL database.
--pandas Store the data in a pandas DataFrame.
-o OUTPUT_PATH, --output OUTPUT_PATH
Output path or directory to store the log file and the data.
-n NAME, --name NAME Name of the output files and log file.
--map-uniprot Map uniprot ID to nucleotide sequence.
--create-fasta {protein,nucleotide,all}
Create a fasta file with the nucleotide sequences, protein sequences or both.
-w, --overwrite If file name already exists, overwrite it. Default is False.
Test the pdb mapping with the following command:
python PDBmapNT.py -i Example/examples_nustruDB/example_pdbIDs.txt --pandas -o . -n example_pdbIDs_nustru --create-fasta all
To create the database from Uniprot, the nucleotide sequence is fetched from the NCBI nt database. The refered Genebank ID or EMBL ID is used to fetch the nucleotide sequence. Two scripts are provided, but while the first one follows a similiar strategy as the PDB mapping with only providing Uniprot ids (see nustruDB/Example/example1_uniprotIDs.txt
), the second script takes in a predefined list (see nustruDB/Example/example1_uniprotList.tsv
).
The database is then created with the following arguments / options:
usage: UPmapNT [-h] -i ENTRYID [--sql] [--pandas] -o OUTPUT_PATH -n NAME [--create-fasta {protein,nucleotide,all}] [-w] -m API_MAIL -k API_KEY
Map uniprot ID to nucleotide sequence and prints an allignment of the pdb protein sequence to the nucleotide sequence
options:
-h, --help show this help message and exit
-i ENTRYID, --input ENTRYID
Path to file containing PDB IDs with comma separation.
--sql Store the data in a SQL database.
--pandas Store the data in a pandas DataFrame.
-o OUTPUT_PATH, --output OUTPUT_PATH
Output path or directory to store the log file and the data.
-n NAME, --name NAME Name of the output files and log file.
--create-fasta {protein,nucleotide,all}
Create a fasta file with the nucleotide sequences, protein sequences or both.
-w, --overwrite If file name already exists, overwrite it. Default is False.
-m API_MAIL, --mail API_MAIL
Provide mail from ncbi api account: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
-k API_KEY, --key API_KEY
Provide api key from ncbi api account: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
Test this uniprot mapping with the following command:
python fastUPmapNT.py -i Example/examples_nustruDB/example_uniprotList.tsv -o . -n example_uniprotList_nustru [-w]
To use the parallel version of the uniprot mapping, a tsv table of the uniprot ids and the column features is needed. The script is adapted from: https://www.uniprot.org/help/api_queries and uses batches to retrieve the required data.
usage: fetchUPTSV.py [-h] -i INPUT_FILE -o OUTPUT_PATH -n NAME [-w]
Retrieve uniprot features from uniprot IDs and store them in a tsv file
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input file with uniprot IDs.
-o OUTPUT_PATH, --output OUTPUT_PATH
Output path or directory to store the data.
-n NAME, --name NAME Name of the output file.
-w, --overwrite If file name already exists, overwrite it. Default is False.
Test the tsv table creation with the following command:
python fetchUPTSV.py -i Example/examples_nustruDB/example_uniprotIDs.txt -o uniprot_features -n uniprot_features [-w]
The tsv table can then be used to map the nucleotide sequences to the uniprot ids with the following arguments / options:
usage: fastUPmapNT.py [-h] -i INPUT_FILE -o OUTPUT_PATH -n NAME [-w]
Retrieve nucleotide sequences from uniprot IDs in a csv file.
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input file with uniprot IDs in a csv format.
-o OUTPUT_PATH, --output OUTPUT_PATH
Output path or directory to store the log file and the data.
-n NAME, --name NAME Name of the output files and log file.
-w, --overwrite If file name already exists, overwrite it. Default is False.
Test the uniprot mapping with the following command:
python fastUPmapNT.py -i Example/examples_nustruDB/example_uniprotList.tsv -o . -n example_uniprotList_nustru [-w]
The dataframe or database contains many redundant entries. To reduce the redundancy and also delete wrong translations from the nucleotide coding sequence, the entries are filtered by the following criteria: The coding sequence starts with ATG, the coding sequence is dividable by trinucleotides (codons), the coding sequence only contains valid nucleotides (A, T, C, G) and the coding sequence can be translated to the same assigned protein sequence. The script is called with the following arguments / options:
usage: db_filter.py [-h] -i INPUT_FILE -o OUTPUT_PATH -n NAME [-u UNIQUE] [-w]
Filter the csv databases to reduce rendundancy, check cds and right translation and further improve the data contained.
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input file of csv formatted uniprot and/ or pdb entries.
-o OUTPUT_PATH, --output OUTPUT_PATH
Output to store the new filtered and reduced csv.
-n NAME, --name NAME Name of the output csv file and log file.
-u UNIQUE, --unique UNIQUE
pssibility to drop duplicate. To keep duplicates use None. Default: organisms
-w, --overwrite If file name already exists, overwrite it. Default is False.
Test the filtering with the following command:
python db_filter.py -i Example/examples_nustruDB/example_db_filtered.csv -o . -n nustruDB -u
It is better to filter first the data and then fetch the secondary structure information, as it reduces the API load. To run this script, an installed version of the DSSP
software is required. Try to install it with the following command:
On Linux:
sudo apt install dssp
On MacOS:
brew install brewsci/bio/dssp
The script will create two additional columns with the b-factor for PDB entries or the pLDDT score for Uniprot entries and the secondary structure. The chain of the model is taken into account. The script is called with the following arguments / options:
usage: db_fetch.py [-h] -i INPUT_FILE -o OUTPUT_PATH -n NAME [-d] [-w]
Fetch the pdb and map the secondary structure, as well as plddt or bfactor to the protein sequence.
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input file of csv formatted uniprot or pdb entries.
-o OUTPUT_PATH, --output OUTPUT_PATH
Output to store the new csv with secondary structure information.
-n NAME, --name NAME Name of the output files and log file.
-d, --download Download the pdb files.
-w, --overwrite If file name already exists, overwrite it. Default is False.
Test the secondary structure mapping with the following command:
python db_fetch.py -i Example/examples_nustruDB/example_dbfetch.csv -o . -n example_dbfetch_nustru_fetched [-w]
The domains and fold classes were not fetched during the previous steps. The script db_fclass.py
is used to fetch the annotated domains and fold classes for a protein. The fold class is retrieved from
the CATH or SUPFAM database, this still needs to be improved. The script is called with the following arguments / options:
usage: db_fclass.py [-h] -i INPUT_FILE -o OUTPUT_PATH -n NAME [-w]
Fetch the annotated domains and fold classes for a protein.
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input file of csv formatted uniprot or pdb entries.
-o OUTPUT_PATH, --output OUTPUT_PATH
Output to store the new csv with domain and fold class information.
-n NAME, --name NAME Name of the output files and log file.
-w, --overwrite If file name already exists, overwrite it. Default is False.
Test the domain and fold class fetching with the following command:
python -i Example/examples_nustruDB/example_dbfclass.csv -o . -n example_dbfclass_nustru_fetched
Some of the analysis were done on whole protein families. While nustrufiller.py
can be used to fetch the data and perform several data preperation steps automatically, the script fetchINPRO.py
can be used to fetch the members (Uniprot IDs) of a protein family from InterPro seperately. The script is called with the following arguments / options:
usage: fetchINPRO.py [-h] -o OUTPUT_PATH [-w] family_id
Fetch Accessions from InterPro
positional arguments:
family_id InterPro family ID to fetch accessions for
options:
-h, --help show this help message and exit
-o OUTPUT_PATH, --output OUTPUT_PATH
Output to store the interpro accessions.
-w, --overwrite If file name already exists, overwrite it. Default is False.
Test the InterPro fetching with the following command:
python fetchINPRO.py IPR000839 -o IPR000839_accessions.txt
To analyse the effects of codon usage bias and specific correlations between synonymous codons and secondary structure elements, two notebooks are provided. The first notebook codon_metrics_analysis.ipynb
investigates the secondary structure and codon usage bias of different parts of the protein sequence.
The second notebook cross_validation_destribution_analysis.ipynb
uses the KL divergence to compare the synonymous codon usage bias of the secondary structure elements. The notebook also compares the frequencies and probabilities.
The evolutionary analysis is based on the multiple sequence alignment of different protein families with data from the nustruDB (described previously). For each msa, the amino acid`s normalized codon rarities are calculated with the following formula:
Where
The multiple sequence alignment is done with the mafft
software. The software can be installed with the following command:
On Linux:
sudo apt mafft
On MacOS:
brew install mafft
The previous database scripts should have created a protein fasta file with the filtered protein sequences, so that the alignment can then be done like this:
mafft --auto input.fasta > output_aln.fasta
The tree construction is done with the FastTree
software. The software can be installed with the following command:
On Linux:
sudo apt install FastTree
On MacOS:
brew install FastTree
From the previous alignment, the tree can be constructed with the like this:
FastTree input_aln.fasta > output_tree.nwk
The automatic rooting of the tree can be done with MAD. The software should be manually installed from the source code, as described in the requirements section. The rooting can be done with the following command:
python mad.py input_tree.nwk
Alternatively, the msa and tree construction can be automatically done with the setup/align_and_tree.sh
script or the setup/alignment_and_manual_tree_rooting.ipynb
notebook. Both helps to align the sequences and construct the tree. The notebook also helps to root the tree manually if desired.
For the analysis of the codon rarity and protein structure the notebook msa_codon_structure.ipynb
is provided. The notebook explains each individual step of the analysis. It directly implements the described formula to calculate the codon rarity on the msa. Furthermore, the log odds ratio is calculated to reveal positions with a higher likelihood to be influenced by codon usage bias.
For the phylogenetic analysis the notebook phyl_codon_rarity.ipynb
is provided. The notebook explains the individual steps and the results of the analysis. It uses the ete3
package to visualize the tree and the codon rarity. It also does a regression analysis to find the correlation between the codon rarity and the evolutionary distance.
To analyse the fold classes and domains on the codon rarity (CRS) or relative synonymous codon usage (RSCU) the notebook fold_domain_analysis.ipynb
is used. The notebook visualizes the ratio of alpha helix and beta sheet contents and the CRS or RSCU. To investigate the CRS on the folds, the path to the pickle file from msa_codon_structure.ipynb
, which should be created when running the notebook, needs to be defined.
Some of the scripts and notebooks are planned to be more optimized. The logging for the data construction seems to not work properly and should be fixed.