- Scientific Context
- Workflow
- Installation & Usage
- Input Data
- Test Run
- WIKI
- Bugs
- Acknowledgements
- License
- Contact
ProtTrace is a simulation based approach to assess for a protein, the seed, over what evolutionary distances its orthologs can be found by means of sharing a significant sequence similarity. By doing so, it helps to differentiate between the true absence of an ortholog in a given species, and its non-detection due to a limited search sensitivity. ProtTrace was presented 2018 at the German Conference on Bioinformatics (GCB). The high resolution PDF of the corresponding poster is available from HERE.
The workflow of protTrace to infer the evolutionary traceability of a seed protein is shown in the figure below (mouse over to see details). It consists of three main steps
- Parameterization: The compilation of an orthologous group for this protein. In the standard setting, OMA orthologous groups are used. The sequences in the ortholog group are then used to infer the parameters of substitution and the insertion- and deletion process.
- Traceability calculation: The in-silico evolution of the seed protein using the simulation software REvolver, and the determination of the traceability curve.
- Visualization: The inference of the traceability index for the protein in 233 species from all domains of life, and the generation of a colored tree. A high resolution PDF of the image is available HERE.
Please refer to the protTrace WIKI for a full description of the installation and usage guidlines. The WIKI will also explain how to set up a virtual machine running protTrace. Below, we will provide a quick excerpt.
protTrace is written in Python 2.7, some helper scripts in Perl and R. Find below a the 3rd party software that is required by protTrace:
- The ProtTrace package contains scripts written in different languages. In order to run ProtTrace you need the following resources:
- Python v2.7.13 or higher. Note, ProtTrace will not run under Python 3
- Install also the DendroPy module (can be done via Conda).
- Perl v5 or higher including the following modules
- Getopt::Long
- List::Util
- LWP::Simple
- Java v1.7 or higher
- R v3 or higher
- wget
Program name | Version | Description | Mandatory | BioConda |
---|---|---|---|---|
MAFFT | v6 or higher | Multiple Sequence alignment | yes | yes |
NCBI Blast | v2.7 or higher | Sequence similarity based search | yes | yes |
HMMER | 3.2 or higher | Sequence similarity based search using Hidden Markov Mode | yes | yes |
IQTREE | 1.6.7.1 or higher | Phylogenetic tree reconstruction | yes | yes |
HaMStR OneSeq | v1 or higher | targeted ortholog search | no | no |
For the start, we suggest to omit the optional use of HaMStR, since the use of this software comes along with some strict naming conventions.
Once that is out of the way (we suggest to use the conda package management system for this) you can just clone this repository to get a copy of protTrace.
git clone https://github.com/BIONF/protTrace
To configure protTrace simply move into the protTrace directory and run the configure script
perl bin/create_conf.pl -name=prog.conf -getOMA -getPfam
This will check if all dependencies are existing, it will allow you to set all parameters required for the protTrace run, and eventually will download the required data from the OMA database and from the Pfam database. * If you are confident that you have this data already available, you can omit either or both of the options -getOMA and -getPfam. You will then have to tell protTrace via the create_conf.pl script where this data is located. * Make sure to adhere to the formatting requirements for the OMA data, and that you ran hmmpress on the Pfam database.
Once everything is set, you are ready to run protTest
Enter the protTest directory and type
python bin/protTrace.py -h
this should obtain
USAGE: protTrace.py -i <omaIdsFile> | -f <fastaSeqsFile> -c <configFile> [-h]
-i Text file containing protein OMA ids (1 id per line)
-f List of input protein sequences in fasta format
-c Configuration file for setting program's dependencies
protTest can use either OMA protein ids, or a protein sequence in fasta format as input
In toy_example/
you can find two files, test.ids and test.fasta for performing a test run with protTrace.
We describe the input in the section Test Run of our WIKI.
We provide in the directory toy_example two files for testing protTrace
- test.ids: This file contains the OMA protein id of a yeast protein DIM1. To run this test:
- create a config file prot.conf using the create_conf.pl script. We recommend to leave all values as default for the start
- place the config file into the directory toy_example
- enter the directory toy_example and run protTrace by typing
The output that will be generated by this run is described in the WIKIpython ../bin/protTrace.py -i test.id -c prot.conf
- test.fasta: This file contains the protein sequence of human ZNT3.
- create or modify the config file prog.conf using the create_conf.pl script. Make sure to set in the section General Options the entry species to HUMAN
- place the config file into the directory toy_example
- enter the directory toy_example and run protTrace by typing
The output that will be generated by this run is described in the WIKIpython ../bin/protTrace.py -f test.fasta -c prot.conf
Read the WIKI to explore the functionality of protTrace.
Any bug reports or comments, suggestions are highly appreciated. Please open an issue on GitHub or be in touch via email.
We would like to thank the members of Ebersberger group for many valuable suggestions and ...bug reports :)
- Arpit Jain
- Ingo Ebersberger
- Dominik Perisa
This tool is released under GNU-GPL3.0 license.
Arpit Jain, Arndt von Haeseler, Ingo Ebersberger The evolutionary Traceability of protein (2018) BioRxiv
Ingo Ebersberger ebersberger@bio.uni-frankfurt.de