Content summary:
LRRpredictor is an open-source tool for detecting LRR motifs within leucine rich repeats proteins. It resides on secondary structure, relative solvent accessibility and disorder predictions that are performed using RaptorX-Property [1-4] and sequence variability profiles generated using HH-suite [5,6] on Uniprot20 sequence database.
Future updates planned for version 1.1:
- RaptorX-Property is currently being updated and as this is finished, LRRpredictor will integrate these updates as well as a newer version of HHsuite (currently not compatible with RaptorX-Property).
- The project cannot be installed on Ubuntu WSL (Windows Linux Subsystem) or MacOS. This will be fixed in v1.1.
- LRRpredictor currently accepts only single sequence input in FASTA format. A feature update for other sequence formats, as well as multiple sequence fasta input files is currently being implemented.
A docker image containing LRRpredictor readily installed and setup can be pulled from our repository.
-
Install Docker client.
-
First you need to download a security SSL certificate.
For Ubuntu/Debian:
sudo wget old.biochim.ro/ib/departments/strbiochem/LRRpred/193.231.158.8_5000.crt -O /usr/local/share/ca-certificates/193.231.158.8_5000.crt
sudo update-ca-certificates
sudo service docker restart
- After the ca-certificate is set, the image can be pulled :
sudo docker pull 193.231.158.8:5000/lrrpredictor:1.0.1
You can see the that the image has been pulled by :
sudo docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
193.231.158.8:5000/lrrpredictor 1.0.0 24d71f69f21f 4 hours ago 2.61GB
- Run the docker image. Now the terminal promt has changed to
root@<IMAGE ID>
.
sudo docker run -it 193.231.158.8:5000/lrrpredictor:1.0.1 bash
- Further we will download Uniprot20 database. Make sure that you have around 50 GB disk space available at the location were the docker image is located ::exclamation:: :
bash download-uniprot.sh
Additionally, if you want to download supplemental training data used for cross validation and testing that is not necessary for running LRRpredictor (these are not needed anymore for prediction):
bash download-validation-set.sh
Everythong is setup and now you can use LRRpredictor - see Usage section.
- gcc 5.4 and higher
- Cmake 3.1 and higher
- Python 3.6 and higher
- scikit-learn v0.22
- imbalanced-learn v0.6.1
- numpy v1.17 and higher
You can install them by:
sudo apt-get install build-essential, cmake
sudo apt-get install python3.6, python3-pip
If you have Ubuntu 14.04 or 16.04, additional steps might be required to install Python3.6 exclamation:. Please see link1 and link2
Before continuing please check that you are using Python 3.6 or higher by typing python3
.
Further you can install Scikit-learn and Imbalanced-learn libraries using pip :
pip3 install scikit-learn==0.22 imbalanced-learn==0.6.1 numpy
Please be sure you are cloning the project in a location where you have write permissions and at least 5 GB available ❗.
# You can clone the project it anywhere in your computer. This is only an example...
cd /home/test/
git clone --recursive https://github.com/eliza-m/LRRpredictor_v1
Check that you see LRRpredictor_v1
directory, when issueing :
ls -l
The following variables should not be changed for now. In the following update we plan to make the installation more customisable.
echo "export LRRpredictor_HOME="$(pwd)"/LRRpredictor_v1" >> ~/.bashrc
source ~/.bashrc
A whole setup workflow for HHsuite, RaptorX-Property and LRRpredictor and also for downloading all the needed files will be performed by typing:
cd ${LRRpredictor_HOME}
bash setupAll.sh
LRRpredictor requires the Uniprot20 database. The following steps need to be done only once, when LRRpredictor is set up for the first time.
Case 1: If you already have the Uniprot20 database in your computer, please run the following:
UNIPROT20_PATH=/***replace_with_your_path***/"
# defining a symbolic link to point to your local Uniprot20 copy.
cd ${LRRpredictor_HOME}/RaptorX_Property_Fast/databases
rm uniprot20
ln -s ${UNIPROT20_PATH}/uniprot20_2016_02 uniprot20
cd ${LRRpredictor_HOME}
Case 2: If you do not have the Uniprot20 database, we will need download it. Please be sure that you have ~50 GB disk space available ❗. This can be downloaded anywhere on your computer as further on symbolic links will be set up for the exact path to this database.
UNIPROT20_PATH=/***replace with the path where you want to download Uniprot***/"
# Downloading Uniprot20
cd ${UNIPROT20_PATH}
wget http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/uniprot20_2016_02.tgz
tar -xvzf uniprot20_2016_02.tgz
rm uniprot20_2016_02.tgz
# defining a symbolic link to point to your local Uniprot20 copy.
cd ${LRRpredictor_HOME}/RaptorX_Property_Fast/databases
rm uniprot20
ln -s ${UNIPROT20_PATH}/uniprot20_2016_02 uniprot20
cd ${LRRpredictor_HOME}
Now the installation is complete. You can use now LRRpredictor.
LRRpredictor can be used with the following sintax:
cd $LRRpredictor_HOME
python3 LRRpred.py <file.fasta> <OuputDirectory>
An example is provided for testing the installation :
python3 LRRpred.py gpa2.fasta results
The provided input file needs to be in fasta format. At the moment only single sequence files can be provided. ❗
Upon running, in the output directory results
, a folder with the protein file name has been generated gpa2
.
user@ea846151db35:/home/test/LRRpredictor_v1# ls -l results/gpa2/
total 328
drwxr-xr-x 3 root root 4096 Feb 26 17:51 RaptorX-Property # Contains RaptorX-Property structural predictions
-rw-r--r-- 1 root root 97165 Feb 26 17:51 gpa2.input # The data used as input for the 8 classifiers used by LRRpredictor
-rw-r--r-- 1 root root 60378 Feb 26 17:52 gpa2.pred.txt # Long version of the prediction results
-rw-r--r-- 1 root root 1738 Feb 26 17:52 gpa2.predshort.txt # Short version of the results
-rw-r--r-- 1 root root 165984 Feb 26 17:51 gpa2.pssm # variability profile used (also present in input file)
The long version of the prediction results contains:
#prot resid aa unused hasPred clf1 clf2 clf3 clf4 clf5 clf6 clf7 clf8 LRRpredictor
gpa2 0 M - 0 - - - - - - - - -
gpa2 1 A - 0 - - - - - - - - -
gpa2 2 Y - 0 - - - - - - - - -
gpa2 3 A - 0 - - - - - - - - -
gpa2 4 A - 0 - - - - - - - - -
gpa2 5 V - 1 0.0 0.0 0.0001 0.0002 0.0 0.0 0.0 0.0001 0.0001
gpa2 6 T - 1 0.0 0.0 0.0002 0.0 0.0 0.0 0.0 0.0 0.0
gpa2 7 S - 1 0.0001 0.0001 0.0001 0.0004 0.0 0.0 0.0 0.0004 0.0001
gpa2 8 L - 1 0.0001 0.0 0.0002 0.0 0.0 0.0 0.0 0.0 0.0
gpa2 9 M - 1 0.0093 0.0007 0.0008 0.0119 0.0001 0.0 0.0 0.0012 0.003
Header description:
- prot - protein name
- resid - residue number
- aa - amino acid one letter code
- unused - unused field (this field is used only training and testing data and indicates the position where a true LRR motifs starts; these positions were identified from structural files).
- hasPred - '1' for positions for which prediction are generated, '0' for margins.
- clf1-8 - Each classifiers predicted probability (min: 0, max 1)
- LRRpred - LRRpredictor probability based on all eight classifiers.
The short version of the prediction results contains only the detected potential LRR motifs, that yielded a probability value over 0.5.
#Prot pos clf1 clf2 clf3 clf4 clf5 clf6 clf7 clf8 LRRpred -5 -4 -3 -2 -1 L x x L x L +6 +7 +8 +9 +10
gpa2 498 0.7261 0.5799 0.8533 0.513 0.602 0.0606 0.8891 0.4282 0.5815 C S F K S R S R I S I H N E E E
gpa2 520 0.716 0.5073 0.9835 0.5105 0.3683 0.0358 0.5765 0.3352 0.5041 S E A H S I I T L C I F K C V T
gpa2 538 0.9436 0.9636 1.0 0.7928 0.9191 0.8902 1.0 0.649 0.8948 L S F K L V R V L D L G L T T C
gpa2 561 0.8005 0.9763 1.0 0.747 0.7626 0.895 1.0 0.8371 0.8773 L S L I H L R Y L S L R F N P R
gpa2 599 0.928 0.976 1.0 0.8087 0.9609 0.9056 1.0 0.8324 0.9264 S S L C Y L Q T F K L Y H P F P
gpa2 626 0.9298 0.9768 1.0 0.5384 0.9491 0.9053 1.0 0.8258 0.8907 L T M P Q L R K L C M G W N Y L
gpa2 651 0.5822 0.3423 0.9926 0.5818 0.3902 0.0031 0.9687 0.5873 0.556 L V L K S L Q C L N E L N P R Y
gpa2 654 0.7745 0.5717 1.0 0.7384 0.9009 0.5866 0.996 0.5031 0.7589 K S L Q C L N E L N P R Y C T G
gpa2 676 0.8824 0.8095 1.0 0.7785 0.9525 0.8812 0.9969 0.7421 0.8804 P N L K K L E V F G V K E D F R
Header description:
- prot - protein name
- pos - residue number where a detected LRR motif starts (i.e first
L
fromLxxLxL
minimalistic motif) - clf1-8 - Each classifiers predicted probability (min: 0, max 1)
- LRRpred - LRRpredictor probability based on all eight classifiers.
Starting from columns 12 until the end, the amino acid sequence of the detected LRR motif is shown: 5 positions upstream the motif (-5 to -1), the minimalistic motif LxxLxL and 5 positions downstream (6 to 10).
If these files were not generated when runing the provided example, something went wrong during installation. Most likely reasons could be:
- Environment variable is not set. Try the following and see if you get simillar paths.
root@ea846151db35:/home/test/LRRpredictor_v1# echo $LRRpredictor_HOME
/home/test/LRRpredictor_v1
- Uniprot20 was not downloaded or the symbolic link does not point to the uniprot20 files. Check the following
root@ea846151db35:/home/test/LRRpredictor_v1# ls -lh ${LRRpredictor_HOME}/RaptorX_Property_Fast/databases/uniprot20/
total 39G
-rw------- 1 1001 1001 637 Feb 26 2016 md5sum
-rw------- 1 1001 1001 1.9G Feb 26 2016 uniprot20_2016_02.cs219
-rw------- 1 1001 1001 18 Feb 26 2016 uniprot20_2016_02.cs219.sizes
-rw------- 1 1001 1001 29G Feb 26 2016 uniprot20_2016_02_a3m.ffdata
-rw------- 1 1001 1001 190M Feb 26 2016 uniprot20_2016_02_a3m.ffindex
lrwxrwxrwx 1 1001 1001 28 Feb 26 2016 uniprot20_2016_02_a3m_db -> uniprot20_2016_02_a3m.ffdata
-rw------- 1 1001 1001 222M Feb 26 2016 uniprot20_2016_02_a3m_db.index
-rw------- 1 1001 1001 1.8G Feb 25 2016 uniprot20_2016_02_cs219.ffdata
-rw------- 1 1001 1001 181M Feb 25 2016 uniprot20_2016_02_cs219.ffindex
-rw------- 1 1001 1001 5.1G Feb 26 2016 uniprot20_2016_02_hhm.ffdata
-rw------- 1 1001 1001 2.9M Feb 26 2016 uniprot20_2016_02_hhm.ffindex
lrwxrwxrwx 1 1001 1001 28 Feb 26 2016 uniprot20_2016_02_hhm_db -> uniprot20_2016_02_hhm.ffdata
-rw------- 1 1001 1001 3.4M Feb 26 2016 uniprot20_2016_02_hhm_db.index
- RaptorX-Property or HH-suite encountered a problem with your sequence. Check if you have the following log files from RaptorX as they should provide info regarding what went wrong.
root@ea846151db35:/home/test/LRRpredictor_v1# ls -l results/gpa2/RaptorX-Property/gpa2.tgt_log*
total 16
-rw-rw-r-- 1 eliza eliza 285 Jan 12 23:08 gpa2.tgt_log1
-rw-rw-r-- 1 eliza eliza 6897 Jan 12 23:08 gpa2.tgt_log2
- The pre-compiled binaries provided by RaptorX-Property do not work on your architecture/OS. Try run the following:
root@ea846151db35:/home/test/LRRpredictor_v1# ${LRRpredictor_HOME}/RaptorX_Property_Fast/util/SS8_Predict/bin/bcnf_mpitp
The ‘expected’ output of this internal binary would be similar to this :
I am 0
$JOB_ID 433 3
$JOB_ID bcast 433 3
If the binary does not work, it will say something like :
cannot execute binary file: Exec format error
This issue is still currently being addressed as these binaries cannot easily be recompiled as they come from other 3rd party software used by Raptorx. This will be solved in the near future.
- HHsuite is not working properly. Try the following:
root@ea846151db35:/home/test/LRRpredictor_v1# ${LRRpredictor_HOME}/hh-suite/install/bin/hhblits
If the help menu is printed, hhsuite should be fine, if an exec error appears, please reinstall hh-suite or rerun the setulAll.sh script.
If you use LRRpredictor please cite:
[1] Wang, S.; Li, W.; Liu, S.; Xu, J. RaptorX-Property: a web server for protein structure property prediction. Nucleic Acids Res. 2016, 44, W430–W435.
[2] Wang, S.; Peng, J.; Ma, J.; Xu, J. Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields. Sci. Rep. 2016, 6, 1–11.
[3] Wang, S.; Ma, J.; Xu, J. AUCpreD: Proteome-level protein disorder prediction by AUC-maximized deep convolutional neural fields. In Proceedings of the Bioinformatics; Oxford University Press, 2016; Vol. 32, pp. i672–i679.
[4] Wang, S.; Sun, S.; Xu, J. AUC-maximized deep convolutional neural fields for protein sequence labeling. In Proceedings of the Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer Verlag, 2016; Vol. 9852 LNAI, pp. 1–16.
[5] Remmert, M.; Biegert, A.; Hauser, A.; Söding, J. HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 2012, 9, 173–175.
[6] Steinegger, M., Meier, M., Mirdita, M., Vöhringer, H., Haunsberger, S. J., Söding, J. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 2019, 473. doi: 10.1186/s12859-019-3019-7