AnnoTEP is a platform dedicated to the annotation of transposable elements (TEs) in plant genomes. Built on the Plant genome Annotation pipeline, it combines sophisticated annotation tools integrated with HTML resources to offer researchers an enhanced experience during the annotation process. By integrating these tools with a user-friendly interface, AnnoTEP aims to facilitate and optimize the work of TE annotation, providing an effective solution for plant genomic analysis.
AnnoTEP is currently available in three formats: web server, container with graphic interface and container with bash interface. Clicking on each format below will take you to the system where you can access or install the platform:
- Identification, validation and annotation of SINE and LINE elements
- Genome masking (local mode)
- Report generation on TEs
- Generation of graphs illustrating repeated elements
- Generation of age graphs for Gypsy and Copia elements
- Generation of LTR phylogeny and density graphs
AnnoTEP can be installed on the machine in different ways, one of which is using Docker. The tool is available in two formats: with a graphical interface and without an interface (terminal mode). To follow the steps below, you need to have Docker installed on your machine. You can download it directly from the official Docker website
Important: for this version your machine must have access to the internet network
Open the terminal and run the following commands:
Step 1. Download the AnnoTEP image:
docker pull annotep/graphic-interface:v1
Step 2. Next, run the container with the command below, specifying a folder to store the annotation results on your machine:
docker run -it -v {folder-results}:/root/TEs/www/results -dp 0.0.0.0:5000:5000 annotep/graphic-interface:v1
-v {folder-results}:/root/TEs/www/results
: This creates a volume between the host and the container to store data. You can replace-v {folder-results}
with any folder path on your machine, if you don't have the folder created Docker will create it./root/TEs/www/results
is the path of the directory folder, you don't need to change it.-dp 0.0.0.0:5000:5000
: Maps the container's port 5000 to the host's port 5000.annotep/graphic-interface:v1
: Specifies the image to be used.
docker run -it -v $HOME/results-annotep:/root/TEs/www/results -dp 0.0.0.0:5000:5000 annotep/graphic-interface:v1
Step 3. After running the container with the previous command, access the AnnoTEP interface by typing the following address into your web browser:
127.0.0.1:5000
Step 4. When you access 127.0.0.1:5000 you will see a version of the AnnoTEP platform similar to the WEB version.
-
If you want to run tests, you can download the file Arabidopsis thaliana (Chromosome 4)
AtChr4.fasta
from the repository. Its SINE and LINE annotation can take 5 minutes and its complete annotation can take between 30 and 50 minutes if 10 threads are used for the operations. -
This version includes a field for the number of threads to be used. This option is only valid in the full annotation, and it is recommended to have at least 4 threads on your machine. Please note that the fewer the threads, the longer it will take to analyze the element.
-
The type of annotation and the results obtained are explained in section Results Container
Step 5. Within the interface you can enter your data such as: email, genome and annotation type and send it for analysis. When the work is completed without any errors, you will receive an e-mail informing you that the results are available in the directory entered in -v {folder}
.
Step 6: You can follow the progress of the annotation via the Docker logs.
- In the terminal, type
docker ps
. - A list of active containers will appear. Select the
CONTAINER ID
of the AnnoTEP image. - With the ID copied, type and paste:
docker logs {CONTAINER ID}
.
Important2: Avoid shutting down the machine during the process, as this could interrupt the data analysis. Even when using the web system, processing takes place locally on your machine.
Important3: Bear in mind that the speed of the annotation will depend on the performance of your local machine.
Return to Table of contents
Step 1. Download the AnnoTEP image:
docker pull annotep/bash-interface:v1
Step 2. Use the -h
parameter to display a user guide describing how to use the script:
docker run annotep/bash-interface:v1 python run_annotep.py -h
- You will be introduced to:
usage: run_annotep.py [-h] --file FILE --type {1,2,3,4} [--threads THREADS]
Run annotep with specified parameters.
optional arguments:
-h, --help show this help message and exit
--threads THREADS Number of threads used to complete annotation (default threads: 4).
This parameter does not need to be set for the other annotation types [1, 2, 3].
Required arguments:
--file FILE Genome file name (.fasta)
--type {1,2,3,4} Type annotation:
[1] SINE Annotation
[2] LINE Annotation
[3] SINE and LINE annotation
[4] Complete Annotation
- The type of annotation and the results obtained are explained in section Results Container
Step 3. To simplify this step, we recommend creating a folder to insert your genomic data in FASTA format. Once created, run the container using the command below as a guide. Make sure you provide the full path to the folder where you want to save the results, as well as the full path to the genomes folder:
docker run -it -v {folder-results}:/root/TEs/results -v {absolute-path-to-folder-genomes}:{absolute-path-to-folder-genomes} annotep/bash-interface:v1 python run_annotep.py --file {absolute-path-to-folder-genomes/genome.fasta} --type {type-annotation} --threads {optional}
-v {folder-results}:/root/TEs/results
: This creates a volume between the host and the container to store data. You can replace-v {folder-results}
with any folder path on your machine where you want to save the results, if you don't have the folder created Docker will create it./root/TEs/www/results
is the directory folder path, you don't need to change it.-v {absolute-path-to-folder-genomes}:{absolute-path-to-folder-genomes}
: It is responsible for creating a temporary copy of the genomic files inside Docker, which is why you must enter the correct address of the folder that stores the genomes in{absolute-path-to-folder-genomes}
.--file {absolute-path-to-folder-genomes/genome.fasta}
: Here you must enter the correct address of the folder that stores the genomes along with the name of the genome you want to annotate.--type {type-annotation}
: Type of annotation shown in step 2--threads {optional}
: optional parameter for complete annotation (type 4), define the number of threads that the complete annotation (type 4) will use by default. Not necessary for other annotation types (1,2,3).
- If you want to run tests, you can download the Arabidopsis thaliana (Chromosome 4) file
AtChr4.fasta
from the repository. Its SINE and LINE annotation can take 5 minutes and its complete annotation can take between 30 and 50 minutes if 10 threads are used for the operations.
docker run -it -v $HOME/results-annotep:/root/TEs/results -v /home/user/TEs:/home/user/TEs annotep/bash-interface:v1 python run_annotep.py --file /home/user/TEs/AtChr4.fasta --type 2
docker run -it -v $HOME/results-annotep:/root/TEs/results -v /home/user/TEs:/home/user/TEs annotep/bash-interface:v1 python run_annotep.py --file /home/user/TEs/AtChr4.fasta --type 4 --threads 12
Step 4. Now wait for the genome annotation to be completed by following the analysis through the terminal
Return to Table of contents
Each annotation parameter triggers different results:
1. SINE annotation: Generates a folder named “SINE”, containing files in .fa format and alignment images.
2. LINE Annotation: Creates a folder named “LINE”, containing files in .fa and .gff3 formats.
3. Complete Annotation: Covers the generation of data for SINEs, LINEs, TIRs, Helitrons, among others. When performing this annotation, a folder called “complete-analysis” is created, containing several subfolders and files in .fa and .gff3 formats. Some of the subfolders include:
- {genome}.fasta.mod.EDTA.raw: Contains refined files from the SINE and LINE annotations, as well as the LTR, TIR and Helitrons annotations.
- TE-REPORT: Provides a general summary of the elements present in the genome and presents quantitative data on them.
- LTR-AGE: Analyzes the ages of the Gypsy and Copia superfamilies.
- TREE: Displays the phylogenetic trees of the LTR elements.
The Results section presents the additional data obtained from the complete annotation.
Return to Table of contents
- The installation guide to be presented was adapted from Plant genome Annotation, with some modifications throughout the code.
- Plant Genome Annotation uses modified code from the AnnoSINE, MGEScan-non-LTR, TEsorter and EDTA pipelines.
-
System Ubuntu
After downloading miniconda from the link above, run it in the terminal window:
bash Miniconda3-latest-Linux-x86_64.sh
Step 1. In the terminal run:
git clone https://github.com/Marcos-Fernando/AnnoTEP.git $HOME/TEs
Step 2. Access the repository location on the machine:
cd $HOME/TEs
Note: Pay attention to the name of the folder. In this guide, we will be using the folder named TEs
. To make configuration easier, we recommend using this name.
Step 1. In the terminal download the following libraries:
sudo apt-get install libgdal-dev lib32z1 python-is-python3 python3-setuptools python3-biopython python3-xopen trf hmmer2 seqtk
sudo apt-get install hmmer emboss python3-virtualenv python2 python2-setuptools-whl python2-pip-whl cd-hit iqtree
sudo apt-get install python2-dev build-essential linux-generic libmpich-dev libopenmpi-dev bedtools pullseq bioperl
sudo apt-get install pdf2svg
# R dependencies
sudo apt-get install r-cran-ggplot2 r-cran-tidyr r-cran-reshape2 r-cran-reshape rs r-cran-viridis r-cran-tidyverse r-cran-gridextra r-cran-gdtools r-cran-phangorn r-cran-phytools r-cran-ggrepel
Access the R program from the terminal and install libraries from within it:
R
install.packages("hrbrthemes")
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggtree")
BiocManager::install("ggtreeExtra")
- In the event of an error with BiocManager or the ggtree and ggtreeExtra packages, you can use another method:
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("YuLab-SMU/ggtree")
devtools::install_github("YuLab-SMU/ggtreeExtra")
Step 2. After installing the libraries, copy the irf
and break_fasta.pl
scripts to local/bin on your machine:
sudo cp Scripts/irf /usr/local/bin
sudo cp Scripts/break_fasta.pl /usr/local/bin
Step 3. Then configure the TEsorter:
cd $HOME/TEs/TEsorter
sudo python3 setup.py install
Check the version of python on the machine to proceed with the configuration
- Python 3.7
cd /usr/local/lib/python3.7/dist-packages/TEsorter-1.4.1-py3.6.egg/TEsorter/database/
- Python 3.10
cd /usr/local/lib/python3.10/dist-packages/TEsorter-1.4.1-py3.10.egg/TEsorter/database/
...
sudo hmmpress REXdb_v3_TIR.hmm
sudo hmmpress Yuan_and_Wessler.PNAS.TIR.hmm
sudo hmmpress REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm
sudo hmmpress REXdb_protein_database_viridiplantae_v3.0.hmm
sudo hmmpress REXdb_protein_database_metazoa_v3.hmm
sudo hmmpress Kapitonov_et_al.GENE.LINE.hmm
sudo hmmpress GyDB2.hmm
sudo hmmpress AnnoSINE.hmm
cd $HOME/TEs
At this stage you can choose to use your data or download some examples for testing:
- Theobrama cacao
wget https://cocoa-genome-hub.southgreen.fr/sites/cocoa-genome-hub.southgreen.fr/files/download/Theobroma_cacao_pseudochromosome_v1.0_tot.fna.tar.gz
tar xvfz Theobroma_cacao_pseudochromosome_v1.0_tot.fna.tar.gz
mv Theobroma_cacao_pseudochromosome_v1.0_tot.fna Tcacao.fasta
rm Theobroma_cacao_pseudochromosome_v1.0_tot.fna.tar.gz
- Arabidopsis thaliana
wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas.gz
gzip -d TAIR10_chr_all.fas.gz
cat TAIR10_chr_all.fas | cut -f 1 -d" " > At.fasta
rm TAIR10_chr_all.fas
- If you can't download Arabidopsis thaliana automatically, you can download it directly from tair, by clicking on
TAIR10_chr_all.fas.gz
and following the steps in the command above from the second line onwards.
Step 1. Create and activate the AnnoSINE conda environment:
cd SINE/AnnoSINE/
conda env create -f AnnoSINE.conda.yaml
cd bin
conda activate AnnoSINE
In this pipeline, we will be using HMMER version 3.4 due to a bug in version 3.3. We therefore need to configure the environment variables.
Step 1. Have vim
installed on your machine, in the terminal type:
vim ~/.bashrc
A window with instructions will open, drag to the last line of the document and press the letter i
to activate edit mode and type the PATH command:
export PATH="$HOME/miniconda3/envs/AnnoSINE/bin:$PATH";
export PATH="$HOME/TEs/SINE/AnnoSINE/hmmer-develop:$PATH"
export PATH="$HOME/TEs/SINE/AnnoSINE/hmmer-develop/src:$PATH"
export PATH="$HOME/TEs/SINE/AnnoSINE/hmmer-develop/bin:$PATH"
When finished, press the ESC
button to end the editing mode, type :wq
and press ENTER
to save the changes and close the document.
After making the changes, restart the terminal (or close the terminal and open it again)
Step 2. Apply the changes and activate the environment:
source ~/.bashrc
conda activate AnnoSINE
Check that the current version is 3.4:
hmmsearch -h
If everything is correct, we can continue. If not, check the environment variables.
Step 3. Configuring HMMER:
cd ..
cd hmmer-develop
make clean
./configure
make -j
Step 1. Run the test data (chromosome 4 of A. thaliana) to verify the installation:
python3 AnnoSINE.py 3 ../AtChr4.fasta ../Output_Files
- A file 'Seed_SINE.fa' will be created in '../Output_Files'. This file contains all the planned SINE elements and will be used later in the next steps.
We are now ready to annotate the SINE elements of your genome project file.
Step 2. In this example we will run the preloaded A. thaliana genome or its data
python3 AnnoSINE.py 3 $HOME/TEs/At.fasta At
cp ./At/Seed_SINE.fa $HOME/TEs/At-Seed_SINE.fa
- Deactivate the environment
conda deactivate
cd $HOME/TEs
Step 1. Enter the Non-LTR folder and create a virtual environment
cd non-LTR/mgescan/
virtualenv -p /usr/bin/python2 mgescan-virtualenv
source mgescan-virtualenv/bin/activate
pip2 install biopython==1.76
pip2 install bcbio-gff==0.6.6
pip2 install docopt==0.6.1
python setup.py install
Follow the instructions on the installer screens. If you are unsure about any settings, accept the defaults.
mgescan is now installed and ready to work. Test the installation:
mgescan --help
MGEscan will use version hmmer 3.2, so we need to configure the development environment again.
Step 1. In the terminal type:
vim ~/.bashrc
A window with instructions will open, so add the following commands:
export PATH="$HOME/miniconda3/envs/AnnoSINE/bin:$PATH";
export PATH="$HOME/miniconda3/envs/EDTA/bin:$PATH";
export PATH="$HOME/TEs/non-LTR/hmmer-3.2/src/:$PATH";
When finished, press the ESC
button to end the editing mode, type :wq
and press ENTER
to save the changes and close the document.
After making the changes, restart the terminal (or close the terminal and open it again)
source ~/.bashrc
Step 2. In the terminal, run (only once):
cd ..
cd hmmer-3.2
make clean
./configure
make -j
Now we can run MGEScan-non-LTR, in the terminal configure the directories:
cd $HOME/TEs/non-LTR
mkdir At-LINE
cd At-LINE
ln -s $HOME/TEs/At.fasta At.fasta
cd ..
# Set the ulimit higher value - See below
ulimit -n 8192
Step 3. Run MGEScan-non-LTR
mgescan nonltr $HOME/TEs/non-LTR/At-LINE --output=$HOME/TEs/non-LTR/At-LINE-results --mpi=4
Step 4. Removing false positives with TEsorter and generating the pre-final non-redundant LINE library showing compatible input for the modified EDTA pipeline:
cd At-LINE-results
Step 5. Run the following command to generate the non-redundant LINE-lib.fa file:
cat info/full/*/*.dna > temp.fa
cat temp.fa | grep \> | sed 's#>#cat ./info/nonltr.gff3 | grep "#g' | sed 's#$#" | cut -f 1,4,5#g' > ver.sh
bash ver.sh | sed 's#\t#:#' | sed 's#\t#\.\.#' > list.txt
mkdir TMP
break_fasta.pl < temp.fa TMP/
cat temp.fa | grep \> | sed 's#>#cat ./TMP/#g' | sed 's#$#.fasta#g' > A.txt
cat temp.fa | grep \> > list2.txt
paste list2.txt list.txt | sed 's/>/ sed "s#/g' | sed 's/\t/#/g' | sed 's/$/#g"/g' > B.txt
paste A.txt B.txt -d"|" > rename.sh
bash rename.sh > candidates.fa
/usr/local/bin/TEsorter -db rexdb-plant --hmm-database rexdb-plant -pre LINE -p 22 -cov 80 -eval 0.0001 -rule 80-80-80 candidates.fa
more LINE.cls.lib | sed 's/#/__/g' | sed 's#.fa##g' | cut -f 1 -d" " | sed 's#/#-#g' > pre1.fa
mkdir pre1
break_fasta.pl < pre1.fa pre1
cat pre1/*LINE.fasta | sed 's#__#\t#g' | cut -f 1 > pre2.fa
/usr/local/bin/TEsorter -db rexdb-line --hmm-database rexdb-line -pre LINE2 -p 22 -cov 60 -eval 0.0001 -rule 80-80-80 pre2.fa
more LINE2.cls.lib | sed 's/#/__/g' | sed 's#.fa##g' | cut -f 1 -d" " | sed 's#/#-#g' > pre-final.fa
mkdir pre-final
break_fasta.pl < pre-final.fa pre-final
cat pre-final/*LINE*.fasta > pre-final2.fa
cdhit-est -i pre-final2.fa -o clustered -c 0.8 -G 1 -T 22 -d 100 -s 0.6 -aL 0.6 -aS 0.6
cat clustered | sed 's/__/#/g' | sed 's#-#/#g' > LINE-lib.fa
#
rm -rf pre1/ pre-final/ TMP/
rm LINE2*
rm LINE.cls.*
rm A.txt B.txt clustered.clstr clustered LINE.dom* list2.txt list.txt pre1.fa pre2.fa pre-final2.fa pre-final.fa rename.sh temp.fa ver.sh candidates.fa
cp LINE-lib.fa $HOME/TEs/At-LINE-lib.fa
- Deactivate the environment and return to the pipeline home screen:
deactivate
cd $HOME/TEs
Step 1. Install and activate the EDTA conda environment:
cd EDTA
bash
conda env create -f EDTA.yml
conda activate EDTA
perl EDTA.pl
-
In some cases it may happen that the RunCmdsMP.py package is not added inside EDTA, so to avoid future errors, it is recommended to manually add the file inside the development environment folder:
sudo cp $HOME/TEs/Scripts/RunCmdsMP.py $HOME/miniconda3/envs/EDTA/lib/python3.6/site-packages/
Step 2. Now let's use the At-LINE-lib.fa
and At-Seed_SINE.fa
files generated in the previous steps:
cd ..
mkdir Athaliana
cd Athaliana
nohup $HOME/TEs/EDTA/EDTA.pl --genome ../At.fasta --species others --step all --line ../At-LINE-lib.fa --sine ../At-Seed_SINE.fa --sensitive 1 --anno 1 --threads 10 > EDTA.log 2>&1 &
Step 3. Track progress by:
tail -f EDTA.log
Notes:
1. Set the number of threads available on your computer or server. Set the maximum available. In our code it is set to 10.
2. For more accurate TE detection and annotation, activate the "sensitive" flag. This will activate the RepeatModeler to identify remaining TEs and other repeats. The RepeatModeler step will also generate the Superfamily and Lineage TE classification and can capture other unknown LINEs and repeats. Our modified EDTA pipeline will do this automatically. This step is strongly recommended.
3. The SINE and LINE structural annotations are available in the $genome.EDTA.raw folder. Look for SINE.intact.fa, SINE.intact.gff3, LINE.intact.fa and LINE.intact.gff3
4. The final LINE library is embedded in the TElib.fa file. So if you want to recover all the LINEs, use this file.
Generally, non-autonomous elements can carry passenger genes (for example, non-autonomous LARDs and Helitrons). Therefore, for proper annotation of the genome, these elements must be partially masked. The modified EDTA pipeline will take care of this automatically and generate a suitably masked genome sequence for structural gene annotation. The softmasked genome sequence is available in the EDTA folder, with the name $genome-Softmasked.fa .
Return to Table of contents
Still in the EDTA environment run:
cd $HOME/TEs
cd Athaliana
mkdir TE-REPORT
cd TE-REPORT
ln -s ../At.fasta.mod.EDTA.anno/At.fasta.mod.cat.gz .
perl $HOME/TEs/ProcessRepeats/ProcessRepeats-complete.pl -species viridiplantae -nolow -noint At.fasta.mod.cat.gz
mv At.fasta.mod.tbl TEs-Report-Complete.txt
perl $HOME/TEs/ProcessRepeats/ProcessRepeats-lite.pl -species viridiplantae -nolow -noint -a At.fasta.mod.cat.gz
mv At.fasta.mod.tbl TEs-Report-Lite.txt
The results obtained are: TEs-Report-Completo.txt
and TEs-Report-Lite.txt
.
TEs-Report-Complete.txt
presents a table containing the classifications of the transposable elements, the partial elements named with the suffix “-like” (e.g. Angela-like);
TEs-Report-Lite.txt
generates a report similar to Report-Complete, but simpler.
Continuing in the TE-REPORT folder, we will generate the graphs, using TEs-Report-Lite.txt
as a base.
cat TEs-Report-Lite.txt | grep "%" | cut -f 2 -d":" | awk '{print $1}' > count.txt
cat TEs-Report-Lite.txt | grep "%" | cut -f 2 -d":" | awk '{print $2}' > bp.txt
cat TEs-Report-Lite.txt | grep "%" | cut -f 2 -d":" | awk '{print $4}' > percentage.txt
cat TEs-Report-Lite.txt | grep "%" | cut -f 1 -d":" | sed 's# ##g' | sed 's#-##g' | sed 's#|##g' > names.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w NonLTR > plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w LTRNonauto | sed 's#LTRNonauto#LTR_nonauto#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w "LTR/Copia" >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w "LTR/Gypsy" >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w "Pararetrovirus" >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w "ClassIUnknown" | sed 's#ClassIUnknown#Class_I_Unknown#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w "TIRs" >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w "ClassIIUnknown" | sed 's#ClassIIUnknown#Class_II_Unknown#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w "Unclassified" >> plot.txt
echo "Type Number length percentage" > header.txt
cat header.txt plot.txt > plot1.txt
python $HOME/TEs/Scripts/plot_TEs.py
mv TE-Report.pdf TE-Report1.pdf
pdf2svg TE-Report1.pdf TE-Report1.svg
python $HOME/TEs/Scripts/plot_TEs-bubble.py
mv TE-Report.pdf TE-Report1-bubble.pdf
pdf2svg TE-Report1-bubble.pdf TE-Report1-bubble.svg
paste names.txt count.txt bp.txt percentage.txt | grep -w SINEs > plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w LINEs >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w LARDs >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w TRIMs >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w TR_GAG >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w BARE2 >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Ale >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Alesia >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Angela >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Bianca >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Bryco >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Lyco >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w GymcoI >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w GymcoII >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w GymcoIII >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w GymcoIV >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Ikeros >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Ivana >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Osser >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w SIRE >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w TAR >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Tork >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Ty1outgroup | sed 's#Ty1outgroup#Ty1-outgroup#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Phygy >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Selgy >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w OTA >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w OTAAthila | sed 's#OTAAthila#Athila#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w OTATatI | sed 's#OTATatI#TatI#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w OTATatII | sed 's#OTATatII#TatII#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w OTATatIII | sed 's#OTATatIII#TatIII#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w OTATatOgre | sed 's#OTATatOgre#Ogre#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w OTATatRetand | sed 's#OTATatRetand#Retand#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Chlamyvir >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Tcn1 >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w CRM >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Galadriel >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Tekay >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w Reina >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w MITE >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w EnSpm_CACTA | sed 's#EnSpm_CACTA#CACTA#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w hAT >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w MuDR_Mutator | sed 's#MuDR_Mutator#MuDR#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w PIF_Harbinger | sed 's#PIF_Harbinger#Harbinger#g' >> plot.txt
paste names.txt count.txt bp.txt percentage.txt | grep -w "RC/Helitron" | sed 's#RC/Helitron#Helitron#g' >> plot.txt
cat header.txt plot.txt > plot1.txt
python $HOME/TEs/Scripts/plot_TEs.py
mv TE-Report.pdf TE-Report2.pdf
pdf2svg TE-Report2.pdf TE-Report2.svg
python $HOME/TEs/Scripts/plot_TEs-bubble.py
mv TE-Report.pdf TE-Report2-bubble.pdf
pdf2svg TE-Report2-bubble.pdf TE-Report2-bubble.svg
The data obtained will be:
The landscape repeat graph is a reasonable inference of the relative ages of each element identified in a given genome. To create it we will use the file with the .align
extension created after using ProcessRepeats-lite.pl
In the terminal, run:
cd $HOME/TEs
cd Athaliana/TE-REPORT
cat At.fasta.mod.align | sed 's#TIR/.\+ #TIR &#g' | sed 's#DNA/Helitron.\+ #Helitron &#g' | sed 's#LTR/Copia.\+ #LTR/Copia &#g' | sed 's#LTR/Gypsy.\+ #LTR/Gypsy &#g' | sed 's#LINE-like#LINE#g' | sed 's#TR_GAG/Copia.\+ #LTR/Copia &#g' | sed 's#TR_GAG/Gypsy.\+ #LTR/Gypsy &#g' | sed 's#TRBARE-2/Copia.\+ #LTR/Copia &#g' | sed 's#BARE-2/Gypsy.\+ #LTR/Gypsy &#g' | sed 's#LINE/.\+ #LINE &#g' > tmp.txt
#
cat tmp.txt | grep "^[0-9]" -B 6 | grep -v "\-\-" | grep "LTR/Copia" -A 5 | grep -v "\-\-" > align2.txt
cat tmp.txt | grep "^[0-9]" -B 6 | grep -v "\-\-" | grep "LTR/Gypsy" -A 5 | grep -v "\-\-" >> align2.txt
cat tmp.txt | grep "^[0-9]" -B 6 | grep -v "\-\-" | grep "TIR" -A 5 | grep -v "\-\-" >> align2.txt
cat tmp.txt | grep "^[0-9]" -B 6 | grep -v "\-\-" | grep "LINE" -A 5 | grep -v "\-\-" >> align2.txt
cat tmp.txt | grep "^[0-9]" -B 6 | grep -v "\-\-" | grep "LARD" -A 5 | grep -v "\-\-" >> align2.txt
cat tmp.txt | grep "^[0-9]" -B 6 | grep -v "\-\-" | grep "TRIM" -A 5 | grep -v "\-\-" >> align2.txt
cat tmp.txt | grep "^[0-9]" -B 6 | grep -v "\-\-" | grep "Helitron" -A 5 | grep -v "\-\-" >> align2.txt
cat tmp.txt | grep "^[0-9]" -B 6 | grep -v "\-\-" | grep "SINE" -A 5 | grep -v "\-\-" >> align2.txt
cat tmp.txt | grep "^[0-9]" -B 6 | grep -v "\-\-" | grep "Unknown" -A 5 | grep -v "\-\-" >> align2.txt
#
perl $HOME/TEs/ProcessRepeats/calcDivergenceFromAlign.pl -s At.divsum align2.txt
genome_size="`perl $HOME/TEs/EDTA/util/count_base.pl ../At.fasta.mod | cut -f 2`"
perl $HOME/TEs/ProcessRepeats/createRepeatLandscape.pl -g $genome_size -div At.divsum > ../RepeatLandscape.html
tail -n 72 At.divsum > divsum.txt
cat $HOME/TEs/Rscripts/plotKimura.R | sed "s#_SIZE_GEN_#$genome_size#g" > plotKimura.R
Rscript plotKimura.R
mv Rplots.pdf RepeatLandScape.pdf
pdf2svg RepeatLandScape.pdf RLandScape.svg
rm align2.txt
rm tmp.txt
The graphics obtained will be: RepeatLandScape.pdf
and RLandScape.svg
.
To plot the ages of the LTR Gypsy and LTR Copia elements, we will use a ggplot2 Rscript.
cd $HOME/TEs
cd Athaliana
mkdir LTR-AGE
cd LTR-AGE
ln -s ../At.fasta.mod.EDTA.raw/At.fasta.mod.LTR-AGE.pass.list .
ln -s $HOME/TEs/Rscripts/plot-AGE-Gypsy.R .
ln -s $HOME/TEs/Rscripts/plot-AGE-Copia.R .
cat -n At.fasta.mod.LTR-AGE.pass.list | grep Gypsy | cut -f 1,13 | sed 's# ##g' | sed 's#^#Cluster_#g' | awk '{if ($2 > 0) print $n}' > AGE-Gypsy.txt
cat -n At.fasta.mod.LTR-AGE.pass.list | grep Copia | cut -f 1,13 | sed 's# ##g' | sed 's#^#Cluster_#g' | awk '{if ($2 > 0) print $n}' > AGE-Copia.txt
Rscript plot-AGE-Gypsy.R
Rscript plot-AGE-Copia.R
pdf2svg AGE-Copia.pdf AGE-Copia.svg
pdf2svg AGE-Gypsy.pdf AGE-Gypsy.svg
The final files are: AGE-Copia.pdf
, AGE-Gypsys.pdf
, AGE-Copia.svg
and AGE-Gypsy.svg
.
Plotting the phylogeny of the alignments of all the LTR-RT domains.
cd $HOME/TEs
cd Athaliana
mkdir TREE
cd TREE
ln -s ../At.fasta.mod.EDTA.TElib.fa .
cat At.fasta.mod.EDTA.TElib.fa | sed 's/#/_CERC_/g' | sed 's#/#_BARRA_#g' > tmp.txt
mkdir tmp
break_fasta.pl < tmp.txt ./tmp
cat tmp/*LTR* | sed 's#_CERC_#\t#g' | cut -f 1 > TE.fasta
rm -f tmp.txt ; rm -f At.fasta.mod.EDTA.TElib.fa ; rm -Rf tmp
/usr/local/bin/TEsorter -db rexdb-plant --hmm-database rexdb-plant -pre TE -dp2 -p 40 TE.fasta
concatenate_domains.py TE.cls.pep GAG > GAG.aln
concatenate_domains.py TE.cls.pep PROT > PROT.aln
concatenate_domains.py TE.cls.pep RH > RH.aln
concatenate_domains.py TE.cls.pep RT > RT.aln
concatenate_domains.py TE.cls.pep INT > INT.aln
cat GAG.aln | cut -f 1 -d" " > GAG.fas
cat PROT.aln | cut -f 1 -d" " > PROT.fas
cat RH.aln | cut -f 1 -d" " > RH.fas
cat RT.aln | cut -f 1 -d" " > RT.fas
cat INT.aln | cut -f 1 -d" " > INT.fas
perl $HOME/TEs/Scripts/catfasta2phyml.pl -c -f *.fas > all.fas
iqtree2 -s all.fas -alrt 1000 -bb 1000 -nt AUTO
cat TE.cls.tsv | cut -f 1 | sed "s#^#cat ../At.fasta.mod.EDTA.TEanno.sum | grep -w \"#g" | sed 's#$#"#g' > pick-occur.sh
bash pick-occur.sh > occur.txt
cat occur.txt | sed 's#^ TE_#TE_#g' | awk '{print $1,$2,$3}' | sed 's# #\t#g' | sort -k 2 -V > sort_occur.txt
cat occur.txt | sed 's#^ TE_#TE_#g' | awk '{print $1,$2,$3}' | sed 's# #\t#g' | sort -k 3 -V > sort_size.txt
cat all.fas | grep \> | sed 's#^>##g' > ids.txt
cat sort_occur.txt | cut -f 1,2 | sed 's#^#id="#g' | sed 's#\t#" ; data="#g' | sed 's#$#" ; ver="`cat ids.txt | grep $id`" ; echo -e "$ver\\t$data" #g' > pick.sh
bash pick.sh | grep "^TE" | grep "^TE" | sed 's/#/_/g' | sed 's#/#_#g' > occurrences.tsv
cat sort_size.txt | cut -f 1,3 | sed 's#^#id="#g' | sed 's#\t#" ; data="#g' | sed 's#$#" ; ver="`cat ids.txt | grep $id`" ; echo -e "$ver\\t$data" #g' > pick.sh
bash pick.sh | grep "^TE" | grep "^TE" | sed 's/#/_/g' | sed 's#/#_#g' > size.tsv
rm -f pick-occur.sh sort_occur.txt sort_size.txt ids.txt pick.sh
ln -s $HOME/TEs/Rscripts/LTR_tree.R .
ln -s $HOME/TEs/Rscripts/LTR_tree-density.R .
ln -s $HOME/TEs/Rscripts/LTR_tree_rec_1.R .
ln -s $HOME/TEs/Rscripts/LTR_tree_rec_2.R .
Rscript LTR_tree.R all.fas.contree TE.cls.tsv LTR_RT-Tree1.pdf
Rscript LTR_tree-density.R all.fas.contree TE.cls.tsv occurrences.tsv size.tsv LTR_RT-Tree2.pdf
Rscript LTR_tree_rec_1.R all.fas.contree TE.cls.tsv LTR_RT-Tree3.pdf
Rscript LTR_tree_rec_2.R all.fas.contree TE.cls.tsv LTR_RT-Tree4.pdf
pdf2svg LTR_RT-Tree1.pdf LTR_RT-Tree1.svg
pdf2svg LTR_RT-Tree2.pdf LTR_RT-Tree2.svg
pdf2svg LTR_RT-Tree3.pdf LTR_RT-Tree3.svg
pdf2svg LTR_RT-Tree4.pdf LTR_RT-Tree4.svg
The files generated will be: LTR_RT-Tree1.pdf
, LTR_RT-Tree2.pdf
, LTR_RT-Tree3.pdf
, LTR_RT-Tree4.pdf
, LTR_RT-Tree1.svg
, LTR_RT-Tree2.svg
, LTR_RT-Tree3.svg
and LTR_RT-Tree2.svg
.
- The outer circle (purple) represents the length (in bp) occupied by each element, while the inner circle (red) represents the number of occurrences of each element.
Return to Table of contents
Step 1. Access the graphic-interface
folder folder and create a Python virtual environment by running the following commands in your terminal. Make sure you have done the environment setup before proceeding.
python -m venv .venv
. .venv/bin/activate
Important 4: If you cloned the git repository to a directory different from the recommended one, which is $HOME/TEs, you will need to adjust some lines of code to avoid potential issues.
Follow these steps:
1. Adjust the main.py file:
- Inside the
graphic-interface
folder, locate the file namedmain.py
. - Open the file and find the following line of code:
UPLOAD_FOLDER = os.path.join(os.environ['HOME'], 'TEs')
- Modify this line to reflect the directory where you installed the repository. For example:
UPLOAD_FOLDER = {folder installation location}
#or
UPLOAD_FOLDER = os.path.join(os.environ['HOME'], 'new_directory')
- Replace "new_directory" with the correct path of the folder where the repository was cloned.
2. Adjust the annotation.py file:
- Still in the
graphic-interface
folder, go to theextensions
subfolder and locate theannotation.py
file. - Repeat the same process: find the line of code that defines the UPLOAD_FOLDER path:
UPLOAD_FOLDER = os.path.join(os.environ['HOME'], 'TEs')
- Change this line to the new directory where the repository was installed, just like in the previous example:
UPLOAD_FOLDER = {folder installation location}
#or
UPLOAD_FOLDER = os.path.join(os.environ['HOME'], 'new_directory')
By following these steps, the system will correctly recognize the new installation path, preventing any errors during processing.
Step 2: Install the packages needed for the application by running the following command (this only needs to be done once):
pip install -r required.txt
- Inside the
required.txt
file, you'll find the fundamental libraries, such as Flask and python-dotenv. If any package shows an error, you'll need to install it manually.
Step 3: Now, inside the "graphic-interface" folder and with the virtual environment activated, run the following command to start the application:
flask run
If all the settings are correct, you will see a message similar to this one:
* Serving Flask app 'main.py' (lazy loading)
* Environment: development
* Debug mode: on
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger PIN: 264-075-516
Step 4. Click on the link http://127.0.0.1:5000/ or copy and paste it into your browser to access the platform and start testing it.
- The type of annotation and the results obtained are explained in section Results Container
- This mode is entirely command-line based, so there's no need to create a development environment. Make sure you have done the environment setup before proceeding.
- Go to the
bash-interface
folder
Important 5: Just like in the graphic-interface
folder, if you cloned the git repository to a directory different from the suggested one, which is $HOME/TEs, you will need to adjust some lines of code to avoid potential issues.
Follow these steps:
- Inside the
bash-interface
folder, locate the file namedrun_annotep.py
. - Open the file and find the following line of code:
UPLOAD_FOLDER = os.path.join(os.environ['HOME'], 'TEs')
- Modify this line to reflect the directory where you installed the repository. For example:
UPLOAD_FOLDER = {folder installation location}
#or
UPLOAD_FOLDER = os.path.join(os.environ['HOME'], 'new_directory')
- Replace "new_directory" with the correct path of the folder where the repository was cloned.
By following these steps, the system will correctly recognize the new installation path, preventing any errors during processing.
Step 1. Go to the "local" folder and run the run_annotep.py
script by typing the following command:
python run_annotep.py -h
- The
-h
parameter displays a user guide describing how to use the script:
usage: run_annotep.py [-h] --file FILE --type {1,2,3,4} [--threads THREADS]
Run annotep with specified parameters.
optional arguments:
-h, --help show this help message and exit
--threads THREADS Number of threads used to complete annotation (default threads: 4).
This parameter does not need to be set for the other annotation types [1, 2, 3].
required arguments:
--file FILE Genome file name (.fasta)
--type {1,2,3,4} Type annotation:
[1] SINE Annotation
[2] LINE Annotation
[3] SINE and LINE annotation
[4] Complete Annotation
Step 2: Run the command adding the full path of the directory containing the genome and the type of annotation you want:
python run_annotep.py --file {absolute-path-to-folder-genomes}/genome.fasta --type number
python run_annotep.py --file /home/user/TEs/At.fasta --type 2
python run_annotep.py --file $HOME/TEs/At.fasta --type 4 --threads 10
- The type of annotation and the results obtained are explained in section Results Container
Return to Table of contents