Current annotation pipeline of the International Weed Genomics Consortium (IWGC)
Authors: Dr. Nathan D. Hall (Developer), Nicholas A. Johnson (Documentation), and Dr. Eric L. Patterson (P.I.)
GitHub Repo: IWGC_annotation_pipeline
This repository was established to document the genome annotation methods used by the International Weed Genomics Consortium and in the article "Subtelomeric 5-enolpyruvylshikimate-3-phosphate synthase copy number variation confers glyphosate resistance in Eleusine indica" and to also provide a publicly available pipeline or framework for computational genome annotation. The current state of the pipeline works well for the Patterson Lab but will require a substantial time investment for tweaking of small things like directory paths of files or programs and subsequent troubleshooting during the initial setup. It should also be noted that for functional annotation the MetaCyc database requires a usage license (conditionally free for academic use) that will be required for you to obtain or you will have to omit these parts to use the pipeline. We hope at the very least this documentation will provide researchers with a reference for designing their own genome annotation pipeline. We may attempt to release a more distributable version in the future, but there are no set plans currently.
-
Repeat regions are annotated using RepeatModeler (for program versions see Dependencies).
-
Annotated repeat regions are masked with RepeatMasker to reduce the computational burden of further analysis.
-
bedtools is used to soft mask the genome using the output of RepeatMasker.
- IsoSeq reads are mapped to the repeat-masked genome using minimap2.
- SAMtools is used to convert the minimap2 alignment SAM file into a BAM file.
- Isoforms are collapsed with Cupcake.
- Genome, collapsed cDNA from Cupcake, repeat libraries from RepeatModeler, and a protein FASTA from a close relative species are fed into MAKER.
- Genes that produce proteins under 27 amino acids long are removed from further annotation with only the longest proteins from each gene and unique untranslated regions (UTRs) used for functional annotation.
-
Isoforms are selected using AGAT which selects the longest isoforms with longest intact open read frame.
-
Isoforms are then extracted from GFF3 format using using gffread
-
These predicted proteins are used for all downstream analysis.
- MultiLoc2 Workstation Edition (ML2) is used predict protein location within the cell. Briefly, ML2 uses a trained machine learning classifer to predict protein location. For more on this program see the excellent work by Blum et al. (2009).
-
InterPro is run locally using iprscan5 version 5.47-82.0.
-
It uses several programs to predict InterPro Accessions which can be broken down into the following categories.
- Sites
- Conserved Site
- Active Site
- Binding Site
- PTM
- Repeats
- Domains
- Families
- Homologous SuperFamilies
- Sites
-
For more information see release notes here.
-
IPRSCAN Results provide MetaCyc accessions but not descriptions. To obtain descriptions a custom python script was used to extract Pathway IDs and link them to their descriptions. MetaCyc is a proprietary database that is made conditionally available to academic researchers free of charge.
*This search is for the best hit in each database using MMSeqs2. Which is an extremley fast method of searching that can be up to 36 times faster than BLAST (Steinegger and Söding 2017).
- UniRef50 is a clustered database that is regularly updated. See work by Suzek et al. (2015).
- We use UniRef_50 version 2021_03 and downloaded representative plants from the same release to add descriptions.
Cluster information and UniRef IDs are reported
All available proteins from fully sequenced plants were downloaded May 2021 and put into a MMeqs2 database using NCBI Datasets and MMSeqs2 tool
Herbicide interacting genes were identified from the literature and mapped to KEGG Ortholog. UniRef50 IDs were converted to KEGG Orthologs and cross matched with herbicide interacting genes. Matches were flagged and Annotated with HRAC and WSSA classifiers.
The pipeline runs all analysis seperately, but a series of python scripts is used to collect, filter and and format the data. Results are reported in JSON format which is ideal for NOSQL databases and queries. Given the number of Null values or missing values this format is ideal. The results are gappy because not every analysis returns a result for each isoform. This is expected. Finally, GFF3 is produced that contains Ontology Terms, and Notes.
The below program version numbers are the exact versions used by the current IWGC_annotation_pipeline. Different versions may also be compatible but no other versions have been verified to be compatible with this pipeline by the PattersonWeedLab.
- RepeatModeler (version 2.0.2)
- h5py (version 2.9.0)
- RepeatMasker (version 4.1.2)
- bedtools (version 2.30.0)
- minimap2 (version 2.24)
- SAMtools (version 1.11)
- Cupcake (version 28.0)
- gffread (version 0.12.7)
- MAKER (version 3.01.04)
- gff3 (version 1.0.1)
- BCBio (version 0.7.0)
- gfftools (version 0.0.4)
- AGAT (version 0.8.0)
- gffread (version 0.12.7)
- InterProScan5 (version 5.47-82.0)
- MMseqs2 (version 4.1)
- MultiLoc2 Workstation Edition (version 2-1)
Genomes are first structurally annotated in terms of repeat regions and gene models. Structural annotation is then used to functionally annotate gene models.
First build a database for RepeatModeler.
BuildDatabase -name Species_Name genome.fasta
Next use repeat database from last step to model repeats. -pa 1 uses 4 threads, -pa 25 uses 100 threads. Adjust accordingly.
RepeatModeler -database Species_Name -pa 25 -LTRStruct
Output from RepeatModeler used as input for RepeatMasker.
RepeatMasker -gff -a -pa 20 -u -lib RepeatModeler_output-families.fa genome.fasta
May need to load module h5py if the above fails.
Output from RepeatMasker used to soft mask repeat regions in genome with bedtools.
bedtools maskfasta -fi genome.fasta -bed RepeatMasker_output.gff -soft -fo genome.softmasked.fasta
minimap2 -a -x splice -H -t 100 -O6,24 -B4 genome.softmasked.fasta isoseq.fastq -o output.sam
Prepare algnment for CupCake with samtools.
samtools view -b -T original_genome.fasta minimap2_alignment.sam > minimap2_alignment.bam
Collapse isoforms with CupCake.
python path/to/collapse_isoforms_by_sam.py --input ISOseq.fq --fq -b minimap2.sorted.bam -o Species_Name
This is our workaround to parallelize MAKER.
grep '>' genome.fasta | sed 's/>//g' > chrs.list
For intital setup, ensure all paths to MAKER required executables are accurate on your system in the below section.
#-----Location of Executables Used by MAKER/EVALUATOR
Edit maker_split_and_run.sh for your input files as shown below.
#!/bin/bash
##==============================================
## files that get modified
genome="Species_Name.softmasked.fasta"
cupcake="Species_Name.collapsed.gff"
##==============================================
## files that don't get modified
repeat_lib="Species_Name-families.fa"
proteins="Related_Species_Name.proteins.fa"
Run maker_split_and_run.sh to separate the genome into different directories by chromosome/scaffold/contig/number of headers in fasta and setup a MAKER.ctl for each.
bash maker_split_and_run.sh
Edit the path /path/to/Species_Name/Maker
in maker_run.sh
shown below to your MAKER working directory.
#!/bin/bash
module load gffread
module load MakerP
while read i
do
cd /path/to/Species_Name/Maker/${i}
nohup maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl &
done < chrs.list
Use maker_run.sh
to submit MAKER runs in each directory setup by maker_split_and_run.sh
.
bash maker_run.sh
Edit Species_Name before running.
global_gff="Species_Name.gff"; printf "##gff-version 3\n" >${global_gff} ;tail -n +2 */*maker.output/*_datastore/*/*/*/*gff | awk -F"\t" 'NF==9 && ($3=="gene" || $3 =="CDS" || $3 =="mRNA" || $3 =="exon" || $3 == "five_prime_UTR" || $3 == "three_prime_UTR" || $3 =="tRNA" )' >>${global_gff}
gffread -S -y Species_Name.prots -g Species_Name.genome.fa Species_Name.gff
samtools faidx Species_Name.prots
Pick smallest protein.
sort -r -k2 -n Species_Name.prots.fai | cut -f -2 | grep est2genome
sort -k2 -n Species_Name.prots.fai | cut -f -2 | grep protein2genome | awk '$2 < 27 {print $1}' | sed 's/-mRNA-[0-9]*//' | sort | uniq > exclude_lt27.list`
Needs pip install gff3.
printf "##gff-version 3\n" > Species_Name.ge27.gff ; python /path/to/gff_filter.py -e exclude_lt27.list -g Species_Name.gff >> Species_Name.ge27.gff
python /path/to/gffPrepare/validate_gff.py --gff Species_Name.ge27.gff > Species_Name.ge27_uniq.gff
Used to rename gff to the Patterson Lab/IWGC common naming convention. Eleusine indica = EleIn. Eleusine indica glyphosate-resistant = EleInR. Species_name = SpeNa.
python /path/to/renameGff.py -g Species_Name.ge27_uniq.gff -t SpeNa > SpeNa.v2.gff
/path/to/gff3sort/gff3sort.pl --precise --chr_order natural SpeNa.v2.gff > SpeNa.v2.sorted.gff`
Unlike with structural annotation, the functional annotation pipeline is completely contained within a few custom scripts. Config variables and functions are imported from pipeline_config.sh
. ml2_one_by_one.sh
is referenced by Functional_Annotation_v4.sh
.
Once configured, the below command is the only one to run. Get your species' ID from NCBI.
/path/to/Functional_Annotation_v4.sh -G Species_Name.genome.fa -g SpeNa.v2.sorted.gff -s SpeNa -t 102 -i NCBI_Species_Name_ID -n Scientific_Name -c Common_Name