Subtelomeric 5-enolpyruvylshikimate-3-phosphate synthase copy number variation confers glyphosate resistance in Eleusine indica
DOI: 10.1038/s41467-023-40407-6
GitHub Repo: Eindica_Subtel_EPSPS_CNV
Repo Author: Nicholas A. Johnson
This repository was developed to improve the transparency and reproducibilty of the computational methods used in the Nature Communications article "Subtelomeric 5-enolpyruvylshikimate-3-phosphate synthase copy number variation confers glyphosate resistance in Eleusine indica". Please feel free to use any or all of the visualization methods or plot code provided in this repository and use the below citation if this information was helpful for you.
- Zhang, C., Johnson, N.A., Hall, N. et al. Subtelomeric 5-enolpyruvylshikimate-3-phosphate synthase copy number variation confers glyphosate resistance in Eleusine indica. Nat Commun 14, 4865 (2023). https://doi.org/10.1038/s41467-023-40407-6
The below data is included for the sake of transparency and is not required to create any of the plots, which are self-contained within their own directories.
- CoGe: Glyphosate-Susceptible and Glyphosate-Resistant Eleusine indica assembled genomes and annotation files
- NCBI: Bioproject containing raw sequencing data (SRA) and genome FASTA files (GenBank)
- a. Karyotype file (
eindicakaryotype.circos
) is a space delimited genome index that was generated from the assembled GS genome using standard command line tools/languages, likegrep
andAWK
. - b. Gene density file (
geneCoverageHeatMap.circos
) gives normalized gene density in 500Kb genomic windows. Genome windows were made with bedtools (version 2.30.0). Total base pairs contained within genes were counted in each window from the E_indica.gff3 annotation file, divided by total base pairs in each window (500Kb), and normalized. Genes were annotated with the International Weed Genomics Consortium Annotation Pipeline. - c. Transposable element genomic coverage for Gypsy (
gypsycoverage.circos
), Copia (copiacoverage.circos
), and all transposable elements excluding Gypsy and Copia elements (othercoverage.circos
) was calculated in terms of total base pairs contained in TEs divided by total base pairs in each 500Kb bedtools-generated genomic window. TEs were annotated with RepeatModeler (version 2.0.2). - d. Transposable element density file (
TE_total_coverage.circos
) gives normalized TE density in 500Kb genomic windows. TE element density windows were generated in the same fashion as the gene density file (geneCoverageHeatMap.circos
). - Region-A and Region-B labels contained within
labels_epsps.circos.txt
are punctuated with highlights on the karyotype track (a.) contained withinhighlights_epsps.txt
.
Install Circos directly from Circos website or from Conda. This plot was made with Circos version 0.69.9 from Bioconda.
If you see this error when installing via Conda...
UnsatisfiableError: Note that strict channel priority may have removed packages required for satisfiability.
Try running the below command.
conda create --strict-channel-priority --override-channels --channel conda-forge --channel bioconda --channel defaults --name circos -c bioconda circos
This Circos plot uses Arial font, which does not come preinstalled. Please move ArialBold.ttf
from EindicaGS_Circos to /circos/fonts
(Conda Path: /path/to/opt/anaconda3/envs/circos/fonts
) on your system to reproduce the included plot with Arial font.
Alternatively, use a font included with Circos by editing the eindica_circos_006/eindica_circos.conf
file as shown below.
<fonts>
default = fonts/modern/cmunsx.otf
#default = fonts/ArialBold.ttf
</fonts>
Run the below command from the directory EindicaGS_Circos.
circos -conf eindica_circos_006/eindica_circos.conf -outputdir eindica_circos_006/tmp
The below Circos plot will be produced. Note: a, b, c, and d labels in the publication version of the figure were added to the final image using Inkspace.
The above Circos plot shows (a.) the length (Mb) of chromosomes one through nine as an index with corresponding (b.) gene-rich (blue) and gene-poor (yellow) genomic regions, (c.) Gypsy (red), Copia (blue), and other (black) transposable element family coverage across the genome (scale: 0-50%), (d.) transposable element rich (red) and transposable element poor (yellow) genomic regions, and the native locations of Region-A (red label) and Region-B (blue label) of the subtelomeric EPSPS-Cassette.
I have also included a commented out <links>
section in eindica_circos.conf
which was not included in Fig. 1 of the publication. If you are curious about what links will look like between interchromosomal syntenic regions which are 14.5Kb or longer, uncomment the <links>
section.
GS and GR Eindica genomes were aligned to eachother using minimap2 (version 2.24) and the resulting .paf
was converted into a .coords
file using RagTag (version 2.1.0) and MUMmer4 (version 4.0.0) as shown below.
minimap2 -cx asm20 -t 14 EleInS.fa EleInR.fa > EleInS_v_EleInR.paf
#ragtag.py is from RagTag
ragtag.py paf2delta EleInS_v_EleInR.paf > EleInS_v_EleInR.delta
#Show-coords is from MUMmer4
show-coords -lTH EleInS_v_EleInR.delta > EleInS_v_EleInR.coords
The resulting coords file was given an additional column fill
to colorize links using command line and saved as RvsS.dualsynteny.cis.txt
. RvsS.dualKaryotype.txt
is genome index of both the GS and GR genomes with four additional columns (fill
, species
, size
, and color
) that were added with command line. These files were visualized using Eindica_Synteny_Ideogram_V1.R
.
EindicaRS_Synteny_Ideogram contains Eindica_Synteny_Ideogram_V1.R
and the only two source files, RvsS.dualsynteny.cis.txt
and RvsS.dualKaryotype.txt
. Running the R script will produce the below ideogram. Note: Numbers in boxes above and below the ideogram and bold letter “T”s on the karyotype in the publication version of this figure were added in Microsoft PowerPoint.
Grey links indicate shared synteny between chromosome pairs. Red links indicate large inversions of synteny between the genomes. Black links represent Region-A and Region-B of the EPSPS-Cassette in their native locations.
Fig. 3: Copy number variation in chromosome three across eight glyphosate-resistant and eight glyphosate-susceptible Eleusine indica individuals.
Illumina resequencing data of eight GR and eight GS Eindica individuals was aligned to their corresponding GR or GS genomes using HiSat2 (version 2.1.0). CNVnator (version 0.4.1) was used with these alignments to assess copy number variation across the genomes in terms of read depth. The resulting CNV files for chromosome three, the native location of the EPSPS locus, were visualized using EindicaRIdeogram_v5.R
.
Eindica_EPSPS_CNV_Ideogram contains EindicaRIdeogram_v5.R
to create the below CNV ideograms and the associated source files in Eindica_EPSPS_CNV_Ideogram/data
. Note: The publication version of Fig. 3 combines these two figures into one and adds additional visuals using Microsoft PowerPoint.
The above plots show copy number variation in chromosome three across eight glyphosate-resistant and eight glyphosate-susceptible Eleusine indica individuals. The ideogram shows deletions below 0.25x of average read depth (blue color spectrum), copy number variation above 0.25 of average read depth and below 4x of average read depth and with a p-value greater than 0.01 (white), and duplications above 4x of average read depth (red color spectrum) across chromosome three in eight glyphosate-resistant (R) versus eight glyphosate-susceptible (S) E. indica individuals at a scale of (the top plot) full chromosome length (63,742,515 base pairs) and (the bottom plot) the first 5,000,000 base pairs of chromosome three. Band thickness is proportional to the length of the genomic region exhibiting copy number variation. Region-A (green triangle), containing EPSPS, and Region-B (purple triangle), the genomic region co-duplicated with EPSPS, are consistently duplicated around 25x compared to average read depth in R individuals but are not duplicated in any of the S individuals.
Fig. 5: Relatedness of EPSPS-cassette subtelomere sequence to chromosomal subtelomeric sequences of the glyphosate-resistant and glyphosate-susceptible Eleusine indica genomes.
Subtelomeric repeat units from across the GR and GS genomes which are >86% similar in DNA sequence to the EPSPS-Cassette 451bp subtelomere repeat unit were obtained via BLAST+ (version 2.14.0). The extracted subtelomeric repeat units were aligned to eachother and the EPSPS-Cassette subtelomere repeat unit with MUSCLE5 (version 5.1.0) before creating a tree with RAXML-NG (version 1.1.0).
Eindica_Subtel_Tree contains Eindica_Subtel_Tree_V5.R
to create the below tree as well as the only needed source file, Eindica_R_S_EPSPS_SubTels.afa.raxml.bestTree.clean_labels
. Make sure to change the working directory in the Rscript. The included Eindica_Subtel_Tree.svg
was exported as a 600x600 SVG file using RStudio (version 2023.06.0+421).
The above plot shows the relatedness of subtelomeric sequences found on the glyphosate-resistant (R; gold) and glyphosate-susceptible (S; blue) E. indica genomes to the subtelomeric sequence found on the EPSPS-cassette (Cassette; green). Chromosomes at branch tips further from Cassette are less related to the EPSPS-cassette than chromosomes closer to Cassette. Branch distance is based on similarity. The sequences with the highest relatedness to the EPSPS-cassette subtelomere sequence on each chromosome were used as representative sequences to make this tree.
At the bottom of Eindica_Subtel_Tree_V5.R
you will find the code for a version of the tree with a table for a legend (shown below), just for fun. This version worked well for a poster.
The above plot shows the relatedness of subtelomeric sequences found on the glyphosate-resistant (R; red) and glyphosate-susceptible (S; blue) E. indica genomes to the subtelomeric sequence found on the EPSPS-Cassette (green). Chromosomes at branch tips further from cassette are less related to cassette than chromosomes closer to cassette. Branch distance is based on BLAST similarity. The sequences with the highest relatedness to the EPSPS-Cassette subtelomere sequence on each chromosome were used as representative sequences to make a tree.
Supp. Fig. 4: Differential expression of eight glyphosate-resistant versus eight glyphosate-susceptible Eleusine indica individuals.
Illumina paired-end cDNA reads from eight GR and eight GS E. indica individuals were mapped with HiSat2 (version 2.1.0) to the GS transcriptome, which was pulled from E_indica.gff3 annotation file using gffread (version 0.12.7). HiSat2 alignments were converted into a counts table EindicaRS.counts.v2.txt
using SAMtools (version 1.11). Labels in gene_names_v2.csv
were created for all genes annotated within Region-A and -B of the EPSPS-Cassette with command line.
The script EindicaRS_edgeR_v4.R
contained in EindicaRS_edgeR calculates differential expression using the two-sided quasi-likelihood F-test in edgeR, writes the calculations into resistant.v.susceptible.txt
and plots it into the below volcano plot.
The above plot from RNA-Seq data shows over-expressed (red) and underexpressed (blue) genes in GR E. indica individuals with labels for all identified genes within the EPSPS-Cassette. Gene labels with a non-integer numerical value represent splice variants of the same gene. Genes below a p-value of 0.01 or a fold change value below two were considered not differentially expressed (grey) between the two treatment groups.