-
Notifications
You must be signed in to change notification settings - Fork 6
/
manual.html
302 lines (301 loc) · 35 KB
/
manual.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
<html> <head> <title>rnaQUAST 2.3.0 Manual</title> <style type="text/css"> .highlight pre { background-color: #f0f2f4; border-radius: 2px; font-size: 100%; line-height: 1.45; overflow: auto; padding: 16px; } </style> </head> <body><p dir="auto"><a href="https://anaconda.org/bioconda/rnaquast" rel="nofollow"><img src="https://camo.githubusercontent.com/976f964354e062323b77ef17988548388c237fe20afbff2404c26b49dcc859b9/68747470733a2f2f696d672e736869656c64732e696f2f636f6e64612f646e2f62696f636f6e64612f726e6171756173742e7376673f7374796c653d666c6167266c6162656c3d42696f436f6e6461253230696e7374616c6c" alt="BioConda Install" data-canonical-src="https://img.shields.io/conda/dn/bioconda/rnaquast.svg?style=flag&label=BioConda%20install" style="max-width: 100%;"></a><br>
<a href="https://www.python.org/downloads/" rel="nofollow"><img src="https://camo.githubusercontent.com/f3798b635ffeed6d54324f4874bbeaafe27280993e8e15ba6e14882c3ba8fa66/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d332e372d626c7565" alt="Python version" data-canonical-src="https://img.shields.io/badge/python-3.7-blue" style="max-width: 100%;"></a><br>
<a href="https://www.gnu.org/licenses/old-licenses/gpl-2.0" rel="nofollow"><img src="https://camo.githubusercontent.com/69529f2804f510d2b63f86bfb81274cb8362b99f440b880837014885f25e0308/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e63652d47504c76322d626c7565" alt="License" data-canonical-src="https://img.shields.io/badge/licence-GPLv2-blue" style="max-width: 100%;"></a><br>
<a href="https://github.com/ablab/rnaquast/releases/"><img src="https://camo.githubusercontent.com/50b9d7a4dc10e3263441000f65c0c39c500b3de2f3d16842f838fb305eab03ee/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f762f72656c656173652f61626c61622f726e617175617374" alt="GitHub release (latest by date)" data-canonical-src="https://img.shields.io/github/v/release/ablab/rnaquast" style="max-width: 100%;"></a><br>
<a href="https://github.com/ablab/rnaquast/releases"><img src="https://camo.githubusercontent.com/f07f45b85cfb569ec8d6097612aece4113d85d962f804ac577e6eb4ef7b6b1ca/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f646f776e6c6f6164732f61626c61622f726e6171756173742f746f74616c2e7376673f7374796c653d736f6369616c266c6f676f3d676974687562266c6162656c3d446f776e6c6f6164" alt="GitHub Downloads" data-canonical-src="https://img.shields.io/github/downloads/ablab/rnaquast/total.svg?style=social&logo=github&label=Download" style="max-width: 100%;"></a></p>
<h1 dir="auto">rnaQUAST 2.3 manual</h1>
<ol dir="auto">
<li><a href="#sec1">About rnaQUAST</a></li>
<li><a href="#sec2">Installation & requirements</a><br>
2.1. <a href="#sec2.1">General requirements</a><br>
2.2. <a href="#sec2.2">Software for <em>de novo</em> quality assessments</a><br>
2.3. <a href="#sec2.3">Read alignment software</a></li>
<li><a href="#sec3">Options</a><br>
3.1. <a href="#sec3.1">Input data options</a><br>
3.2. <a href="#sec3.2">Basic options</a><br>
3.3. <a href="#sec3.3">Advanced options</a></li>
<li><a href="#sec4">Understanding rnaQUAST output</a><br>
4.1. <a href="#sec4.1">Reports</a><br>
4.2. <a href="#sec4.2">Detailed output</a><br>
4.3. <a href="#sec4.3">Plots</a></li>
<li><a href="#sec5">Citation</a></li>
<li><a href="#sec6">Feedback and bug reports</a></li>
</ol>
<p dir="auto"><a name="sec1"></a></p>
<h2 dir="auto">1 About rnaQUAST</h2>
<p dir="auto">rnaQUAST is a tool for evaluating RNA-Seq assemblies using reference genome and gene database. In addition, rnaQUAST is also capable of estimating gene database coverage by raw reads and <em>de novo</em> quality assessment using third-party software.</p>
<p dir="auto">rnaQUAST version 2.3 was released under GPLv2 on June 21st, 2024 and can be downloaded from <a href="https://github.com/ablab/rnaquast/releases">https://github.com/ablab/rnaquast/releases</a>.</p>
<p dir="auto">There is also a <a href="https://github.com/SimonHegele/rnaQAUSTcompare">visualizer software</a> developed by one of rnaQUAST users <a href="https://github.com/SimonHegele">@SimonHegele</a>.</p>
<p dir="auto"><strong>For impatient people:</strong></p>
<ul dir="auto">
<li>
<p dir="auto">You will need Python, <a href="https://pythonhosted.org/gffutils/installation.html" rel="nofollow">gffutils</a>, <a href="http://matplotlib.org/" rel="nofollow">matplotlib</a> and <a href="https://joblib.readthedocs.io/en/latest/" rel="nofollow">joblib</a>. Also you will need <a href="http://research-pub.gene.com/gmap/" rel="nofollow">GMAP</a> (or <a href="http://hgwdev.cse.ucsc.edu/~kent/exe/" rel="nofollow">BLAT</a>) and <a href="http://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/" rel="nofollow">BLASTN</a> installed on your machine and added to the <code class="notranslate">$PATH</code> variable.</p>
</li>
<li>
<p dir="auto">You may also install rnaQUAST via conda</p>
<pre class="notranslate"><code class="notranslate"> conda install -c bioconda rnaquast
</code></pre>
</li>
<li>
<p dir="auto">To verify your installation run</p>
<pre class="notranslate"><code class="notranslate"> python rnaQUAST.py --test
</code></pre>
</li>
<li>
<p dir="auto">To run rnaQUAST on your data use the following command</p>
<pre class="notranslate"><code class="notranslate"> python rnaQUAST.py \
--transcripts /PATH/TO/transcripts1.fasta /PATH/TO/ANOTHER/transcripts2.fasta /PATH/TO/MULTIPLE/*.fasta [...] \
--reference /PATH/TO/reference_genome.fasta --gtf /PATH/TO/gene_coordinates.gtf
</code></pre>
</li>
</ul>
<p dir="auto"><a name="sec2"></a></p>
<h2 dir="auto">2 Installation & requirements</h2>
<p dir="auto"><a name="sec2.1"></a></p>
<h3 dir="auto">2.1 General requirements</h3>
<p dir="auto">rnaQUAST can be installed via conda:</p>
<pre class="notranslate"><code class="notranslate"> conda install -c bioconda rnaquast
</code></pre>
<p dir="auto">If you wish to run rnaQUAST from <a href="https://github.com/ablab/rnaquast/releases">the release archive</a> you need:</p>
<ul dir="auto">
<li>Python3 or Python2 (2.5+)</li>
<li><a href="http://matplotlib.org/" rel="nofollow">matplotlib</a> python package</li>
<li><a href="https://joblib.readthedocs.io/en/latest/" rel="nofollow">joblib</a> python package</li>
<li><a href="https://pythonhosted.org/gffutils/installation.html" rel="nofollow">gffutils</a> python package (needs <a href="http://biopython.org" rel="nofollow">biopython</a>)</li>
<li><a href="http://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/" rel="nofollow">NCBI BLAST+ (blastn)</a></li>
<li><a href="http://research-pub.gene.com/gmap/" rel="nofollow">GMAP</a> (or <a href="http://hgwdev.cse.ucsc.edu/~kent/exe/" rel="nofollow">BLAT</a>) aligner</li>
</ul>
<p dir="auto">rnaQUAST still works under Python2 (2.5+), but since Python2 is outdated, its support is not maintained since version 2.0.</p>
<p dir="auto">Note, that due to the limitations of <code class="notranslate">BLAT</code>, in order to work with reference genomes of size more than 4 Gb a <code class="notranslate">pslSort</code> is also required.</p>
<p dir="auto">Paths to <code class="notranslate">blastn</code> and <code class="notranslate">GMAP</code> (or <code class="notranslate">BLA</code>T) should be added to the <code class="notranslate">$PATH</code> environmental variable. To check that everything is installed correctly we recommend to run:</p>
<pre class="notranslate"><code class="notranslate">python rnaQUAST.py --test
</code></pre>
<p dir="auto">Note that <code class="notranslate">gffutils</code> is used to complete gene coordinates in case of missing transcripts / genes records. For more information, see <a href="#sec3.3">advanced options</a>.<a name="sec2.2"></a></p>
<h3 dir="auto">2.2 Software for <em>de novo</em> quality assessment</h3>
<p dir="auto">When reference genome and gene database are unavailable, we recommend to run <a href="http://busco.ezlab.org/" rel="nofollow">BUSCO</a> and <a href="http://topaz.gatech.edu/GeneMark/" rel="nofollow">GeneMarkS-T</a> in rnaQUAST pipeline.</p>
<p dir="auto"><strong>BUSCO requirements</strong></p>
<p dir="auto">BUSCO allows to detect core genes in the assembled transcripts. To use it you should install <a href="http://busco.ezlab.org/" rel="nofollow">BUSCO v4+</a>, <a href="http://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/" rel="nofollow">tblastn</a>, <a href="http://hmmer.janelia.org/" rel="nofollow">HMMER</a> and transeq and add these tools to the <code class="notranslate">$PATH</code> variable.</p>
<p dir="auto">To run BUSCO provide lineage-specific database name via <code class="notranslate">--busco</code> option. You may also download the appropriate database from <a href="http://busco.ezlab.org" rel="nofollow">http://busco.ezlab.org</a> manually and provide it using the same option (see <a href="#busco">options</a> for details).</p>
<p dir="auto"><strong>GeneMarkS-T requirements</strong></p>
<p dir="auto"><a href="http://topaz.gatech.edu/GeneMark/" rel="nofollow">GeneMarkS-T</a> allows to predict genes in the assembled transcripts without reference genome. If you wish to use it in rnaQUAST pipeline, GeneMarkS-T should be properly installed and added to the <code class="notranslate">$PATH</code> variable.</p>
<p dir="auto"><a name="sec2.3"></a></p>
<h3 dir="auto">2.3 Read alignment software</h3>
<p dir="auto">rnaQUAST is also capable of calculating various statistics using raw reads (e.g. database coverage by reads). To obtain them you need to install <a href="https://github.com/alexdobin/STAR">STAR</a> aligner and add it to the <code class="notranslate">$PATH</code> variable. To learn more see <a href="#readopts">input options</a>.</p>
<p dir="auto"><a name="sec3"></a></p>
<h2 dir="auto">3 Options</h2>
<p dir="auto"><a name="sec3.1"></a></p>
<h3 dir="auto">3.1 Input data options</h3>
<p dir="auto">To run rnaQUAST you need to provide either FASTA files with transcripts (recommended), or align transcripts to the reference genome manually and provide the resulting PSL files.</p>
<p dir="auto"><code class="notranslate">-r <REFERENCE>, --reference <REFERENCE></code><br>
Single file with reference genome containing all chromosomes/scaffolds in FASTA format (preferably with <code class="notranslate">*.fasta, *.fa, *.fna, *.ffn or *.frn</code> extension) OR<br>
<strong><code class="notranslate">*.txt</code></strong> file containing the one-per-line list of FASTA files with reference sequences.</p>
<p dir="auto"><code class="notranslate">--gtf <GENE_COORDINATES></code><br>
File with gene coordinates in GTF/GFF format (needs information about parent relations). We recommend to use files downloaded from <a href="http://www.gencodegenes.org/" rel="nofollow">GENCODE</a> or Ensembl.</p>
<p dir="auto"><code class="notranslate">--gene_db <GENE_DB></code><br>
Path to the gene database generated by gffutils. The database is created during the first run. This option is not compatible with <code class="notranslate">--gtf</code> option. We recommend to use this option once the database is created in order to speed up the run.</p>
<p dir="auto"><code class="notranslate">--gmap_index <INDEX FOLDER>,</code><br>
Folder containing pre-built GMAP index for the reference genome. Using previously constructed index decreases running time. Note, that you still need to provide the reference genome that was used for index construction when this option is used.</p>
<p dir="auto"><code class="notranslate">-c <TRANSCRIPTS ...>, --transcripts <TRANSCRIPTS, ...></code><br>
File(s) with transcripts in FASTA format separated by space. Wildcards can be used, e.g. <code class="notranslate">--transcripts */*.fasta</code>.</p>
<p dir="auto"><code class="notranslate">-psl <TRANSCRIPTS_ALIGNMENT ...>, --alignment <TRANSCRIPTS_ALIGNMENT, ...></code><br>
File(s) with transcript alignments to the reference genome in PSL format separated by space.</p>
<p dir="auto"><a name="readopts"></a></p>
<p dir="auto"><code class="notranslate">-sam <READS_ALIGNMENT>, --reads_alignment <READS_ALIGNMENT></code><br>
File with read alignments to the reference genome in SAM format.</p>
<p dir="auto"><code class="notranslate">-1 <LEFT_READS>, --left_reads <LEFT_READS></code><br>
File with forward paired-end reads in FASTQ or gzip-compressed fastq format.</p>
<p dir="auto"><code class="notranslate">-2 <RIGHT_READS>, --right_reads <RIGHT_READS></code><br>
File with reverse paired-end reads in FASTQ or gzip-compressed fastq format.</p>
<p dir="auto"><code class="notranslate">-s <SINGLE_READS>, --single_reads <SINGLE_READS></code><br>
File with single reads in FASTQ or gzip-compressed fastq format.</p>
<p dir="auto"><a name="sec3.2"></a></p>
<h3 dir="auto">3.2 Basic options</h3>
<p dir="auto"><code class="notranslate">-o <OUTPUT_DIR>, --output_dir <OUTPUT_DIR></code><br>
Directory to store all results. Default is <code class="notranslate">rnaQUAST_results/results_<datetime></code>.</p>
<p dir="auto"><code class="notranslate">--test</code><br>
Run rnaQUAST on the test data from the <code class="notranslate">test_data</code> folder, output directory is <code class="notranslate">rnaOUAST_test_output</code>.</p>
<p dir="auto"><code class="notranslate">-d, --debug</code><br>
Report detailed information, typically used only for detecting problems.</p>
<p dir="auto"><code class="notranslate">-h, --help</code><br>
Show help message and exit.</p>
<p dir="auto"><a name="sec3.3"></a></p>
<h3 dir="auto">3.3 Advanced options</h3>
<p dir="auto"><code class="notranslate">-t <INT>, --threads <INT></code><br>
Maximum number of threads. Default is min(number of CPUs / 2, 16).</p>
<p dir="auto"><code class="notranslate">-l <LABELS ...>, --labels <LABELS ...></code><br>
Name(s) of assemblies that will be used in the reports separated by space and given in the same order as files with transcripts / alignments.</p>
<p dir="auto"><code class="notranslate">--prokaryote</code><br>
Use this option if the genome is prokaryotic.</p>
<p dir="auto"><code class="notranslate">-ss, --strand_specific</code><br>
Set if transcripts were assembled using strand-specific RNA-Seq data in order to benefit from knowing whether the transcript originated from the + or - strand.</p>
<p dir="auto"><code class="notranslate">--min_alignment <MIN_ALIGNMENT></code><br>
Minimal alignment length to be used, default value is 50.</p>
<p dir="auto"><code class="notranslate">--no_plots</code><br>
Do not draw plots (makes rnaQUAST run a bit faster).</p>
<p dir="auto"><code class="notranslate">--blat</code><br>
Run with <a href="http://hgwdev.cse.ucsc.edu/~kent/exe/" rel="nofollow">BLAT alignment tool</a> instead of <a href="http://research-pub.gene.com/gmap/" rel="nofollow">GMAP</a>.</p>
<p dir="auto"><a name="busco"></a><br>
<code class="notranslate">--busco</code><br>
Run <a href="http://busco.ezlab.org/" rel="nofollow">BUSCO tool</a>, which detects core genes in the assembly (see <a href="#sec2">Installation & requirements</a>). Use this option to provide BUSCO database name to use or path to the local database. Also, you can set <code class="notranslate">auto-lineage</code> for automated lineage selection.</p>
<p dir="auto"><code class="notranslate">--gene_mark</code><br>
Run with <a href="http://topaz.gatech.edu/GeneMark/" rel="nofollow">GeneMarkS-T</a> gene prediction tool. Use <code class="notranslate">--prokaryote</code> option if the genome is prokaryotic.</p>
<p dir="auto"><code class="notranslate">--disable_infer_genes</code><br>
Use this option if your GTF file already contains genes records, otherwise gffutils will fix it. Note that gffutils may work for quite a long time.</p>
<p dir="auto"><code class="notranslate">--disable_infer_transcripts</code><br>
Is option if your GTF file already contains transcripts records, otherwise gffutils will fix it. Note that gffutils may work for quite a long time.</p>
<p dir="auto"><code class="notranslate">--lower_threshold</code><br>
Lower threshold for x-assembled/covered/matched metrics, default: 50%.</p>
<p dir="auto"><code class="notranslate">--upper_threshold</code><br>
Upper threshold for x-assembled/covered/matched metrics, default: 95%.</p>
<p dir="auto"><a name="sec4"></a></p>
<h2 dir="auto">4 Understanding rnaQUAST output</h2>
<p dir="auto">In this section we describe metrics, statistics and plots generated by rnaQUAST. Metrics highlighted with <strong><em>bold italic</em></strong> are considered as the most important and are included in the short summary report (<code class="notranslate">short_report.txt</code>).</p>
<p dir="auto"><a name="sec4"></a></p>
<p dir="auto">For the simplicity of explanation, <em>transcript</em> is further referred to as a sequence generated by the assembler and <em>isoform</em> denotes a sequence from the gene database. Figure below demonstrates how rnaQUAST classifies transcript and isoform sequences using alignment information. <a target="_blank" rel="noopener noreferrer" href="fig1.png"><img src="fig1.png" alt="" style="max-width: 100%;"></a></p>
<p dir="auto"><a name="sec4.1"></a></p>
<h3 dir="auto"><a name="sec4.1">4.1 Reports</a></h3>
<p dir="auto">The following text files with reports are contained in <code class="notranslate">comparison_output</code> directory and include results for all input assemblies. In addition, these reports are contained in <code class="notranslate"><assembly_label>_output</code> directories for each assembly separately.</p>
<p dir="auto"><strong><code class="notranslate">database_metrics.txt</code></strong><br>
Gene database metrics.</p>
<ul dir="auto">
<li><strong><em>Genes</em></strong> / Protein coding genes – number of genes / protein coding genes</li>
<li>Isoforms / Protein coding isoforms – number of isoforms / protein coding isoforms</li>
<li>Exons / Introns – total number of exons / introns</li>
<li>Total / Average length of all isoforms, bp</li>
<li>Average exon length, bp</li>
<li>Average intron length, bp</li>
<li><strong><em>Average</em></strong> / Maximum number of exons per isoform</li>
</ul>
<p dir="auto"><a name="sec4.1"></a><a name="readcov"></a>Coverage by reads. The following metrics are calculated only when <code class="notranslate">--left_reads</code>, <code class="notranslate">--right_reads</code>, <code class="notranslate">--single_reads</code> or <code class="notranslate">--sam</code> options are used (see <a href="#readopts">options</a> for details).</p>
<ul dir="auto">
<li>Database coverage – the total number of bases covered by reads (in all isoforms) divided by the total length of all isoforms.</li>
<li>x%-covered genes / isoforms / exons – number of genes / isoforms / exons from the database that have at least x% of bases covered by all reads, where x is specified with <code class="notranslate">--lower_threshold / --upper_threshold</code> options (50% / 95% by default).</li>
</ul>
<p dir="auto"><strong><code class="notranslate">basic_mertics.txt</code></strong><br>
Basic transcripts metrics are calculated without reference genome and gene database.</p>
<ul dir="auto">
<li><strong><em>Transcripts</em></strong> – total number of assembled transcripts.</li>
<li><strong><em>Transcripts > 500 bp</em></strong></li>
<li>Transcripts > 1000 bp</li>
<li>Average length of assembled transcripts</li>
<li>Longest transcript</li>
<li>Total length</li>
<li>Transcript N50 – a maximal number N, such that the total length of all transcripts longer than N bp is at least 50% of the total length of all transcripts.</li>
</ul>
<p dir="auto"><strong><code class="notranslate">alignment_metrics.txt</code></strong><br>
Alignment metrics are calculated with reference genome but without using gene database. To calculate the following metrics rnaQUAST filters all short partial alignments (see <a href="#sec3.3"><code class="notranslate">--min_alignment</code> option</a>) and attempts to select the best hits for each transcript.</p>
<ul dir="auto">
<li><strong><em>Transcripts</em></strong> – total number of assembled transcripts.</li>
<li><strong><em>Aligned</em></strong> – the number of transcripts having at least 1 significant alignment.</li>
<li><strong><em>Uniquely aligned</em></strong> – the number of transcripts having a single significant alignment.</li>
<li>Multiply aligned – the number of transcripts having 2 or more significant alignments. Multiply aligned transcripts are stored in <code class="notranslate"><assembly_label>.paralogs.fasta</code> file.</li>
<li>Misassembly candidates reported by GMAP (or BLAT) – transcripts that have discordant best-scored alignment (partial alignments that are either mapped to different strands / different chromosomes / in reverse order / too far away).</li>
<li><strong><em>Unaligned</em></strong> – the number of transcripts without any significant alignments. Unaligned transcripts are stored in <code class="notranslate"><assembly_label>.unaligned.fasta</code> file.</li>
</ul>
<p dir="auto">Number of assembled transcripts = Unaligned + Aligned = Unaligned + (Uniquely aligned + Multiply aligned + Misassembly candidates reported by GMAP (or BLAT)).</p>
<p dir="auto">Alignment metrics for non-misassembled transcripts</p>
<ul dir="auto">
<li><strong><em>Average aligned fraction.</em></strong> Aligned fraction for a single transcript is defined as total number of aligned bases in the transcript divided by the total transcript length.</li>
<li>Average alignment length. Aligned length for a single transcript is defined as total number of aligned bases in the transcript.</li>
<li>Average blocks per alignment. A block is defined as a continuous alignment fragment without indels.</li>
<li>Average block length (see above).</li>
<li><strong><em>Average mismatches per transcript</em></strong> – average number of single nucleotide differences with reference genome per transcript.</li>
<li>NA50 – N50 for alignments.</li>
</ul>
<p dir="auto"><a name="misassemblies"></a><strong><code class="notranslate">misassemblies.txt</code></strong></p>
<ul dir="auto">
<li><strong><em>Transcripts</em></strong> – total number of assembled transcripts.</li>
<li>Misassembly candidates reported by GMAP (or BLAT) – transcripts that have discordant best-scored alignment (partial alignments that are either mapped to different strands / different chromosomes / in reverse order / too far away).</li>
<li>Misassembly candidates reported by BLASTN – transcripts are aligned to the isoform sequences extracted from the genome using gene database with BLASTN and then transcripts that have partial alignments to multiple isoforms are selected.</li>
<li><strong><em>Misassemblies</em></strong> – misassembly candidates confirmed by both methods described above. Using both methods simultaneously allows to avoid considering misalignments that can be caused, for example, by paralogous genes or genomic repeats. Misassembled transcripts are stored in <code class="notranslate"><assembly_label>.misassembled.fasta</code> file.</li>
</ul>
<p dir="auto"><strong><code class="notranslate">sensitivity.txt</code></strong><br>
Assembly completeness (sensitivity). For the following metrics (calculated with reference genome and gene database) rnaQUAST attempts to select best-matching database isoforms for every transcript. Note that a single transcript can contribute to multiple isoforms in the case of, for example, paralogous genes or genomic repeats. At the same time, an isoform can be covered by multiple transcripts in the case of fragmented assembly or duplicated transcripts in the assembly.</p>
<ul dir="auto">
<li><strong><em>Database coverage</em></strong> – the total number of bases covered by transcripts (in all isoforms) divided by the total length of all isoforms.</li>
<li>Duplication ratio – total number of aligned bases in assembled transcripts divided by the total number of isoform covered bases. This metric does not count neither paralogous genes nor shared exons, only real overlaps of the assembled sequences that are mapped to the same isoform.</li>
<li>Average number of transcripts mapped to one isoform.</li>
<li><strong><em>x%-assembled genes / isoforms</em></strong>/ exons – number of genes / isoforms / exons from the database that have at least x% captured by a single assembled transcript, where x is specified with <code class="notranslate">--lower_threshold / --upper_threshold</code> options (50% / 95% by default). 95%-assembled isoforms are stored in <code class="notranslate"><assembly_label>.95%assembled.fasta</code> file.</li>
<li>x%-covered genes / isoforms– number of genes / isoforms from the database that have at least x% of bases covered by all alignments, where x is specified with <code class="notranslate">--lower_threshold / --upper_threshold</code> options (50% / 95% by default).</li>
<li><strong><em>Mean isoform assembly</em></strong> – assembled fraction of a single isoform is calculated as the largest number of its bases captured by a single assembled transcript divided by its length; average value is computed for isoforms with > 0 bases covered.</li>
<li>Mean isoform coverage – coverage of a single isoform is calculated as the number of its bases covered by all assembled transcripts divided by its length; average value is computed for isoforms with > 0 bases covered.</li>
<li>Mean exon coverage – coverage of a single exon is calculated as the number of its bases covered by all assembled transcripts divided by its length; average value is computed for exons with > 0 bases covered.</li>
<li>Average percentage of isoform x%-covered exons, where x is specified with <code class="notranslate">--lower_threshold / --upper_threshold</code> options (50% / 95% by default). For each isoform rnaQUAST calculates the number of x%-covered exons divided by the total number of exons. Afterwards it computes average value for all covered isoforms.</li>
</ul>
<p dir="auto"><a href="http://busco.ezlab.org/" rel="nofollow">BUSCO</a> metrics. The following metrics are calculated only when <code class="notranslate">--busco</code> option is used (see <a href="#busco">options</a> for details).</p>
<ul dir="auto">
<li><strong><em>Complete</em></strong> – percentage of completely recovered genes.</li>
<li><strong><em>Partial</em></strong> – percentage of partially recovered genes.</li>
</ul>
<p dir="auto"><a href="http://topaz.gatech.edu/GeneMark/" rel="nofollow">GeneMarkS-T</a> metrics. The following metrics are calculated when reference and gene database are not provided or <code class="notranslate">--gene_mark</code> option is used (see <a href="#sec3.3">options</a> for details).</p>
<ul dir="auto">
<li><strong><em>Genes</em></strong> – number of predicted genes in transcripts.</li>
</ul>
<p dir="auto"><strong><code class="notranslate">specificity.txt</code></strong><br>
Assembly specificity. To compute the following metrics we use only transcripts that have at least one significant alignment and are not misassembled.</p>
<ul dir="auto">
<li><strong><em>Unannotated</em></strong> – total number of transcripts that do not cover any isoform from the database. Unannotated transcripts are stored in <code class="notranslate"><assembly_label>.unannotated.fasta</code> file.</li>
<li><strong><em>x%-matched</em></strong> – total number of transcripts that have at least x% covering an isoform from the database, where x is specified with <code class="notranslate">--lower_threshold / --upper_threshold</code> options (50% / 95% by default).</li>
<li><strong><em>Mean fraction of transcript matched</em></strong> – matched fraction of a single transcript is calculated as the number of its bases covering an isoform divided by the transcript length; average value is computed for transcripts with > 0 bases matched.</li>
<li>Mean fraction of block matched – matched fraction of a single block is calculated as the number of its bases covering an isoform divided by the block length; average value is computed for blocks with > 0 bases matched.</li>
<li>x%-matched blocks – percentage of blocks that have at least x% covering an isoform from the database, where x is specified with <code class="notranslate">--lower_threshold / --upper_threshold</code> options (50% / 95% by default).</li>
<li>Matched length – total number of transcript bases covering isoforms from the database.</li>
<li>Unmatched length – total alignment length - Matched length.</li>
</ul>
<p dir="auto"><strong><code class="notranslate">relative_database_coverage.txt</code></strong><br>
Relative database coverage metrics are calculated only when raw reads (or read alignments) are provided. rnaQUAST uses read alignments to estimate the upper bound of the database coverage and the number of x-covered genes / isoforms / exons (see <a href="#readcov">read coverage</a>) and computes the following metrics:</p>
<ul dir="auto">
<li><strong><em>Relative database coverage</em></strong> – ratio between transcripts database coverage and reads database coverage.</li>
<li>Relative x%-assembled genes / isoforms / exons – ratio between transcripts x%-assembled and reads x%-covered genes / isoforms / exons.</li>
<li>Relative x%-covered genes / isoforms / exons – ratio between transcripts x%-covered and reads x%-covered genes / isoforms / exons.</li>
</ul>
<p dir="auto"><a name="sec4.2"></a></p>
<h3 dir="auto">4.2 Detailed output</h3>
<p dir="auto">These files are contained in <code class="notranslate"><assembly_label>_output</code> directories for each assembly separately.</p>
<ul dir="auto">
<li><code class="notranslate"><assembly_label>.unaligned.fasta</code> – transcripts without any significant alignments.</li>
<li><code class="notranslate"><assembly_label>.paralogs.fasta</code> – transcripts having 2 or more significant alignments.</li>
<li><code class="notranslate"><assembly_label>.misassembled.fasta</code> – misassembly candidates detected by methods described above. See <a href="#misassemblies"><code class="notranslate">misassemblies.txt</code></a> description for details.</li>
<li><code class="notranslate"><assembly_label>.correct.fasta</code> – transcripts with exactly 1 significant alignment that do not contain misassemblies.</li>
<li><code class="notranslate"><assembly_label>.x%-assembled.list</code> – IDs of the isoforms from the database that have at least x% captured by a single assembled transcript, where x is specified by the user with an option <code class="notranslate">--upper_threshold</code> (95% by default).</li>
<li><code class="notranslate"><assembly_label>.unannotated.fasta</code> – transcripts that do not cover any isoform from the database.</li>
</ul>
<p dir="auto">The following text file is contained in <code class="notranslate">comparison_output</code> directory and <code class="notranslate"><assembly_label>_output</code> directories for each assembly separately.</p>
<ul dir="auto">
<li><code class="notranslate">reads.x%-covered.list</code> – IDs of the isoforms from the database that have at least x% bases covered by all reads, where x is specified with <code class="notranslate">--lower_threshold / --upper_threshold</code> options (50% / 95% by default).</li>
</ul>
<p dir="auto"><a name="sec4.3"></a></p>
<h3 dir="auto">4.3 Plots</h3>
<p dir="auto">The following plots are similarly contained in both <code class="notranslate">comparison_output</code> directory and <code class="notranslate"><assembly_label>_output</code> directories. Please note, that most of the plots represent cumulative distributions and some plots are given in logarithmic scale.</p>
<p dir="auto"><strong>Basic</strong></p>
<ul dir="auto">
<li><strong><em><code class="notranslate">transcript_length.png</code></em></strong> – assembled transcripts length distribution (+ database isoforms length distribution).</li>
<li><code class="notranslate">block_length.png</code> – alignment blocks length distribution (+ database exons length distribution).</li>
<li><code class="notranslate">x-aligned.png</code> – transcript aligned fraction distribution.</li>
<li><code class="notranslate">blocks_per_alignment.png</code> – distribution of number of blocks per alignment (+ distribution of number of database exons per isoform).</li>
<li><code class="notranslate">alignment_multiplicity.png</code> – distribution for the number of significant alignment for each multiply-aligned transcript.</li>
<li><strong><em><code class="notranslate">mismatch_rate.png</code></em></strong> – substitution errors per alignment distribution.</li>
<li><code class="notranslate">Nx.png</code> – Nx plot for transcripts. Nx is a maximal number N, such that the total length of all transcripts longer than N bp is at least x% of the total length of all transcripts.</li>
<li><code class="notranslate">NAx.png</code> – Nx plot for alignments.</li>
</ul>
<p dir="auto"><strong>Sensitivity</strong></p>
<ul dir="auto">
<li><strong><em><code class="notranslate">x-assembled.png</code></em></strong> – a histogram in which each bar represents the number of isoforms from the database that have at least x% captured by a single assembled transcript.</li>
<li><code class="notranslate">x-covered.png</code> – a histogram in which each bar represents the number of isoforms from the database that have at least x% of bases covered by all alignments.</li>
<li><code class="notranslate">x-assembled_exons.png</code> – a histogram in which each bar represents the number of exons from the database that have at least x% captured by a single assembled transcript.</li>
<li><code class="notranslate">x-covered_exons.png</code> – a histogram in which each bar represents the number of exons from the database that have at least x% of bases covered by all alignments.</li>
<li><code class="notranslate">alignments_per_isoform.png</code> – plot showing number of transcript alignments per isoform</li>
</ul>
<p dir="auto"><strong>Specificity</strong></p>
<ul dir="auto">
<li><code class="notranslate">x-matched.png</code> – a histogram in which each bar represents the number of transcripts that have at least x% matched to an isoform from the database.</li>
<li><code class="notranslate">x-matched_blocks.png</code> – a histogram in which each bar represents the number of all blocks from all transcript alignments that have at least x% matched to an isoform from the database.</li>
</ul>
<p dir="auto">To compare different reports you can also use a <a href="https://github.com/SimonHegele/rnaQAUSTcompare">visualizer software</a> developed by one of rnaQUAST users <a href="https://github.com/SimonHegele">@SimonHegele</a>.</p>
<p dir="auto"><a name="sec5"></a></p>
<h2 dir="auto">5 Citation</h2>
<p dir="auto"><a href="https://academic.oup.com/bioinformatics/article/32/14/2210/1743439" rel="nofollow">Bushmanova, E., Antipov, D., Lapidus, A., Suvorov, V. and Prjibelski, A.D., 2016. rnaQUAST: a quality assessment tool for de novo transcriptome assemblies. Bioinformatics, 32(14), pp.2210-2212.</a></p>
<p dir="auto"><a name="sec6"></a></p>
<h2 dir="auto">6 Feedback and bug reports</h2>
<p dir="auto">Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve rnaQUAST. If you have any troubles running rnaQUAST, please send us <code class="notranslate">logs/rnaQUAST.log</code> from the output directory.<br>
Submit your issues and comments to our <a href="https://github.com/ablab/rnaquast/issues">GitHub repository</a>.</p><br/><br/><br/><br/><br/> </body> </html>