-
Notifications
You must be signed in to change notification settings - Fork 0
/
README-pairagon
301 lines (235 loc) · 11.8 KB
/
README-pairagon
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
===========================================================
PAIRAGON - A PairHMM Based cDNA to Genome Alignment Program
===========================================================
Table of Contents
=================
1. INTRODUCTION
2. RUNNING PAIRAGON
2.1 Required Software
2.2 Required Components
2.3 Running the Program
3. OUTPUT FORMATS
1. INTRODUCTION
===============
Pairagon is a pair-HMM based cDNA to genome alignment program.
It is written in C using the Twinscan/Pairagon libraries from
the Laboratory of Computational Genomics, Washington University.
There are two modes of aligning two sequences using Pairagon:
a) optimal Viterbi decoding, which guarantees optimal alignment
subject to the given alignment scoring scheme, and
b) Stepping Stone Viterbi decoding (Meyer and Durbin, 2002)
that saves space and time by compromising on the optimality
guarantee.
There are also two different implementations of the Viterbi
algorithm:
a) standard Viterbi algorithm (that is faster, but uses more
memory)
b) Treeterbi algorithm that uses a tree structure to store
the Viterbi variables thereby decreasing the memory
requirements (500MB has been the maximum in our experience)
2. RUNNING PAIRAGON
===================
2.1 Required Software
---------------------
2.1.1 Seed alignment programs
Running Pairagon using the Stepping Stone algorithm needs a seed
alignment program to generate the Stepping Stone regions of the
alignment. Pairagon can use any one of the following three
alignment programs. You need at least one of these to run the
Stepping Stone mode. Alignments of two sequences might differ if
the seed alignment programs produce different alignments.
BLAT:
The following executables whould be
available in your path:
(1) blat
(2) faToTwoBit
(3) pslSort
(4) pslCDnaFilter
GMAP:
Pairagon uses the PSL style output from GMAP. Therefore you need
the programs that manipulate PSL files as well.
The following executables whould be
available in your path:
(1) gmap
(2) pslSort
(3) pslCDnaFilter
WU-BLAST:
The following executables whould be
available in your path:
(1) blastn
(2) xdformat
2.1.2 Perl
You would also need Perl to use our driver script.
2.1.3 GLIB2.0 (Recommended)
We highly recommend that you install glib2.0 library. The faster
version of Pairagon (invoked using --unopt by the runPairagon.pl
driver script -- see below) uses an efficient memory allocator
from glib2.0. The default version does not need that. The included
Makefile assumes that glib2.0 is installed for linux boxes and that
it is not installed for other architectures. If you do not
want to install glib2.0 on your linux box, or if you have glib2.0
on a non-linux box, you will have to modify the Makefile.
2.2 Required Components
-----------------------
(1) Pairagon executable
(2) Pairagon HMM parameter file
(3) cDNA sequence
(4) genomic sequence
(5) seed alignment file for Stepping Stone (optional)
(1) Pairagon executable
Depending on the package you downloaded, you would either have the
source code in the src/ directory, or the executables in the bin/
directory. If you have the source code, run one of the following
commands to make the executable:
linux:
make pairagon-linux
solaris:
make pairagon-sparc
(Choose your compiler and make sure that the right arguments
are set for your compiler)
Mac OS X:
make pairagon-macosx
(Choose your compiler and make sure that the right arguments
are set for your compiler)
Pairagon has been tested on these three architectures. If you have a
different architecture, you could still try to make the executable,
however we do not guarantee you that Pairagon will compile and run on
your architecture.
(2) Pairagon HMM parameter file
The HMM parameter file summarizes the pairHMM state model and the
probabilities associated with it. Two parameter files have been included
in this distribution under parameters directory:
pairagon_simple.zhmm - uses a simple dinucleotide model at the donor
and acceptor sites. These parameters were
generated by bootstrapped training of
the model using 21249 alignments of human MGC
clone sequences to the NCBI Build 34 of the
human genome sequence available at the UCSC
Genome browser. Since this parameter file is
trained on high quality cDNA sequence, we do
not make guarantees on the performance of
Pairagon on EST sequences and low quality cDNA
sequences.
pairagon_branch.zhmm - uses a position specific scoring matrix for donor
and acceptor sites and models the branch sites in
the U12 introns. These parameters were generated
using 20945 Pairagon alignments of human MGC cDNA
sequences to the NCBI build 34 of the human genome
and the U12 intron model parameters were obtained
from 405 U12 introns from Levine and Durbin (2001).
This is still an experimental version, and we have
only tested it on the human genome.
(3) cDNA sequence
The cDNA sequence should be in FASTA format
(4) genomic sequence
The genomic sequence should be in FASTA format
(5) seed alignment file for Stepping Stone (optional)
File listing the seed alignment to be used by the Stepping Stone
algorithm. The seed alignment file has the following the format:
>header information (same as the cDNA Fasta file's header)
genomic_boundary_start=<number> genomic_boundary_end=<number> strand={+|-}
count=n
(g1b, c1b) (g1e, c1e)
(g2b, c2b) (g2e, c2e)
...
(gnb, cnb) (gne, cne)
The second line specifies the subsequence of the genomic sequence that
you want to use. Since the time and space complexity of Pairagon is
linear on the product of the sequence sizes, it helps to restrict the
search space if it is possible. This line is the only one that is
optional, and the absence of this line would result in the whole
genomic sequence being used. The strand keyword tells Pairagon which
orientation of the cDNA it should run. The cdna coordinates of the
HSPs (see below) refer to this strand of the cDNA.
The count=n line lists the number of HSPs in the seed alignment, and each line
that follows lists the coordinates of the HSPs in the following format:
(hsp_genomic_begin, hsp_cdna_begin) (hsp_genomic_end, hsp_cdna_end)
It is important that the header information is the same as the header in
the cDNA fasta file, since the program uses it to match the seed alignment to
the right cDNA.
2.3 Running the Program
-----------------------
STAND-ALONE PAIRAGON:
If you have all the files listed in the "Required Components" section,
you can run pairagon in one of two ways:
Faster version (uses approximately 297 MB on our linux box, might require
several GBs depending on the input sequences)
bin/pairagon parameters/pairagon_simple.zhmm examples/cdnatest1.fa examples/genomictest1.fa --seed=seed_file
Treeterbi version (uses approximately 14 MB on our linux box, takes longer
to finish. We haven't seen it use more than a 500 MB irrespective of the
length of the input sequences)
bin/pairagon parameters/pairagon_simple.zhmm examples/cdnatest1.fa examples/genomictest1.fa --seed=seed_file -o
This will run Pairagon in two iterations, using the forward and reverse
cDNA sequence assuming that the cDNA sequence is in the sense orientation.
The highest scoring alignment
among the two will be selected and reported. If you have prior
knowledge about the orientation of the cDNA in the alignment and the
orientation of introns in the genomic sequence, they can be specified by
--alignment_mode and --splice_mode, respectively, and only those modes
will be tested.
USING THE PERL SCRIPT:
We have included a script in the bin/ directory of the distribution,
runPairagon.pl, which is a driver file for running Pairagon. It is mostly useful
for running the Stepping Stone implementation of Pairagon, since the global optimal
implementation only needs 3 files: parameters, genomic sequence and cDNA sequence.
Seed alignments for Stepping Stone can be obtained in two ways:
a) Running the seed alignment program
BLAT, GMAP or WU-BLAST will be run to generate the respective output files. The best locus
and other loci that are within 1% of the best locus will be chosen and Pairagon will be run
for each locus.
E.g.,
bin/runPairagon.pl --seed=BLAT --exedir bin --outdir examples --params parameters/pairagon_simple.zhmm --unopt examples/cdnatest1.fa examples/genomictest1.fa
b) Using batch output file from seed alignment program
BLAT and GMAP batch output files in PSL format can be fed in to the script and
it will choose the lines matching the cDNA sequence that is being aligned. Pairagon
will be run for each alignment (locus) from BLAT/GMAP. This is useful for full genome alignments
and the genomic sequence explicitly set is ignored. It uses the genomic sequence from the --genome
directory correcponding to the locus.
E.g.,
bin/runPairagon.pl --seed=BLAT --seedfile=your_batch_file.psl --genome=your_genome_file.2bit --exedir bin --outdir examples --params parameters/pairagon_simple.zhmm --unopt examples/cdnatest1.fa
We have also included BPdeluxe.pm (Zhang and Gish 2005) and FAlite.pm, PERL modules useful in parsing BLAST
outputs and Fasta files, in the lib/perl5/ directory of the distribution. FAlite.pm is
required to run runPairagon.pl successfully. BPdeluxe.pm is required if you use WU-BLAST as the seed
alignment program. Please make sure that the lib/perl5 directory
is in the library path of your Perl installation.
You would run the example alignment using the driver script in one of two ways:
Viterbi version:
bin/runPairagon.pl --exedir bin --outdir examples --params parameters/pairagon_simple.zhmm --unopt examples/cdnatest1.fa examples/genomictest1.fa
Treeterbi version:
bin/runPairagon.pl --exedir bin --outdir examples --params parameters/pairagon_simple.zhmm examples/cdnatest1.fa examples/genomictest1.fa
File examples/cdnatest1.fa.estgen contains the alignment in est_gen format.
examples/cdnatest1.fa.progress lists all the commands the pipeline
executed to get the final alignment.
3. OUTPUT FORMATS
=================
The current implementation of Pairagon can generate two formats of
output: the state sequence of the Viterbi parse, or the alignment in
est_genome style output. Since there are parsers for est_genome output,
you can parse our outputs using them. We also include a program
pairagon2estgen that converts the Viterbi parse into est_genome style
output. You can run it by typing
pairagon2estgen examples/cdnatest1.pair -cdna=examples/cdnatest1.fa -genomic=examples/genomictest1.fa
4. VERSION HISTORY
==================
Pairagon 1.01: 19 July 2007
Release Update
Bugfix
Pairagon 1.0: 29 August 2006
Release Update
Pairagon 0.99 beta: 06 July 2006
Model changes including branch points for U12 introns
Pairagon 0.95 beta: 22 June 2006
Added Treeterbi decoding
Several memory and speed optimizations
Pairagon 0.7 alpha: 12 January 2006
Several speed optimizations; new parameter file
Pairagon 0.5 alpha: 02 June 2005
Alpha release made public.
REFERENCES
==========
A. Levine and R. Durbin. A computational scan for U12-dependent introns in the
human genome sequence, Nucleic Acids Research (2001) 29(19), 4006-4013
I.M. Meyer and R. Durbin. Comparative ab initio prediction of gene structures
using pair HMMs, Bioinformatics (2002) 18(10), 1309-1318
M. Zhang and W. Gish. Improved spliced alignment from an information theoretic
approach, Bioinformatics (2005) 22(1), 13-20