README-pairagon

===========================================================
PAIRAGON - A PairHMM Based cDNA to Genome Alignment Program
===========================================================

Table of Contents
=================

1. INTRODUCTION
2. RUNNING PAIRAGON
   2.1 Required Software
   2.2 Required Components
   2.3 Running the Program
3. OUTPUT FORMATS

1. INTRODUCTION
===============

Pairagon is a pair-HMM based cDNA to genome alignment program. 
It is written in C using the Twinscan/Pairagon libraries from 
the Laboratory of Computational Genomics, Washington University. 

There are two modes of aligning two sequences using Pairagon: 
 a) optimal Viterbi decoding, which guarantees optimal alignment 
    subject to the given alignment scoring scheme, and 
 b) Stepping Stone Viterbi decoding (Meyer and Durbin, 2002) 
    that saves space and time by compromising on the optimality 
    guarantee.

There are also two different implementations of the Viterbi
algorithm:
 a) standard Viterbi algorithm (that is faster, but uses more
    memory)
 b) Treeterbi algorithm that uses a tree structure to store
    the Viterbi variables thereby decreasing the memory
    requirements (500MB has been the maximum in our experience)

2. RUNNING PAIRAGON
===================

2.1 Required Software
---------------------

2.1.1 Seed alignment programs

Running Pairagon using the Stepping Stone algorithm needs a seed
alignment program to generate the Stepping Stone regions of the
alignment. Pairagon can use any one of the following three
alignment programs. You need at least one of these to run the
Stepping Stone mode. Alignments of two sequences might differ if
the seed alignment programs produce different alignments.

   BLAT:
   
   The following executables whould be 
   available in your path:
      (1) blat
      (2) faToTwoBit
      (3) pslSort
      (4) pslCDnaFilter
   
   GMAP:
   
   Pairagon uses the PSL style output from GMAP. Therefore you need
   the programs that manipulate PSL files as well.
   The following executables whould be 
   available in your path:
      (1) gmap
      (2) pslSort
      (3) pslCDnaFilter
   
   WU-BLAST:
   
   The following executables whould be 
   available in your path:
      (1) blastn
      (2) xdformat

2.1.2 Perl

   You would also need Perl to use our driver script.

2.1.3 GLIB2.0 (Recommended)

   We highly recommend that you install glib2.0 library. The faster
   version of Pairagon (invoked using --unopt by the runPairagon.pl
   driver script -- see below) uses an efficient memory allocator
   from glib2.0. The default version does not need that. The included
   Makefile assumes that glib2.0 is installed for linux boxes and that
   it is not installed for other architectures. If you do not
   want to install glib2.0 on your linux box, or if you have glib2.0
   on a non-linux box, you will have to modify the Makefile.

2.2 Required Components
-----------------------

   (1) Pairagon executable
   (2) Pairagon HMM parameter file
   (3) cDNA sequence
   (4) genomic sequence
   (5) seed alignment file for Stepping Stone (optional)


(1) Pairagon executable

Depending on the package you downloaded, you would either have the 
source code in the src/ directory, or the executables in the bin/
directory. If you have the source code, run one of the following 
commands to make the executable:

linux:
make pairagon-linux

solaris:
make pairagon-sparc
(Choose your compiler and make sure that the right arguments 
are set for your compiler)

Mac OS X:
make pairagon-macosx
(Choose your compiler and make sure that the right arguments 
are set for your compiler)

Pairagon has been tested on these three architectures. If you have a 
different architecture, you could still try to make the executable, 
however we do not guarantee you that Pairagon will compile and run on 
your architecture.

(2) Pairagon HMM parameter file

The HMM parameter file summarizes the pairHMM state model and the 
probabilities associated with it. Two parameter files have been included 
in this distribution under parameters directory:
 pairagon_simple.zhmm  - uses a simple dinucleotide model at the donor
                         and acceptor sites. These parameters were
                         generated by bootstrapped training of 
                         the model using 21249 alignments of human MGC 
                         clone sequences to the NCBI Build 34 of the 
                         human genome sequence available at the UCSC 
                         Genome browser. Since this parameter file is 
                         trained on high quality cDNA sequence, we do 
                         not make guarantees on the performance of 
                         Pairagon on EST sequences and low quality cDNA 
                         sequences.
 pairagon_branch.zhmm  - uses a position specific scoring matrix for donor
                         and acceptor sites and models the branch sites in
                         the U12 introns. These parameters were generated 
                         using 20945 Pairagon alignments of human MGC cDNA 
                         sequences to the NCBI build 34 of the human genome 
                         and the U12 intron model parameters were obtained 
                         from 405 U12 introns from Levine and Durbin (2001).
                         This is still an experimental version, and we have
                         only tested it on the human genome.

(3) cDNA sequence

The cDNA sequence should be in FASTA format

(4) genomic sequence

The genomic sequence should be in FASTA format

(5) seed alignment file for Stepping Stone (optional)

File listing the seed alignment to be used by the Stepping Stone 
algorithm. The seed alignment file has the following the format:

>header information (same as the cDNA Fasta file's header)
genomic_boundary_start=<number> genomic_boundary_end=<number> strand={+|-}
count=n
(g1b, c1b) (g1e, c1e)
(g2b, c2b) (g2e, c2e)
...
(gnb, cnb) (gne, cne)

The second line specifies the subsequence of the genomic sequence that 
you want to use. Since the time and space complexity of Pairagon is 
linear on the product of the sequence sizes, it helps to restrict the 
search space if it is possible. This line is the only one that is 
optional, and the absence of this line would result in the whole 
genomic sequence being used. The strand keyword tells Pairagon which 
orientation of the cDNA it should run. The cdna coordinates of the 
HSPs (see below) refer to this strand of the cDNA.

The count=n line lists the number of HSPs in the seed alignment, and each line 
that follows lists the coordinates of the HSPs in the following format:
(hsp_genomic_begin, hsp_cdna_begin) (hsp_genomic_end, hsp_cdna_end)

It is important that the header information is the same as the header in 
the cDNA fasta file, since the program uses it to match the seed alignment to 
the right cDNA.


2.3 Running the Program
-----------------------

STAND-ALONE PAIRAGON:

If you have all the files listed in the "Required Components" section, 
you can run pairagon in one of two ways:

Faster version (uses approximately 297 MB on our linux box, might require 
several GBs depending on the input sequences)

	bin/pairagon parameters/pairagon_simple.zhmm examples/cdnatest1.fa examples/genomictest1.fa --seed=seed_file

Treeterbi version (uses approximately 14 MB on our linux box, takes longer 
to finish. We haven't seen it use more than a 500 MB irrespective of the 
length of the input sequences)

	bin/pairagon parameters/pairagon_simple.zhmm examples/cdnatest1.fa examples/genomictest1.fa --seed=seed_file -o

This will run Pairagon in two iterations, using the forward and reverse 
cDNA sequence assuming that the cDNA sequence is in the sense orientation. 
The highest scoring alignment 
among the two will be selected and reported. If you have prior 
knowledge about the orientation of the cDNA in the alignment and the 
orientation of introns in the genomic sequence, they can be specified by 
--alignment_mode and --splice_mode, respectively, and only those modes 
will be tested.

USING THE PERL SCRIPT:

We have included a script in the bin/ directory of the distribution, 
runPairagon.pl, which is a driver file for running Pairagon. It is mostly useful
for running the Stepping Stone implementation of Pairagon, since the global optimal
implementation only needs 3 files: parameters, genomic sequence and cDNA sequence.

Seed alignments for Stepping Stone can be obtained in two ways:

a) Running the seed alignment program
BLAT, GMAP or WU-BLAST will be run to generate the respective output files. The best locus
and other loci that are within 1% of the best locus will be chosen and Pairagon will be run
for each locus.
E.g.,
bin/runPairagon.pl --seed=BLAT --exedir bin --outdir examples --params parameters/pairagon_simple.zhmm --unopt examples/cdnatest1.fa examples/genomictest1.fa

b) Using batch output file from seed alignment program
BLAT and GMAP batch output files in PSL format can be fed in to the script and
it will choose the lines matching the cDNA sequence that is being aligned. Pairagon 
will be run for each alignment (locus) from BLAT/GMAP. This is useful for full genome alignments
and the genomic sequence explicitly set is ignored. It uses the genomic sequence from the --genome
directory correcponding to the locus.
E.g.,
bin/runPairagon.pl --seed=BLAT --seedfile=your_batch_file.psl --genome=your_genome_file.2bit --exedir bin --outdir examples --params parameters/pairagon_simple.zhmm --unopt examples/cdnatest1.fa 

We have also included BPdeluxe.pm (Zhang and Gish 2005) and FAlite.pm, PERL modules useful in parsing BLAST 
outputs and Fasta files, in the lib/perl5/ directory of the distribution. FAlite.pm is 
required to run runPairagon.pl successfully. BPdeluxe.pm is required if you use WU-BLAST as the seed
alignment program. Please make sure that the lib/perl5 directory
is in the library path of your Perl installation.

You would run the example alignment using the driver script in one of two ways:

Viterbi version:
bin/runPairagon.pl --exedir bin --outdir examples --params parameters/pairagon_simple.zhmm --unopt examples/cdnatest1.fa examples/genomictest1.fa

Treeterbi version:
bin/runPairagon.pl --exedir bin --outdir examples --params parameters/pairagon_simple.zhmm examples/cdnatest1.fa examples/genomictest1.fa

File examples/cdnatest1.fa.estgen contains the alignment in est_gen format. 
examples/cdnatest1.fa.progress lists all the commands the pipeline 
executed to get the final alignment.

3. OUTPUT FORMATS
=================

The current implementation of Pairagon can generate two formats of 
output: the state sequence of the Viterbi parse, or the alignment in 
est_genome style output. Since there are parsers for est_genome output, 
you can parse our outputs using them. We also include a program 
pairagon2estgen that converts the Viterbi parse into est_genome style 
output. You can run it by typing

pairagon2estgen examples/cdnatest1.pair -cdna=examples/cdnatest1.fa -genomic=examples/genomictest1.fa

4. VERSION HISTORY
==================

Pairagon 1.01: 19 July 2007
 Release Update
 Bugfix
Pairagon 1.0: 29 August 2006
 Release Update
Pairagon 0.99 beta: 06 July    2006
 Model changes including branch points for U12 introns
Pairagon 0.95 beta: 22 June    2006
 Added Treeterbi decoding
 Several memory and speed optimizations
Pairagon 0.7 alpha: 12 January 2006
 Several speed optimizations; new parameter file
Pairagon 0.5 alpha: 02 June 2005
 Alpha release made public.

REFERENCES
==========
A. Levine and R. Durbin. A computational scan for U12-dependent introns in the
   human genome sequence, Nucleic Acids Research (2001) 29(19), 4006-4013
I.M. Meyer and R. Durbin. Comparative ab initio prediction of gene structures 
   using pair HMMs, Bioinformatics (2002) 18(10), 1309-1318 
M. Zhang and W. Gish. Improved spliced alignment from an information theoretic
   approach, Bioinformatics (2005) 22(1), 13-20