Sequences

Evolution

What does Molecular Evolution means?

Something changes in the "material" coming up with something else, "branching" from the original element along the timeline representing the lifetime of the elements

the mutation can result in something better or something that can be deadly for the newbe (survival of the fittest)

Simmetry is a dominant trait in the sistem

Fundamental Mechanism

The system composed by DNA->RNA->Protein was dominant among all the others and became the main base from which evolution started

Main features

Mutation
Randomness
Time
Selection
Constraints (physical)

Based on what we are seing right now we infer what came before us

The study of changes occurring in DNA and in its products is the object of study of Molecular Evolution

Sequence-wise we cannot infer something basing ourselves only on positional informations because it could be a case of pure coincidence

Gene Duplication

main terms

orthologs (strong relationship) (a gene has the same function in different organisms)
paralogs (weak relationship)
homologs (deriving from the same tree)

Sequence Alignment

Final goal: structural Alignment

align two sequences

find positionally and sequentially identical part

Sequence alignment have to deal with probabilistic similarities

Random sequences can have a 5-10% "similarity index"

Alignment

Main paradigms

Similarity matrix
Gap cost
both previous points concurr in the creation of the "Dynamic Programming Algorithm"

Visualizing alignment

Software Jalview

Dot matrix

each sequence is aligned to a side of the matrix. When there is a match between the x and y axis the point of the matrix is filled with a dot

To prevent checking the whole matrix a Sliding window is used

With approximation noise becomes more blurry and the useful information is highlighted (Blosum 62 => a particular matrix to overlap to the original one)

Dynamic programming

The alignment is computed in 2 steps

Computation of the best solution in every box
Backtracking: choice of the optimal path on the basis of data computed in the boxes

The difference between global and local alignment stands in box filling and in the choice of the backtracking starting point

Global Alignment

Notation:

x_i = i-th element of the sequence x
y_j = j-th element of the sequence y
x_(1..i) = Prefix of x from 1 to i
F = optimal score matrix
- F(i,j) = optimal alignment x_(1..i) with y_(1..j)
d = gap penalty
s = scoring matrix

Algorithm

Build F
Initialize F(0,0) = 0; F(i,0) = -di; F(0,j)= -dj
Fill the table from top-left to bottom-right corner using the recursive relationship F(i,j) = max( F(i-1,j-1)+s(x_i,y_i) , F(i-1,j)-d , F(i,j-1)-d )

GA-Backtracking

The path always starts from the last cell. By definition it ends at the cell(1,1).

Shifts:

Diagonal - both
Up - gap up
Sx - gap down

Local Alignment

Discovered a decade after the global alignment

F(i,j) = max( 0, F(i-1,j-1)+s(x_i,y_j), F(i-1,j)-d, F(i,j-1)-d)

Local Alignment focuses on region where similarities are

LA-Backtracking

The Traceback always starts from the cell with the highest score

Semi-Global Alignment

Matrix initialization (first row or column) is done with 0s as in the “local” alignment

Matrix compilation is done as in the “global” alignment

SGA-Backtracking

Traceback always starts from the last row or column

Semi Global Alignment isn't as used as the prev ones since Local Alignment already does most of the work

Setting the upper-left side to 0s "frees" the match of the left side of the two sequences

Forcing to start from the low-right corner forces the match of the right side of the two sequences

Both properties are symmetrical

Suboptimal Alignment

Sometimes it may be useful to have more than one alignment for two sequences. These alignments are called suboptimal.

It is possible to compute them using small variations during the backtracking phase. This phase is then reiterated to get different alignments, up to a threshold T

Similarity Matrices

Similarity matrices are tables associating a similarity value to each substitution

The most common matrices are based on statistical methods indicating the substitution frequency among aminoacids in homologous protein families

Given p_a as the probability of finding residue a, the probability of a random substitution between residues is computed as the probability of two independent events => C(a,b) = p_a * p_b

Probability of a mutation between amminoacids a and b is observed on a set of homolog sequences => M(a,b) = p_(a,b)

The relationship between match and random can be expressed as "likelihood" or "odds ratio" => M(a,b)/C(a,b)

To keep the odd ratio more stable and get a larger number, we take the logarithm

PAM matrices

PAM = Point accepted mutation

Matrices are built on homologous sequences showing only 1% of accepted mutations

Two sequences are at 1 PAM of distance if, on average, one mutation on 100 amminoacids has been detected.

Blosum matrices

BLOSUM = Blocks Amino Acid Substitution Matrices

Blosum is calculated with the same formula of PAM, but has better performances being based on direct observations

The number after BLOSUM name is the percentage of sequence identities that are held by the matrix

Affine Gap Costs

Two parameters are used for gaps:

Gap open (γ) = opening the first gap of an indel
Gap extension (δ) = extending an already existing indel

Gap penalty:d = γ + δ * (length(i)– 1)

Alignment differences

Global Alignment (red) impose an alignment which involves all the residues of the two sequences, no matter their similarity
Local Alignment (green) on the other hand allows to align only most similar residues of the two sequences
Semiglobal Alignment (blue) tries to combine both methods

Multiple sequence alignments

Purposes:

Demonstrate homology
Molecular phylogeny
Structural prediction
Functional prediction
Identification of functionally important sites

other thoughts:

Usage of algorithms for the search of an optimal alignment between two sequences creates problems in its generation

If L is the length of the sequences, it would take O(L^N) units of time to align N sequences. Not feasible!

Usageof euristic methods or progressive based on the hypotesis that the sequences to be aligned are phylogenetically correlated

CLUSTALW

Pairwise alignment of all the starting sequences with:
- Approximate methods (n-tuple)
- Dynamic algorithm
Scoring of the alignment used to build the phylogenetic tree
Progressive alignment of the sequences according to the tree order

The tree is a dendogram of priority for the pairwise alignment of multiple sequences

Features:

Sequence weighting
Matrix score
Special gap score

Progressive methods: disadvantages

Once an alignment is fixed is not modified in the subsequent steps. In particular, the gap location cannot change
Initial errors are propagated in subsequent steps
Initial phylogenetic trees are derived from distance matrices between pairs of independently aligned sequences. These are less reliable than phylogenetic trees derived from complete multiple sequence alignments
Alignment errors depend on sequence similarities. Care must be taken in selecting input sequence to be real homologs and of comparable length to avoid the insertion of too many gaps
If sequences are too divergent (< 25-30% sequence identity) progressive methods become unreliable

Other "multiple alignment" methods

CLUSTAL-OMEGA has replaced CLUSTALW
T-COFFEE
MAFFT

Sequence databases

Uniprot

EMBL-EBI (european bioinformatics institute)
SIB (swiss institute of bioinformatics)
PIR (protein information resource)

Swiss-Prot manually annotated and reviewed

vs

TrEMBL automatically annotated and not reviewed

InterPro

Proteins can be classified into different groups based on:

The FAMILIES to which they belong
The DOMAINS they contain
The SEQUENCE FEATURES they possess

A protein family is a group of proteins that share a common evolutionary origin

Domains are distinct functional and/or structural units in a protein

Usually they are responsible for a particular function or interaction, contributing to the overall role of a protein

Domains may exist in a variety of biological contexts, where similar domains can be found in proteins with different functions

Sequences features are group of amino acids that confer certain characteristics upon a protein, and may be important for its overall function

Pfam

one of the most relevant InterPro database

Collection of multiple sequence alignment based on Hidden Markov Models

Database searches

Interesting searches for non identical sequences

The main idea is that homologous proteins have a common ancestor and therefore share extensive regions of similarity

In most cases low levels of similarity need to be faced

It is not trivial to distinguish real and false homologs

SLIDE 28 L3

Statistical Significance

Considering a score system to compare sequences and rank similarities: in order to evaluate the significance of the score we need to assess the statistical significance of the result

Negative cases are naturally more abundant. Trivial reasons

Where to draw the Threshold Line

The Z-score is defined as:

Z = (opt_query-mu_random) / stddev_random

The Z-score quantifies the distance of the opt value from the mean as function of the standard deviation.

E-value

After sampling the scores, we will obtain a distribution similar to the normal one called extreme value distribution (Gumbel distribution)

To understand how significative is the score obtained from my real alignment with respect to the observed distribution, the E-value is used.

Instead of comparing the whole strings we search for a really small subsequence of the original sequence in the database of sequences discarding the ones not containing the choosen subsequence (i.e. the majority of sequences) obtaining an euristically decent solution (we need to consider that we may discard suboptimal solutions since the substring we choose may be the only different part)

The problem is no longer quadratic but linear

Two hit method

The algorithm considers only cases where two hits exists on the same diagonal at the same distance lower than an A parameter before looking for HSPs

To avoid loosing sensitivity, the T threshold has been lowered. This is useful since we tend to discard a lot of sequences a priori

Consensus Sequences

can we use this information to extract informations about the degree of sequence conservation for each aminoacid?
there is an “evolutionary trace” derived from the natural selection of functional proteins?
Then we will analyze the most common techniques to answer these questions, and above all sequence profiles

Color usage

Sequence logos

Beyond PSI-BLAST

Given a protein family, how can we fix the information in the multiple sequence alignment in order to look for other sequences that are still unknown?

The most common alignment methods, even if they use profiles, i.e. they do not evaluate indels positions

Idea: create a Hidden Markov Model that best represents reality.

Columns can be treated as states in a hidden markov model with a certain probability of going from one to the next state (not backwards)

The probability from one state to another can be lower than 1, the complement probability takes us to another state allowing us to skip the next column/state or to insert another state an indefinite amount of times before going back to the original column/state

reminder: adding something is arbitrary, deletion is more precise since they work with blanks (fixed number of them)

All previous steps are used to create the HMM that the sequences need to match

An HMM represents a generalization of the profile concept.

Insertion and deletion probabilities are different at each position

Application of HMMs

HMM are used in the database Pfam
The most popular program is HMMER

HMM can be built from the alignment

Nucleotide sequence databases

Ensembl

Ensembl connects with UniProt.

Files

2_sequences.md

Latest commit

History