What does Molecular Evolution means?
Something changes in the "material" coming up with something else, "branching" from the original element along the timeline representing the lifetime of the elements
the mutation can result in something better or something that can be deadly for the newbe (survival of the fittest)
Simmetry is a dominant trait in the sistem
The system composed by DNA->RNA->Protein was dominant among all the others and became the main base from which evolution started
- Mutation
- Randomness
- Time
- Selection
- Constraints (physical)
Based on what we are seing right now we infer what came before us
The study of changes occurring in DNA and in its products is the object of study of Molecular Evolution
Sequence-wise we cannot infer something basing ourselves only on positional informations because it could be a case of pure coincidence
main terms
- orthologs (strong relationship) (a gene has the same function in different organisms)
- paralogs (weak relationship)
- homologs (deriving from the same tree)
Final goal: structural Alignment
- align two sequences
- find positionally and sequentially identical part
Sequence alignment have to deal with probabilistic similarities
Random sequences can have a 5-10% "similarity index"
Main paradigms
- Similarity matrix
- Gap cost
- both previous points concurr in the creation of the "Dynamic Programming Algorithm"
Software Jalview
each sequence is aligned to a side of the matrix. When there is a match between the x and y axis the point of the matrix is filled with a dot
To prevent checking the whole matrix a Sliding window is used
With approximation noise becomes more blurry and the useful information is highlighted (Blosum 62 => a particular matrix to overlap to the original one)
The alignment is computed in 2 steps
- Computation of the best solution in every box
- Backtracking: choice of the optimal path on the basis of data computed in the boxes
The difference between global and local alignment stands in box filling and in the choice of the backtracking starting point
Notation:
- x_i = i-th element of the sequence x
- y_j = j-th element of the sequence y
- x_(1..i) = Prefix of x from 1 to i
- F = optimal score matrix
- F(i,j) = optimal alignment x_(1..i) with y_(1..j)
- d = gap penalty
- s = scoring matrix
- Build F
- Initialize F(0,0) = 0; F(i,0) = -di; F(0,j)= -dj
- Fill the table from top-left to bottom-right corner using the recursive relationship F(i,j) = max( F(i-1,j-1)+s(x_i,y_i) , F(i-1,j)-d , F(i,j-1)-d )
The path always starts from the last cell. By definition it ends at the cell(1,1).
Shifts:
- Diagonal - both
- Up - gap up
- Sx - gap down
Discovered a decade after the global alignment
F(i,j) = max( 0, F(i-1,j-1)+s(x_i,y_j), F(i-1,j)-d, F(i,j-1)-d)
Local Alignment focuses on region where similarities are
The Traceback always starts from the cell with the highest score
Matrix initialization (first row or column) is done with 0s as in the “local” alignment
Matrix compilation is done as in the “global” alignment
Traceback always starts from the last row or column
Semi Global Alignment isn't as used as the prev ones since Local Alignment already does most of the work
Setting the upper-left side to 0s "frees" the match of the left side of the two sequences
Forcing to start from the low-right corner forces the match of the right side of the two sequences
Both properties are symmetrical
Sometimes it may be useful to have more than one alignment for two sequences. These alignments are called suboptimal.
It is possible to compute them using small variations during the backtracking phase. This phase is then reiterated to get different alignments, up to a threshold T
Similarity matrices are tables associating a similarity value to each substitution
The most common matrices are based on statistical methods indicating the substitution frequency among aminoacids in homologous protein families
Given p_a as the probability of finding residue a, the probability of a random substitution between residues is computed as the probability of two independent events => C(a,b) = p_a * p_b
Probability of a mutation between amminoacids a and b is observed on a set of homolog sequences => M(a,b) = p_(a,b)
The relationship between match and random can be expressed as "likelihood" or "odds ratio" => M(a,b)/C(a,b)
To keep the odd ratio more stable and get a larger number, we take the logarithm
PAM = Point accepted mutation
Matrices are built on homologous sequences showing only 1% of accepted mutations
Two sequences are at 1 PAM of distance if, on average, one mutation on 100 amminoacids has been detected.
BLOSUM = Blocks Amino Acid Substitution Matrices
Blosum is calculated with the same formula of PAM, but has better performances being based on direct observations
The number after BLOSUM name is the percentage of sequence identities that are held by the matrix
Two parameters are used for gaps:
- Gap open (γ) = opening the first gap of an indel
- Gap extension (δ) = extending an already existing indel
Gap penalty:d = γ + δ * (length(i)– 1)
- Global Alignment (red) impose an alignment which involves all the residues of the two sequences, no matter their similarity
- Local Alignment (green) on the other hand allows to align only most similar residues of the two sequences
- Semiglobal Alignment (blue) tries to combine both methods
Purposes:
- Demonstrate homology
- Molecular phylogeny
- Structural prediction
- Functional prediction
- Identification of functionally important sites
other thoughts:
- Usage of algorithms for the search of an optimal alignment between two sequences creates problems in its generation
If L is the length of the sequences, it would take O(L^N) units of time to align N sequences. Not feasible!
- Usageof euristic methods or progressive based on the hypotesis that the sequences to be aligned are phylogenetically correlated
- Pairwise alignment of all the starting sequences with:
- Approximate methods (n-tuple)
- Dynamic algorithm
- Scoring of the alignment used to build the phylogenetic tree
- Progressive alignment of the sequences according to the tree order
The tree is a dendogram of priority for the pairwise alignment of multiple sequences
Features:
- Sequence weighting
- Matrix score
- Special gap score
- Once an alignment is fixed is not modified in the subsequent steps. In particular, the gap location cannot change
- Initial errors are propagated in subsequent steps
- Initial phylogenetic trees are derived from distance matrices between pairs of independently aligned sequences. These are less reliable than phylogenetic trees derived from complete multiple sequence alignments
- Alignment errors depend on sequence similarities. Care must be taken in selecting input sequence to be real homologs and of comparable length to avoid the insertion of too many gaps
- If sequences are too divergent (< 25-30% sequence identity) progressive methods become unreliable
- CLUSTAL-OMEGA has replaced CLUSTALW
- T-COFFEE
- MAFFT
- EMBL-EBI (european bioinformatics institute)
- SIB (swiss institute of bioinformatics)
- PIR (protein information resource)
Swiss-Prot manually annotated and reviewed
vs
TrEMBL automatically annotated and not reviewed
Proteins can be classified into different groups based on:
- The FAMILIES to which they belong
- The DOMAINS they contain
- The SEQUENCE FEATURES they possess
A protein family is a group of proteins that share a common evolutionary origin
Domains are distinct functional and/or structural units in a protein
Usually they are responsible for a particular function or interaction, contributing to the overall role of a protein
Domains may exist in a variety of biological contexts, where similar domains can be found in proteins with different functions
Sequences features are group of amino acids that confer certain characteristics upon a protein, and may be important for its overall function
one of the most relevant InterPro database
Collection of multiple sequence alignment based on Hidden Markov Models
Interesting searches for non identical sequences
The main idea is that homologous proteins have a common ancestor and therefore share extensive regions of similarity
In most cases low levels of similarity need to be faced
It is not trivial to distinguish real and false homologs
SLIDE 28 L3
Considering a score system to compare sequences and rank similarities: in order to evaluate the significance of the score we need to assess the statistical significance of the result
Negative cases are naturally more abundant. Trivial reasons
Where to draw the Threshold Line
The Z-score is defined as:
Z = (opt_query-mu_random) / stddev_random
The Z-score quantifies the distance of the opt value from the mean as function of the standard deviation.
After sampling the scores, we will obtain a distribution similar to the normal one called extreme value distribution (Gumbel distribution)
To understand how significative is the score obtained from my real alignment with respect to the observed distribution, the E-value is used.
Instead of comparing the whole strings we search for a really small subsequence of the original sequence in the database of sequences discarding the ones not containing the choosen subsequence (i.e. the majority of sequences) obtaining an euristically decent solution (we need to consider that we may discard suboptimal solutions since the substring we choose may be the only different part)
The problem is no longer quadratic but linear
The algorithm considers only cases where two hits exists on the same diagonal at the same distance lower than an A parameter before looking for HSPs
To avoid loosing sensitivity, the T threshold has been lowered. This is useful since we tend to discard a lot of sequences a priori
- can we use this information to extract informations about the degree of sequence conservation for each aminoacid?
- there is an “evolutionary trace” derived from the natural selection of functional proteins?
- Then we will analyze the most common techniques to answer these questions, and above all sequence profiles
Given a protein family, how can we fix the information in the multiple sequence alignment in order to look for other sequences that are still unknown?
The most common alignment methods, even if they use profiles, i.e. they do not evaluate indels positions
Idea: create a Hidden Markov Model that best represents reality.
Columns can be treated as states in a hidden markov model with a certain probability of going from one to the next state (not backwards)
The probability from one state to another can be lower than 1, the complement probability takes us to another state allowing us to skip the next column/state or to insert another state an indefinite amount of times before going back to the original column/state
reminder: adding something is arbitrary, deletion is more precise since they work with blanks (fixed number of them)
All previous steps are used to create the HMM that the sequences need to match
An HMM represents a generalization of the profile concept.
- Insertion and deletion probabilities are different at each position
- HMM are used in the database Pfam
- The most popular program is HMMER
HMM can be built from the alignment
Ensembl connects with UniProt.