Assignments for Wednesday:
Powerpoint slides on trees and tree building are here
Approaches to phylogenetic reconstruction
A) Distance analyses
- calculate pairwise distances
(different distance measures, correction for multiple hits, correction for codon bias)- make distance matrix (table of pairwise corrected distances)
- calculate tree from distance matrix
i) using optimality criterion
(e.g.: smallest error between distance matrix
and distances in tree, or use
ii) algorithmic approaches (UPGMA or neighbor joining)
B) Parsimony analyses
find that tree that explains sequence data with minimum number of substitutions
(tree includes hypothesis of sequence at each of the nodes)
C) Maximum Likelihood analyses
given a model for sequence evolution, find the tree that has the highest probability under this model.
This approach can also be used to successively refine the model.
Bayesian statistics use ML analyses to calculate posterior probabilities for trees, clades and evolutionary parameters. Especially MCMC approaches have become very popular in the last year, because they allow to estimate evolutionary parameters (e.g., which site in a virus protein is under positive selection), without assuming that one actually knows the "true" phylogeny.
D - ...) Else:
spectral analyses, evolutionary parsimony, i.e., look only at patterns of substitutions,
Another way to categorize methods of phylogenetic reconstruction is to ask if they are using
- an optimality criterion (e.g.: smallest error between distance matrix and distances in tree, least number of steps), or
- algorithmic approaches (UPGMA or neighbor joining)
Introduction to Bayesian Analyses
An illustration of the usefulness of Bayesian thinking is here.Paul Lewis, a colleague in Ecology and Evolutionary Biology at the University of Connecticut is one of the pioneers applying a Bayesian framework to the analysis of molecular data. His lecture notes for his Introduction to Bayesian Phylogenetics are here (about four hours of lecture). Essentially, the Bayesian approach tries to assess the probability of a model, bipartition or range of a parameter value - in contrast to ML, which assesses the probability of the data given a model.
It has been shown that under some conditions the biased sampling of tree and parameter space converges on the posterior probability. The approach most often used in recent months is Markov Chain Monte Carlo sampling. The principle is illustrated by a little program that Paul Lewis wrote called MCRobot. This little robot runs around in two dimensional space over which different distribution can be defined. The walk of the robot is biased in a way so that probability to find the robot in a place is proportional to the defined distribution.
--- MCRobot demo in class ---
- Start the program. The black space you look at is the absolutely flat space where robot walks. To start the robot press "Ctrl-F" (you see the 100 steps that robot took connected to each other). To continue walking press "Ctrl-N" [you can hold "Ctrl-N" for the continuous walk]
- Run the robot for the sufficient amount of time. How well does the robot explore the space?
- Now change the terrain by introducing hills. To do so, drag a mouse somewhere in the space and release the mouse. The hill is depicted as yellow contours. Run the robot for some time. Do you observe any noticeable difference in the robot behavior?
- Go to "Chains" menu and toggle "2 chains". Now MCRobot will run two chains simultaneously (cold [original blue] chain, and heated [red chain]). Go to "Show" menu and choose "All chains" to have both chains depicted. Run the chains for some time. Are there any differences in the red and blue chains space exploration?
- Go to "Robot"-> "Options", toggle "Allow rotation" option. This will allow the rotation of the plane where the robot walks [before we always looked from the top] To change viewing angle, press right mouse button and rotate the mouse. Now run the chains again. Are there any differences in the red and blue chains space exploration from this angle of view?
The programs that evaluate molecular sequences (e.g., MrBayes) are doing the same as the MCRobot, but they walks around in tree and parameter space. For each place it visits, the program calculates the likelihood. The decision to take or reject a step is based on the likelihood. From the evaluation of all the trees and parameters visited (minus a burn-in phase), one can calculate the posterior probabilities of trees and parameters.
Additional material:
Paul Lewis (EEB - UConn) has written a very readable and thorough descriptions of the Bayesian approach: from MCB/EEB372 class 22
Paul Lewis' MCRobot program that illustrates the MCMC approach to estimate posterior probabilities is here.
Olga's exercise on the value of Bayesian thinking is here.
For those interested to read more about the application of probability mapping to comparative Genome analyses: An article on the use of ml mapping in comparative genome analyses is here. (See Fig1, 2, 3, 4, 7, and Tab. 4); an improved version of probability mapping that solves the problem of poor taxon sampling inherent with quartet analyses is here, and an article that describes the extension to more than 4 genomes is here.