For Friday: try to understand non-parametric bootstrapping
Friday class Morning session in chemistry, afternoon session in Computer Science Building (ITEB) Room C13 (In case you do not want to be part of an experimental group using a new sofware, pick the morning class.)
Questions on quiz 6?
Intro to phylogenetic reconstruction
|
| Compilation of sequence dataset |
| Alignment |
| Determination of substitution model |
| Tree building |
| Tree evaluation |
|
Baron Karl Friedrich Hieronymus von Münchhausen |
Bootstrapping
is one of the most popular ways to assess the reliability of branches. The term
bootstrapping goes back to the Baron Münchhausen (pulled himself out of a
swamp by his shoe laces). Briefly, positions of the aligned sequences are randomly
sampled from the multiple sequence alignment with replacements.?
The sampled positions are assembled into new data sets, the so-called bootstrapped
samples. Each position has an about 63% chance to make it into a particular bootstrapped
sample. If a grouping has a lot of support, it will be supported by at least some
positions in each of the bootstrapped samples, and all the bootstrapped samples
will yield this grouping. Bootstrapping can be applied to all methods of phylogenetic
reconstruction. Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values.
|
Create a bootstrap sample on the blackboard
Continue at last years class here on Bootstrap and shortcomings of trees calculated with clustalw.
Bootstrap and non-informative data (here)
Questions and answers on bootstrap
How many different groups of homologous proteins are there?
Problems: homology and detection of homology are two different things.
Paradox (?): If all genes evolved through duplication and diversification from the same first self replicating RNA molecule, aren't all genes homologs?
At present there are about 2000 known types of protein folds in the pdb data banks. How many of these folds can be joined into a single class?
(see the earlier example of Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers have the same type of nucleotide binding fold (picture), are they homologous
A) Maximum likelihood ratio test
The reconstruction of phylogenetic trees from molecular sequences necessitates that you assume a model that describes the evolutionary process. Often these assumptions are not clearly spelled out; and some make the claim that parsimony analyses does not assume a model at all, it just searches for the tree that explains the data with the least number of substitutions. However, an alternative view is that parsimony corresponds to a model in which all substitutions are equally likely. One of the major problems, especially if one wants to calibrate the data with respect to time, is the correction for multiple substitutions. The situation is complicated by two factors:
1. different sites along a sequence experience substitutions with different frequency
2. if a replacement occurs, the different types of replacements occur with different probabilities
Both of these considerations are valid for amino acid and nucleotide sequences. Taking both of these processes into account greatly improves the validity of the obtained trees. Two approaches have been used to address problem 1. Assign different weight to different positions a priori (e.g. first, second, third codon position, or stem versus loop regions in rRNA. Or have the program decide which distribution of among site rate variation is present in the data and make the appropriate corrections for multiple substitutions.
The so called gamma function has become very popular for this purpose. A good and readable overview was published by Z. Yang: The among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11: 367-372. (1996) here
The gamma -distribution is useful because a single parameter (the shape parameter a) continuously alters the character of the distribution. With a = infinity all sites change at the same rate; an extreme ASRV where only a few sites vary and the majority of sites are invariant or change only very slowly is obtained with a approaching 0. In between are cases that resemble exponential- (a =1),Poisson- (a= 2), and normal distributions (a >10). (see here for graphs)
If you want to compare two models of evolution (this includes the tree) given a certain data set, you can utilize the so-called maximum likelihood ratio test. If L1 and L2 are the likelihoods of the two models, d =2(logL1-logL2) approximately follows a Chi square distribution with n degrees of freedom. Usually n is the difference in model parameters (i.e., how many parameters are used to describe the substitution process). In particular n can be the difference in branches between two trees (one tree is more resolved than the other). In principle, this test can only be applied if on model is a more refined version of the other. However, if you compare two completely resolved trees with each other that differ only in a single branch, you can, following a suggestion by Joe Felsenstein, use one degree of freedom. In case of a molecular clock assumption, the model that assumes a clock has n-2 fewer parameters (as all sequences end up in the present at the same level, their branches cannot be freely chosen).
You can either look up the Chi-square distribution in a table, or you can go to Paul Lewis webpage, (http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/chiscalc.exe)
B) Maximum
likelihood mapping
An often-encountered problem in inspecting trees is the assessment of support for different groupings. E.g. does Giardia lamblia form the deepest branch within the known eukaryotes. Maximum likelihood mapping offers a graphic approach to this problem.
You can generate a ml-map for a single branch, or for the complete dataset. The principle is that for each possible quartet, the probabilities for the three tree types are plotted in a simplex. (Pi= Li/(L1+L2+L3) Note that P1+P2+P3=1; Pi is the (kind of) posterior probability and Li is the likelihood of tree i. If you generate a plot for the whole tree, you learn about how many quartets are resolved with confidence, if you plot only a single branch, you learn how many quartets support each of the possible orientations of the branch.
E.g. if one wants to know if Giardia lamblia is the deepest branch within the eukaryotes, on can choose the "higher" eukaryotes as cluster a, another deep branching eukaryote (one that competes against Giardia) as cluster b, Giardia as cluster c, and the outgroup as cluster d. For an easy sample output see this sample ml-map. A more complicated result from the analysis of carbamoyl phosphate synthetase domains is here.
One can also use ml-mapping to illustrate the information content in a dataset consisting of many sequences. An example is here. Fig. 4 (simulation) and 5 (real data).
C)
Bayesian Posterior probabilities with TREEPUZZLE and MrBayes
The formula used by Strimmer and von Haeseler (here) to calculate posterior probabilities (i.e. the probability that tree topology Ti is true given an aligned set of four sequences) considers only three trees (i.e. branch lengths and topology), each with the same a priori probability. These three trees are those that have the highest likelihood for the three possible topologies. However, there are infinitely many other trees that differ from the three chosen ones only by differences in branch lengths. What is the effect on the calculated posterior probability to use only the single best tree as a representative of all the trees with the same topology? There is no a priori reason to exclude the other trees that have slightly lower likelihoods.
A
different approach that does not make these assumptions is the use of Markov Chain
Monte Carlo methods to explore tree space. The Program MrBayes
written by Huelsenbeck and Ronquist performs such a random walk in tree space.
Trees with a higher probability are visited more often then those with a lower
probability. Some slides regarding ml mapping and Bayesian posterior probability
are here.
If after going through the slides in the last link, you are interested to
explore Bayes theorem [i.e., the posterior probability of an event given the data
= (the probability of the data given the event) times (the probality of the event
/ probability of the data), go to this
illustration by Olga Zhaxybayeva.
You can use MrBayes to calculate the probabilies of trees with different topology, bipartions, etc. If you calculate the consensus of all trees visited after the "burn in" phase, the percent of trees that have a certain partion directly reflects the posterior probability of this partition.
Paul Lewis (EEB - UConn) has written a very readable and thorough descriptions of the Bayesian approach: from MCB/EEB372 class 22
Paul Lewis' MCRobot program that illustrates the MCMC approach to estimate posterior probabilities is here.
For those interested to read more about the application of probability mapping to comparative Genome analyses: An article on the use of ml mapping in comparative genome analyses is here. (See Fig1, 2, 3, 4, 7, and Tab. 4); an improved version of probability mapping that solves the problem of poor taxon sampling inherent with quartet analyses is here, and an article that describes the extension to more than 4 genomes is here.