Assignments

  • for Wednesday read : Why should/might you care about Molecular Evolution
  • Contemplate optional essay assignment
  • "Quiz" 6-- due on Monday

Review: What is in a tree?

Trees are often used to depict the evolutionary history of organisms, species and molecules. (see slides)

  • Trees can be either rooted or unrooted (at least the ones calculated from molecular data :-)).
  • The assumption of a molecular clock is usually not justified a priori.
  • Gene tree - species tree - genealogy
  • Lineage sorting
  • Gene Duplications
  • HGT
  • Trees form molecular data are usually calculated as unrooted trees (at least they should be - if they are not this is usually a mistake). To root a tree you either can assume a molecular clock (substitutions occur at a constant rate, again this assumption is usually not warranted and needs to be tested), or you can use an outgroup (i.e. something that you know forms the deepest branch).

For example, to root a phylogeny of birds, you could use the homologous characters from a reptile as outgroup; to find the root in a tree depicting the relations between different human mitochondria, you could use the mitochondria from chimpanzees or from Neanderthals as an outgroup; to root a phylogeny of alpha hemoglobins you could use a beta hemoglobin sequence, or a myoglobin sequence as outgroup.

  • Trees have a branching pattern (also called the topology), and branch lengths. Often the branch lengths are ignored in depicting trees (these trees often are referred to as cladograms - note that cladograms should be considered rooted).
    You can swap branches attached to a node, and you can depict the tree as rooted in any branch you like without changing the tree.

Tree exercise: Which of these trees are identical, when you consider them as unrooted and only consider the topology? here

While many trees have identical topologies, there is an enormous number of possible different tree topologies for rather small number of terminal taxa. An illustrative table is here.


IMPORTANT TERMS IN MOLECULAR EVOLUTION

Evolution of protein families:
      Homology (shared ancestry) versus  Analogy (convergent evolution)

Homology: Two sequences are homologous, if there existed an ancestral molecule in the past that is ancestral to both of the sequences

Homology is a "yes" or "no" character (don't know is also possible). Either sequences (or characters share ancestry or they don't (like pregnancy). Molecular biologist often use homology as synonymous with similarity of percent identity. One often reads: sequence A and B are 70% homologous. To an evolutionary biologist this sounds as wrong as 70% pregnant.

Types of Homology

Especially with respect to molecular evolution the following types of homology are really important!
(Especially the ones in bold. Yes, it will be in the final!):

Orthology: bifurcation in molecular tree reflects speciation
Paralogy: bifurcation in molecular tree reflects gene duplication
Xenology: gene was obtained by organism through horizontal transfer
Synology: genes ended up in one organism through fusion of lineages.

Orthologs: bifurcation in molecular tree reflects speciation. These are the molecules people interested in the taxonomic classification of organisms want to study.

Paralogs: bifurcation in molecular tree reflects gene duplication. The study of paralogs and their distribution in genomes provides clues on the way genomes evolved.
Gen and genome duplication have emerged as the most important pathway to molecular innovation, including the evolution of developmental pathways.

Xenologs: gene was obtained by organism through horizontal transfer. The classic example for Xenologs are antibiotic resistance genes, but the history of many other molecules also fits into this category: inteins, selfsplicing introns, transposable elements, ion pumps, other transporters,

Synologs: genes ended up in one organism through fusion of lineages. The paradigm are genes that were transferred into the eukaryotic cell together with the endosymbionts that evolved into mitochondria and plastids
(the -logs are often spelled with "ue" like in orthologues)

Discussion and examples from Fitch's article (TIG 2000, Fig. 1, see reading assignment). See also globin trees.


How many different groups of homologous proteins are there?
 Problems:  homology and detection of homology are two different things. 

Paradox (?): If all genes evolved through duplication and diversification from the same first self replicating RNA molecule, aren't all genes homologs?

At present there are about 2000 known types of protein folds in the pdb data banks.  How many of these folds can be joined into a single class? 
(see the earlier example of
Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication).   The monomers have the same type of nucleotide binding fold (picture), are they homologous (see here for recent speculation) ?

Intro to phylogenetic reconstruction

Phylogenetic analysis is an inference of evolutionary relationships between organisms.
Those relationships are usually represented by tree-like diagrams . Note: the assumption of tree-likeliness of evolution is controversial.

Steps of the phylogenetic analysis:


Compilation of sequence dataset
Alignment
Determination of substitution model
Tree building
Tree evaluation

 

 

 

Why phylogenetic reconstruction of molecular evolution?

    1. systematic classification of organisms
    2. e.g.: Who were the first angiosperms? (i.e. where are the first angiosperms located relative
      to present day angiosperms?)

      Where in the tree of life is the last common ancestor located?

    3. Evolution of molecules

e.g.: domain shuffling, reassignment of function, gene duplications, horizontal gene transfer, drug targets, detection of genes that drive evolution of a species/population (e.g. influenca virus, see here for more examples)

How:

1) Obtain sequences

Sequencing

Databank Searches -> ncbi a) entrez, b) BLAST, c) blast of pre-release data

Friends

 

2) Determine homology (see notes for earlier classes for practical implementation)

Reminder on Definitions:
Homology: Two sequences are homologous, if there existed an ancestral molecule in the past that is ancestral to both of the sequences

Types of homology:

Orthology: bifurcation in molecular tree reflects speciation
Paralogy: bifurcation in molecular tree reflects gene duplication
Xenology: gene was obtained by organism through horizontal transfer
Synology: genes ended up in one organism through fusion of lineages.

 

3) Align sequences

(most algorithms used for phylogenetic reconstruction require a global alignment. An exception is statalign
from Thorne JL, and Kishino H, 1992, Freeing phylogenies from artifacts of alignment. Mol Bio Evol 9:1148-1162)

      1. algorithms doing a global alignment: clustalw 1.7, or pile_up (GCG)
      2. local alignments (MACAW)

Select part of the alignment that is reliable! Modify alignment, if necessary.

 

4) Reconstruct evolutionary history

    A) Distance analyses

      1. calculate pairwise distances
        (different distance measures, correction for multiple hits, correction for codon bias)
      2. make distance matrix (table of pairwise corrected distances)
      3. calculate tree from distance matrix
i) using optimality criterion
(e.g.: smallest error between distance matrix
and distances in tree, or use
ii) algorithmic approaches (UPGMA or neighbor joining)

    B) Parsimony analyses

      find that tree that explains sequence data with minimum number of substitutions

      (tree includes hypothesis of sequence at each of the nodes)

       

    C) Maximum Likelihood analyses

      given a model for sequence evolution, find the tree that has the highest probability under this model.

      This approach can also be used to successively refine the model.

      Bayesian statistics use ML analyses to calculate posterior probabilities for trees, clades and evolutionary parameters. Especially MCMC approaches have become very popular in the last year, because they allow to estimate evolutionary parameters (e.g., which site in a virus protein is under positive selection), without assuming that one actually knows the "true" phylogeny.

       

      D - ...) Else:
      spectral analyses, evolutionary parsimony, i.e., look only at patterns of substitutions,

Another way to categorize methods of phylogenetic reconstruction is to ask if they are using

  • an optimality criterion (e.g.: smallest error between distance matrix and distances in tree, least number of steps), or
  • algorithmic approaches (UPGMA or neighbor joining)

5) Interpret the result.

It is especially important to consider artifacts that might originate in phylogenetic reconstruction, and to asses the reliability of your results.

 



Bootstrapping
- how to assess reliability of partitions given in a tree.

Baron Karl Friedrich Hieronymus von Münchhausen

Bootstrapping is one of the most popular ways to assess the reliability of branches.The term bootstrapping goes back to the Baron Münchhausen (pulled himself out of a swamp by his shoe laces). Briefly, positions of the aligned sequences are randomly sampled from the multiple sequence alignment with replacements.  The sampled positions are assembled into new data sets, the so-called bootstrapped samples. Each position has an about 63% chance to make it into a particular bootstrapped sample. If a grouping has a lot of support, it will be supported by at least some positions in each of the bootstrapped samples, and all the bootstrapped samples will yield this grouping. Bootstrapping can be applied to all methods of phylogenetic reconstruction.
Bootstrapping thus realizes the impossible: the evolution of sequences in real life happened only once, and it is impossible to run the evolution of, let's say, small subunit ribosomal RNAs again. Nevertheless, using the resampling approach, pseudosamples are generated that have a variation that resembles the variation one would have obtained, if it were possible to sample 100 or 1000 parallel worlds in which the evolution of 16S rRNAs occurred over and over again. You end up with a statistical analyses using a single original sample only.

Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values.

 

 

Continue at last years class here.

Slides on trees and Phylip are here