Reading assignment for Friday: chapter 11

Draw a diagram of the genome rearrangements that relate the Mycobacterium tuberculosis CDC1551 genome to the genome of Mycobacterium avium subsp. paratuberculosis K-10 (see http://www.ncbi.nlm.nih.gov/sutils/geneplot.cgi?tax1=83331&tax2=262316 )

Go over results from class 20.

Do over results from gene plot exercise: Borrelia burgdorferi vs B. garinii at NCBI and at EMU (import and analyze in Excel here)

How are the approaches different? What would be preferred to do?

 

 

From:<http://dml.cmnh.org/2002Jul/msg00351.html>

----- Original Message -----
From: <Dinogeorge@aol.com>
Sent: Thursday, July 11, 2002 6:47 PM
Subject: Re: New finds

 

> > --+--+-----------A
> >   |  `--+--+-----B
> >   |     |  `--+--C
> >   |     |     `--D
> >   |     `--------E
> >    `--------------F
>
> This is >not< a Hennigian comb. Only the entire ABCDE clade and the F
lineage
> make a (two-toothed) Hennigian comb in this cladogram. In a Hennigian comb
> the side branches are left unbranched, like the teeth of a comb. Hence the
> name.

This _is_ a Hennigian comb, because in a cladogram, _only_ topology counts.
A cladogram is a mobile. Look at the following -- it's exactly the same
cladogram as above:

--+--F
  `--+--A
     `--+--E
        `--+--B
           `--+--D
              `--C

... what a side branch is lies completely in the hand of the presentator.
All I did was I rotated a few stems around their long axes.

* References:

The Clay of Evolution -
How to study genes and genomes.

How can genes get duplicated:
Whole genome duplication, partial genome duplication, single genes get duplicated (tandem repeats)

Whole genome duplication: frequent event in plants, also speculated to have occurred at least twice in the early evolution of vertebrates. 15% of the yeast genome is present in duplicated form, the currently accepted idea is that there was an ancient duplication followed by rearrangement and gene loss. The idea of genome duplications in early vertebrate evolution has become very popular, but phylogeny of regulatory proteins does not support this idea (see here and here for pro and here for contra).

The picture below is a comparison of the Yeast proteom with itself (the diagonal is removed).  It clearly shows many small regions of duplications. 

The diagram depicts the result of a BLAST search of each ORF in a genome against the genome (=collection of ORFs). The proteins encoded in the genome are listed in order on both axes (could be different genomes as well, see below). The color of each dot reflects the E-value for the comparison of the ORFs. The smaller the E-value, the lighter the point.

Parts of chromosomes get duplicated: traces of this are seen in Arabidopsis and Caenorhabditis

Single genes get duplicated -> gene families originally tandemly replicated (see the Caeonrhapditis paper above)

Some TOOLS at NCBI

The NCBI provides several different interfaces to browse through and analyze genomes. For example, in the Borrelia genome, if you click on the complete genome, you get a graphical representation, further clicks move you down throw several levels to the nucleotide and encoded amino acid sequence. If you click on an ORF, you retrieve the sequence followed by an output of a blast search of this sequence against the nr database. The graphic representation shows you which part of the ORF generated the match, if you click on the number that represents the score, you open a new window with the alignment (again with nice graphics included). If you click on the number an window with the matching sequence in gb-format opens up. If the ORF is part of a cluster of putatively orthologous genes, you can get information on the cluster by clicking on the COGnumber.

From the Borrelia genome page, you can go to tables listing all ORF, or to taxtable, which provides an interesting nearest neighbor coloring of the genome.  It is noteworthy that many of the pink dots are endonucleases.  Also, there are many transporters among the odd colored genes. 

In an attempt to capture some phylogenetic information in blast comparisons, Olendzenski et al. pioneered an approach to use multiple reference genomes to screen for putatively horizontally transferred genes (see Fig. 4). A similar approach, but using only two instead of three reference genomes is implemented in the TAX PLOT program at the NCBI's genome page (see below).

You pick one genome to analyze, and two reference genomes. The program returns a plot of every ORF in the selected genome represented in a coordinate system, where the two coordinates are the highest alignment score with the two reference genomes:

Selected genome was from Borrelia burgdorferi. The list of selected genes is below:

DefinitionBlast2SeqGenBankBlink
V-type ATPase, subunit B (atpB) [Borrelia burgdorferi]15594439=>
aaV-TYPE ATP SYNTHASE BETA CHAIN (V-TYPE A72212585403=>
aaATP synthase F1 alpha subunit [Aquifex a26115606090=>

V-type ATPase, subunit A (atpA) [Borrelia burgdorferi]15594440=>
aaH+-transporting ATP synthase, subunit A 105111498766=>
aaATP synthase F1 beta subunit [Aquifex ae22115607015=>

prolyl-tRNA synthetase (proS) [Borrelia burgdorferi]15594747=>
aaprolyl-tRNA synthetase (proS) [Archaeogl65511499201=>
aaproline-tRNA synthetase [Aquifex aeolicu16715605873=>

phenylalanyl-tRNA synthetase, beta subunit (pheT) [Borrelia burgdorferi]15594859=>
aaphenylalanyl-tRNA synthetase, subunit be70911499019=>
aaphenylalanyl-tRNA synthetase beta subuni15315606806=>

chemotaxis histidine kinase (cheA-1) [Borrelia burgdorferi]15594912=>
aachemotaxis histidine kinase (cheA) [Arch79811498645=>
aahistidine kinase sensor protein [Aquifex8615605839=>

methionyl-tRNA synthetase (metG) [Borrelia burgdorferi]15594932=>
aamethionyl-tRNA synthetase (metS) [Archae87311499048=>
aamethionyl-tRNA synthetase alpha subunit 43615606482=>

spermidine/putrescine ABC transporter, ATP-binding protein (potA) [Borrelia burgdorferi]15594987=>
aaspermidine/putrescine ABC transporter, A67811499200=>
aaABC transporter [Aquifex aeolicus]32515607081=>

lysyl-tRNA synthetase [Borrelia burgdorferi]15595004=>
aalysyl-tRNA synthetase (lysS) [Archaeoglo64211498815=>
aacysteinyl-tRNA synthetase [Aquifex aeoli9215606347=>

More on Comparing Genomes:

Genome dot plots allow to compare two genomes (or rather the ORF in encoded in these genomes). In contrast to a normal dot plot, one does not move a window through the sequence, rather one takes one ORF at a time and compares it to the other genome.

Robert L. Charlebois' genome and bioinformatics site performed these and other analysis. Most of these are now availble at the EMU server maintained by Robert Beiko

For example BLASTP-based dot plot of Pyrococcus abyssi vs Pyrococcus horikoshii depicted below clearly reveals inversions, and duplications (two parallel diagonals), the latter can also be detected by comparing a genome to itself.

See this paper from Tillier and Collins on a discussion of this and similar patterns.

Recently, the NCBI added a pairwise genome comparison of protein homologs (symmetrical best hits) to their web page (from any summary sequence view of a genome (e.g. here) select GenPlot (e.g. here). (This analysis is different from the above in that in does not consider all pairwise scores, but only those ORFs that pick each other as top scoring blast hits, i.e. at best each ORF is represented by one point.)

The blastall algorithm might be your best chance to generate a plot that includes all significant blast hits (possibly covered on Wednesday).

 

Selection versus genetic drift.

Selection

Deterministic models to describe selection:  (diploid organisms, two alleles A1 and A2)

    codominance (kind of logistic equation) q=frequency of allele A2, 

Genotype:                                     A1A1        A1A2        A2A2

Relative number of offspring           1             1+s          1+2s

Fitness                                            w11          w12          w22

frequency                                      p^2             2pq          q^2

(pq: allele frequencies,=> genotype frequencies in Hardy Weinberg equilibrium)

Change in frequency (approximately):  
dq/dt= s* q*(1-q) and

q(t)=1/(1+((1-q0)/q0)*e-st)

    over dominance

Genotype:                                          A1A1    A1A2    A2A2

 Relative number of offspring                1         1+s1     1+s2

          s1>s:   balancing selection (try it)

Go to Kent Holsinger's collection of JAVA applets here and explore some of the time courses with different values of s1 and s2.  

Under which conditions of w11, w12, and w22 can one maintain both alleles over long periods of time?

Stochastic approaches -- random drift - neutral evolution:

Law of the gutter (see also Steven J Gould?s interpretation on the trend to increasing complexity)

Explore some simulations: 
     Drift only (vary the population size N),

How does the survival of multiple alleles in a population depend on the population size.

     Drift and Selection (interesting setting: P=0.01, N=50)

Note: Even though the allele conveys a strong selective advantage of 10%, the allele has a rather large chance to go extinct quickly.

     This simulation follows many populations (with the selected parameters) over time. It plots a histogram that shows how many of the populations have the allele frequency indicated on the y-axis. If you set the mutation rate to 0, this provides a nice illustration of the law of the gutter. (In the presence of the alleles converting back and forth, fixation does not occur.)

Mutation rate versus Substitution rate

The following assumes co-dominance or no selection:

s=0:  Probability of fixation, P, is equal to frequency of allele in population, q

mutation rate (per gene/per unit of time) = u ;  

frequency with which new alleles are generated in a diploid population size N equals to u*2N

Probability of fixation for each new allele = 1/(2N)

Substitution rate = frequency with which allele is generated * Probability of fixation= u*2N *1/(2N) = u

Therefore:
The substitution rate is independent of population size if s=0 and equal to the mutation rate!!!!

This is the reason that there is hope that the molecular clock might sometimes work.

For advantageous mutations: 
      Probability of fixation, P, is approximately equal to 2s;
      e.g., if selective advantage s = 1% then P = 2%

      Does this correspond to the simulations you performed above?

Fixation time

Neutral mutations:  tav=4*Ne generations 
(Ne=effective population size; For n discrete generations Ne= n/(1/N1+1/N2+?..1/Nn)

S unequal to 0:  tav= (2/s) ln (2N) generations  (also true for mutations with negative s --  How can this be??)

E.g.:  N=106, s=0:  average time to fixation: 4*106 generations

N=106, s=0.01:  average time to fixation: 2900 generations

Neutral theory: 

The vast majority of observed sequence differences between members of a population are neutral (or close to neutral). These differences can be fixed in the population through random genetic drift. Some mutations are strongly counter selected (this is why there are patterns of conserved residues). Only very seldom is a mutation under positive selection. 

The neutral theory does not say that all evolution is neutral and everything is only due to to genetic drift.

(Nearly neutral theory:  Even synonymous mutations do not lead to random composition but to codon bias.  Small negative selection might be sufficient to produce this bias. )

Note: the larger the population the better selection works, and the closer to neutral a mutation needs to be in order to be fixed by genetic drift. (If N*s<<1 the mutation behaves as neutral, and the fixation probability is 1/N; if N*s~1 then fixation probability is only about 2s, which is small, but seems to work.)

Is Evolution in humans only neutral? Does selection still play a role? E.g., here , the distribution of alleles that encode a protein presumably involved in brain development (here for the article, in case you are interested to read more, a similar case reported here), here for a comment that argues the haplotype frequencies might be due to drift and small founder populations and not reflect selection.