Comments on / Additions to Class 3:

In searching Entrez, you can add links to online journals for which UConn has a subscription. (If you are outside UConn, you need to set up a proxy account for the links to work).

The link to use is http://www.ncbi.nlm.nih.gov/sites/entrez?otool=uconnlib
(more info at http://www.lib.uconn.edu/online/shortcuts.html

Use MyNCBI at Entrez for repeating searches in regular intervals (Alternative is Pubcrawler see class2).

Web of Science Cited Reference Search - problems?

Do example on clipboard and index. (use GI 2266989 (nucl) and 3334404 (prot))
How many related sequences does the nucleotide sequence have?
How many related sequences does the encoded protein sequence have? (check page 400 and 1000)
Demonstrate Links and BLINK

Bottom lines:
a) Genbank is redundant
b) If possible, it is preferable to use a 20 letter protein sequence as query rather than a 4 letter nucleotide sequence!

Outstanding questions: Honors conversion?


Other web pages:

Nucleic Acid Research Database Issue
Every year, the first issue of Nucleic Acid Research is devoted to updates on biological databases

http://www.ebi.ac.uk/
The European homolog/analog to NCBI, software archive.

http://rdp.cme.msu.edu/
The US ribosomal databank project

http://www.jgi.doe.gov/
Genomes at the DOE joint genome institute

http://www.genomesonline.org/
List of completed genomes and ongoing genomes

http://www.tigr.org
Home of several "completed" genomes projects (now renamed in the J. Craig Venter Institute)

http://genome-www.stanford.edu/
Yeast and Arabidopsis genome projects

http://www.flybase.org/
Database of Drosophila Genome

http://www.arabidopsis.org/
TAIR - The Arabidopsis Information Resource

http://www.ensembl.org/
Ensembl Genome Browser (Eukaryotic genomes, including Human and Mouse genomes)

 

 

Sequence and structure databanks can be divided into many different categories.
One of the most important is:

 

Supervised databanks with gatekeeper.

Examples:

  • Swissprot
  • Refseq (at NCBI)

Entries are checked for accuracy.
+ more reliable annotations
-- frequently out of date

 

 

Repositories without gatekeeper.

Examples:

  • GenBank
  • EMBL
  • TrEMBL

Everything is accepted.
+ everything is available
-- many duplicates
-- poor reliability of annotations

Old Powerpoint slides on data banks are here

In case you want more information on databanks, a computer sciences oriented intro to biological databanks, their design and management is at https://gcrcweba.lacusc.co.la.ca.us/bifresources/BuildingUsingBiolDB.pdf

 

PRSS -
When are two similar sequences homologous?
I.e., when is their similarity due to shared ancestry?
(The opposite to homology is analogy, due to convergent evolution.)

(Note: we will discuss alignment algorithms later, for now it is sufficient to know that given a scoring matrix and two sequences, one can calculate an alignment that has an optimal score)

One way to quantify the similarity between two sequences is to

1.    compare the actual sequences and calculate alignment score

2.    randomize (scramble) one (or both) of the sequences and calculate the alignment score for the randomized sequences.

3.    repeat step 2 at least 100 times

4.    describe distribution of randomized alignment scores

5.    do a statistical test to determine if the score obtained for the real sequences is significantly better than the score for the randomized sequences

To illustrate the assessment of similarity/homology we will use a program from Pearson's FASTA package called PRSS. 
This and many other programs by Bill Pearson are available from his web page at http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml

A web version is available here.

Go through example. Sequences are here (fl), here (B), here (A) and here (A2)

There are many other alignment programs.  BLAST is a program that is widely used and offered through the NCBI (go here for more info).  It also offers to do pairwise comparisons (go here, do example).

To force the blast program to report an alignment increase the E-value.


Rules of thumb:

Usually E values (in a blast search or through randomization) smaller than 10-4 are convincing.

(For small values the E value gives the probability to find a match of this quality in a search of a data bank of the same size by chance alone - for more detailed information see Terminology section and the BLAST help manual on P values)

If you can demonstrate significant similarity using either randomization or an unweighted blast search, your sequences are homologous (i.e. related by common ancestry).  Convergent evolution has not been shown to lead to sequence similarities detectable by these means (see above - this might not be true for scores in PSI-blast)

If the actual alignment score is more than three standard deviations (of the randomized sequences) better than the mean for the randomized sequences, the two sequences are homologous (i.e. related by common ancestry).  PRSS and many other program use more accurate distributions to describe the distribution of random hits.  The expectation value for the alignment-score of the actual sequences is based on these statistics.

A similar approach is used in the FASTA database search. If one chooses to display a histogram of the search, the output includes the histogram of all the alignment scores obtained with the individual sequences contained in the database. Included are the actual sequence scores, and the ones that are expected based on a probability distribution. An example is here or do a search with gi 2493127 here.

Terminology:

E-values give the expected number of matches with an alignment score this good or better,
P-values give the probability of to find a match of this quality or better. P values are [0,1], E-values are [0,infinity).
For small values E=P
z-values give the distance between the actual alignment score and the mean of the scores for the randomized sequences expressed as multiples of the standard deviation calculated for the randomized scores.
For example: a z-value of 3 means that the actual alignment score is 3 standard deviations better than the average for the randomized sequences. Z-values > 3 are usually considered as suggestive of homology, z-values > 5 are considered as sufficient demonstration. (see the but below). A somewhat readable description of E, P, HSP and other values is here.

BUT:
Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.

Examples:

Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous, however, the above tests fail to detect this homologies. (for example, enzymes with GRASP nucleotide binding sites are depicted here.)

DNA replication involves many different enzymes. Some of the proteins do the same thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the above tests fail to detect homology.

Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers have the same type of nucleotide binding fold (picture)

Assignments:

For Friday:

go through the yellow boxes in class4 (above).
Re-read Chapter 2.
Check the example genbank formated file (here) and read through the list of other frequently used formats (here)
Email a question that would be suitable for the 1st take-home quiz to the instructor.

For Monday, Sept. 10:

Read through posted take-home-quiz #1
Read Chapter on Evolution as Algorithm in "Darwin's Dangerous Idea" by Daniel C. Dennett [available through WebCT]

Optional for Monday (Sept 10)(required for grad students :)) :

 

 

What does Bioinformatics have to do with Molecular Evolution? 

Problem: Application of first principles does not (yet) work

The following chain although (believed to be) mainly determined by the DNA sequence (plus other components of the cell which in turn are encoded by other parts of the genome) can at present not be simulated in a computer.  

DNA sequence ->
transcription ->
translation ->
protein folding ->
protein function (catalytic and other properties) ->
properties of the organism(s) ->
ecology (taking also the non biological environment into account) ->

... .

 

Most scientists believe that the principle of reductionism (plus new laws and relations emerging on each level) is true for this chain; however, this is clearly "in principle" only.
Biology relies on this sequence to work more or less unambiguously (prions), but:

At several steps along the way from DNA to function our understanding of the chemical and physical processes involved is so incomplete that prediction of protein function based on only a single DNA sequence is at present impossible (at least for a protein of reasonable size).

Solution:
Use evolutionary context:

"Nothing in biology makes sense except in the light of evolution"

Theodosius Dobzhansky



Present day proteins evolved through substitution and selection from ancestral proteins. Related proteins have similar sequence AND similar structure AND similar function.

In the above mantra "similar function" can refer to:

  • identical function,

  • similar function, e.g.:
    • identical reactions catalyzed in different organisms; or
    • same catalytic mechanism but different substrate (malic and lactic acid dehydrogenases);
    • similar subunits and domains that are brought together through a (hypothetical) process called domain shuffling, e.g. nucleotide binding domains in hexokinase, myosin, HSP70, and ATPsynthases.

The Size of Protein Sequence Space (back of the envelope calculation):

Consider a protein of 600 amino acids.
Assume that for every position there could be any of the twenty possible amino acid.
Then the total number of possibilities is
20 choices for the first position times 20 for the second position times 20 to the third .... = 20 to the 600 = 4*10^780 different proteins possible with lengths of 600 amino acids.

For comparison the universe contains only about 10^89 protons and has an age of about 5*10^17 seconds or 5*10^29 picoseconds.

If every proton in the universe were a computer that explored one possible protein sequence per picosecond, we only would have explored 5*10^118 sequences, i.e. a negligible fraction of the possible sequences with length 600 (one in about 10^662).

The following is based on observation and not on an a priori truth:

If two proteins (not necessarily true for nucleotide sequences) show significant similarity in their primary sequence, they have shared ancestry, and probably similar function.
(although some proteins acquired radically new functional assignments, lysozyme -> lense crystalline). 


To date there is no example known where convergent evolution has let to significant similarity of the primary sequence (although here are examples where similar selection pressures have resulted in similar convergent substitutions in homologous proteins).

THE REVERSE IS NOT TRUE:

PROTEINS WITH THE SAME OR SIMILAR FUNCTION DO NOT ALWAYS SHOW SIGNIFICANT SEQUENCE SIMILARITY
for one of two reasons:

a)  they evolved independently
(e.g. different types of nucleotide binding sites);

or

b)   they underwent so many substitution events that there is no readily detectable similarity remaining.

In particular, PROTEINS WITH SHARED ANCESTRY DO NOT ALWAYS SHOW SIGNIFICANT SIMILARITY
(reason: see B above); many recent advances concern the improved detection of similarity.