Assignment for Friday's class:

  • Read the section on Databases below.
  • Spend at least 5 minutes exploring one of the other databases linked above

Assignment for Wednesday's class (next week):

  • Read chapter 2 of the textbook.
  • Finish the computer lab exercis
  • If you have problems with pubmed, go through the interactive tutorial at NCBI, then ask questions.

Problems accessing huskyCT?

Student questions

Bioinformatics Minor

Is it Bioinformatics or not?


 

DataBank Searches at NCBI. Information Retrieval using Entrez.



NCBI (National Center for Biotechnology Information) is a home for many public biological databases (see diagram below). All of the databases are interlinked, and they all have common search and retrieval system - Entrez.

Another representation of the connections between the different databases in ENTRZ is here.

For the interactive Pubmed tutorial click here. An Entrez tutorial (non interactive) is here (both go well beyond what you need to know for Friday).

Use Boolean operators (AND, OR, NOT) to perform advanced searches. Here is an excellent explanation of the Boolean operators from the Library of Congress Help Page.

Search Field Tags- Listed here.

Explore features of Entrez interface: Limits, Index, History, Clipboard and MyNCBI.

 

Other Useful Databases and Services:

While Medline is incorporating more and more non-medical literature, there might still be gaps in the coverage. Alternatives are other databanks available though the National Library of Medicine (here) and the local services offered at the UConn libraries. Especially Current Contents and Agricola nicely complement PubMed. The best way to access them is the use of "SilverPlatter" database. Also, the "Web of Science" database gives access to the Science Citation Index: a database that tracks cited references in journals.

Note that many resources are restricted to the UConn domain, thus you either need to access them from a campus computer or through the proxy account. In some instances you are prompted to connect to the UConn VPN network.


Want to be informed about new sequences/articles in your research area? Check out these services:

PubCrawler
Swiss-Shop


In short, these services allow a user to define queries that are stored in the user's profile. Using these queries the searches are regularly performed against the updates to the databases, and then the user is i nformed (by email alert) if there is anything new that match the queries. A great way to save time!

Other web pages:

Nucleic Acid Research Database Issue
Every year, the first issue of Nucleic Acid Research is devoted to updates on biological databases

http://www.ebi.ac.uk/
The European homolog/analog to NCBI, software archive.

http://rdp.cme.msu.edu/
The US ribosomal databank project

http://www.jgi.doe.gov/
Genomes at the DOE joint genome institute

http://www.genomesonline.org/
List of completed genomes and ongoing genomes

http://www.tigr.org
Home of several "completed" genomes projects (now renamed in the J. Craig Venter Institute)

http://genome-www.stanford.edu/
Yeast and Arabidopsis genome projects

http://www.flybase.org/
Database of Drosophila Genome

http://www.arabidopsis.org/
TAIR - The Arabidopsis Information Resource

http://www.ensembl.org/
Ensembl Genome Browser (Eukaryotic genomes, including Human and Mouse genomes)

 

Sequence and structure databanks can be divided into many different categories.
One of the most important is

 

Supervised databanks with gatekeeper.

Examples:

  • Swissprot
  • Refseq (at NCBI)

Entries are checked for accuracy.
+ more reliable annotations
-- frequently out of date

 

 

Repositories without gatekeeper.

Examples:

  • GenBank
  • EMBL
  • TrEMBL

Everything is accepted.
+ everything is available
-- many duplicates
-- poor reliability of annotations

Old Powerpoint slides on data banks are here

In case you want more information on databanks, a computer sciences oriented intro to biological databanks, their design and management is at
https://gcrcweba.lacusc.co.la.ca.us/bifresources/BuildingUsingBiolDB.pdf

 

Discussion: What seperates living from dead?

 

What does Bioinformatics have to do with Molecular Evolution? 

Problem: Application of first principles does not (yet) work

The following chain although (believed to be) mainly determined by the DNA sequence (plus other components of the cell which in turn are encoded by other parts of the genome) can at present not be simulated in a computer.  

DNA sequence ->
transcription ->
translation ->
protein folding ->
protein function (catalytic and other properties) ->
properties of the organism(s) ->
ecology (taking also the non biological environment into account) ->

... .

 

Most scientists believe that the principle of reductionism (plus new laws and relations emerging on each level) is true for this chain; however, this is clearly "in principle" only.
Biology relies on this sequence to work more or less unambiguously (prions), but:

At several steps along the way from DNA to function our understanding of the chemical and physical processes involved is so incomplete that prediction of protein function based on only a single DNA sequence is at present impossible (at least for a protein of reasonable size).

Solution:
Use evolutionary context:

"Nothing in biology makes sense except in the light of evolution"

Theodosius Dobzhansky



Present day proteins evolved through substitution and selection from ancestral proteins. Related proteins have similar sequence AND similar structure AND similar function.

In the above mantra "similar function" can refer to:

  • identical function,

  • similar function, e.g.:
    • identical reactions catalyzed in different organisms; or
    • same catalytic mechanism but different substrate (malic and lactic acid dehydrogenases);
    • similar subunits and domains that are brought together through a (hypothetical) process called domain shuffling, e.g. nucleotide binding domains in hexokinase, myosin, HSP70, and ATPsynthases.

The Size of Protein Sequence Space (back of the envelope calculation):

Consider a protein of 600 amino acids.
Assume that for every position there could be any of the twenty possible amino acid.
Then the total number of possibilities is
20 choices for the first position times 20 for the second position times 20 to the third .... = 20 to the 600 = 4*10^780 different proteins possible with lengths of 600 amino acids.

For comparison the universe contains only about 10^89 protons and has an age of about 5*10^17 seconds or 5*10^29 picoseconds.

If every proton in the universe were a computer that explored one possible protein sequence per picosecond, we only would have explored 5*10^118 sequences, i.e. a negligible fraction of the possible sequences with length 600 (one in about 10^662).

The following is based on observation and not on an a priori truth:

If two proteins (not necessarily true for nucleotide sequences) show significant similarity in their primary sequence, they have shared ancestry, and probably similar function.
(although some proteins acquired radically new functional assignments, lysozyme -> lense crystalline). 


To date there is no example known where convergent evolution has let to significant similarity of the primary sequence (although here are examples where similar selection pressures have resulted in similar convergent substitutions in homologous proteins).

THE REVERSE IS NOT TRUE:

PROTEINS WITH THE SAME OR SIMILAR FUNCTION DO NOT ALWAYS SHOW SIGNIFICANT SEQUENCE SIMILARITY
for one of two reasons:

a)  they evolved independently
(e.g. different types of nucleotide binding sites);

or

b)   they underwent so many substitution events that there is no readily detectable similarity remaining.

In particular, PROTEINS WITH SHARED ANCESTRY DO NOT ALWAYS SHOW SIGNIFICANT SIMILARITY
(reason: see B above); many recent advances concern the improved detection of similarity.