Introduction to Bioinformatics - Course at FH Mannheim, Summer 2009
Dept. Molecular and Cell Biology
University of Connecticut
Storrs, CT 06269-3125
Participation, Assignments, Exam on Wednesday (Start 13.00)
notes and assignments will be available through the www@
The first set of assignments are at the bottom of this page
Textbook: none is required but the following are recommended.
Understanding Bioinformatics (Paperback)
Essential Bioinformatics (Paperback)
Excellent book, it provides a very readable and concise overview of the most important tools and concepts in Bioinformatics
by Jean-Michel Claverie
Excellent introductory bioinformatics book.
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Third Edition
Edited by Andreas D. Baxevanis and B. F. Francis Ouellette
The book covers many aspect of bioinformatics that we do not cover in class, but it is an excellent reference. The section on phylogenetics is weak, but you have your instructor to provide you with much more detail.
Don't buy the 2nd edition by mistake!
book to look up things and to consult if faced with a real world problem.
And Molecular Evolution (Paperback)
The authors discuss in detail many applications in molecular evolution and bioinformatics. This book should be very useful to those who want to study some aspects of things covered in this course in more detail.
: A Phylogenetic Approach
Blackwell Science Inc; ISBN: 0865428891
book gives an excellent introduction to terms, methods, and problems in molecular
evolution. It does not contain too many
details on individual algorithm, but it provides a very readable overview.
Other recommended books:
Graur and Li: Fundamentals of Molecular Evolution, Second Edition
Bioinformatics (general definition):
Area between Computer Sciences (Informatics) and Biology (genomics)
(or application of the tools of informatics to biology)
Bioinformatics took off only with the availability of large
amounts of genome information, thus a more narrow delineation might be:
Area between Informatics and Genomics
Related areas: Computational biology, Cybernetics
Typically bioinformatics is considered to include:
management of biological databanks,access to biological data, andextracting useful information from biological data.
For more detailed discussion see Mark Gerstein's introduction
What does Bioinformatics have to do with Molecular Evolution?
Problem: Application of first principles does not (yet) work:
The following chain of events although (believed to be) mainly determined by the DNA sequence (plus other components of the cell which in turn are encoded by other parts of the genome) can at present not be simulated in a computer.
DNA sequence ->
protein folding ->
protein function (catalytic and other properties) ->
properties of the organism(s) ->
ecology (taking also the non biological environment into account) -> ... .
Most scientists believe that the principle
of reductionism (plus new laws and relations emerging on each level) is
true for this chain; however, this is clearly “in principle” only.
Biology usually assumes this sequence works more or less unambiguously (prions), but:
At several steps along the way from DNA to function our understanding of the chemical and physical processes involved is so incomplete that prediction of protein function based on only a single DNA sequence is at present impossible (at least for a protein of reasonable size).
Use evolutionary context -
“Everything in biology makes sense only if considered in the context of evolution.”
Present day proteins evolved through substitution and selection from ancestral proteins. As a result
related proteins have similar sequence AND similar structure AND similar function.
In the above mantra "similar function" can refer to:
Experience shows that protein sequence space is so big that similar sequences do not arise through convergent evolution (at least if significant similarity is detectable through pairwise comparison, in contrast simple similar protein folds might have evolved twice independently).
The Size of Protein Sequence Space (back of the envelope calculation):
a protein of 600 amino acids. Assume that for every position there could be any
of the twenty possible amino acid. Then the total number of possibilities is
20 choices for the first position times 20 for the second position times 20 to
the third .... = 20 to the 600 = 4*10^780 different proteins with
lengths of 600 amino acids.
20 choices for the first position times 20 for the second position times 20 to the third .... = 20 to the 600 = 4*10^780 different proteinspossible
with lengths of 600 amino acids.
comparison the universe contains only about 10^89 protons and has an age of about
5*10^17 seconds or 5*10^29 picoseconds.
comparison the universe contains only about 10^89 protons and has an age of about
5*10^17 seconds or 5*10^29 picoseconds.
every proton in the universe were a computer that explored one possible protein
sequence per picosecond, we only would have explored 5*10^118 sequences,
i.e. a negligible fraction of the possible sequences
with length 600 (one in about 10^662)
If every proton in the universe were a computer that explored one possible protein sequence per picosecond, we only would have explored 5*10^118 sequences, i.e. a negligible fraction of the possible sequences with length 600 (one in about 10^662).
The following is based on observation and not on an a priori truth:
If two sequences show significant similarity
in their primary sequence, they have shared ancestry,
and probably similar function.
To date there is no example known where convergent evolution has let to significant similarity of the primary sequence (although here are examples where similar selection pressures have resulted in similar convergent substitutions in homologous proteins).
THE REVERSE IS NOT TRUE:
DOMAINS WITH THE SAME OR SIMILAR FUNCTION DO NOT ALWAYS SHOW SIGNIFICANT SEQUENCE SIMILARITY for one of two reasons:
a) they evolved independently (e.g. different types of nucleotide binding sites); or
b) they underwent so many substitution events that there is no readily detectable similarity remaining.)
In particular, DOMAINS WITH SHARED ANCESTRY DO NOT ALWAYS SHOW SIGNIFICANT SIMILARITY (reason: see B above), many recent breakthroughs in bioinformatics concern the improved detection of similarity.
in finding exons in eukaryotic genomes illustrate the difficulties in "first
principle approaches". (See this afternoon's class.) Again, consideration
of the evolutionary context provides a solution.
Again, consideration of the evolutionary context provides a solution.
slides on protein space and homology are here
Powerpoint slides on protein space and homology are here
Databanks and databank searches:
Sites for databank searches and retrieval:
The NCBI maintains several databanks. The entries in each databank are pre-linked to other entries in the same databank and to entries in the other databanks
(PubMed), including books
Medline - DNA - protein genome data banks - protein structures - books
Everything already cross linked between the three databanks.
"Homologous" sequences and papers (!) one click away (related sequence / related medline buttons)
Warning: Sometimes CROSSLINKS are updated only slowly. Links of papers to sequences often never make it into the databanks
In addition to using the prelinked relationships you can search for similar sequences at the NCBI's site; however, often this is not necessary.
A sample genbank formated entry is here. Explore the meaning of the different links.
Other formats that are frequently used and notes on the different alphabets are here.
An easy way to stay up-to-date are services (agents) that search the web for new and interesting publications or sequences. There are many companies that offer this, some that are available to everyone are:
Also: While Medline is incorporating more and more non-medical literature, there are still gaps in the coverage. Alternatives are other databanks available though the National Library of Medicine (here), through the ISI Web of Science and through local services.
The Web of Science databases allow you to search articles that cite a particular article or author (do demo on laptop via squid)
Powerpoint slides on data banks are here (use only the first 5 slides)
[ A) write down your answers!
B) write your name on a piece of paper, fold it into a sign and put it on your desk. If you want to fold a paper crane (really short instructions are here, really long instructions are here) and write your name onto the wings!
C) if you need help with an assignment, move your name sign to the top of your screen ]
Use Pubmed in NCBI's
Entrez to find an article written by
Carl R. Woese (famous scientist, codiscover of the archaea), published
in the journal Proceedings of the National Academy of Sciences
with the words primary
kingdoms in the title of the paper. Try to use Boolean operators and field
tags; if you cannot recall the tags, use the Preview/Index tool.
What query did find the 1977 article?
How many related articles are linked to this article?
When was the most recent of the related articles published?
Search for the same article in Google Scholar. How many articles cited Woese's 1977 PNAS paper? When was the most recent citation?
In what order does GOOGLE scholar list the articles that cite the Woese paper? (A consequence of this is that the rich get richer.)
In what order does Entrez list the related articles? Dr. JP Gogarten seems obsessed by an important protein called ATP synthase. Is he interested in anything else? How many articles did he published that are NOT related to the ATP synthase OR ATPase?
What query did you assemble?
How many articles did you find?
a paper co-authored by Senejani, Hilario and Gogarten published in BMC Biochemistry.
What was the topic of the paper?
Display the abstract of this paper and click on book in the link menu (on top right of abstract). Items in the abstract that are covered in any of the reference books turn into hyperlinks. If you need more information on any of the items follow these links. What item did you look up? Was this helpful?
3. To what domain, phylum/kingdom and family does Thermoplasma belong? (Use the Taxonomy link in Entrez)
4. How many protein sequences are available for
Thermoplasma acidophilum, how many are available for the genus Thermoplasma?
(In the taxonomy browser go to Thermoplasma and check protein in the header then hit "return".)
5. Use Entrez to find a Protein sequence that is of interest to you. (If you don't find something of interest, use gi|405795).
How many related protein sequences does your sequence have (see the pulldown menu under LINK)?
How many related nucleotide sequences does your sequence have (see the pulldown menu under LINK)?
How many related nucleotide sequences does the nucleotide sequence have?
Explore the BLink page (results from a data bank search with this sequence).
What is shown on this page? (check where some of the links lead to)
What do the colors in the symbolic alignment on the right hand side signify?
Where do the three links in every entry link to?
Note: all of these results are already linked to your sequence, you did not need to perform a new search to get the results.
The symbolic alignment is particular helpful in case your protein consists of many different domains (go here for a striking example).
6. How many different archaeal RubisCO (=ribulose bisphosphate carboxylase oxygenase = rbcl = ribulose bisphosphate carboxylase large subunit) encoding genes can you find in the protein data bank. Pretend that you are only intereted in RubisCO genes in Archaea NOT in bacterial RubisCOs. (Archaea and Bacteria are the two domains of prokaryotes.)
One possibility is to utilize your clipboard at the NCBI. Start by selecting "protein" in ENTREZ. Explore different search strategies (names, fields, enzyme and substrate names ... .) Save positives to the clipboard. If you later go to the clipboard, you can retrieve related sequences. Remember, nobody claimed that this is a perfect world. It certainly is not easy to formulate a good search strategy. If you don't know if an organism is an Archaeon, click on the taxonomy link associated with most sequences.
How many different archaea that have a RubiCO homologue can you find?
Do many of these have more than one RubisCO gene?
Can you be sure that you found all of them?
7. If you have time and/or interest sign up for the pubcrawler service (see above) to send you a notice when a paper is published on something you are interested in.