Take Home Test #1
(This is an open book exam based on the honors system -- you can use notes, lecture notes, on line manuals, and text books.
No teamwork and no chat rooms please, write down your own answers.)
Please hand in a typed copy of your answers before class on Monday 9/17! (Electronic submissions will be accepted only in case of emergencies!)
- What is Entrez?
- Entrez is so effective because it only uses a non redundant database.
True or false?
- Define the term "homologous". (Use less than 10 words for your answer.)
- In a Databank search, what is an E-value? (Use less than 10 words for your answer.)
- You do a databank search using FASTA with an amino acid sequence as query. The only reported match has an E-value of 10. What does this mean for the similarity of the two sequences? (Use less than 10 words for your answer.)
- A databank search returns a match with an E-value of “3.4 e-178”. (check ALL that apply)
A) The probability to obtain a match of this quality or better when searching a database of the same size with an unrelated sequence is about 3.4 10-178.
B) Due to chance we should expect this result (or better) 3.4 10-178 times.
C) Due to chance we should expect this result (or better) 3.4 e-178 times. (e denotes Euler's number and is about 2.71...)
D) This match is insignificant.
E) The two sequences are very likely to be homologous.
F) The P-value is about 3.410-182 (i.e. the e-value divided by 10000)
G) The P-value is about 3.4 10-178.
- What is the possible range for P-values?
- What is the possible range for E-values?
- You work as a teaching assistant in a course. One day a rather mediocre student hands in an excellent report on a research project on building trees from molecular data. You suspect that the report was copied from a research article published in the scientific literature. What would you do to find the paper that the student copied from?
- What are the three Boolean operators that can be used to conduct literature searches in Entrez/pubmed?
- Suppose you wanted to find the most recent journal article authored by Dr. James Cole of the University of Connecticut about protein kinases. What exact search query would you perform? (Please include fields (examples: author, date, journal, etc.) and Boolean operators with your exact query search).
What is the Pubmed ID of the article?
- In which order does Entrez combine searches connected by Boolean operators? Place parentheses in the following example to indicate the order in which entrez combines the searches.
gogarten jp [AU] AND Doolittle W [AU] OR Lapierre P [au]
What happens, if another term is added? (again place parentheses to indicate the order of combination)
gogarten jp [AU] AND Doolittle W [AU] OR Lapierre P [au] AND Zhaxybayeva [au]
- How would you need to formulate a search for articles that Gogarten JP co-authored with either Lapierre P or Zhaxybayeva O?
- Which type of data are stored in Genbank? (check all that apply)
A) DNA
B) mRNA
C) protein sequences
D) Protein Structures
E) Scientific Literature, including abstracts and some full text articles
- What kind of data is stored in the “TrEMBL” database?
- Name 2 advantages and disadvantages for each: databases with gatekeepers, and databases without gatekeepers.
- What properties are traditionally considered as defining characteristics of life?
- You are presented with hemagglutinin gene sequences of two avian influenza virus strains. These strains differ markedly in their virulence in chickens. Use the sequence with GI number 111054739 as the first sequence and 111054751 as the second to align using BLAST 2 for proteins (set to Blastp program).
What is the % identity of the sequences?
Switch the view option from Standard to Mismatch-highlighting and scan through the aligned sequences. What do the dots in the ‘subject’ line represent?
Identify any differences in the amino acid sequences between these two genes by their one letter symbols and position.
Are these sequences homologus? Explain your answer in a sentence.
For Graduate Students:
I. What is the Gaia hypothesis, and who is (are) usually credited as its developer(s)?
II. Who was Alan Turing, and why should you care?
Extra credits:
- Quiz1.xls contains the number of genome projects registered with the Gold Database over time. Assuming that these data can be described by an exponential function, what is the doubling time for the total number of genomes and for the Bacterial and Eukaryotic genomes respectively? Describe how you arrived at your answer.
- In the analysis of the last question, the start of the time axis is assumed to be equal to the completion of the first genome. What time for the curve to start would be more appropriate? Does this impact the doubling time estimate?