Future Monday and Wednesday classes will take place in CB 206!

 

Assignments for Friday's class:

Assignments for Monday's class:

 

Annotation problem, e.g., missing ORFs in E.coli see http://mic.sgmjournals.org/content/156/7/1909.full

Types of Error in a Databank search

False positives: The number of false positives are estimated in the E-value. The P-value or significance value gives the probability that a positive identification is made in error (same as with drug tests).
Danger: avoid fishing expiditions. If you do 100 tests on random data, you expect one to be positive at the 1% significance level.

You could apply the Bonferroni correction:

For the individual test at the divide the overall desired significance level by the number of parallel tests. The hypothesis to be be rejected is that ALL of the individual tests are not significantly different from chance.
(For more discussion of fishing expeditions see here)

False negatives: Homologous sequences in the databank that are not recognized as such. If there are only 12000 different protein families, an average a sequence should have (size of the databank)/12000 matches. In other words, the number of false negatives is probably very large.