PSI BLAST (ppt-slides)

BLAST Example:

A) download genomes as faa files (all ORFs as mupltiple fasta files) -- one source ftp://ftp.ncbi.nlm.nih.gov

B) to use numbers as identifiers replace GI numbers with consecutive numbers
   
(possible perl script

#!/usr/bin/perl -w

#decided to have input file entered in command line
#####INPUT Name of multiple seq file, open file and assign IN filehandle #############
unless(@ARGV==1) {die "please provide file name in command line \n
file should contain multiple sequences in fasta format \n\n";}
$num=0;
$filename=$ARGV[0];
open(IN, "< $filename") or die "cannot open $filename:$!";
$outfilename = "$filename"."\.num";
open(OUT, "> $outfilename")||die "cannot open: $!";

##################################################

while (defined ($line=<IN>)){ # read through file line by line

if ($line=~/^>/) { #look for beginning of line starting with > (^ is an anchor for the beginning of the line)
$num++;
$line =~ s/^>//;
$line= ">"."$num \t"." $line";
};
print "$line"; #print to screen
print OUT "$line"; #print to OUT
}

close(IN);
close(OUT);

Rename renumbered genomes

install blastall programs on your computer http://www.ncbi.nlm.nih.gov/blast/download.shtml

format data bank to search:
formatdb -i tmaritima -p T -o T

run blast search of all ORFs in one genome against other genome:
blastall -p blastp -d tmaritima -i tpetrophila -o results.tab -F F -m 9 -W 2 -a 2 -e 0.0000001

To generate a file that directly can go to excel use option -m 8

(type blastall or formatdb - to get more info)

DO EXAMPLE...

similar for PSI Blast commandline:
blastpgp -d nr -i test -j 5 -C test.chk -a 2 -o test.out -h 0.00001 -F f

blastpgp -d /Users/jpgogarten/genomes/msb8.faa -i test -a 2 -R test.chk -o test.out3 -F f


PSI-blast provides an enormous advantage over normal blast in the detection of distantly related sequences.  It only works, if some closely related sequences are already available, but if this is the case it finds a lot of other distantly related sequences. 

The NCBI page describes PSI blast as follows:
Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases. 

A diagram giving an overview on the PSI-blast procedure is here.

The results of a normal blast search are aligned and a pattern of conserved residues is extracted from the alignment.  This pattern (the Position Specific Scoring Matrix) is used as query for the next iteration.  An important parameter to adjust is the E-value threshold up to which matches are included in the alignment and pattern extraction. 
At higher iterations a PSI blast profile can be corrupted and false positives are identified with significant E-values.   I.e., in a traditional blast search one can be quite certain that a match with an E-value of 10^-13 represents a homologue; this is not clear with PSI blast.  Test studies indicate that profile corruptions are likely after more than 5 iterations. On the positive side: there are many fewer false negatives with PSI blast than with normal blast.

  • False negatives: Homologous sequences that are not detected -- less in PSI blast than in normal blast
  • False positives: Non-homologous sequences that are listed as matches -- very few in normal blast - possible in PSI blast after profile corruption

The "problem" is that the E-value reported in a PSI-blast search represents the match with the profile, not with the original sequence!!

PSI BLAST Example

Query sequence:

>gi|163506|gb|AAA30693.1| peripherin
KFDQKKRVKLAQGLWLMNWFSVLAGIIIFGLGLFLKIELRKRSDVMNNSESHFVPNSLIGVGVLSCVFNS
LAGKICYDALDPAKYAKWKPWLKPYLAVCVLFNVVLFLVALCCFLLRGSLESTLAHGLKNGMKFYRDTDT
PGRCFMKKTIDMLQIEFKCCGNNGFRDWFEIQWISNRYLDFSSKEVKDRIKSNVDGRYLVDGVPFSCCNP
NSPRPCIQYQLTNNSAHYSYDHQTEELNLWLRGCRAALLSYYSNLMNTTGAVTLLVWLFEVTITVGLRYL
HTALEGMANPEDPECESEGWLLEKSVPETWKAFLESVKKLGKGNQVEAEGEDAGQAPAAG

Blast page

  • Use SWISSPROT
  • Check format for PSI-blast
  • Check inclusion limit
  • Increase number of reported matches
  • Point out manual selection of sequences to be included in profile
  • Use PSSM to search something else (individual genomes (here), or PDB)

Assignments for Friday (10/27)

Selection versus genetic drift.

Selection

Deterministic models to describe selection:  (diploid organisms, two alleles A1 and A2)

    codominance (kind of logistic equation) q=frequency of allele A2, 

Genotype:                                     A1A1        A1A2        A2A2

Relative number of offspring           1             1+s          1+2s

Fitness                                            w11          w12          w22

frequency                                      p^2             2pq          q^2

(pq: allele frequencies,=> genotype frequencies in Hardy Weinberg equilibrium)

Change in frequency (approximately):  
dq/dt= s* q*(1-q) and

q(t)=1/(1+((1-q0)/q0)*e-st)

    over dominance

Genotype:                                          A1A1    A1A2    A2A2

 Relative number of offspring                1         1+s1     1+s2

          s1>s:   balancing selection (try it)

Go to Kent Holsinger's collection of JAVA applets here and explore some of the time courses with different values of s1 and s2.  

Under which conditions of w11, w12, and w22 can one maintain both alleles over long periods of time?

Stochastic approaches -- random drift - neutral evolution:

Law of the gutter (see also Steven J Gould?s interpretation on the trend to increasing complexity)

Explore some simulations: 
     Drift only (vary the population size N),

How does the survival of multiple alleles in a population depend on the population size.

     Drift and Selection (interesting setting: P=0.01, N=50)

Note: Even though the allele conveys a strong selective advantage of 10%, the allele has a rather large chance to go extinct quickly.

     This simulation follows many populations (with the selected parameters) over time. It plots a histogram that shows how many of the populations have the allele frequency indicated on the y-axis. If you set the mutation rate to 0, this provides a nice illustration of the law of the gutter. (In the presence of the alleles converting back and forth, fixation does not occur.)

Mutation rate versus Substitution rate

The following assumes co-dominance or no selection:

s=0:  Probability of fixation, P, is equal to frequency of allele in population, q

mutation rate (per gene/per unit of time) = u ;  

frequency with which new alleles are generated in a diploid population size N equals to u*2N

Probability of fixation for each new allele = 1/(2N)

Substitution rate = frequency with which allele is generated * Probability of fixation= u*2N *1/(2N) = u

Therefore:
The substitution rate is independent of population size if s=0 and equal to the mutation rate!!!!

This is the reason that there is hope that the molecular clock might sometimes work.

For advantageous mutations: 
      Probability of fixation, P, is approximately equal to 2s;
      e.g., if selective advantage s = 1% then P = 2%

      Does this correspond to the simulations you performed above?

Fixation time

Neutral mutations:  tav=4*Ne generations 
(Ne=effective population size; For n discrete generations Ne= n/(1/N1+1/N2+?..1/Nn)

S unequal to 0:  tav= (2/s) ln (2N) generations  (also true for mutations with negative s --  How can this be??)

E.g.:  N=106, s=0:  average time to fixation: 4*106 generations

N=106, s=0.01:  average time to fixation: 2900 generations