Powerpoint slides on BLAST are here

Blast, PSI–blast, and Homology

Are two similar sequences homologous (i.e., is their similarity due to shared ancestry) ?

One way to quantify the similarity between two sequences is to

1.    compare the actual sequences and calculate alignment score

2.    randomize (scramble) one (or both) of the sequences and calculate the alignment score for the randomized sequences.

3.    repeat step 2 at least 100 times

4.    describe distribution of randomized alignment scores

5.    do a statistical test to determine if the score obtained for the real sequences is significantly better than the score for the randomized sequences

To illustrate the assessment of similarity/homology we will use a program from Pearson's FASTA package called PRSS. 
This and many other programs by Bill Pearson, web page at http://fasta.bioch.virginia.edu/ 

A web version is available here. (Output of old PRSS is here, fasta search here)

Go through example. Sequences are here (fl), here (B), here (A) and here (A2)

Summary of results is here.

There are many other alignment programs.  BLAST is a program that is widely used and offered through the NCBI (go here for more info).  It also offers to do pairwise comparisons  (go here, do example).

To force the program to report an alignment increase the E-value.

Rules of thumb:

Usually E values (in a blast search or through randomization) smaller than 10-4 are convincing.

(For small values the E value gives the probability to find a match of this quality in a search of a data bank of the same size by chance alone - for more detailed information see Terminology section and the BLAST help manual on P values)

If you can demonstrate significant similarity using either randomization or an unweighted blast search, your sequences are homologous (i.e. related by common ancestry).  Convergent evolution has not been shown to lead to sequence similarities detectable by these means (see above - this might not be true for scores in PSI-blast)

If the actual alignment score is more than three standard deviations (of the randomized sequences) better than the mean for the randomized sequences, the two sequences are homologous (i.e. related by common ancestry).  PRSS and many other program use more accurate distributions to describe the distribution of random hits.  The expectation value for the alignment-score of the actual sequences is based on these statistics.

Terminology:

E-values give the expected number of matches with an alignment score this good or better,
P-values give the probability of to find a match of this quality or better. P values are [0,1], E-values are [0,infinity).
For small values E=P
z-values give the distance between the actual alignment score and the mean of the scores for the randomized sequences expressed as multiples of the standard deviation calculated for the randomized scores.
For example: a z-value of 3 means that the actual alignment score is 3 standard deviations better than the average for the randomized sequences. Z-values > 3 are usually considered as suggestive of homology, z-values > 5 are considered as sufficient demonstration. (see the but below). A somewhat readable description of E, P, HSP and other values is here.

BUT:
Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.

Examples:

Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous, however, the above tests fail to detect this homologies. (for example, enzymes with GRASP nucleotide binging sites are depicted here.)

DNA replication involves many different enzymes. Some of the proteins do the same thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the above tests fail to detect homology.

Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers have the same type of nucleotide binding fold (see gogarten.uconn.edu rotating F-ATPases)

BLAST and PSI BLAST

Run a blast trial with  this sequence

A BLAST tutorial for standard blast is here; a more general tutorial on BLAST, including PSI BLAST, is here

An easy way to force the program to report less significant matches is to increase the expect threshold in the blast search form pages

PSI-blast provides an enormous advantage over normal blast in the detection of distantly related sequences.  It only works if some closely related sequences are already available, but if this is the case it finds a lot of other distantly related sequences. 

The NCBI page describes PSI blast as follows:
Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases. 

A diagram giving an overviw over the PSI-blast procedure is here.

The results of a normal blast search are aligned and a pattern of conserved residues is extracted from the alignment.  This pattern is used for the next iteration.  An important parameter to adjust is the E-value threshold down to which matches are included in the alignment and pattern extraction. 
At higher iterations a PSI blast profile can be corrupted and false positives are identified with “significant” E-values.   I.e., in a traditional blast search one can be quite certain that a match with an E-value of 10^-13 represents a homologue; this is not clear with PSI blast.  Test studies indicate that profile corruptions are likely after more than 5 iterations. On the positive side: there are many fewer false negatives with PSI blast than with normal blast.

 

Assignment #3:

Write down your answers!

Blast using the NCBI web interface:
  1. There are different BLAST programs.
    Why/when would you want to use BLASTP? When blastN, blastx, tblastn or tblastx?
    Which program would you use in the following situation: A genome has been sequences and annotated, but the F-ATPase is not present as an annotated ORF. You have databanks of the annotated ORFs and of the original nucleotide sequence.


  2. Using a protein sequence of your choice (see above, or gi|59713171) search NR, SWISSPROT and env_nr.
    How many significant hits did you find in each of these databanks?
    How do the results change, when you use a smaller word size or a different scoring matrix (use swissprot to explore these questions)?
    Do all the target sequences have similar description lines? At what E-values does the content of the annotation lines change?
    Compare your results with that of your neighbors.
    Did the low complexity filter replace any part of your query sequence with XXX?
    If you want to search an individual genome, you can go to the corresponding genome page (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi click in the genome in the Refseq column), and start a blast search. You also can start searches of several genomes simultaneously at the microbial genome blast page.

  3. Using Entrez download the protein sequence with PID# 2506213. What type of proteins do you find among the related proteins?

  4. Copy the FASTA format of gi2506213 onto your computer's clipboard (not the clipboard of the NCBI page), go to the NCBI BLAST page and do a BLAST search (default options). Do you obtain any different results?

  5. Download the fasta formated sequences for D-Alanin-D-Alanin ligase (gi|145722) and a glutathione synthetase (gi|121663). Also check among the related sequences for carbamoyl phosphate synthetases.

  6. Do a BLAST search of Swissprot with the D-Ala D-Ala ligase or the glutathione synthetase as a query. Increase the "Expect" parameter to 1000 and the "descriptions" to 500. (You actually can skip this item, because the first iteration of a PSI Blast search corresponds to the normal Blast search!)

  7. Do a PSI-BLAST search with the same sequence as a query (use the SWISSPROT database as target). YOU NEED TO INCREASE THE MAXIMUM NUMBER OF TARGET SEQUENCES TO 20000 IN THE FIRST BLAST FORM. After how many iterations do you start to pick up the other sequence types? Which other types of enzyme are included among the hits?
    (If it is too slow check here for 7 rounds of PSI-BLAST).

  8. Do a PSI-BLAST search for 3 iterations with the following sequence:
    >gi|7436316|pir||D75028 Pab VMA intein CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
    AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
    HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
    LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFI GGNMPTLLHNT
    What types of enzymes do you get as hits? Do you notice anything strange about the search results? Save the PSSM (Position Specific Scoring Matrix, or profile) from your search on the 4th iteration. To do that choose PSSM from download menu on top of the result page. We are going to use this profile in the next question.

  9. Now we will use the PSSM to BLAST the completed genomes. Go to Microbial Genomes Genomic BLAST page (Let it load completely before choosing any options). Paste intein sequence into query sequence box, change Query and Database entries to "Protein". Choose one or several of the following genomes as Database:
    • Pyrobaculum aerophilum
    • Aeropyrum pernix
    • Sulfolobus tokodaii
    • Archaeoglobus fulgidus
    • Methanothermobacter thermautotrophicus
    • Thermoplasma volcanium
    • Methanococcus jannaschii
    • Saccharomyces cerevisiae (This genome is on the eukaryotes Genomic BLAST page - select eukaryoted on top of the genome blast page)

    After that click "Adv. BLAST" button. This will redirect you to the BLAST search window. Under Program selection check PSI blast and load your PSSM from Question #7 under the PSI/PHI blast menu.
    What are the results of your search? Did you get any significant matches? What are they? If you have significant matches, does the match occur over the full lengths of both query and subject sequences? Use Blink to investigate if the hits are indeed inteins.
    What is your conclusion? In your answer indicate
     - genomes searched,
     - number of significant matches found,
     - the E-values of these matches, and
     - the identity of these matches
         (i.e., are these probable inteins, or are they likely to be something else?).




    The following is an exercise that you might want to try on a computer where you are able to run local blast searches, and where you are the admistrator. You might want to return to this exercise if you have time this afternoon or next week.

    Using Blast on a "local" machine (demo using bbcxsrv using files in Blasttest_):
    Many completed genomes can be found on the NCBI FTP site (ftp://ftp.ncbi.nih.gov). The eukaryotic genomes are accessible from the genomes directory while all prokaryotic (bacteria + archaea) genomes are stored in the the subfolder /bacteria (a clear case of anti archaeal discrimination). Go into the bacteria sub-folder and click on the Pyrococcus abyssi folder (Pyrococcus is a member of the archaeal domain). Among the files in these folders we are are interested in the .ffn file which contains all nucleotide sequences for individual predicted ORF's and the .faa file which contains the predicted ORF's amino acids sequences. Both files are in a fasta format. Copy the .faa file on to your computer and rename the file p_abyssi.faa. Also download the .faa file form Thermotoga maritima and rename it to t_maritima.faa.
    Also download any other genomes you might be interested in!

  10. On the cluster (remote machine), you should NOT run long processes on the master node. Either qrsh to a node that is not busy (type qrsh **), or submit your command to the queue using qsub.

    On the XSERV cluster using the ssh connection established earlier type
    qrsh -l hipri=TRUE <enter> enter your password when prompted,
    cd blasttest
    formatdb -i p_abyssi.faa -o T -p T
    The -i is to indicate your infile.name, -p T is for a protein sequence, for a nucleotide sequence, you need to use -p F (for False); the -o T instructs the formatdb program to create indices. For more information on the formatdb program check http://bioinformatics.ubc.ca/resources/tools/formatdb or type  formatdb - <enter>
    You need to replace p_abyssi.faa with one of the genomes you downloaded. This genome will be converted into a searchable database.

    To do a blast search of every sequence in t_maritima.faa against the P.abyssi genome, we use blastall. To get information on the program type
    blastall <enter>
    For details check one of the many help pages on hereand here.

    We will use the following options (you need to change the command to reflect the names of the genomes you want to compare; -d gives the name of the databank that you created using formatdb):

    blastall -i t_maritima.faa -d p_abyssi.faa -o blast.out -p blastp -e 10 -m 8 -a 2
    -i the query sequence(s)
    -d the database
    -o the output file
    -p the blast program to use
    -e the e-value cut off
    -m the format of the output. option 8 gives a tabular output, that can be easily read into excel spreadsheet or into a database program. option 9 is similar but includes comment lines.
    the individual columns in the output denote the following
    # Query id # Subject id # % identity # alignment length # number of mismatches # number of gap openings # position of query start # position of query end # position of subject start # position of subject end # e-value of a hit # bit score of a hit

    Which option turns off the the low complexity filter?
    Which option, and which setting, sets the wordsize to 2?
    Which option allows to use two processors?

    Inspect the output typing more blast.out on the commandline on the cluster

    Often one is only interested in the top scoring hit for each query. A simple perlscript (here) goes through the blastall output in m8 or m9 format and retains only the top scoring hit, and it adds a header line that will be correctly interpreted by EXCEL. Save this script into your blasttest folder on the cluster (the easiest is to ctrl click on here and select save link as, then select the blasttemp folder in the finder). Glance through the script using a text editor (vi, or textwrangler, or just click on the here link in your browser).

    To run the blast.out file through the perl script type
    perl extract_lines.pl blast.out
    This will create a file in your directory called blast.out.top that only contains the top scoring hits, and from which the comment lines (in case you used the m9 option in blastall) are removed.

    A modified version of this program (here) adds another column to the listing that for each alignment gives the bitscore per residue in the alignment.

    Import this file into an excel spreadsheet. If you have an afp connection to the cluster, you can

    start EXCEL on you local machine,
    open an empty spreadsheet,
    select Data -> Get external data;
    select your file (you'll need to tell it to show you all files, not only files with a recognized extension)

    Sorting the data on the different columns you can figure out which gene was found to be most similar between the two genomes.

    Which were the genomes you selected?
    Which gene pair had the highest % identity?

    (to retrieve the information on target sequence, you can use the command fastacmd. If the GI number of the target is 14521962, and the database is p_all.faa the command is as follows:
    fastacmd -s 14521962 -d p_all.faa
    By default the databank is nr as installed on the cluster. Thus fastacmd -s 14521962 also works (and you get more annotation), provided the sequences in your databank are also in nr.

    Next, we will calculate a histogram for the % identity values for all the significant alignments.
    In Excel select
    Tools -> Data Analysis, (the 2008 version of Exel for Macs no longer allows to calculate histograms via the add-ons, but there are spreadsheets abailable on the net into which you can copy paste the the data you want to plot - a histrogram generating spreadsheet for Excel2008 is here)
    select histogram. (If this does not show-up in your menu, you need to install the analysis tool pack. If you did not install office with all options, you will need to have the office installation disk, sorry.)
    In the window select the column you want to analyze, to beautify things, you might add a columns with bin ranges - see demo in class.

    Do you observe a smooth distribution of % identiy values?
    Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons?
    Can you come up with other interesting things you could do with these data in Excel?

    (histogram for the values in other columns, correlation between the values in different columns, calculate additional columns with corrected values;
    one informative exercise is to sort the data table on the E-values and to exclude the hits with non-significant E-values, and to compare the histograms you calculated with all hits (E<10 and he histogram you calculated with E< 0.0001. Can you explain the observation? If you don't have time to do this yourself, go here - this compares T.maritima top scoring hits to three Pyrococcus genomes, check the %identity sheet.)