Assignments

Think about the Intron early and the Intron late theories.
What does the Go-plot depict (SEE BELOW)?
Where in biology does exon shuffling occur? (see here)

For those interested in the origin of introns:
Why might the successful prediction of an intron in insertion site #5 (found in Culex) not be fully convincing for the intron early theory? (see here and Fig 3&4. here)

 

Dotlet

Dotlet can handle DNA - DNA comparisons using also the reverse complement:

 


Comparison of nucleotide sequence with introns vs. protein sequence it codes.

in Dotlet:

exons in dotlet

Using BLAST:

finding exons using blast

Aside: recall types of introns and where they occur, the "role" of introns. See below for more details.

Repetitive proteins in Dotlet

How many repeats do you identify when you compare the Methanopyrus sequence against itself?

repetitive domains

BLAST -> Genome DotPlot

Discuss figure: Where are corresponding genes, where are duplicated and gene family members?

genome dot-plot

 

discuss rest of ppt slides from class 4.

 

Sequence alignment

Pairwise alignment

A) DOT PLOT

The easiest way to align two sequences is to use a dotplot. In its most straight forward implementation the two sequences to be aligned are written along the coordinate axis.

In more realistic implementations a window of 5 to 20 nucleotides or amino acids is slid along one of the axes (i.e., sequences) and compared to every possible window on the other axis (sequence). The dot intensity is adjusted to reflect the percent identity (or similarity) in the two windows.

See the Dotlet exercises from last Friday.

LINE

Optimal global and local alignments.

There are many different algorithms to calculate pairwise sequence alignments. For two sequences it is "easy" to calculate an optimal global alignment. (According to the motto: "It can be easily shown" -- see here). The so called Needleman-Wunsch algorithm is widely used, it optimizes a positive alignment score, a related (and under some conditions equivalent approach) is to minimize the differences between to sequences.

LINE

Multiple Sequence Alignments

LINE

CLUSTAL, CLUSTALW and CLUSTALX

Usually global alignments are the easiest to calculate (local see below)

One of the easiest to use, most sophisticated, and most versatile alignment programs is clustalw

(Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237-244;
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680
)
.

Clustalw runs on all possible platforms (unix, mac, pc), and it is part of most multiprogram packages, and it is also available via different web interfaces (for examples here, and here). 

Clustalw uses a very simple menu driven command-line interface, and you also can run it from the command line only (i.e. it is easy to incorporate into scripts.)

Clustalx uses the same algorithms as clustalw.  However, it has a much nicer interface, it displays information on the level of similarity, and it uses color in the alignment.  Especially for amino acids the use of color greatly enhances the ability to recognize conservative replacements. Clustalx2.1 is available for different platforms at the ebi's ftp site (follow your platform, clustalx is stored in the clustalw folders)

Clustal reads and writes most formats used by different programs.  The easiest format is the FASTA format:

> name of sequence or any other information goes in the first line. This line starts with ">". The line can be longer than 80 characters. The first line ends with the first paragraph sign.p
The second line contains the sequence itself; numbers and other non standard characters are ignored. Be careful if you download sequences. Often the transfer programs introduce paragraph signs every 100 characters, and the end of a command line frequently ends up as the beginning of the sequence.
All sequences to be read should be in a single file.

(sample clustalw input file)

(sample clustalw output file)

Clustal also reads aligned sequences.  If you input aligned sequences you can go directly to the tree section.
!! Be careful if you make a mistake, and the sequences are not aligned, your tree will look strange!!
!!!
ALWAYS CHECK YOUR ALIGNMENT!!!

Clustal also is useful to reformat and edit alignments, it is very forgiving in reading formats, e.g., you can open the clustal format (*.aln) in a text editor and delete columns and reload the file into clustalw, and output it in the other formats available.

For calculating an alignment, you can select different substitution matrices, and gap penalties (end-gaps can be considered differently!)

Clustal is better than its reputation. It is doing a great job in handling gaps, especially terminal gaps, and it makes good use of different substitution matrices.

To align sequences clustal performs the following steps (aka as progressive alignment):

1) Pairwise distance calculation
2) Clustering analysis of the sequences
3) Iterated alignment of two most similar sequences or groups of sequences.

It is important to realize that the second step is the most important. The relationships found here will create a serious bias in the final alignment. The better your guide tree, the better your final alignment. You can load a guide tree into clustal. This tree will then be used instead of the neighbor joining tree calculated by clustalw as a default. (The guide tree needs to be in normal parenthesis notation WITH branch lengths).

LINE

Other programs often used for multiple sequence alignment
(We will not use these program in this course; if you are already confused by the information provided, skip to the assignments):

A program available via the www is SAM (sequence alignment and modeling system) by Richard Hughey, Anders Krogh, Christian Barrett, & Leslie Grate at UCSC. The input consists of a multiple sequence file (aligned or not aligned) in FASTA format. The program uses secondary structure predictions, neighboring sites, etc. to place gaps. The program can be accessed at http://www.cse.ucsc.edu/research/compbio/sam.html

If your sequences are not very similar, and if you are not able to generate a trustworthy multiple sequence alignment, you can calculate distance trees based on pairwise alignments only. The best program for this purpose is statalign from Jeff Thorne (Thorne JL, Kishino H (1992) Freeing phylogenies from artifacts of alignment. Mol Bio Evol 9:1148-1162). It runs under standard UNIX.  It's only worth your effort if you are getting gray hairs because of a data set you cannot reliably align.

MUSCLE is the current alignment program of choice. It is thought to give bettter alignments compared to clustal, it is faster and works with larger datasets. The program is available through a webserver at the ebi, and as a commandline program to download here.

Alignments by Eye:

One useful sequence editor is seaview. It runs on PC and most unix flavors. The latest version (4.3) includes phylogenetic reconstruction using phyml and parsimony.

Introns and Their Evolution

 


Three groups of introns based on their splicing mechanisms:

group I and II are self-splicing [have different splicing mechanism: see this figure for comparison of splicing]:
BOOK2


group III introns are present in eukaryotic nucleus, need spliceosomes to splice out:

BOOK

Where different groups of introns occur?

  • Group I: were discovered in ciliated protozoan Tetrahymena; found also in Physarum, fungal mitochondria and phage T4, rare in Bacteria, one is present in Thermotoga 23SrRNA
  • Group II: common in Bacteria, and so far found only in one Archaeal genus, Methanosarcina
  • Spliceosomal Introns: present throughout eukaryotes, but more common in "crown-group" eukaryotes

Where do spliceosomal introns come from and how the splicing machinery evolved?

Hypothesis:

Spliceosomal introns evolved from Class II introns; the function of some of the internal loops of the class II introns are taken over by the spliceosomal snRNA (small nuclear RNA).

Support:

Gratuitous complexity hypothesis for evolution of spliceosomal machinery: See reading assignment on WebCT [the portions for the reading are highlighted in the PDF file]

Problem:

class II introns are found in bacteria, and only in one Archaeal genus, Methanosarcina; why is it that predominately "crown-group" eukaryotes have introns?

Not much of a splice site consensus (exon1 GT-intron-AT exon2)

Group I introns often have homing endonucleases.
Homing endonucleases and intron mobility. Spread in populations, selective pressure on endonuclease. See the excellent paper by Goddard and Burt on the reinvasion cycle.

Also: reverse splicing

Possible benefits of having introns:

Exon shuffling, alternative splicing (1 gene -> different protein products) ....

Two rival hypotheses: Intron Early vs. Intron Late

Intron early:

Protein diversity arose in analogy to exon shuffling in the generation of antibody diversity (see your biochemistry or genetics textbook on the maturation of the immune system).

Claims:

Intron late:

Present day introns are late invaders of already functional genes. Exon shuffling might play some role in eukaryotes, but most of protein diversity arose before introns invaded protein coding genes.

Claims:
  • distribution of introns mapped on phylogentic trees unambiguously points towards late invasion (and here).
  • The correlation between structure and intron position is not unambiguous.
  • The finding that introns in mitochondrial (eubacterial) and nucleocytoplasmic genes have introns in the same location could reflect a preferred intron integration site. The phase pattern is also observed in vertebrate genes, in which the introns are of late origin.
  • Exon shuffling requires introns located in the same phase, but there might be other reasons for having a slight excess of introns in the same phase. For introns to frequently invade genes, there needs to be mechanisms for introns to find new "homes" (see above).

Compromise:

mixed model of intron evolution
  • version 1 - while some introns are recent, most are old. E.g.: [Roy, 2003].
  • version 2 - while most introns are recent, some are older, but not necessarily very old. E.g.: [Rogozin et al., 2003]

Else:

it was suggested that classII introns were the reason for the speration between transcription and translation in Eukaryotes (accomplished through the nuclear envelope). Martin and Koonin's hypothesis suggests that class 2 introns were braught into the eukaryotc cell by the mitochondrial endosymbiont.