Old tasks
6.1 read chapter 6
6.2 Write a script that reads in a multiple sequence file (FASTA formated, see here and here for examples), and writes out each sequence in a seperate file. Assuming the input file follows NCBI conventions, can you find a way to name the files in a useful fashion? If you dare, extend the program to calculate a multiple sequence alignment of the these sequences in either clustalw or muscle.
One version is here
A while loop reads input file line by line
Note regex /^>/ The ^ anchors pattern to beginning of line
If new annotation line is encountered extract gi and make new filename and assign filehandle OUT to new name
For all: print last line to OUT
Another solution is here
Alternative input of filename
Very simple name to create filenames
Note that the elaborate closing of files in not necessary
A third version is here
Note the reassignment of the end of record variable $/
as a result a complete sequence including the annotation line is read into one string variable (which can include many \n!)
This is a cool way to shatter files that contain multiple results.
6.3 (this assignment has some difficulties, try it, design an approach, but if it doesn't work, stop after 60 minutes and send where you are at that point).
Modify the program from ex4-7.pl so that it calculates and prints (every n nucleotides) a running total of nucleotide excesses in one strand over the complementary strand, aka the cummulative strand bias. For example A over T, or G over C, or keto over amino base excess ((G+T)-(A+C))).
What does the result mean? Which of the above measures (any others you have tried?) shows most bias?
Microbial genomes are available through the NCBI either at http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi or at ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria (which currently also includes the Archaea!!). Each genome is in its own folder. *.fna contains the whole nucleotide sequence as single fasta file, *.ffn, contains the individual CDS, *.gbk is the whole genome (nucleotides with CDSes translated); *.faa is a multiple sequence file that contains all recognized encoded proteins (useful for blasting). For this exercise we can use any *.fna file. For example this genome from Frankia_Ccl3. NOTE: if you write the program use a shorter sequence, not the whole genome!
A program that goes through the *.fna present in the same directory and calculates three types of bias is here.
Note that the find-the-file-to-work-on approach is very different form the more typical glob.
The program goes through the genome line by line, and nucleotide by nucleotide, and counts the number of bases for each nucleotide. Every 1000 nucleotides (or 5000), it prints cummulative strand biases to a file called my_table. This file is tab delimited and can be loaded into Exel, for example. Having headers in the first line make ploting things in Excel a lot easier, but for other programs you might want to delete the line.
An alternative to plot data is gnuplot, an opensource commandline driven plot and analysis program. On the Biotech cluster it is ONLY installed on the master node, i.e. you cannot call it from within a script running on a node. Usually you would use gnuplot in an interactive manner, but you also can bundle your commands in a file and execute them all at one. To execute them type gnuplot name_of_file in our case gnuplot plot.p.
Some comments on gnuplot are here . In our case the plot.p tells gnuplot to create the plot in a postscript file, on your personal machine you can use the screen directly. (For Macs the easiest way to install this is to use FINK).
A short intro to homology searching is here (ppt: here)
blastall, formatdb and fastacmd are installed on the cluster and can be invoked form the commandline.
typing blastall alone gives a list of options. Note: -F, -W, -a, -e
typing "formatdb -" or "fastacmd -" gives you a list of options
Download sequences from http://genome.ornl.gov/microbial/tmel/
Do example of formating databank (Thermotoga MSB8 in geneomes)
formatdb -i *.faa -p T -o T -n msb8.faa
A typical blastall command using the commandline is
blastall -p blastp -d /common/data/swissprot -i query.fa -o results.tab -F F -m 9 -W 2 -a 2 -e 0.000001
blastall -p blastp -d /Users/jpgogarten/genomes/Thermotoga_maritima/MSB8.faa -i tmel.faa -o /scratch/results.tab -F F -m 9 -W 2 -a 2 -e 0.001
DISCUSS /scratch and NFS problems.
An easy way to extract individual fasta files form a databank is the fastacmd for example, if we want to retrieve sequences with gi numbers 32699731 and 3915985 :
fastacmd -d /common/data/swissprot -p T -s 32699731,3915985
Note: no space between numbers. This only works if in making the database with formatdb you used the -o T option.
Within a script you could use the -o option to write the sequences to file, or you could assign the sequences to a variable using backtics:
$gi1=32699731;
$gi2=3915985;
$seq1=`fastacmd -d nr -p T -s $gi1,$gi2` ;
print "the sequences are: \n$seq1\n";
More info on fastacmd, blastall and formatdb is available at the NCBI: fastacmd, blastall, formatdb
7.1 read through the PSIBLAST notes at ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastpgp.html
Try to figure out which commands will allow you to use nr as a database to develop a PSSM and then use the result to search another, possibly user defined database.
(hints: Use checkpoints (-C and -R option) the end of the file has an example (more complicated because it uses tblastn).
7.2 Develop an idea for a student project, write outline and outline for programs
7.3 Modify the cumulative nucleotide strand bias program to
(A)
calculate the (G-C)/(G+C) bias in a sliding window
(B) writes a plotfile for use in gnuplot that has the correct name for the genome as title for the plot
7.4 Write a script that takes the input and ouput from a blastall search and returns all sequences from the query file that did NOT have a match in the databank.
Else: All the programs available through the emboss suite are available on the cluster (type the name at the command line). See http://emboss.sourceforge.net/ for more details.
A listing of program groups is here. The information for a program that calculated IEP and statistics is here.
run_pepstats.pl runs pepstat on all the sequences in the genome.
($%#^&* this requires aa sequences without any ambiguous characters, to accomplish this, you can use the script check_genome.pl . This generates a new *.faa file where all the non aa characters are removed.)
parse_pepstats.pl goes through the outputs of the pepstat program and extracts information. The resulting files can be loaded into excel (see here for an example).