5.1. Read chapter on input/output, glob and printf
5.2. Rewrite the program from 4.2/3 using a printf statement, so that the gi numbers and the number of occurences are printed in a nice way (columns, right handed entries).
The complete program is here (the input here). The pertinent portion is :
foreach (@sorted_by_value){
printf "GI number %-9d occurred %4d times\n", $_, $gi_hash{$_};
}
5.3 Write a script that takes all files with the extension .fa and writes their contents in a single multiple sequence file
One way to solve this problem is to use a simple unix statement -- we would not want to do this, because we would not learn anything on filehandles in perl:
cat *.fa > name_of_outfile
cat: lists the content of the file.
* is a wildcard,
*.fa is expanded into a list containing all directory entire that fit the description
> redirects the output to a file (away form the default, which is the screen). As the result all files with name *.fa will be written into a fileA perl script is here ex5-3.pl (derived from class5.pl)
#!usr/bin/perl/ -w
$outname="";# reset name of outfile
open (OUT, ">temp.out") || die "cannot open: $!";#open a file for output -
# as the names of the infiles have not been read yet, I assign this a temporary name to be renamed later
#
while(defined($file=glob("*.fa"))){ #loop opens one file ending in .fa after the other. stores the name in $file
open (IN, "<$file") || die "cannot open: $!"; #reports error if file cannot be opened
#and assigns $file to IN
@filename_parts=split(/\./,$file); # writes first part of filename into $filename_parts[0]
$outname .= "$filename_parts[0]".'.'; ### concatenated $filename_part[0] into new outname ###
#repeats this for every cycle
while (defined($line=<IN>)){print OUT $line;} #writes contents of $file line by line into filehandle OUT
} #no more *.fa files in directory
close OUT; #closes filehandle OUT and temp.out
$outname="$outname".'fa.multiple';#adds useful extension to outfile
system ("mv temp.out $outname"); # uses unix command mv (move)to rename temp.out into new $outname
5.5 Which command(s) could you use to open a file for output and append new things to its end rather than overwriting it?
open (OUT, ">>/some/directory/file.out") || die "cannot open: $!";
Go through sequence alignemtn for at least 30 minutes :)
Powerpoint slides on sequence alignment are here
New things:
6.1 read chapter 6
6.2 Write a script that reads in a multiple sequence file (FASTA formated, see here and here for examples), and writes out each sequence in a seperate file. Assuming the input file follows NCBI conventions, can you find a way to name the files in a useful fashion? If you dare, extend the program to calculate a multiple sequence alignment of the these sequences in either clustalw or muscle.
6.3 (this assignment has some difficulties, try it, design an approach, but if it doesn't work, stop after 60 minutes and send where you are at that point).
Modify the program from ex4-7.pl so that it calculates and prints (every n nucleotides) a running total of nucleotide excesses in one strand over the complementary strand, aka the cummulative strand bias. For example A over T, or G over C, or keto over amino base excess ((G+T)-(A+C))).
What does the result mean? Which of the above measures (any others you have tried?) shows most bias? .
Regular expressions:
A good regex link:
http://www.anaesthetist.com/mnm/perl/regex.htm
Some regular expression examples (Thanks to Tim Harlow):
1. Position of a match
Some code:
$string = "ATCGCATGGAA";
while ($string =~ /T.G/g) {
print "$& ends at position ", pos($string)-1, "\n";
}
Output:
TCG ends at position 3
TGG ends at position 8
2. Capturing a match
Some code:
#NCBI Complete Microbial Genomes web page
my $url = 'http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi';
use LWP::Simple; #to get info use the perldoc command: perldoc LWP::SIMPLE
my $content = get $url;
die "Couldn't get $url" unless defined $content;
# Then go do things with $content, like this:
if ($content =~ m/Complete Microbial Genomes selected: .A. - (\d+), .B.
- (\d+)/) {
print "There are $1 completely sequenced archael genomes";
print ", and $2 completely sequenced bacterial genomes";
print " - that's a total of ", $1+$2, " genomes!\n";
}
else {
print "Couldn't find what I was looking for, sorry!\n";
}
Output:
There are 26 completely sequenced archeal genomes, and 285 completely
sequenced bacterial genomes - that's a total of 311 genomes!
A (very) short introduction to BioPerl:
1. Get a sequence by gi number
Some code:
use Bio::DB::GenBank; #perldoc works!
$gb = new Bio::DB::GenBank;
$seqobj = $gb->get_Seq_by_gi('86742390');
print $seqobj->description();
print "\n";
print $seqobj->seq();
print "\n";
print "this is a ", $seqobj->alphabet(), " sequence\n";
use Bio::Tools::SeqStats;
$weight = Bio::Tools::SeqStats->get_mol_wt($seqobj);
print "molecular weight of ", $seqobj->id(), " is between ",
$$weight[0], " and ", $$weight[1], "\n";
Output:
ATP synthase F1, alpha subunit [Frankia sp. CcI3].
MTELSIRPEEIRDALREYVDSFQATSAGREEVGRVVVTGDGIARVEGLPHTMTNELLEFHGGVLGLALNLEVGEIGTVILGESENIEEGQEVRRTGQILSAPVGDGFLGRVVDPLGRPIDGLGEIVAEGSRELELQAPTVVQRQPVKEPLQTGIKAIDAMTAIGRGQRQLIIGDRQTGKTTVAIDAIINQRDNWTSGDPSKQVKCVYVAIGQKKSTIREVVNSLEEAGALAYTTIVAAPADEPAGFKYIAPYTGSAIAQHWMYNGQHALIVFDDLTKQAEAYRAISLLLRRPPGREAYPGDVFYLHSRLLERCAKLSDELGGGSLTGLPIIETKANDISAYIPTNVISITDGQVFLESDLFNQGVRPAINVGTSVSRVGGSAQVKAMKSVAGRLRLDLAQYRELEAFSAFGSDLDKASRDQLARGARLVELLKQPQGQPFPVERQVVSIWAGTTGKLDDVPVADIRRFESEFLDFVGRSYPGVYDAIVTTGKLSDDTIAMLESAVAEFKKQFTLSDGKPLVNEPAPSPLDPGLVRQESIPVHRPAARKDDEG
this is a protein sequence
molecular weight of YP_482790 is between 59691 and 59691
A somewhat readable overview of Bioperl (which modules do what) is at http://www.bioperl.org/