Class 5

 

Send questions, requests for help, programs and assignments to
gogarten@uconn.edu

 

 

Reminder : DON'T RUN THINGS ON THE MASTER NODE !

qrsh
or
ssh node0nn

 

Most of the smaller assignments should be solvable within half an hour. Using the notes, the text book and the internet you should try to solve one problem for not more than one hour. Then ask me or Pascal for help.

In total, the assignments for one week might take a few hours, but if it goes beyond 6 hours total, ask for help, or hand in the latest version of your attempt to solve the assignment. Sometimes, a little help can go a long way. The main reason for the assignments is to make you actually write code and to learn form the mistakes you make.

 wrong code in removing blank lines from input file (class 4; still worked, but ...):
The code should read as follows: (would have caught it using strict and warnings!)

while(defined($line=<IN>)){
   $line =~ s/\s//g;    # substitutes all white spaces \s with nothing globally in $line
   #chomp only removes the \n at the end, this also removes white spaces inside the line
   if ($line ne '') {push(@gi,$line)};    # this was wrong, it used !=; compare class2
   }
close(IN);

chomp removes the \n at the end of a string stored in a variable, and stores the string back into the variable. chomp also is a function, but the value returned by the function is not the string with the \n removed but zero or one depending on the successful execution of the chomp. See example here, output is here

 

Old assignments

4.1 Read chapter on hashes (chapter 5)

4.2/4.3 See class4.pl; file with gi-numbers is here.

Useful assignment:

@gi_names = sort(keys(%gi_hash)); #takes all the keys of %gi_hash and assigns them as elements to the array @gi_names

One solution is here, includes answers to 4.4. and 4.5.

4.4 Write a script that determines the number of elements in %ash.

@keys = keys(%ash); #assigns keys to an array

$number =@keys; # determines number of different keys (uses array in scalar context).

print "$number \n";

4.5 Write a script (or subroutine) that prints out a hash sorted on the keys in alphabetical order.

@gi_names = sort(keys(%gi_hash)); # sorts key and assigns keys to an array
foreach (@gi_names){
print "$_ occurred $gi_hash{$_} times\n";
}

This is not great see class4A.pl for better ways to sort.

4.6 How can you remove an entry in a hash (key and value)?

delete $gi_hash{$varaible_denoting_some_key};

From Perl in a Nutshell:

sort

sort [code] list
Sorts a list and returns the sorted list value. By default (without a code argument), it sorts in standard string comparison order (undefined values sorting before defined null strings, which sort before everything else). code, if given, may be the name of a subroutine or a code block (anonymous subroutine) that defines its own comparison mechanism for sorting elements of list. The routine must return to the sort function an integer less than, equal to, or greater than 0, depending on how the elements of the list are to be ordered. (The handy <=> and cmp operators can be used to perform three-way numeric and string comparisons.)

The normal calling code for subroutines is bypassed, with the following effects: the subroutine may not be a recursive subroutine, and the two elements to be compared are passed into the subroutine as $a and $b, not via @_. The variables $a and $b are passed by reference, so don't modify them in the subroutine. ...

More in class4A.pl

see example here

examples to sort a hash by value (followed by key)

@sorted_by_value = sort { $gi_hash{$a} <=> $gi_hash{$b}} keys (%gi_hash);

or

@sorted_by_value = sort by_value keys %gi_hash; # in contrast to the comment in the subroutine chapter this does not work with &
sub by_value { $gi_hash{$a} <=> $gi_hash{$b}} # defines the order smaller befor larger (a before b)

or

@sorted_by_value = sort by_value (keys (%gi_hash));
sub by_value {
$gi_hash{$a} <=> $gi_hash{$b}
or
$a <=> $b #if the values are the same, then sort ascibethically (cmp) or numerically (<=>) on the keys
} # defines the order smaller befor larger (a before b)

4.7 Rewrite the program in 3.4 so that it uses hashes and calculates di-, tri-, and quartet-nucleotide frequencies.:q

see here

#!/usr/bin/perl -w
##########INPUT Sequence concatenated into single string##########
//

####################sequence to array
@bases=split(//,$seq); #splits string into separate elements (bases)

$num_bases=@bases; #length of array

###################calculate GC content
$n=3; #if you want different n-mers change this number
%nlet=(); #reset things
$trip='';

for ($i=0; $i<$num_bases+1-$n; $i++) #go through @bases and form nlets of consecutive nucleotides
{
for ($k=0;$k<$n;$k+=1){ #joins $n consecutive nucleotides to form string

$trip .= $bases[$i+$k];#form nlet
}
$nlet{$trip} += 1; #increase nlet counter for one particular nlet by one
$trip='';

}
#print hash
@nlet_present = sort {$nlet{$b} <=> $nlet{$a} or $a cmp $b } (keys(%nlet));
foreach (@nlet_present){
print "$_ occurred $nlet{$_} times\n";
};

See here, here, and here for an attempt to use Excel to generate a nicer output

New Stuff

From http://vergil.chemistry.gatech.edu/resources/programming/perl-tutorial

What if we cannot get a filehandle? We should handle this case graciously by using die. For example:
open (inf, "/some/directory/file.in") || die "cannot open: $!";
So what does this do? Perl tries to open file.in OR it calls die with the string. The $! contains the most recent system error, so it will append a useful tag to the output of die. You could even make a dienice subroutine that could be more helpful. You can exit a Perl program immediately by calling exit;.
Notice how we used the logical operator, || as a control structure! It's basically saying open this file or die! You can even use &&.

One of the nice things about perl is the ease with which other programs can be run from a perl script.

In the example today we will use multiple sequence files in fasta format. The Wikipedia entry for fasta is here .

Note that there are relaxed and more defined versions of the comment line. NCBI adheres to the format described in the wikipedia entry. A typical comment line looks as follows:

>gi|73666634|ref|YP_302650.1| Ubiquinone biosynthesis hydroxylase, UbiH/UbiF/VisC/COQ6 [Ehrlichia canis str. Jake]

Note the [species designation]

The files to work on are here and here

clustalw is one of many multiple sequence alignment programs. It is already installed on the cluster.

see class5.pl for continuation

Powerpoint slides on sequence alignment are here

From Perl in a NUTSHELL:

glob

glob expr
Performs filename expansion (globbing) on expr, returning the next successive name on each call. If expr is omitted, $_ is globbed instead. This is the internal function implementing the <*> operator, except that it may be easier to type this way.

5.1. Read chapter on input/output, glob and printf

5.2. Rewrite the program from 4.2/3 using a printf statement, so that the gi numbers and the number of occurences are printed in a nice way (columns, right handed entries).

5.3 Write a script that takes all files with the extension .fa and writes their contents in a single multiple sequence file

5.5 Which command(s) could you use to open a file for output and append new things to its end rather than overwriting it?