Class 4

 

Send questions, requests for help, programs and assignments to
gogarten@uconn.edu

 

 

Reminder : DON'T RUN THINGS ON THE MASTER NODE !

qrsh
or
ssh node0nn

 

Most of the smaller assignments should be solvable within half an hour. Using the notes, the text book and the internet you should try to solve one problem for not more than one hour. Then ask me or Pascal for help.

In total, the assignments for one week might take a few hours, but if it goes beyond 6 hours total, ask for help, or hand in the latest version of your attempt to solve the assignment. Sometimes, a little help can go a long way. The main reason for the assignments is to make you actually write code and to learn form the mistakes you make.

 

Old Assignments:

Forcing strings to be evaluated exercise 2G: the eval function takes a string as an argument and executes it as code. Using this function the assignment 2G could have been solved as follows:

#!/usr/bin/perl -s -w
my @array = ('1', '0&&1', '0||1', '45', '45-45', '45/45', '45==45', '$a=45', '45<=>45',);
for (my $n=0; $n<9; $n += 1) {
   my $answer=eval($array[$n]);
   if($answer){
      print "$array[$n] is true \n"}
  else {
     print "$array[$n] is false \n"};
};

An shorter version is here

3-1: The Sieve of Eratosthenes of Cyrene can be used to calculate prime numbers.
Additional reading is here and here
Go through principle here.

Write a script that uses the Sieve of Eratosthenes to calculate all prime numbers up to a a number $n provided by the user.
Once you have a working version, what might you be able to do to accelerate the script?
Try to come up with a nice representation of the results.

Go over program in vim (indentation/colors) ex3-1.pl

#!/usr/bin/perl/ -w
# setting up#############################
$n=1000; #maximum number to test
print "This program runs the sieve of Eratosthenes \nEnter number up to which you want to run the sieve: ";
$n=<STDIN>;
if ($n eq "\n") {print "please enter number\n";exit};
chomp ($n);
print "testing up to $n; please wait.\n";

@array=(1..$n); #assign numbers to array

###################start of the sieve###############
#I left some of the bug finding stuff asa commented lines
# - Note: slot 1 of @array contains 2!
for ($i=1;$i<($n/2)+1; $i++) #loop to pick array entries to use as divisor= $i -
# test only upt $n/2 as divisor - one could choose a better limit here
{
print "divisor tested: $i; testing $array[$i] \n";#reassures the user that the program is doing things :)

if ($array[$i]==0){next} #if divisor is already 0 skip to next number in array!!!

for ($k=$i+1;$k<$n;$k++) #loop to pick numerator $k
{
if ($array[$k]==0) {next}; # if number is already set at zero skip ahead
$test = $array[$k] % $array[$i]; #check if devisor in loop fits into numerator
# print "$k : $array[$k] -- $i : $array[$i] \$test:$test \n ";

if ($test==0) {$array[$k]=0}; #if it fits, then numerator is not prime and set to 0

}
# print "\n\n";

}
###########################End of sieve##############
#Print result: ################################
print "\n\n\n tested up to $n \n\t\t"; #print what was done
foreach (@array) {
$count++;
if ($count % 20 ==0) {print "\nNext #$count:\n\t"}; #print a new line every 20 numbers and add a little comment
print "\t$_";#print the prime or 0 \t works, because the numbers will not get too big.
}
print "\n";

Especially for large numbers the above approach #@#%^&^(. It is way faster to just erase multiples of the divisor form the list, rather than going though all the modulo operations!

ex3-1B.pl

#!/usr/bin/perl/ -w
# setting up#############################
#$n=1000; #maximum number to test
print "This program runs the sieve of Eratosthenes \nEnter number up to which you want to run the sieve: ";
$n=<STDIN>;
if ($n eq "\n") {print "please enter number\n";exit};
chomp ($n);
print "testing up to $n; please wait.\n";

@array=(1..$n); #assign numbers to array

###################start of the sieve###############
#I left some of the bug finding stuff as commented lines
# - Note: slot 1 of @array contains 2!

for ($i=1;$i<($n/2)+1; $i++) #loop to pick array entries to use as divisor= $i -
# test only upt $n/2 as divisor - one could choose a better limit here
{
# print "divisor tested: $i+1; testing $array[$i] \n";#reassures the user that the program is doing things :)

if ($array[$i]==0){next} #if divisor is already 0 skip to next number in array!!!
$r=2;
while ((($i+1)*$r)<($n+1)) {
# print "$r ($i+1) $array[($i+1)*$r-1]\n";
$array[($i+1)*$r-1]=0;
$r++;}

# print "$k : $array[$k] -- $i : $array[$i] \$test:$test \n ";

# print "\n\n";

}
###########################End of sieve##############
#Print result: ################################
print "\n\n\n tested up to $n \n\t\t"; #print what was done
foreach (@array) {
$count++;
if ($count % 20 ==0) {print "\nNext #$count:\n\t"}; #print a new line every 20 numbers and add a little comment
#############How would this need to be changed to start the lines on 1 ...21....41
print "\t$_";#print the prime or 0 \t works, because the numbers will not get too big.
}
print "\n";

 

3-2: read chapter 4 on subroutines
Note: parameter handed to the subroutine, results returned from the subroutine

3-3: write a subroutine that takes a DNA sequence and calculates the reverse complement.
This is what you might need:

reverse ($text) reverses the order of characters e.g.:

#!/usr/bin/perl/
$text='ATGCC';
$rev= reverse $text;
print " $rev \n" ;
output CCGTA

tr/ATGC/TACG/ translates every A into a T, T into A ....
$seq =~ tr/ATGC/TACG/ # Takes the string in $seq, performs the translation and returns the result into $seq. E.G.:

#!/usr/bin/perl/
$text=$comp='ATTGTCCCCGG';
$comp =~ tr/ATGC/TACG/;
print " the complement of \$text=$text is $comp\n" ;

output: the complement of $text=ATTGTCCCCGG is TAACAGGGGCC

ex3-3.pl

#!/usr/bin/perl -w
###############INPUT####################
#input sequence; chomp every line, and concatenate into one big scalar called $seq
unless(@ARGV==1) {die "please provide name of the file in the command line!!\n";}
$filename=$ARGV[0];
open(IN, "< $filename") or die "cannot open $filename:$!";

$seq='';
while(defined($line=<IN>)){

chomp($line);
$seq .= $line ;
}
# print "input sequence is \n $seq\n"; #make sure sequence is read accurately
########### END INPUT#########################

####Program: call sub and print output
$invcomp=&RevComp($seq);#call subroutine hand over content of $seq as parameter
# the result of the subroutine's operation will be assigned to $invcomp

print "\n\n\ninverse complement is \n$invcomp\n"; #print output
########End Program###############
#### Sub:
sub RevComp {
my ($DNAseq, $rev, $rev_comp);# 'local' variables
# print "subroutine RevComp was called\n"; #make sure sub is called
$DNAseq=$_[0];
# print "subroutine is working on \n $DNAseq\n";# make sure sub got the right sequence
$rev = reverse $DNAseq;
$rev_comp=$rev;
$rev_comp =~ tr/atgcATGC/TACGTACG/;
$rev_comp; # returns contents of this variable, the last thing executed,to the main program -NOT superfluous - else the result is the number of translations made
};

A modified version using two subroutines, and illustrating the value of strict and local variables is here

3-4 If time, go through class3.pl, assignment at end

ex3-4.pl

#!/usr/bin/perl -w
# Note this is the unimproved version!
##########INPUT Sequence concatenated into single string##########
unless(@ARGV==1) {die "please provide name of the file in the command line!!\n";}
$filename=$ARGV[0];
open(IN, "< $filename") or die "cannot open $filename:$!";

$seq=''; #assigns empty string

while(defined($line=<IN>)){

chomp($line);
$seq .= $line ;
}

####################sequence to array
@bases=split(//,$seq); #splits string into separate elements (bases)

$num_bases=@bases; #length of array

###################calculate GC content

$num_GC=0;

for ($i=0; $i<$num_bases; $i++) #counts Gs and Cs in @bases
{
if(($bases[$i]=~"G") or ($bases[$i]=~"C")) #if it matches G or C increase counter
{&
$num_GC++;
}
}

$GC_content=($num_GC/$num_bases)*100;
print "\nThe GC content of the sequence in the file ",'"',$filename,'"', " is ",$GC_content,"%.\n\n";

The script in the link is a little more sophisticated. It handles FASTA comment lines, lowercase and whitespaces and non standard characters.

 

New assignments

4.1 Read handout on hashes

4.2/4.3 See class4.pl; file with gi-numbers is here.

Useful assignment:

@gi_names = sort(keys(%gi_hash)); #takes all the keys of %gi_hash and assigns them as elements to the array @gi_names

4.4 Write a script that determines the number of elements in a %ash.

4.5 Write a script (or subroutine) that prints out a hash sorted on the keys in alphabetical order.

4.6 How can you remove an entry in a hash (key and value)?

4.7 Rewrite the program in 3.4 so that it uses hashes and calculates di-, tri-, and quartet-nucleotide frequencies.