Perl for Bioinformatics • Introduction Unix, Perl and Python • Data types • Input and output Perl for Bioinformatics • Functions • Control structures George W. Bell, Ph.D. George W. Bell, Ph.D. WIBR Bioinformatics and Research Computing • Comparisons http://jura.wi.mit.edu/bio/education/hot_topics/Unix_Perl_Python/ • Sample script 2 Objectives Why Perl? • Good for text processing (sequences and data) • write, modify, and run simple Perl scripts • Easy to learn and quick to write • Built from good parts of lots of languages/tools • design customized and streamlined data • Lots of bioinformatics tools available manipulation and analysis pipelines with manipulation and analysis pipelines with • Open source and free for Unix, Windows, and Perl scripts Mac 3 4
A first Perl program Scalar data • Describe one thing • Create this program and call it hey.pl • Start with $ #!/usr/local/bin/perl –w • Can be numbers or text (a “string”) C b b t t ( “ t i ”) # The Perl "Hey" program • Strings need single or double quotes print "What is your name? "; chomp ($name = <STDIN>); $numSeq = 5; # number; no quotes print "Hey, $name, welcome to the $seqName = "GAL4"; # “string”; use quotes $level = -3.75; # numbers can be decimals too BaRC course.\n"; print "The level of $seqName is $level\n"; perl hey.pl • To run: or chmod +x hey.pl • To run: • Perl has some strange-looking “special variables” too: ./hey.pl $_ default input variable $. input line number 5 6 Hash Array • An ordered list of scalar variables • An unordered pair (“keys” and “values”) of lists • The entire list is indicated by a @ • Each key points to a corresponding value. Each key points to a corresponding value. @genes = ("BMP2", "GATA-2", "Fez1"); • The entire list is indicated by a % @orfLengths = (395, 475, 431); %geneToLength = (); # Create an empty hash @info = (12, "student", 5.0e-05, "comic books"); • An item of the hash is accessed like $foo{key} • One item of the list is accessed like $foo[2] $g $geneToLength{"BMP2"} = 395; g { } ; • The first item is actually the 0 th item • The first item is act all the 0 th item $gene = "BMP2"; print "The ORF of $genes[0] is $orfLengths[0] nt."; print "The ORF of $gene is $geneToLength{$gene} nt."; The ORF of BMP2 is 395 nt. Prints out: • Prints out: The ORF of BMP2 is 395 nt. 7 8
Perl input and output Filehandles To read from or write to a file in Perl, it first needs to be opened. open(filehandle, filename); In general, • Types of input: – keyboard (STDIN) Filehandles can serve at least three purposes: open(IN, $inFile); # Open for input – files open(OUT, ">$outFile"); # Open for output • Types of output: open(OUT, ">>$outFile"); # Open for appending – screen (STDOUT) Then, get data all at once @lines = <IN>; – files or one line at a time while (<IN>) { • Unix redirection can be very helpful $line = $_; do stuff with this line; ex: ./ hey.pl > hey_output.txt print OUT "This line: $line"; } 9 10 Perl functions – a sample Control Structures 1 if (condition) # note that 0, “”, and (undefined) are false print opendir closedir open close { print If statement is true ; print "If statement is true"; chomp mkdir split join die } else # optional; ‘if’ can be used alone; elsif also possible length chdir readdir chmod sort { print "If statement is false"; substr push unlink rename use } m// s/// tr/// lc uc if ($exp >= 2) # gene is up-regulated { print "The gene $seq is up-regulated ($exp)"; Description of command: slides exercises } 11 12
Control Structures 2 Control Structures 3 while (condition) for ( initialize; test; increment ) { { print "condition is true"; print "condition is true"; # Do something interesting with this value # Do interesting things... } } # Go through an array (@seqs) where open(DATA, "myData.txt"); # Open a file to read # $#seqs = index of the last element in @seqs while (<DATA>) { for ($i = 0; $i <= $#seqs; $i++) # Split by tabs and make an array { # Print elements of @seqs and @orf on a line @dataThisRow = split /\t/, $_; print "$seqs[$i]\t"; # Print first field followed by "\n" (line end) print "$orf[$i]\n"; print "$dataThisRow[0]\n"; } 13 14 } String comparisons Arithmetic & numeric comparisons • Arithmetic operators: + - / * % • Choices: eq (equals), ne (not equal to) Notation: $i $i = $i + 1; $i + 1; $i += 1; $i + 1; $i++; $i++; • Notation: • Comparisons: > , < , <= , >= , = = , != if ($gene1 ne $gene2) if ($num1 != $num2) # If these are different { { print "$gene1 and $gene2 are different"; print "$num1 and $num2 are different"; } } } else { • Note that = = is very different from = print "$gene1 and $gene2 are the same"; = = used as a test: if ($num = = 50) } = used to assign a variable: $num = 50 15 16
Multiple comparisons Embedding shell commands • use backquotes ( ` ) around shell command && • AND • example using EMBOSS to reverse-complement: l i EMBOSS l || • OR `revseq mySeq.fa mySeq_rc.fa`; if (($exp > 2) || ($exp > 1.5 && $numExp > 10)) • Capture stdout from shell command if desired { • EMBOSS qualifier “-filter” prints to stdout print "Gene $gene is up-regulated"; } $date = `date`; $rev_comp = `revseq mySeq.fa -filter`; print $date; print "Reverse complement:\n$rev_comp\n"; 17 18 Perl modules Programming issues • "a unit of software reuse" • What should the program do? What does it do? • adds a collection of commands related to a specific task p • Who will be using/updating your software? • see https://tak.wi.mit.edu/trac/wiki/Perl to find Perl modules installed on tak – Reusability • BioPerl is a collection of bioinformatics tasks – Commenting • Example of a descriptive statistics module: – Error checking • Development vs execution time • Development vs. execution time use Statistics::Lite qw(:all); St ti ti Lit ( ll) @nums = (324, 456, 876, 678, 654, 789); • Debugging tools: printing and commenting $mean = mean(@nums); print "The mean of my numbers is $mean\n"; • Beware of OBOBs ("off-by-one bugs") 19 20
Example: align_pairs.pl Summary #!/usr/local/bin/perl –w • Input/output # Automatically do lots of pairwise sequence alignments $ $seqs = $ARGV[0]; q $ [ ]; # Get first argument (word after command) # g ( ) • Variables (scalars and arrays) V i bl ( l d ) $hs = "human"; # directory with human proteins $mm = "mouse"; # directory with mouse proteins open(SEQ_LIST, $seqs); # Open file for reading • Functions (brief look) while(<SEQ_LIST>) # Read one line at a time { • Control structures $seq = chomp($_); # trim end-of-line character print STDERR "Aligning $seqFile…\n"; • Comparisons # Create EMBOSS command for S-W (optimal) alignment $CMD = "water $hs/$seq $mm/$seq –outfile $seq.aligned"; # Execute the command (needs EMBOSS package) • Sample script: align_pairs.pl `$CMD`; } BMP7 Example print "All done with alignments\n"; GATA4 file LIN28A To run: ./align_pairs.pl SeqList.txt 21 22 Demo scripts: Books with more information http://iona.wi.mit.edu/bio/bioinfo/scripts/ and /nfs/BaRC_Public/BaRC_code/ • O’Reilly books at http://proquest.safaribooksonline.com/search/perl y p p q p – Thanks to the MIT Libraries – Learning Perl (Schwartz et al.) – Programming Perl (Wall, Christiansen, and Orwant) • Beginning Perl for Bioinformatics – Tisdall • ‘Using Perl to Facilitate Biological Analysis’ (Stein) in Bioinformatics (Baxevanis & Ouellette) • ‘Bioinformatics Programming using Perl and Perl Modules’ in Bioinformatics: Sequence and Genome Analysis, 2 nd ed. (Mount) AND several good web sites (see course page) 23 24
Exercises • Parsing a SAM short read alignment file • Parsing a SAM short-read alignment file into a BED file • Retrieving and aligning a list of human- mouse orthologs ouse o t o ogs 25
Recommend
More recommend