perl for bioinformatics
play

Perl for Bioinformatics Introduction Unix, Perl and Python Data - PowerPoint PPT Presentation

Perl for Bioinformatics Introduction Unix, Perl and Python Data types Input and output Perl for Bioinformatics Functions Control structures George W. Bell, Ph.D. George W. Bell, Ph.D. WIBR Bioinformatics and Research


  1. Perl for Bioinformatics • Introduction Unix, Perl and Python • Data types • Input and output Perl for Bioinformatics • Functions • Control structures George W. Bell, Ph.D. George W. Bell, Ph.D. WIBR Bioinformatics and Research Computing • Comparisons http://jura.wi.mit.edu/bio/education/hot_topics/Unix_Perl_Python/ • Sample script 2 Objectives Why Perl? • Good for text processing (sequences and data) • write, modify, and run simple Perl scripts • Easy to learn and quick to write • Built from good parts of lots of languages/tools • design customized and streamlined data • Lots of bioinformatics tools available manipulation and analysis pipelines with manipulation and analysis pipelines with • Open source and free for Unix, Windows, and Perl scripts Mac 3 4

  2. A first Perl program Scalar data • Describe one thing • Create this program and call it hey.pl • Start with $ #!/usr/local/bin/perl –w • Can be numbers or text (a “string”) C b b t t ( “ t i ”) # The Perl "Hey" program • Strings need single or double quotes print "What is your name? "; chomp ($name = <STDIN>); $numSeq = 5; # number; no quotes print "Hey, $name, welcome to the $seqName = "GAL4"; # “string”; use quotes $level = -3.75; # numbers can be decimals too BaRC course.\n"; print "The level of $seqName is $level\n"; perl hey.pl • To run: or chmod +x hey.pl • To run: • Perl has some strange-looking “special variables” too: ./hey.pl $_ default input variable $. input line number 5 6 Hash Array • An ordered list of scalar variables • An unordered pair (“keys” and “values”) of lists • The entire list is indicated by a @ • Each key points to a corresponding value. Each key points to a corresponding value. @genes = ("BMP2", "GATA-2", "Fez1"); • The entire list is indicated by a % @orfLengths = (395, 475, 431); %geneToLength = (); # Create an empty hash @info = (12, "student", 5.0e-05, "comic books"); • An item of the hash is accessed like $foo{key} • One item of the list is accessed like $foo[2] $g $geneToLength{"BMP2"} = 395; g { } ; • The first item is actually the 0 th item • The first item is act all the 0 th item $gene = "BMP2"; print "The ORF of $genes[0] is $orfLengths[0] nt."; print "The ORF of $gene is $geneToLength{$gene} nt."; The ORF of BMP2 is 395 nt. Prints out: • Prints out: The ORF of BMP2 is 395 nt. 7 8

  3. Perl input and output Filehandles To read from or write to a file in Perl, it first needs to be opened. open(filehandle, filename); In general, • Types of input: – keyboard (STDIN) Filehandles can serve at least three purposes: open(IN, $inFile); # Open for input – files open(OUT, ">$outFile"); # Open for output • Types of output: open(OUT, ">>$outFile"); # Open for appending – screen (STDOUT) Then, get data all at once @lines = <IN>; – files or one line at a time while (<IN>) { • Unix redirection can be very helpful $line = $_; do stuff with this line; ex: ./ hey.pl > hey_output.txt print OUT "This line: $line"; } 9 10 Perl functions – a sample Control Structures 1 if (condition) # note that 0, “”, and (undefined) are false print opendir closedir open close { print If statement is true ; print "If statement is true"; chomp mkdir split join die } else # optional; ‘if’ can be used alone; elsif also possible length chdir readdir chmod sort { print "If statement is false"; substr push unlink rename use } m// s/// tr/// lc uc if ($exp >= 2) # gene is up-regulated { print "The gene $seq is up-regulated ($exp)"; Description of command: slides exercises } 11 12

  4. Control Structures 2 Control Structures 3 while (condition) for ( initialize; test; increment ) { { print "condition is true"; print "condition is true"; # Do something interesting with this value # Do interesting things... } } # Go through an array (@seqs) where open(DATA, "myData.txt"); # Open a file to read # $#seqs = index of the last element in @seqs while (<DATA>) { for ($i = 0; $i <= $#seqs; $i++) # Split by tabs and make an array { # Print elements of @seqs and @orf on a line @dataThisRow = split /\t/, $_; print "$seqs[$i]\t"; # Print first field followed by "\n" (line end) print "$orf[$i]\n"; print "$dataThisRow[0]\n"; } 13 14 } String comparisons Arithmetic & numeric comparisons • Arithmetic operators: + - / * % • Choices: eq (equals), ne (not equal to) Notation: $i $i = $i + 1; $i + 1; $i += 1; $i + 1; $i++; $i++; • Notation: • Comparisons: > , < , <= , >= , = = , != if ($gene1 ne $gene2) if ($num1 != $num2) # If these are different { { print "$gene1 and $gene2 are different"; print "$num1 and $num2 are different"; } } } else { • Note that = = is very different from = print "$gene1 and $gene2 are the same"; = = used as a test: if ($num = = 50) } = used to assign a variable: $num = 50 15 16

  5. Multiple comparisons Embedding shell commands • use backquotes ( ` ) around shell command && • AND • example using EMBOSS to reverse-complement: l i EMBOSS l || • OR `revseq mySeq.fa mySeq_rc.fa`; if (($exp > 2) || ($exp > 1.5 && $numExp > 10)) • Capture stdout from shell command if desired { • EMBOSS qualifier “-filter” prints to stdout print "Gene $gene is up-regulated"; } $date = `date`; $rev_comp = `revseq mySeq.fa -filter`; print $date; print "Reverse complement:\n$rev_comp\n"; 17 18 Perl modules Programming issues • "a unit of software reuse" • What should the program do? What does it do? • adds a collection of commands related to a specific task p • Who will be using/updating your software? • see https://tak.wi.mit.edu/trac/wiki/Perl to find Perl modules installed on tak – Reusability • BioPerl is a collection of bioinformatics tasks – Commenting • Example of a descriptive statistics module: – Error checking • Development vs execution time • Development vs. execution time use Statistics::Lite qw(:all); St ti ti Lit ( ll) @nums = (324, 456, 876, 678, 654, 789); • Debugging tools: printing and commenting $mean = mean(@nums); print "The mean of my numbers is $mean\n"; • Beware of OBOBs ("off-by-one bugs") 19 20

  6. Example: align_pairs.pl Summary #!/usr/local/bin/perl –w • Input/output # Automatically do lots of pairwise sequence alignments $ $seqs = $ARGV[0]; q $ [ ]; # Get first argument (word after command) # g ( ) • Variables (scalars and arrays) V i bl ( l d ) $hs = "human"; # directory with human proteins $mm = "mouse"; # directory with mouse proteins open(SEQ_LIST, $seqs); # Open file for reading • Functions (brief look) while(<SEQ_LIST>) # Read one line at a time { • Control structures $seq = chomp($_); # trim end-of-line character print STDERR "Aligning $seqFile…\n"; • Comparisons # Create EMBOSS command for S-W (optimal) alignment $CMD = "water $hs/$seq $mm/$seq –outfile $seq.aligned"; # Execute the command (needs EMBOSS package) • Sample script: align_pairs.pl `$CMD`; } BMP7 Example print "All done with alignments\n"; GATA4 file LIN28A To run: ./align_pairs.pl SeqList.txt 21 22 Demo scripts: Books with more information http://iona.wi.mit.edu/bio/bioinfo/scripts/ and /nfs/BaRC_Public/BaRC_code/ • O’Reilly books at http://proquest.safaribooksonline.com/search/perl y p p q p – Thanks to the MIT Libraries – Learning Perl (Schwartz et al.) – Programming Perl (Wall, Christiansen, and Orwant) • Beginning Perl for Bioinformatics – Tisdall • ‘Using Perl to Facilitate Biological Analysis’ (Stein) in Bioinformatics (Baxevanis & Ouellette) • ‘Bioinformatics Programming using Perl and Perl Modules’ in Bioinformatics: Sequence and Genome Analysis, 2 nd ed. (Mount) AND several good web sites (see course page) 23 24

  7. Exercises • Parsing a SAM short read alignment file • Parsing a SAM short-read alignment file into a BED file • Retrieving and aligning a list of human- mouse orthologs ouse o t o ogs 25

Recommend


More recommend