Programming in Python Lecture 3: Patterns and Functions Michael Schroeder Sven Schreiber sven.schreiber@tu-dresden.de 1 Slides derived from Ian Holmes, Department of Statistics, University of Oxford Updates by Andreas Henschel
Overview • Patterns (Regular Expressions) • Functions and Lambda Functions 2
Patterns 3
What is a pattern? https://commons.wikimedia.org/wiki/Tree 4
Pattern-matching • logical test to ask whether a string contains a pattern • e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT? name = ‘YBR007C’ 20 bases upstream of dna = ‘TAATAAAAAACGCGTTGTCG’ the yeast gene YBR007C if ‘ACGCGT’ in dna: print(‘%s has MCB!’ % name) The pattern for the MCB binding site The membership operator in YBR007C has MCB! 5
Regular expressions • We already defined a simple pattern: ACGCGT • What if we don’t care about the 3 rd position? => ACGCGT ACCCGT ACACGT ACTCGT • Python provides a pattern-matching engine • Patterns are called regular expressions • They are extremely powerful • Often called "regex" for short • module re 6
Motivation: N-glycosylation motif • Common post-translational modification • Attachment of a sugar group • Occurs at asparagine residues with the consensus sequence NX 1 X 2 , where – X 1 can be anything (but proline inhibits) – X 2 is serine or threonine • Can we detect potential N-glycosylation sites in a protein sequence? 7
Building regexs I: Character Classes • Square brackets define a set of alternative characters (character class) • E.g. [abc] -> matches a,b, or c • Use - to match a range of characters: [A-Z] • Negation : [^X] matches anything but X • [^A-Z] matches anything but A-Z • . matches anything • [a] is equivalent to a 8
Building regexs II: Abbreviations • \d matches any decimal digit [0-9] • \D matches any non-digit [^0-9] • Equivalent syntax for: – whitespace ( \s and \S ) – alphanumeric ( \w and \W ) 9
Building regexps II: Quantifiers • * matches none or any number of times – E.g. ca*t matches: ct, cat, caat, caaat, caaaat, ... • + matches one or any number of times – E.g. ca+t matches cat, caat, caaat, caaaat, ... • ? matches none or once – E.g. bio-?info matches bioinfo and bio-info • { n } matches a specific number of times • { n,m } matches from n (min) to m (max) times – E.g. ab{1,3}c will match abc, abbc, abbbc 10
Using Regular Expressions • Compile a regular expression object (pattern) using re.compile • pattern has a number of methods – match (in case of success returns a Match object, otherwise None, matches only at the beginning !) – search (scans through whole string looking for a match) – findall (returns a list of all matches) A matches >>> import re >>> pattern = re.compile('[ACGT]') >>> if pattern.match(“A"): print(“A matches") successful match Matched >>> if pattern.match("a"): print(“a matches") unsuccessful, returns None >>> by def. case sensitive >>> import re without compiling, short, >>> if re.match('[ACGT]‘, “A"): print("Matched") but less performant >>> Matched Matched 11
Matching alternative strings • (this|that) matches "this" or "that" • ...and is equivalent to th(is|at) case unsensitive search pattern >>> pattern=re.compile("(this|that|other)", re.IGNORECASE) >>> pattern.search("Will match THIS") ## success <_sre.SRE_Match object at 0x00B52860> >>> pattern.search(“Also THat will be matched") ## success <_sre.SRE_Match object at 0x00B528A0> >>> pattern.search("Will not match ot-her") ## will return None >>> Python returns a description of the match object 12
Word and string boundaries ^ matches the start of a string $ matches the end of a string \b matches word boundaries "Escaping" special characters • Characters with special meaning: . ^ $ * + ? { [ ] \ | ( ) • \ is used to free or "escape" those characters from their special meaning • so \[ just matches the character " [ " – if not escaped, " [ " signifies the start of a character class, as in [ACGT] 13
Substitutions/Match Retrieval • Regex can also be used to substitute patterns using re.sub Regex use without compiling >>> re.sub("(red|blue|green)", "colored", "blue socks and red shoes") 'colored socks and colored shoes' matches one or more digits The result, a list of 4 strings, >>> e,raw,frm,to = re.findall("\d+", \ is assigned to 4 variables "E-value: 4, \ Raw Bit Score: 165, \ \ allows multiple line commands Match position: 362-419") alternatively, construct multi-line >>> print(e, raw, frm, to) 4 165 362 419 strings using triple quotes """ …""" 14
N-glycosylation site detector >>> protein=”\ MGMFFNLRSNIKKKAMDNGLSLPISRNGSSNNIKDKRSEHNSNSLKGKYRYQPRSTPSKFQLTVSITSLI\ IIAVLSLYLFISFLSGMGIGVSTQNGRSLLGSSKSSENYKTIDLEDEEYYDYDFEDIDPEVISKFDDGVQ\ HYLISQFGSEVLTPKDDEKYQRELNMLFDSTVEEYDLSNFEGAPNGLETRDHILLCIPLRNAADVLPLMF\ KHLMNLTYPHELIDLAFLVSDCSEGDTTLDALIAYSRHLQNGTLSQIFQEIDAVIDSQTKGTDKLYLKYM\ DEGYINRVHQAFSPPFHENYDKPFRSVQIFQKDFGQVIGQGFSDRHAVKVQGIRRKLMGRARNWLTANAL\ KPYHSWVYWRDADVELCPGSVIQDLMSKNYDVI” >>> regex = "N[^P][ST]" >>> for match in re.finditer(regex, protein): print(match.group(), match.span()) NGS (26, 29) NLT (214, 217) NGT (250, 253) re.finditer N[^P][ST] - the provides an iterator main regular over match-objects expression match.group and match.span print the actual matched string and the position-tuple. 15
Another Example: [KHDAS]DEL 16 Courtesy of Chris Bystroff
Another Example: Zinc finger motif Von Thomas Splettstoesser (www.scistyle.com) - self-made, based on PDB structure 17 1A1L, the open source molecular visualization tool PyMol and Cinema 4D, GFDL, https://commons.wikimedia.org/w/index.php?curid=3106866
hydrophobic C\w{2,4}C\w{3}[LIVMFYWC]\w{8}H\w{3,5}H 18 Courtesy of Chris Bystroff
Test your Regular Expressions www.pythex.org • Develop regular expressions • Test them on examples of your choice 2REG 9ins 1VSN 1osn 1a1b PDB IDs ^[1-9]\w{3}$ 19
Functions 20
Functions • Similar code is often needed in different places of a program • but copy/paste code is a bad idea! • need to separate those pieces of code and call them from different places • Separated code for a self-contained tasks is called a function • Examples of such tasks: – cleaning up a sequence (lowercase, strip newlines..) – reverse complementing a sequence 21
Function Syntax def <functionname> (<arg1>, <arg2>, ...): <block> return <something> Syntax def sum_up_numbers (num1, num2): my_sum = num1 + num2 return my_sum Example 22
Calling a function def sum_up_numbers (num1, num2): my_sum = num1 + num2 return my_sum Function Definition sum_up_numbers (1,5) 6 sum_up_numbers (num1=1,num2=5) 6 Function Calls 23
Example: Largest number • Function to find the largest number in a list def find_max(aList): Function declaration max = aList.pop() for x in aList: Function body if x > max: max = x return max Function result numbers = [1, 5, 1, 12, 3, 4, 6] print("Maximum: %i” % find_max(numbers)) Function call Maximum: 12 24
Lambda Functions 25
Lambda Functions • Kind of anonymous functions • Similar to normal functions but... – ...not bound to a name – ...different syntax – ...can be assigned to variables, passed to functions – ...restricted to one expression/instruction def calc(x): return (x-3)*2 4 calc(5) Normal function definition calc1 = lambda x: (x-3)*2 4 calc1(5) 4 calc2=calc1 calc2(5) Lambda function 26
Map, filter, and reduce • Lambda functions can be passed as arguments to functions • Powerful in combination with map, filter, and reduce map reduce (lambda_function, sequence) filter Function applied to each element ...of the given sequence Decides what to to with the result: map -> apply to each element, return modified list filter -> return list with element tested True reduce -> returns one element resulting from computation 27
Examples map(lambda x: x*3, [1,2,3]) [3,6,9] filter(lambda x: x>=1.0, [1.2,0.5,0.7,1.3]) [1.2,1.3] filter(lambda x: x!=0, map(lambda x: x-2, [4,2,5])) [2,3] 2,0,3 reduce(lambda x,y: x+y, (1,2,3,4)) 10 x, y x, y x, y 1, 2 3 3, 3 6 3,4 28
Summary • Regular expression as powerful tools to detect patterns • Allow matching of character classes, repetitions, alternatives, etc. • Learn the meaning of special characters . ^ $ * + ? { [ ] \ | ( ) • Python offers regexp functions in the re module – match, search, findall, finditer etc. • Regular expressions can be used to find motifs in sequences • Functions as way to separate self-contained tasks and to structure code • Lambda function with map, filter, and reduce for efficient list processing 29
Recommend
More recommend