Programming in Python Lecture 2: Sequences Michael Schroeder Sven Schreiber sven.schreiber@tu-dresden.de 1 Slides derived from Ian Holmes, Department of Statistics, University of Oxford Updates by Andreas Henschel
Overview • Types of sequences and their properties – Lists, Tuples, Strings, Range • Building, accessing and modifying sequences • List comprehensions • File operations 2
Types and Properties of Sequences 3
Lists vs tuples • Both are sequences (used to store collections of objects) • Tuples are immutable, Lists mutable • List are more flexible • Tuples provide better performance • Rule of thumb: Lists for similar kind of objects, tuples for different Construction (Syntax) l = [1,2,3,4] t = (‘sebastian’, ‘m’, 28) l2 = [‘Apple’, ‘Banana’, ‘Orange’] t2 = (‘motif’, ‘ATTCG’, ‘E44’) l[0] t[0] 1 sebastian Accessing Elements l.append(3) Adding/modifying t.append(3) immutable ! l[1] = 5 t[1] = 5 Elements l3 = l+[3,2] t3 = t + (‘phd’,’biotec’) Concatenating 4
Range • Used to provide collections of sequent integer numbers • Allow iteration with loops 0 for x in range(10000): 1 print(x) 2 3 ... ... Excluding last number! 9998 9999 • Numbers are not stored in memory, but just generated when needed (while looping) • Saves time and memory with larger number sets 5
Working with Lists 6
Lists A list is a collection of values/objects nucleotides = ['a', 'c', 'g', 't'] print("Nucleotides: ", nucleotides) Nucleotides: ['a', 'c', 'g', 't'] We can think of the above as a container with 4 entries a c g t the list is the collection of all four elements element 0 Note that the element element 3 element 1 indices start at zero! element 2 7
List literals • There are several ways to create or obtain lists. This is the most common: a comma- separated list, delimited by squared brackets a = [1,2,3,4,5] print("a = ",a) b = ['a','c','g','t'] a = [1,2,3,4,5] print("b = ",b) b = ['a','c','g','t'] c = [1,2,3,4,5] c = list(range(1,6)) d = ['a','c','g','t'] print("c = ",c) d = "a c g t".split() print("d = ", d) 8
Accessing lists To access list elements, use square brackets e.g. x[0] means "element zero of list x " x = ['a', 'c', 'g', 't'] i= 2 a g t print(x[0], x[i], x[-1]) • Remember, element indices start at zero! • Negative indices refer to elements counting from the end e.g. x[-1] means "last element of list x " 9
List operations • You can sort and reverse lists... x = ['a', 't', 'g', 'c'] x = ['a', 't', 'g', 'c'] print("x =",x) x = ['a', 'c', 'g', 't'] x.sort() x = ['t', 'g', 'c', 'a'] print("x =",x) x.reverse() print("x =",x) • You can add, delete and count elements nums = [2,2,5,2,6] [2,2,5,2,6,8] nums.append(8) 3 print(nums) [2,2,2,6,8] print(nums.count(2)) nums.remove(5) print(nums) 10
More list operations >>> x=[1,0]*2 multiplying lists >>> x [1, 0, 1, 0] pop() obtains and >>> x.pop() removes the last 0 element of a list >>> x [1, 0, 1] concatenating lists with + >>> x+=x >>> x or += [1, 0, 1, 1, 0, 1] index(..) searches for the >>> x.index(0) 1 first occurrence of an element 11
Example: Reverse complementing DNA A common operation due to double-helix symmetry of DNA Start by making string lower case again. This is generally good practice Replace 'a' with 't', 'c' with 'g', 'g' with 'c' and 't' with 'a' dna = "accACgttAGgtct".lower() replaced = dna.replace("a",“_a") \ .replace("t","a").replace(“_a","t") \ Convert to list .replace("g",“_g").replace("c","g") \ and reverse .replace(“_g", "c") replacedList = list(replaced) replacedList.reverse() Convert back to string using join print("".join(replacedList)) agacctaacgtggt 12
Taking a slice of a list • The syntax x[ i:j ] returns a list containing elements i,i+1,…,j-1 of list x nucleotides = ['a', ’g’, 'c', 't'] print(nucleotides) print(nucleotides[0:2]) # nucleotides[:2] also works print(nucleotides[2:4]) # nucleotides[2:] also works print(nucleotides[-2:]) # takes last two elements print(nucleotides[::2]) # takes every second print(nucleotides[::-1]) # obtains reversed list ['a', 'g', 'c', 't'] ['a', 'g'] ['c', 't'] ['c', 't'] [‘a', ‘c'] [‘t', ‘c', ‘g', ‘a'] 13
Lists and Strings • A string can be translated into a list of strings and – Using the split method: string. split(separator) • A list of strings can be translated into one string – Using the join method: separator. join(list) sentence = ‘This is a complete sentence.’ print(sentence.split()) [‘This’, ‘is’, ‘a’, ‘complete’, ‘sentence’] datarow = ‘Apples,Bananas,Oranges’ print(datarow.split(‘,’)) [‘Apples’,’Bananas’,’Oranges’] cities = [‘Dresden’, ‘Munich’, ‘Hamburg’, ‘Cologne’] print(‘ -> ’.join(cities)) ‘Dresden -> Munich -> Hamburg -> Cologne’ 14
List Comprehensions 15
What are list comprehensions? • Very concise way to build and transform lists • Typically replaces a for loop and an if-construction • Used very often in Python • Syntax: [expr(var) for var in sequence if condition] [1,9,25,49,81] Squares of all odd numbers between 1 and 10 newlist = [] for x in range(1,11): if x % 2: newlist.append(x**2) Verbose construction of list newlist = [x**2 for x in range(1,11) if x % 2] Construction with list comprehension 16
Examples: List comprehensions sentence = ‘I like MySQL but not Python’ print([(w.lower(), len(w)) for w in sentence.split()]) [(i, 1), (like, 4), (mysql, 5), (but, 3), (not, 3), (python, 6)] numbers = (1,0,-1,6,3,-2,3,4) sum = sum([x for x in numbers if x >0]) print(sum) 17 Sum up all positive integers in a tuple 17
File IO 18
Opening and reading a file Returns file handler #Old number File mode (r, w, a, ...) 1234 # New number 5555 # Test f = open(‘myfile.txt’, ‘r’) 1 for line in f: if not line.startswith(‘#’): print(line) f.close() 1234 Loop variable Linewise iteration over file! 5555 1 with open(‘myfile.txt’, ‘r’) as f: for line in f: if not line.startswith(‘#’): print(line) Shorter and better form File is closed after block!
Example: FASTA format • A format for storing multiple named sequences >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT Name of sequence is AAATACATTTTCAATACC preceded by > symbol >CG11455 TAGACGGAGACCCGTTTTTC NB sequences can TTGGTTAGTTTCACATTGTA span multiple lines AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT • This file contains 3' UTRs GGT >CG11488 for Drosophila genes TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT CG11604 TTCACTCT CG11455 fly3utr.txt CG11488 20
Example: FASTA format with open(‘fly3utr.txt’, ‘r’) as f: CG11604 for line in f: CG11455 if line.startswith(‘>’): CG11488 print(line[1:]) >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT What if we want to AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC show the length of TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT each sequence GGT >CG11488 TAGAAGTCAAAAAAGTCAAG record? TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT 21
Example: FASTA format name = None length = None with open('fly3utr.txt', 'r') as f: >CG11604 TAGTTATAGCGTGAGTTAGT for line in f: TGTAAAGGAACGTGAAAGAT line = line.rstrip() AAATACATTTTCAATACC >CG11455 if line.startswith('>'): TAGACGGAGACCCGTTTTTC # None -> False TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA if name: ATAAAATGAGAAACAATTCT print(name, length) GGT >CG11488 name = line[1:] TAGAAGTCAAAAAAGTCAAG length = 0 TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT else: TTCACTCT length += len(line) print(name, length) CG11604 58 CG11455 83 CG11488 69 22
Summary • Strings, lists, tuples and ranges are all sequences • Lists (usually for elements of same type) – More flexible, more memory consumption • Tuples (usually store elements of different types) – Immutable, less memory consumption • Ranges for fast numeric iteration – Least memory consumption • List comprehension as concise way to transform sequences • Convert strings into lists and vice versa with join and split • File handlers provides line-wise iteration 23
Recommend
More recommend