Dot Plot in Python We want to write a python program that create a - PowerPoint PPT Presentation

Dot Plot in Python We want to write a python program that create a dotoplot. The program should take from command line: 1. a fasta file with two sequences, 2. a scoring matrix 3. a word size 4. a display threshold

Text example Id matrix, wordsize=3, threshold=1 \ S A A A S S A A T T T T A * * * A * * * * * A * * * T * T T

matplotlib example Id matrix, wordsize=3, threshold=1

Building an internal description of the dotplot def computeDP(s1,s2,sm,ws,th): ''' s1,s2= sequences, sm= score matrix (as dict), ws=word size, th= cutoff threshold ''' dp=set() for i in range (len(s1)-ws+1): for j in range (len(s2)-ws+1): sc=0.0 for k in range (ws): sc=sc+sm.get((s1[i+k],s2[j+k]), 0.0) if scoreij >= th: dp.add((i,j)) return dp

Printing in in text format def printDP(s1,s2,dp,shift=0): ''' s1,s2=sequences, dp=dotplot, shif=shifting print positions''' print "\\", for j in range (len(s2)): print s2[j], print for i in range (len(s1)): print s1[i], for j in range (len(s2)): pc=" " if (i-shift,j-shift) in dp : pc="*" print pc, print

Printing using matplotlib def printMDP(s1,s2,dp,shift=0): import matplotlib.pyplot as plt L1= len (s1) L2= len (s2) plt.axis([0, L1+1, 0, L2+1]) x,y=[],[] for i,j in dp: x.append(i+shift+1) y.append(j+shift+1) plt.plot(x, y, 'ro') plt.show()

Computational complexity is quadratic: O (|s1|*|s2|*ws) def computeDP(s1,s2,sm,ws,th): ''' s1,s2= sequences, sm= score matrix (as dict), ws=word size, th= cutoff threshold ''' dp=set() for i in range (len(s1)-ws+1): for j in range (len(s2)-ws+1): sc=0.0 for k in range (ws): sc=sc+sm.get((s1[i+k],s2[j+k]), 0.0) if scoreij >= th: dp.add((i,j)) return dp

Computational complexity is quadratic: O (|s1|*|s2|*ws) Is it possible to improve the computational complexity?

Idea: pre-process one sequence in order to know in advance the possible matches, then scan the other sequence.

Example: assume you have a function getSimpos >>> s1="TTTAAATTT" >>> s2="SAAASSAAT" >>> al,sm=readScoreMatrix('id.mat') >>> getSimpos(s1,al,sm, ws=1 , th=0 ) {'A': [3, 4, 5], 'T': [0, 1, 2, 6, 7, 8]} >>> getSimpos (s1,al,sm, ws=3 , th=3 ) {'AAA': [3], 'TTT': [0, 6], 'AAT': [4], 'ATT': [5], 'TAA': [2], 'TTA': [1]} >>>

With the preprocess given by getSimpos, it is possible to know for each word in the second sequence if there is match with the first and directly in which positions! >>> s1="TTTAAATTT" >>> s2="SAAASSAAT" >>> ws,th = 3,2 >>> simpos=getSimpos(s1,al,sm,ws,th) >>> simpos {'AAA': [3], 'TTT': [0, 6], 'AAT': [4], 'ATT': [5], 'TAA': [2], 'TTA': [1]} >>> for j in range(len(s2)-ws+1): ... word=s2[j:j+ws] ... for i in simpos.get(word,[]): ... print i,j,s1[i:i+ws],s2[j:j+ws] ... 3 1 AAA AAA 4 6 AAT AAT >>>

getSimpos : simplified version def getSimpos_simple(s1,sm,ws,th): ''' s1=sequence, ws=score matrix, ws=word size, th=threshold ''' simpos={} for i in range ( len (query)-ws+1): word=query[i:i+ws] simpos[word]=simpos.get(word,[]) simpos[word].append(i) return simpos It works only for identity matrix! Why??

Blosum62 has more possible matches! >>> al,sm=readScoreMatrix('id.mat') >>> s1 'TTTAAATTT' >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], 'T': [0, 1, 2, 6, 7, 8]} >>> al,sm=readScoreMatrix('blosum62.mat') >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], ' S ': [0, 1, 2, 3, 4, 5, 6, 7, 8], 'T': [0, 1, 2, 6, 7, 8]} >>>

Blosum62 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Idea: add a dictionary that takes care of all the words that we can predict will have a score higher than a threshold for a given matrix and a given word in our sequence.

Given a word: find all high scoring similar words: itertools.product >>> import itertools >>> alphabet='AB' >>> ws=3 >>> for e in itertools.product (alphabet, repeat =ws): ... print "".join(e), e ... AAA ('A', 'A', 'A') AAB ('A', 'A', 'B') ABA ('A', 'B', 'A') ABB ('A', 'B', 'B') BAA ('B', 'A', 'A') BAB ('B', 'A', 'B') BBA ('B', 'B', 'A') BBB ('B', 'B', 'B')

Given a word: find all high scoring similar words: def findSumWords(word,alphabet,sm,th): ''' findSumWords(word,alphabet,sm,th) ''' Similar=[] # list of similar words for w in itertools.product(alphabet,repeat=len(word)): cscore=0.0 for i in range(len(word)): cscore+=sm[word[i],w[i]] if cscore >= th: similar.append("".join(w)) return similar

GetSimpos def getSimpos(query,alphabet,sm,ws,th): simpos={} # stores positions of a given matching word simwords={} # for each used word, store similar words for i in range(len(query)-ws+1): word=query[i:i+ws] # create word # test if we searched for similar words if not simwords.has_key(word): simwords[word]=findSumWords(word,alphabet,sm,th) for k in simwords[word]: simpos[k]=simpos.get(k,[]) simpos[k].append(i) return simpos

Finally: find all the dotplot matches! def lcomputeDP(s1,s2,alphabet,sm,ws,th): ''' s1,s2=sequences, sm=score matrix ws=word size, th=threshold ''' dp=set() # initialize dotplot # process first sequence simpos=getSimpos(s1,alphabet,sm,ws,th) for j in range(len(s2)-ws+1): word=s2[j:j+ws] for i in simpos.get(word,[]): dp.add((i+hw,j+hw)) return dp

Computational complexity ??? Blue + Red def lcomputeDP(s1,s2,alphabet,sm,ws,th): ''' s1,s2=sequences, sm=score matrix ws=word size, th=threshold ''' dp=set() # initialize dotplot # process first sequence simpos=getSimpos(s1,alphabet,sm,ws,th) for j in range(len(s2)-ws+1): # loop 1 word=s2[j:j+ws] for i in simpos.get(word,[]): # loop 2 dp.add((i+hw,j+hw)) return dp

Computational complexity: Blue def getSimpos(query,alphabet,sm,ws,th): simpos={} # stores positions of a given matching word simwords={} # for each used word, store similar words for i in range(len(query)-ws+1): word=query[i:i+ws] # create word # test if we searched for similar words if not simwords.has_key(word): simwords[word]=findSumWords(word,alphabet,sm,th) for k in simwords[word]: simpos[k]=simpos.get(k,[]) simpos[k].append(i) return simpos O(|s1|*dictionary) = O(c* n ) ~ O( n ) with a big constant c!

Computational complexity: Red for j in range(len(s2)-ws+1): # loop 1 word=s2[j:j+ws] for i in simpos.get(word,[]): # loop 2 dp.add((i+hw,j+hw)) O(|s2|*simpos) => worst case O( n 2 ) !!! But In the average case with reasonable threshold end score The number of positions that matches can be considered constant on average, so for average case: O(n) !

Computational complexity ??? Blue + Red def lcomputeDP(s1,s2,alphabet,sm,ws,th): ''' s1,s2=sequences, sm=score matrix ws=word size, th=threshold ''' dp=set() # initialize dotplot # process first sequence simpos=getSimpos(s1,alphabet,sm,ws,th) for j in range(len(s2)-ws+1): # loop 1 word=s2[j:j+ws] for i in simpos.get(word,[]): # loop 2 dp.add((i+hw,j+hw)) return dp O(n)+O(n) = O(n) !

Experimental time performace of the two implementations * dotplot ^ lindotplot

Exercise: Use the lin_dotplot idea to find local ungapped alignments: 1. use the dotplot pairs and link them together in a single alignment if the pairs are on diagonals ( eg. (i-1,j-1), (i,j), (i+1,j+1), …) 2. extend the idea to process a sequence and search similar local alignments in a set of sequences.

Dot Plot in Python We want to write a python program that create a - PowerPoint PPT Presentation

Dot Plot in Python We want to write a python program that create a dotoplot. The program should take from command line: 1. a fasta file with two sequences, 2. a scoring matrix 3. a word size 4. a display threshold Text example Id matrix,

Session 4 : Python for specialists Exercises 3 1) Plot and use fsolve to find the first root of

Perform EDA AN ALYZ IN G IOT DATA IN P YTH ON Matthias Voppichler IT Developer Plot dataframe

Making a scatter plot IN TR OD U C TION TO DATA SC IE N C E IN P YTH ON Hillar y Green -

Novel Plot Structure Hello! 2 what is plot is all the events that happen in a story

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Checking Assumptions Normal distributions: use probability plot (or quantile-quantile plot);

Plot the definition The organized sequence of events that make up a story Every plot is

Contour plot: NO Old plot New plot Likelihood space: NO Outlook: 2 more plots almost done.

1 Parallelism in Python The Problem with Shared State Python provides two mechanisms for

Getting Started with Python The Python Interpreter A piece of software that executes

A Plot for Visualizing Multivariate Data Rida E. A. Moustafa George Mason University ADM

An introduction to Python Andreas Bjerre-Nielsen Agenda 1. Python: what it is; why and how we

Introduction to Data Science: Basic Let's create a dot plot of the example points and text, by

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

16,500,000 Fully upgraded and furnished PLOT # 36 271.50 15,500,000 Completion December 2018

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Hillas Plot: trivial and non-trivial implications Felix Aharonian 1984 Hillas Plot : B L

Mafic magma types (chemical distinction) Alkalis vs silica plot Alkali versus Silica plot was

Py Python Strings Python strings are immuatable: s = abc s[2] = d s = abd

Key point: character is how you get the readers, plot is how you keep them. the great-souled

Outline Spend 30-40 minutes on Python - Not an intro! - Very quick run-through of how Python

12,600,000 Completion October 2017 14,490,000 3 bedroom on the ground floor PLOT # 32 322.47

Python 1 Python Python is high-level programming language for general-purpose programming.

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

Dot Plot in Python We want to write a python program that create a - PowerPoint PPT Presentation

Dot Plot in Python We want to write a python program that create a dotoplot. The program should take from command line: 1. a fasta file with two sequences, 2. a scoring matrix 3. a word size 4. a display threshold Text example Id matrix,

Session 4 : Python for specialists Exercises 3 1) Plot and use fsolve to find the first root of

Perform EDA AN ALYZ IN G IOT DATA IN P YTH ON Matthias Voppichler IT Developer Plot dataframe

Making a scatter plot IN TR OD U C TION TO DATA SC IE N C E IN P YTH ON Hillar y Green -

Novel Plot Structure Hello! 2 what is plot is all the events that happen in a story

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

Checking Assumptions Normal distributions: use probability plot (or quantile-quantile plot);

Plot the definition The organized sequence of events that make up a story Every plot is

Contour plot: NO Old plot New plot Likelihood space: NO Outlook: 2 more plots almost done.

1 Parallelism in Python The Problem with Shared State Python provides two mechanisms for

Getting Started with Python The Python Interpreter A piece of software that executes

A Plot for Visualizing Multivariate Data Rida E. A. Moustafa George Mason University ADM

An introduction to Python Andreas Bjerre-Nielsen Agenda 1. Python: what it is; why and how we

Introduction to Data Science: Basic Let's create a dot plot of the example points and text, by

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

16,500,000 Fully upgraded and furnished PLOT # 36 271.50 15,500,000 Completion December 2018

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Hillas Plot: trivial and non-trivial implications Felix Aharonian 1984 Hillas Plot : B L

Mafic magma types (chemical distinction) Alkalis vs silica plot Alkali versus Silica plot was

Py Python Strings Python strings are immuatable: s = abc s[2] = d s = abd

Key point: character is how you get the readers, plot is how you keep them. the great-souled

Outline Spend 30-40 minutes on Python - Not an intro! - Very quick run-through of how Python

12,600,000 Completion October 2017 14,490,000 3 bedroom on the ground floor PLOT # 32 322.47

Python 1 Python Python is high-level programming language for general-purpose programming.

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons