Composite Pattern Discovery for PCR Application Stanislav Angelov - PowerPoint PPT Presentation

Composite Pattern Discovery for PCR Application Stanislav Angelov University of Pennsylvania, USA Shunsuke Inenaga Kyushu University, Japan Japan Society for the Promotion of Science

Pattern Discovery Input Output text data pattern knowledge ACGTTGACGT ACGTTGACGT TG TGGATCGA TCGATG TG ACCGA ACCGATGAC TGACA rule GATAAA AAATGGG TGGG CAG CAGTGTCACA TGTCACA GTTATGCCCC TGCCCC ACTGTGCCTT ACTGTGCCTT TTGGCAAAGT CAAAGT

Finding Missing Patterns Input : text T and threshold  Output : Pattern pair ( A , B ) satisfying: 1. The distance between any occurrences of A and B in T is at least  , 2.| A | = | B | , and 3.| A | (=| B |) is shortest possible.

Finding Missing Patterns [cont.] Case 1:  -close If A and B are non  -close,   ( A , B ) is said to be a missing pair . T B A A B   Case 2: non  -close   T B A A B   Case 3: non  -close T A A

Application - PCR PCR (Polymerase Chain Reaction)  Standard technique to produce many copies of a region of DNA (can be a tiny sample).  In Medicine, to detect infections.  In Forensic Science, to identify individuals.

Application – PCR [cont.] Nested PCR  Repeated PCR with nested primers  Achieving ultra-sensitive detection  Good adapter primers for nested PCR: bind only to the adapters, and amplify nothing directly from the samples!

Application – PCR [cont.] 5’ 3’ S (sample) Adapter Right specific primer Left specific primer Adapter 3’ 5’ S ’ (complement to S )  • We want a pair of good adapter primers which amplify nothing directly from S or S’ . (Adapter primers are complements to adapters.)

Application – PCR [cont.] 5’ 3’ S (sample) Adapter Right specific primer Left specific primer Adapter 3’ 5’ S ’ (complement to S )  • If ( A , B ) is a missing pair in S and S’ , then ( A ’ , B ) is not a pair of binding sites for any region of length less than  .

Application – PCR [cont.] 5’ 3’ S (sample) Adapter Right specific primer Left specific primer Adapter 3’ 5’ S ’ (complement to S )  • So ( A ’ , B ) satisfies a necessary condition of being a good adapter primer pair!!

Previous Work  Inenaga, Kivioja and Makinen. [WABI’04] proposed a bit-table based algorithm to find a missing pattern pair of the same length.  We also gave a suffix tree based algorithm to solve a generalized problem where the patterns in the pair can be of different length.

Complexity Comparisons Finding missing pattern pair of same length time space our algorithm O (  n loglog  n ) O ( n ) bit-table algorithm of O (  n (  + loglog  n )) O (  n ) inenaga et al. [WABI’04]   is the alphabet size.   is typically 5000 (due to PCR application)!

Complexity Comparisons [cont.] Finding missing pattern pair of different length time space our algorithm O (  n log  n ) O ( n ) suffix tree algorithm A of O ( n 2 ) O ( n ) Inenaga et al. [WABI’04] suffix tree algorithm B of O (  n log  n ) O ( n log  n ) Inenaga et al. [WABI’04]  Our algorithm does not need a suffix tree – not only faster but also simpler.

Single Missing Pattern  We start with finding a single missing pattern.  KEY: There are at most  k patterns of length k . n T P 1 P 2 n - k +1 k < n P  k -1 P  k

Single Missing Pattern [cont.] - We have k < log  n . - If k is the largest integer for which all  k patterns of length k exist in T , then there is a missing pattern of length log  n . n T P 1 P 2 n - k +1 k < n P  k -1 P  k

Single Missing Pattern [cont.]  Compute a bit table of all patterns of length log  n using a bijective mapping f from patterns to integers. ( O ( n ) time, using e.g. Karp & Rabin algo.) 1) there exists a missing pattern of length log  n output it. 2) otherwise (all patterns of length log  n are present in T ) there is a missing pattern of length log  n compute and output it.

Missing Pair of Fixed Length  Input: text T , threshold  , pattern lengths a and b  Output: missing pattern pair ( A , B ) such that | A | = a and | B | = b  Assume w.l.o.g. a > b .  We consider the case a < m , where m is the length of the shortest single missing pattern P in T . Or else P can be paired with any pattern of length b .  Let N a =  a and N b =  b (Note n > N a > N b ).

Missing Pair of Fixed Length [cont.] i 1 i 2 i 3 T A a L 0 • Let f ( A ) = h. • L : array of size N a , h i 1 i 2 i 3 where L [ h ] is the list of occurrences of A in T . N a -1

Missing Pair of Fixed Length [cont.] j T B b H 0 • Let f ( B ) = h’. j h’ • H : array of size n - b +1 , where H [ j ] = h’. n-b

Missing Pair of Fixed Length [cont.] i 1 T   B 1 L H M 0 0 0 h 1 1 h 1 f ( B 1 ) = h 1  h i 1 i 2 i 3 i 1  N b -1 N a -1 C M = 0 n-b

Missing Pair of Fixed Length [cont.] i 1 T   B 1 L H M 0 0 0 h 1 1 h 1 f ( B 1 ) = h 1  h i 1 i 2 i 3 i 1  N b -1 N a -1 C M = 1 n-b

Missing Pair of Fixed Length [cont.] i 1 T   B 2 L H M 0 0 0 h 1 1 h 1  h 2 f ( B 2 ) = h 2 h i 1 i 2 i 3 h 2 1 i 1  N b -1 N a -1 C M = 1 n-b

Missing Pair of Fixed Length [cont.] i 1 T   B 2 L H M 0 0 0 h 1 1 h 1  h 2 f ( B 2 ) = h 2 h i 1 i 2 i 3 h 2 1 i 1  N b -1 N a -1 C M = 2 n-b

Missing Pair of Fixed Length [cont.]  The iteration ends  when C M = N b . This case, all patterns of length b are  -close to A .  or when all positions in L [ h ] are processed. This case, scan M and find a missing pattern of length b . The algorithm outputs the missing pair.  The algorithm runs in total of O (  n ) time and O ( n ) space.

Missing Pair of Same Length [cont.]  Monotonicity property: If ( A , B ) is a missing pair, for any superstrings C , D of A , B resp., ( C , D ) is also a missing pair.  By monotonicity property we can do a binary search on the length 1… log  n of the patterns using the aforementioned algorithm, and find the shortest missing pair of same length. It takes O (  n loglog  n ) time and O ( n ) space.

Missing Pair of Different Length  It is not hard to extend the algorithm to the case where A and B do not necessarily have the same length.  We can find such a missing pair in O (  n log  n ) time and O ( n ) space.

Experiments  Linux on 1GHz CPU with 2GB RAM.  In Java. http://www.cis.upenn.edu/~angelov  Human genome (2.5GB) from ftp://ftp.ensembl.org/pub/current_human/   = 5000 .

Experiments [cont.]  We found 238 pairs of missing patterns of length 8 for the human genome.  For the Baker’s yeast genome, the patterns in the shortest missing pairs are also of length 8 ! [Inenaga et al. WABI’04]  There are common missing pairs of patterns of length 8 for the human and yeast genomes.

Experiments [cont.] Missing pattern pairs of length 8 for both the human and the yeast gemones. The reverse complements are also missing missing pair yeast  AB human  AB (AATCGACG,CGATCGGT) 5008 6458 (CCGATCGG,CCGTACGG) 5658 6839 (CGACCGTA,TACGGTCG) 13933 7585 (CGACCGTA,TCGCGTAC) 5494 5345 (CGAGTACG,GTCGATCG) 5903 8090 (CGATCGGA,GCGCGATA) 6432 6619

Conclusions  We solved the missing pattern pair problem in O (  n loglog  n ) time for the same length case, and O (  n log  n ) time for the different length case. Both in O ( n ) space.  We also developed an alternative algorithm to solve this problem, and moreover solved extended problems (see the proceedings).

Composite Pattern Discovery for PCR Application Stanislav Angelov - PowerPoint PPT Presentation

Composite Pattern Discovery for PCR Application Stanislav Angelov University of Pennsylvania, USA Shunsuke Inenaga Kyushu University, Japan Japan Society for the Promotion of Science Pattern Discovery Input Output text data pattern

Description and functioning Date: July 2016 PCR users and members Markets using PCR: MRC

My Kitchen Table PCR Sophomore Year of High School PCR Primer Primer-Defined Changes to the PCR

e.g. Taq polymerase PCR TEAM HEIDELBERG iGEM 2014 2.0 PCR 2. maintaining DNA

Concrete PCR and EPDs Lionel Lemay, PE, SE, LEED AP Sr. VP, Sustainable Development Brief

On Robustness of Principal Component Regression Anish Agarwal Devavrat Shah, Dennis Shen, Dogyoon

Bio(tech) Interlude: PCR and DNA Sequencing 3 Nobel Prizes: PCR: Kary Mullis, 1993

A C Close se-Up L Look ook a at PCR Polymerase Chain Reaction (PCR) and cellular DNA

PCR EtBr Gels Restriction Digest Ligation By Eunice Rhee and Christine Ahn I. Polymerase

Diagnostic studies Diagnostic studies (histopathology & PCR) (histopathology & PCR)

Real-time PCR Data Markup Language A new standard for archiving and exchanging real-time PCR data

progeny psyllids healthy plants ~30-60% PCR+ ~30-60% PCR+ Healthy citrus 15 days Remove all

RP/RT-PCR Externe-Primer: 1RES: GAA GAA ATG ATG ACA GCA TGT CAG GG (1822-1844) 2RES: TAA TTT

PUBLIC EIR SCOPING MEETING City of Santa Monica City of Santa Monica PCR Services Corporation PCR

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

TLA + specification of PCR parallel programming pattern Work in Progress e E. Solsona 1 Sergio

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

Clustering In this example distance matrix: and have the most similar vectors 0 0.265 0.799

Prediction for Processes on Network Graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for

Genome Sequencing: Introduc2on to Fragment Assembly Lecture 5:

Structural Studies of an AAA+ ATPase Structural Studies of an AAA+ ATPase N-ethylmaleimide

Main parameters (invariants) 160 letters Omaha -Nebraska- -> Boston Diameter average

GDR ADN, 2-4 mai 2012 Replication in eukaryotic genomes Specific features of eukaryotic

Semester projects Semester projects Semester projects Semester projects Principles of Complex

Spatial and modular organisation of brain networks prevents large-scale activation Marcus Kaiser

Composite Pattern Discovery for PCR Application Stanislav Angelov - PowerPoint PPT Presentation

Composite Pattern Discovery for PCR Application Stanislav Angelov University of Pennsylvania, USA Shunsuke Inenaga Kyushu University, Japan Japan Society for the Promotion of Science Pattern Discovery Input Output text data pattern

Description and functioning Date: July 2016 PCR users and members Markets using PCR: MRC

My Kitchen Table PCR Sophomore Year of High School PCR Primer Primer-Defined Changes to the PCR

e.g. Taq polymerase PCR TEAM HEIDELBERG iGEM 2014 2.0 PCR 2. maintaining DNA

Concrete PCR and EPDs Lionel Lemay, PE, SE, LEED AP Sr. VP, Sustainable Development Brief

On Robustness of Principal Component Regression Anish Agarwal Devavrat Shah, Dennis Shen, Dogyoon

Bio(tech) Interlude: PCR and DNA Sequencing 3 Nobel Prizes: PCR: Kary Mullis, 1993

A C Close se-Up L Look ook a at PCR Polymerase Chain Reaction (PCR) and cellular DNA

PCR EtBr Gels Restriction Digest Ligation By Eunice Rhee and Christine Ahn I. Polymerase

Diagnostic studies Diagnostic studies (histopathology &amp; PCR) (histopathology &amp; PCR)

Real-time PCR Data Markup Language A new standard for archiving and exchanging real-time PCR data

progeny psyllids healthy plants ~30-60% PCR+ ~30-60% PCR+ Healthy citrus 15 days Remove all

RP/RT-PCR Externe-Primer: 1RES: GAA GAA ATG ATG ACA GCA TGT CAG GG (1822-1844) 2RES: TAA TTT

PUBLIC EIR SCOPING MEETING City of Santa Monica City of Santa Monica PCR Services Corporation PCR

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

TLA + specification of PCR parallel programming pattern Work in Progress e E. Solsona 1 Sergio

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

Clustering In this example distance matrix: and have the most similar vectors 0 0.265 0.799

Prediction for Processes on Network Graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for

Genome Sequencing: Introduc2on to Fragment Assembly Lecture 5:

Structural Studies of an AAA+ ATPase Structural Studies of an AAA+ ATPase N-ethylmaleimide

Main parameters (invariants) 160 letters Omaha -Nebraska- -&gt; Boston Diameter average

GDR ADN, 2-4 mai 2012 Replication in eukaryotic genomes Specific features of eukaryotic

Semester projects Semester projects Semester projects Semester projects Principles of Complex

Spatial and modular organisation of brain networks prevents large-scale activation Marcus Kaiser

Diagnostic studies Diagnostic studies (histopathology & PCR) (histopathology & PCR)

Main parameters (invariants) 160 letters Omaha -Nebraska- -> Boston Diameter average