Su ffi x arrays Ben Langmead You are free to use these slides. If - PowerPoint PPT Presentation

Su ffi x arrays Ben Langmead You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me brie fl y how you’re using them. For original Keynote fi les, email me.

Su ffi x array As with su ffi x tree, T$ = abaaba$ T is part of index SA (T) = 6 $ (SA = “Su ffi x Array”) 5 a $ 2 a a b a $ m + 1 integers 3 a b a $ 0 a b a a b a $ 4 b a $ b a a b a $ 1 Su ffi x array of T is an array of integers in [0, m ] specifying the lexicographic order of T $’s su ffi xes

Su ffi x array O ( m ) space, same as su ffi x tree. Is constant factor smaller? 32-bit integer can distinguish characters in the human genome, so su ffi x array is ~12 GB, smaller than MUMmer’s 47 GB su ffi x tree.

Su ffi x array: querying Is P a substring of T ? 1. For P to be a substring, it must $ 6 be a pre fi x of ≥ 1 of T ’s su ffi xes a $ 5 2. Su ffi xes sharing a pre fi x are a a b a $ consecutive in the su ffi x array 2 a b a $ 3 Use binary search a b a a b a $ 0 b a $ 4 b a a b a $ 1

Su ffi x array: binary search Python has bisect module for binary search bisect.bisect_left(a, ¡x) : Leftmost o ff set where we can insert x into a to maintain sorted order. a is already sorted! bisect.bisect_right(a, ¡x) : Like bisect_left , but returning rightmost instead of leftmost o ff set from ¡bisect ¡import ¡bisect_left, ¡bisect_right a ¡= ¡[1, ¡2, ¡3, ¡3, ¡3, ¡4, ¡5] print(bisect_left(a, ¡3), ¡bisect_right(a, ¡3)) ¡# ¡output: ¡(2, ¡5) a ¡= ¡[2, ¡4, ¡6, ¡8, ¡10] print(bisect_left(a, ¡5), ¡bisect_right(a, ¡5)) ¡# ¡output: ¡(2, ¡2) Python example: http://nbviewer.ipython.org/6753277

Su ffi x array: binary search We can straightforwardly use binary search to fi nd a range of elements in a sorted list that equal some query: from ¡bisect ¡import ¡bisect_left, ¡bisect_right strls ¡= ¡['a', ¡'awkward', ¡'awl', ¡'awls', ¡'axe', ¡'axes', ¡'bee'] # ¡Get ¡range ¡of ¡elements ¡that ¡equal ¡query ¡string ¡‘awl’ st, ¡en ¡= ¡bisect_left(strls, ¡'awl'), ¡bisect_right(strls, ¡'awl') print(st, ¡en) ¡# ¡output: ¡(2, ¡3) Python example: http://nbviewer.ipython.org/6753277

Su ffi x array: binary search Can also use binary search to fi nd a range of elements in a sorted list with some query as a pre fi x : from ¡bisect ¡import ¡bisect_left, ¡bisect_right strls ¡= ¡['a', ¡'awkward', ¡'awl', ¡'awls', ¡'axe', ¡'axes', ¡'bee'] # ¡Get ¡range ¡of ¡elements ¡with ¡‘aw’ ¡as ¡a ¡prefix st, ¡en ¡= ¡bisect_left(strls, ¡'aw'), ¡bisect_left(strls, ¡'ax') print(st, ¡en) ¡# ¡output: ¡(1, ¡4) Python example: http://nbviewer.ipython.org/6753277

Su ffi x array: binary search We can do the same thing for a sorted list of su ffi xes: 6 $ from ¡bisect ¡import ¡bisect_left, ¡bisect_right 5 a $ t ¡= ¡'abaaba$' 2 a a b a $ suffixes ¡= ¡sorted([t[i:] ¡for ¡i ¡in ¡xrange(len(t))]) 3 a b a $ st, ¡en ¡= ¡bisect_left(suffixes, ¡'aba'), 0 a b a a b a $ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡bisect_left(suffixes, ¡'abb') 4 b a $ print(st, ¡en) ¡# ¡output: ¡(3, ¡5) 1 b a a b a $ Python example: http://nbviewer.ipython.org/6753277

Su ffi x array: querying Is P a substring of T ? Do binary search, check whether P is a $ 6 pre fi x of the su ffi x there a $ 5 How many times does P occur in T ? a a b a $ 2 Two binary searches yield the range of a b a $ 3 su ffi xes with P as pre fi x; size of range equals # times P occurs in T a b a a b a $ 0 Worst-case time bound? b a $ 4 O (log 2 m ) bisections, O ( n ) comparisons b a a b a $ 1 per bisection, so O( n log m )

Su ffi x array: querying Contrast su ffi x array: O( n log m ) with su ffi x tree: O( n ) 6 $ $ a a ba 5 a $ 6 a a b a $ 2 $ ba ba $ a b a $ 3 5 aba$ 4 0 a b a a b a $ aba$ $ 1 3 3 4 b a $ aba$ 1 2 b a a b a $ 0 0 But we can improve bound for su ffi x array...

Su ffi x array: querying Consider further: binary search for su ffi xes with P as a pre fi x Assume there’s no $ in P . So P can’t be equal to a su ffi x. Initialize l = 0 , c = floor(m/2) and r = m (just past last elt of SA) “left” “center” “right” Notation: We’ll use use SA[ l ] to refer to the su ffi x corresponding to su ffi x-array element l . We could write T [ SA[ l ] :], but that’s too verbose. Throughout the search, invariant is maintained: SA[ l ] < P < SA[ r ]

Su ffi x array: querying Throughout search, invariant is maintained: SA[ l ] < P < SA[ r ] What do we do at each iteration? Let c = fl oor(( r + l ) / 2 ) If P < SA[ c ] , either stop or let r = c and iterate If P > SA[ c ] , either stop or let l = c and iterate When to stop? P < SA[ c ] and c = l + 1 - answer is c P > SA[ c ] and c = r - 1 - answer is r

Su ffi x array: querying def ¡binarySearchSA(t, ¡sa, ¡p): ¡ ¡ ¡ ¡assert ¡t[-‑1] ¡== ¡'$' ¡# ¡t ¡already ¡has ¡terminator ¡ ¡ ¡ ¡assert ¡len(t) ¡== ¡len(sa) ¡# ¡sa ¡is ¡the ¡suffix ¡array ¡for ¡t ¡ ¡ ¡ ¡if ¡len(t) ¡== ¡1: ¡return ¡1 ¡ ¡ ¡ ¡l, ¡r ¡= ¡0, ¡len(sa) ¡# ¡invariant: ¡sa[l] ¡< ¡p ¡< ¡sa[r] ¡ ¡ ¡ ¡while ¡True: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡c ¡= ¡(l ¡+ ¡r) ¡// ¡2 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡# ¡determine ¡whether ¡p ¡< ¡T[sa[c]:] ¡by ¡doing ¡comparisons ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡# ¡starting ¡from ¡left-‑hand ¡sides ¡of ¡p ¡and ¡T[sa[c]:] ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡plt ¡= ¡True ¡# ¡assume ¡p ¡< ¡T[sa[c]:] ¡until ¡proven ¡otherwise ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡i ¡= ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡while ¡i ¡< ¡len(p) ¡and ¡sa[c]+i ¡< ¡len(t): ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡p[i] ¡< ¡t[sa[c]+i]: # loop iterations ≈ length ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break ¡# ¡p ¡< ¡T[sa[c]:] of Longest Common Pre fi x ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡elif ¡p[i] ¡> ¡t[sa[c]+i]: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡plt ¡= ¡False (LCP) of P and SA[ c ] ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break ¡# ¡p ¡> ¡T[sa[c]:] ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡i ¡+= ¡1 ¡# ¡tied ¡so ¡far ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡plt: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡c ¡== ¡l ¡+ ¡1: ¡return ¡c If we already know something about ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡r ¡= ¡c ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡else: LCP of P and SA[ c ] , we can save work ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡c ¡== ¡r ¡-‑ ¡1: ¡return ¡r ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡l ¡= ¡c Python example: http://nbviewer.ipython.org/6765182

Su ffi x array: querying Say we’re comparing P to SA[ c ] and we’ve already compared P to SA[ l ] and SA[ r ] in previous iterations. ... l LCP( P , SA[ l ] ) = 3 “Length of the LCP” More generally: LCP( P , SA[ c ] ) ≥ min( LCP( P , SA[ l ] ), LCP( P , SA[ r ] ) ) c SA(T) LCP( P , SA[ c ] ) ≥ 3 We can skip character comparisons LCP( P , SA[ r ] ) = 5 r ...

Su ffi x arrays Ben Langmead You are free to use these slides. If - PowerPoint PPT Presentation

Su ffi x arrays Ben Langmead You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me brie fl y how youre using them. For original

Lecture 3. Su ffi ciency Lecture 3. Su ffi ciency 1 (114) 3. Su ffi ciency 3.1. Su ffi cient

Arrays (2) Higher-Dimensional Arrays Arrays of Character Strings Topics Variables and Arrays

Data Abstraction Copying Arrays. Sorting Arrays. 2D Arrays. Janyl Jumadinova September 30 and

FFI The good, the bad and the ugly Esteban Lorenzano (The Pharo firefighter) Current status of

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Arrays Arrays and Methods Searching Sorting Arrays Reading: => Continue with

Objectives: Discuss arrays Syntax Multi-dimensional arrays Arrays

Taming the C Monster Haskell FFI Techniques Fraser Tweedale @hackuador May 22, 2018 FFI basics

Topic 22 2D Arrays in Java Two Dimensional Arrays Arrays with multiple dimensions may be

Topics Declaring and Instantiating Arrays Accessing Array Elements Writing Methods

Topic #6 CS162 Topic #6 1 CS162 - Topic #6 Lecture: Arrays with Structured Elements

Arrays: Arrays: FIT100 FIT100 FIT100 Just a thought FIT100 FIT100 FIT100 Arrays FIT100

JavaScript Primer Arrays, Functions, Closures Arrays Arrays in JavaScript are a special type

EE 355 Unit 5 Multidimensional Arrays Mark Redekopp 2 MULTIDIMENSIONAL ARRAYS 3

1 Arrays Arrays The array's element type is the type of values it stores The size of an array

company cleaned and optimized its database to increase online revenue 8% Session Title LA LAZ

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

Middleware integration in the Sympa mailing list software Olivier Salan - CRU 1. Sympa, its

Email is Still Your Ace in the Hole SCORE Becky Livingston Becky Livingston Owner Penheel

Lecture 9: Transformers, ELMO Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Course Information CS 6355: Structured Prediction Building up structured output prediction

QueryCompletion/Expansion COMP90042 LECTURE 4, THE UNIVERSITY OF MELBOURNE by Matthias Petri

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #4: SYSTEM

Su ffi x arrays Ben Langmead You are free to use these slides. If - PowerPoint PPT Presentation

Su ffi x arrays Ben Langmead You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me brie fl y how youre using them. For original

Lecture 3. Su ffi ciency Lecture 3. Su ffi ciency 1 (114) 3. Su ffi ciency 3.1. Su ffi cient

Arrays (2) Higher-Dimensional Arrays Arrays of Character Strings Topics Variables and Arrays

Data Abstraction Copying Arrays. Sorting Arrays. 2D Arrays. Janyl Jumadinova September 30 and

FFI The good, the bad and the ugly Esteban Lorenzano (The Pharo firefighter) Current status of

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Arrays Arrays and Methods Searching Sorting Arrays Reading: =&gt; Continue with

Objectives: Discuss arrays Syntax Multi-dimensional arrays Arrays

Taming the C Monster Haskell FFI Techniques Fraser Tweedale @hackuador May 22, 2018 FFI basics

Topic 22 2D Arrays in Java Two Dimensional Arrays Arrays with multiple dimensions may be

Topics Declaring and Instantiating Arrays Accessing Array Elements Writing Methods

Topic #6 CS162 Topic #6 1 CS162 - Topic #6 Lecture: Arrays with Structured Elements

Arrays: Arrays: FIT100 FIT100 FIT100 Just a thought FIT100 FIT100 FIT100 Arrays FIT100

JavaScript Primer Arrays, Functions, Closures Arrays Arrays in JavaScript are a special type

EE 355 Unit 5 Multidimensional Arrays Mark Redekopp 2 MULTIDIMENSIONAL ARRAYS 3

1 Arrays Arrays The array's element type is the type of values it stores The size of an array

company cleaned and optimized its database to increase online revenue 8% Session Title LA LAZ

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

Middleware integration in the Sympa mailing list software Olivier Salan - CRU 1. Sympa, its

Email is Still Your Ace in the Hole SCORE Becky Livingston Becky Livingston Owner Penheel

Lecture 9: Transformers, ELMO Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Course Information CS 6355: Structured Prediction Building up structured output prediction

QueryCompletion/Expansion COMP90042 LECTURE 4, THE UNIVERSITY OF MELBOURNE by Matthias Petri

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #4: SYSTEM

Arrays Arrays and Methods Searching Sorting Arrays Reading: => Continue with