The k -mer index k -mer index 5 0 CC Earlier in this course, we - PDF document

Text indexes Bioinformatics Algorithms (Fundamental Algorithms, module 2) Let T be a string of length n over alphabet Σ (which we refer to as text in the following). Zsuzsanna Lipt´ ak A text index (or string index) is a data structure built on the text which Masters in Medical Bioinformatics allows to answer a certain type of query (e.g. pattern matching) without academic year 2018/19, II. semester traversing the whole text. Typically, we want 1. the index not to use too much space (linear or sublinear in n ), and Su ffi x Trees (and other string indexes) 1 2. the query time to be fast (ideally: independent of n ). 1 Some of these slides are based on slides of Jens Stoye’s. 2 / 17 A common string problem: Pattern matching Pattern matching Pattern matching (aka exact string matching) is at the core of almost Pattern matching (p.m.) every text-managing application. text: T = T 1 . . . T n of length n , Pattern matching pattern: P = P 1 . . . P m of length m Given a (typically long) string T (the text), and a (typically much shorter) string P (the pattern) over the same alphabet Σ , find all occurrences of P as substring of T . • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) Variants: • This is optimal, since one has to read both strings at least once. • output all occurrences of P in T — ”all-occurrences version” • But not tolerable with the data sizes we are seeing now! • decide whether P occurs in T (yes - no) — ”decision version” • That is why we need text indexes. • output the number of occurrences of P in T — ”counting version” We usually refer to the number of occurrences of P as occ P . 3 / 17 4 / 17 The k -mer index r u r P 2 ( s ) 0 0 AA 1 AC 1 Recall that a k -mer (or k -gram) 2 AG 1 is a string of length k . 3 AT 0 4 2 CA The k -mer index k -mer index 5 0 CC Earlier in this course, we saw 6 0 CG the k -mer profile, P k ( s ) 7 CT 0 (or q -gram profile) 8 GA 0 9 GC 1 of a string s . 10 2 GG Ex. 11 0 GT 12 TA 0 s = ACAGGGCA , 13 TC 0 on the right is P 2 ( s ). 14 TG 0 15 TT 0 5 / 17 6 / 17

The k -mer index The k -mer index r u r k -mer index of s r u r k -mer index of s Replacing the number of occurrences Replacing the number of occurrences 0 0 AA AA by the occurrences themselves, by the occurrences themselves, 1 1 1 1 AC AC we get the k -mer index of s . we get the k -mer index of s . 2 AG 3 2 AG 3 3 AT 3 AT Ex. Ex. 4 CA 2 , 7 4 CA 2 , 7 s = ACAGGGCA , s = ACAGGGCA , 5 CC 5 CC on the right 2-mer index of s . on the right 2-mer index of s . 6 6 CG CG 7 7 CT CT Analysis (for p.m.) Analysis (for p.m.) 8 GA 8 GA Space: total space is O ( σ k + n ), Space: total space is O ( σ k + n ), 9 GC 6 9 GC 6 since no. of rows = σ k and total since no. of rows = σ k and total 10 GG 4 , 5 10 GG 4 , 5 11 11 number of entries = n � k + 1. GT number of entries = n � k + 1. GT 12 12 TA TA Time (p.m.): O ( k ) for decision, Time (p.m.): O ( k ) for decision, 13 13 TC TC O ( k + occ P ) for all-occurrences. O ( k + occ P ) for all-occurrences. 14 TG 14 TG N.B.: works only for patterns of 15 TT 15 TT length exactly k 7 / 17 7 / 17 The su ffi x tree T = BANANA$ (add sentinel character $ / 2 Σ ) labels only conceptual! two pointers into string The su ffi x tree [ $ N [7,7] 3 , A 4 ] NA$ [5,7] 7 7 BANANA$ A [2,2] 3 3 $ [1,7] [7,7] $ [7,7] 5 [3,4] 5 NA 6 6 NA$ [5,7] 1 1 $ [7,7] 4 2 4 2 8 / 17 9 / 17 The su ffi x tree • N.B.: the edge labels are not stored explicitly: • they are represented by two pointers [ b , e ] into T : beginning and end Given T string over Σ (finite ordered alphabet), and $ 62 Σ . of an occurrence of the edge label; • this representation is not necessarily unique Definitions • e.g. in the example, any edge with label NA can be represented by • ST ( T ) is a rooted tree with edge-labels from ( Σ [ { $ } ) + such that [3 , 4] or [5 , 6] • the labels of all edges outgoing from a node begin with di ff erent characters; • the paths from the root to the leaves of ST ( T ) spell the su ffi xes of T $; labels only conceptual! two pointers into string • each node in ST ( T ) is either the root, a leaf, or a branching node; [ • L ( u ) is the path-label of node u : the concatenation of edge labels on $ N [7,7] 3 , A 4 ] NA$ [5,7] the path from the root to u , 7 7 B A [2,2] A 3 3 N $ [ [7,7] • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th su ffi x), 1 A , 7 $ N [7,7] 5 [3,4] ] 5 NA A • sd ( v ) is the string-depth of a node v is the length of its path-label, $ 6 6 • a locus ( u , d ) is a position on an edge ( v , u ) where u is a node of NA$ [5,7] 1 1 $ [7,7] ST ( T ) and sd ( v ) < d  sd ( u ): d is the string-depth of locus ( u , d ). 2 2 4 4 10 / 17 11 / 17

Su ffi x tree properties Space usage of su ffi x trees • The leaves of ST ( T ) correspond to the su ffi xes of T $. • ST ( T ) represents exactly the substrings of T $: there is a one-to-one Lemma: correspondence between loci in ST ( T ) (possibly within an edge) and ST ( T ) requires O ( n ) space. substrings of T $. Proof sketch: • This allows us to define locus(P) for a substring P of T . • The leaves in the subtree under a locus(P) correspond to the 1. ST ( T ) has exactly n + 1 leaves (one for each su ffi x). 2. Each internal node is branching, therefore there are at most n internal (beginning positions of) P’s occurrences in T $: one-to-one nodes. correspondence between leaves in subtree under locus(P) and 3. A tree with at most 2 n + 1 nodes has at most 2 n edges. occurrences of substring P . 4. Each node can be represented in constant space. • ST ( T ) requires O ( n ) space (details next). 5. Each edge is labeled by a substring of T $ and hence can be represented by a pair of pointers [ i , j ] into T $. MAGIC! The su ffi x tree represents a possibly quadratic number of objects (the substrings) in linear space! 12 / 17 13 / 17 The su ffi x array Definition The SA is a permutation of { 1 , 2 , . . . , n + 1 } s.t. SA[ i ] = j if the j ’th su ffi x Suf j = T j · · · T n $ is the i ’th among all su ffi xes in lexicographic order. 1 2 3 4 5 6 7 The su ffi x array Example: T = BANANA $ SA = [ 7 , 6 , 4 , 2 , 1 , 5 , 3] i SA Suf i 1 7 $ 2 6 A$ 3 4 ANA$ 4 2 ANANA$ 5 1 BANANA$ 6 5 NA$ 7 3 NANA$ Note $ is smaller than all other characters. 14 / 17 15 / 17 The su ffi x array Some Applications of Su ffi x Trees/Su ffi x Arrays • exact string matching Su ffi x tree • exact set matching Su ffi x array $ NA • text statistics SA = [7,6,4,2,1,5,3] • DNA contamination problem NA$ 7 BANANA$ • common substrings of more than two strings A 3 N.B. $ • matching statistics When reading the leaves of the $ 5 • overlap computation (all-pairs prefix-su ffi x matching) NA ST from left-to-right, we get 6 • exact repeats and palindromes problem the SA. • tandem repeats problem NA$ 1 • shortest unique substring $ • maximal unique matches One can imagine the su ffi x 4 2 • approximate string matching ( k -mismatch and k -di ff erences) array as the leaves of the su ffi x • computation of the q -gram distance tree that fell down and stayed (Note that children of inner nodes are • Lempel-Ziv data compression in order . . . ordered acc. to the alphabet’s order.) 16 / 17 17 / 17

The k -mer index k -mer index 5 0 CC Earlier in this course, we - PDF document

Text indexes Bioinformatics Algorithms (Fundamental Algorithms, module 2) Let T be a string of length n over alphabet (which we refer to as text in the following). Zsuzsanna Lipt ak A text index (or string index) is a data structure built

Euclid OU-MER Herv Dole et al. MER tasks 1. MER photometry strategies 2. Pipeline

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Mer andia Caroline Izzi, Meghan Loveless, and Gage Markley Creative Process Merlandia Logo Mer

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

Recycle Your Android Devices Run real Linux on them David Greaves lbt on #mer #sailfjshos

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

GAME PLAN Cons nsume mer s segme ment ntation a n ana nalys lysis Int Introduce o

Human Centered Design and Development for NASAs MERBoard Jay Trimble NASA Ames Research

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Web Information Retrieval Lecture 3 Index Construction Index construction This time:

A Learned Index for Log-Structured Merge Trees Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan

Cogm ed: Training W eb NOW : Alison W inter : The Cogmed Training Web Explained 1 4 :0 0 GMT

A Main Memory Index Structure to Query Linked Data Olaf Hartig

Explicit methods in the theory of Jacobi forms of lattice index and over number fields Nils

Classifying subfactors up to index 5, Part I Emily Peters http://math.mit.edu/~eep II 1 factors:

E DUCATION STRUCTURES AND INDUSTRIAL DEVELOPMENT : L ESSONS FOR EDUCATION POLICIES IN AFRICAN

L ECTURE 1 Overview of U.S. Macroeconomic History and Data August 31, 2011 I. C HRISTINA R OMER

Comments on Interference Effects on Di- Higgs Boson Production Double Higgs Production at

The k -mer index k -mer index 5 0 CC Earlier in this course, we - PDF document

Text indexes Bioinformatics Algorithms (Fundamental Algorithms, module 2) Let T be a string of length n over alphabet (which we refer to as text in the following). Zsuzsanna Lipt ak A text index (or string index) is a data structure built

Euclid OU-MER Herv Dole et al. MER tasks 1. MER photometry strategies 2. Pipeline

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Mer andia Caroline Izzi, Meghan Loveless, and Gage Markley Creative Process Merlandia Logo Mer

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

Recycle Your Android Devices Run real Linux on them David Greaves lbt on #mer #sailfjshos

FAANG+ holdings in S&amp;P 500 &amp; MSCI EM Index S&amp;P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

GAME PLAN Cons nsume mer s segme ment ntation a n ana nalys lysis Int Introduce o

Human Centered Design and Development for NASAs MERBoard Jay Trimble NASA Ames Research

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Web Information Retrieval Lecture 3 Index Construction Index construction This time:

A Learned Index for Log-Structured Merge Trees Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan

Cogm ed: Training W eb NOW : Alison W inter : The Cogmed Training Web Explained 1 4 :0 0 GMT

A Main Memory Index Structure to Query Linked Data Olaf Hartig

Explicit methods in the theory of Jacobi forms of lattice index and over number fields Nils

Classifying subfactors up to index 5, Part I Emily Peters http://math.mit.edu/~eep II 1 factors:

E DUCATION STRUCTURES AND INDUSTRIAL DEVELOPMENT : L ESSONS FOR EDUCATION POLICIES IN AFRICAN

L ECTURE 1 Overview of U.S. Macroeconomic History and Data August 31, 2011 I. C HRISTINA R OMER

Comments on Interference Effects on Di- Higgs Boson Production Double Higgs Production at

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index