efficient list based computation of the string
play

Efficient List-based Computation of the String Subsequence Kernel - PowerPoint PPT Presentation

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Efficient List-based Computation of the String Subsequence Kernel Slimane Bellaouar 1 Hadda Cherroun 1 Djelloul Ziadi 2 1 Laboratoire LIM, Universit


  1. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Efficient List-based Computation of the String Subsequence Kernel Slimane Bellaouar 1 Hadda Cherroun 1 Djelloul Ziadi 2 1 Laboratoire LIM, Université Amar Telidji, Laghouat, Algérie 2 Laboratoire LITIS - EA 4108, Université de Rouen, Rouen, France

  2. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Outline Introduction 1 String Subsequence Kernels 2 Naive Implementation Efficient Implementations List and Layered Range Tree based Approach 3 Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation Conclusion 4

  3. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Introduction Machine learning algorithms are applied to linear separable problems.

  4. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Introduction Kernel methods project the data into a high dimensional feature space where linear learning machines can be applied.

  5. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Introduction Strings are considered among the important data types. A great effort of research has been devoted to string kernels. The philosophy of all string kernels can be reduced to different ways to count common substrings or subsequences that occur in the two strings.

  6. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Introduction Motivation The efficiency of computation, a key property of kernel methods.

  7. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion String Subsequence Kernels SSK measures the similarity between two strings based on non contiguous elements (subsequences) a gap penalty λ ∈ ] 0 , 1 ] is introduced. φ p � λ l ( I ) , u ∈ Σ p . u ( s ) = I : u = s ( I ) The associated kernel can be written as: � � � λ l ( I )+ l ( J ) . � φ p ( s ) , φ p ( t ) � = K p ( s , t ) = u ∈ Σ p I : u = s ( I ) J : u = t ( J )

  8. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Outline Introduction 1 String Subsequence Kernels 2 Naive Implementation Efficient Implementations List and Layered Range Tree based Approach 3 Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation Conclusion 4

  9. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Naive Implementation A suffix kernel is defined to assist in the computation of the SSK : � λ l ( I ) , u ∈ Σ p , φ p , S u ( s ) = I ∈ I | s | p : u = s ( I ) The SSK can be expressed in terms of its suffix version as follows: | s | | t | � � K S K p ( s , t ) = p ( s ( 1 : i ) , t ( 1 : j )) , i = 1 j = 1 with K S 1 ( s , t ) = ([ s | s | = t | t | ] λ 2 ) .

  10. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Naive Implementation A recursion has to be devised. The similarity between two strings ( sa and tb ) is conditioned by their last symbols. | s | | t | � � λ 2 + | s |− i + | t |− j K S K S p ( sa , tb ) = [ a = b ] p − 1 ( s ( 1 : i ) , t ( 1 : j )) . i = 1 j = 1 The recursion leads to an O ( p ( | s | 2 | t | 2 ) time complexity.

  11. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Naive Implementation Example (computation of the SSK) s = gatta , t = cata and p = 1 . K S g a t t a 1 c 0 0 0 0 0 λ 2 λ 2 a 0 0 0 λ 2 λ 2 t 0 0 0 λ 2 λ 2 a 0 0 0 6 λ 2 K 1 ( gatta , cata ) =

  12. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Outline Introduction 1 String Subsequence Kernels 2 Naive Implementation Efficient Implementations List and Layered Range Tree based Approach 3 Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation Conclusion 4

  13. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Efficient Implementations There exists three efficient approaches to compute the SSK: Dynamic Programming Approach Trie-based Approach Sparse Dynamic Programming Approach

  14. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Effecient Implementations Dynamic Programming Approach The similarity between two strings ( sa and tb ) is conditioned by their final symbols. | s | | t | � � λ 2 + | s |− i + | t |− j K S K S p ( sa , tb ) = [ a = b ] p − 1 ( s ( 1 : i ) , t ( 1 : j )) . i = 1 j = 1 We can consider a separate dynamic programming table: k l λ k − i + l − j K S � � DP p ( k , l ) = p − 1 ( s ( 1 : i ) , t ( 1 : j )) . i = 1 j = 1

  15. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Effecient Implementations Dynamic Programming Approach k l λ k − i + l − j K S � � DP p ( k , l ) = p − 1 ( s ( 1 : i ) , t ( 1 : j )) . i = 1 j = 1 Computing ordinary DP p for each ( k , l ) would be inefficient. We can devise a recursion DP p ( k , l ) = K S p − 1 ( s ( 1 : k ) , t ( 1 : l )) + λ DP p ( k − 1 , l )+ λ DP p ( k , l − 1 ) − λ 2 DP p ( k − 1 , l − 1 ) . The computation of SSK leads to an O ( p | s | | t | ) time complexity. .

  16. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Effecient Implementations Trie-based Approach Approach based on search trees known as tries , introduced by E. Fredkin in 1960. The key idea: leaves play the role as indices of the feature space indexed by the set Σ p . kernel will be evaluated as follows: � � � φ p u ( s ) φ p λ g s + p | L s ( u , g s ) | · λ g t + p | L t ( u , g t ) | K p ( s , t ) = u ( t ) = u ∈ Σ p u ∈ Σ p g s , g t � p + m � The worst-case time complexity of the algorithm is O ( m ( | s | + | t | )) .

  17. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Effecient Implementations Sparse Dynamic Programming Approach Observation: Most of the entries of the DP matrix are zero Propositions: Two data structures A set of match lists instead of the K S p matrix. A range sum tree (B-tree) instead of the DP p matrix. The time complexity is O ( p | L | log min ( | s | , | t | )) , where L = { ( i , j ) | s i = t j } .

  18. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion List and Layered Range Tree based Approach Objective: Improve the complexity of the SSK Observation 1: the computation of K S p ( s , t ) is required only when s | s | = t | t | Proposition: keep only a list of index pairs rather than the whole suffix table, L ( s , t ) = { ( i , j ) : s i = t j } .

  19. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion List and Layered Range Tree based Approach Example K S g a t t a 1 c 0 0 0 0 0 λ 2 λ 2 a 0 0 0 λ 2 λ 2 t 0 0 0 λ 2 λ 2 a 0 0 0 L ( gatta , cata ) = { ( 2 , 2 ) , ( 5 , 2 ) , ( 3 , 3 ) , ( 4 , 3 ) , ( 2 , 4 ) , ( 5 , 4 ) } .

  20. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion List and Layered Range Tree based Approach Observation 2: Not obvious to compute K S p ( s , t ) efficiently on a list data structure. ( O ( p | L ( s , t ) | 2 ) ) Proposition: The suffix table of K S p ( s , t ) can be represented by a 2-D dimensional space.

  21. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion List and Layered Range Tree based Approach K S g a t t a 1 c 0 0 0 0 0 λ 2 λ 2 a 0 0 0 λ 2 λ 2 t 0 0 0 λ 2 λ 2 a 0 0 0

  22. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion List and Layered Range Tree based Approach ⇒ the computation of K S p ( s , t ) can be interpreted as orthogonal range queries. several data structures that are used in computational geometry. � Kd-tree: The time cost = O ( p ( | L | | L | + K )) ( K is the total of the reported points). Range tree: Better query time for rectangular range queries.

  23. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Outline Introduction 1 String Subsequence Kernels 2 Naive Implementation Efficient Implementations List and Layered Range Tree based Approach 3 Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation Conclusion 4

  24. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Suffix Table Representation L ( gatta , cata ) = { ( 2 , 2 ) , ( 5 , 2 ) , ( 3 , 3 ) , ( 4 , 3 ) , ( 2 , 4 ) , ( 5 , 4 ) } .

  25. Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Suffix Table Representation L ( gatta , cata ) = { ( 2 , 2 ) , ( 5 , 2 ) , ( 3 , 3 ) , ( 4 , 3 ) , ( 2 , 4 ) , ( 5 , 4 ) } .

Recommend


More recommend