Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han - PowerPoint PPT Presentation

Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han (Kyungpook National University, Korea) Jinsoo Lee (Kyungpook National University, Korea) Yang-Sae Moon (Kangwon National University, Korea) Haifeng Jiang (Google Inc., USA) 1

Contents � Introduction � Overview of DTW and Existing Lower Bounds � Basic Ranked Subsequence Matching Algorithms � Minimum Distance Matching Window Pair (MDMWP) and mdmwp-Distance Based Pruning � Deferred Group Subsequence Retrieval � Performance Evaluation � Conclusions 2

Time-Series Databases [AFS93, FRM94, MWL01] � Time-series data � Sequences of values sampled at a fixed time interval � Examples: music data, stock prices and network traffic data � Time-series databases � Data sequence: time-series data stored in a database � Query sequence: time-series data given by a user for similarity search 3

Similarity Metric � Measuring similarity as the distance between a data sequence and a given query sequence � We use the dynamic time warping (DTW) distance [BC96, SC78] � One of most robust similarity measures � Widely used for various applications such as query by humming [ZS03], image searching [BCP05], and speech recognition [RJ93] 4

Motivation � Ranked subsequence matching under DTW � finds top- k similar subsequences to a query sequence from data sequences under DTW � All the existing methods have been developed only for either whole matching or range subsequence matching 5

Contributions � Propose the first and foremost approach for ranked subsequence matching � Propose the concept of minimum-distance matching-window pair and pruning with MDMWP distance � Propose deferred group subsequence retrieval along with another lower bound, window-group distance � Show efficiency of the proposed methods using many real and synthetic datasets 6

Review of DTW Sakoe-Chiba Band Warping width 7

Query Envelope [Keo02, ZS03] U Q L 8

LB_Keogh [Keo02 ] � Distance between a query envelope E( Q ) and a data sequence S � Lower bounding distance under DTW at the sequence level S Q 9

Piecewise Aggregate Approximation (PAA) [YF00, Keo02] � Dimension reduction: N dimension → f dimension S = (PAA( S )) S 10

PAA(ENV(Q)) PAA( U ) Q PAA( L ) 11

LB_PAA [ZS03] � Distance between the PAA of the query envelope P (E( Q )) and the PAA of the data sequence P ( S ) � Lower bounding distance under DTW at the index level S Q 12

Lower Boundness of the Two Distances for Whole Matching [Keo02, ZS03] Lemma 1. Given two subsequence Q and S of the same length and a warping width ρ , the following equation holds : We can exploit these lower bounds whenever pruning is possible at the index level or at the sequence level. 13

Related Work � Range Whole Matching [AFC93] � Ranked Whole Matching � Under Euclidean Distance [Keo01, Cha03] � Under DTW [Keo02] � Range Subsequence Matching � Dividing a data sequence into sliding windows, a query sequence into disjoint windows [FRM94] � Dual Match: dual approach of FRM [MWL01] � General Match [MWH02] 14

Two Basic Algorithms for Ranked Subsequence Matching � DualMatchTopK � applies the window construction mechanism of DualMatch [MWL01] to the ranked whole matching algorithm [Cha03, Keo02] � RangeTopK � Obtains top-k entries at the index level using DualMatchTopK and an upper bound ε by retrieving the corresponding data subsequences for the entries � and then finds top- k subsequences using the range subsequence matching algorithm with ε 15

Pruning at the index level Pruning at the sequence level 16

RootNode � R 1 R 2 Q E(Q) s 1 s 2 s 3 s 4 E(q 1 ) E(q 2 ) E(q 3 ) … E(q 8 ) RootNode Distance R 1 < RootNode , 0 , q 1 , -1, -1 > � Top q 1 q 8 < RootNode , 0 , q 2 , -1, -1 > q 3 s 1 < RootNode , 0 , q 3 , -1, -1 > q 2 R 2 δ cur = ∞ … s 3 s 2 < RootNode , 0 , q 8 , -1, -1 > s 4 Priority Queue 17

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 < RootNode , 0 , q 1 , -1, -1 > � Top s 1 R 2 δ cur = ∞ … s 3 s 2 s 4 Priority Queue 18

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 < RootNode , 0 , q 1 , -1, -1 > RootNode � Top MINDIST(P(E(q 1 )), R 1 ) = q 1 1.3 s 1 MINDIST(P(E(q 1 )), R 2 ) = 3.2 δ cur = ∞ R 2 … s 3 s 2 R 1 s 4 Priority Queue 19

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 � Top … q 1 1.3 < R 1 , 1.3 , q 1 , -1, -1 > s 1 3.2 … δ cur = ∞ R 2 < R 2 , 3.2 , q 1 , -1, -1 > s 3 s 2 … s 4 Priority Queue 20

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 < R 1 , 1.3 , q 1 , -1, -1 > � Top q 1 s 1 … δ cur = 5.3 R 2 s 3 s 2 s 4 Priority Queue 21

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 < R 1 , 1.3 , q 1 , -1, -1 > RootNode R 1 � Top q 1 LB_PAA(P(E(q 1 )), s 1 )= 6.5 s 1 … δ cur = 5.3 4.0 R 2 LB_PAA(P(E(q 1 )), s 2 )= s 3 s 2 s 4 Priority Queue 22

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 < R 1 , 1.3 , q 1 , -1, -1 > RootNode R 1 � Top q 1 LB_PAA(P(E(q 1 )), s 1 )= 6.5 s 1 … δ cur = 5.3 LB_PAA(P(E(q 1 )), s 2 )= R 2 4.0 since 6.5 > δ cur , s 3 s 1 is pruned s 2 s 4 Priority Queue 23

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 � Top q 1 6.5 … s 1 < s 2 , 4.0 , q 1 , 3 , 8 > δ cur = 5.3 R 2 4.0 … s 3 s 2 s 4 Priority Queue 24

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 � Top < s 2 , 4.0 , q 1 , 3 , 8 > … s 1 δ cur = 5.3 R 2 s 3 s 2 s 4 Priority Queue 25

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 sid,offset < s 2 , 4.0 , q 1 , 3 , 8 > RootNode R 1 � Top q 1 … s 1 δ cur = 5.3 R 2 s 3 s 2 sid: 3 offset: 8 s 4 Priority Queue 26

RootNode � R 1 R 2 D 3 [8:8+Len( Q )-1] s 1 s 2 s 3 s 4 Q LB_Keogh ( E ( Q ), D 3 [8:8+ Len ( Q )-1])= 5.0 < δ cur RootNode R 1 � Top q 1 … s 1 δ cur = 5.3 R 2 s 3 s 2 sid: 3 offset: 8 s 4 Priority Queue 27

RootNode � R 1 R 2 D 3 [8:8+Len( Q )-1] s 1 s 2 s 3 s 4 Q DTW ρ ( Q , D 3 [8:8+ Len ( Q )-1])= 5.2 < δ cur RootNode R 1 � Top q 1 … s 1 δ cur = 5.3 R 2 s 3 s 2 sid: 3 offset: 8 s 4 Priority Queue 28

RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 � Top q 1 ... s 1 δ cur = 5.3 R 2 < D 3 [8:8+ Len ( Q )-1] , 5.2 , -1 , 3 , 8 > s 3 ... s 2 sid: 3 offset: 8 s 4 Priority Queue 29

Comments on DualMatchTopK � Many unnecessary subsequences are likely to be retrieved due to the loose lower bound � To solve this problem, we propose an approach that prunes the index search space leveraging the novel notion of minimum-distance matching-window pair 30

Minimum-Distance Matching-Window Pair subsequence S [ i:j ] S s 1 s 2 s 3 s 4 LB_PAA ( P ( E ( q i )) , P ( s i )) =9.2 =11.2 =7.1 =6.9 E ( q 1 ) E ( q 2 ) E ( q 3 ) E ( q 4 ) Q U L ω 31

MDMWP Distance � Suppose that MDMWP of P ( E (Q))and P ( S [i:j)) is ( P ( E ( q m ), P ( s m )) � mdmwp-distance = 32

Lower Boundness of MDMWP-distance We call the algorithm that incorporates mdmwp- distance based pruning in DualMatchTopK, AdvTopK 33

Correctness of AdvTopK 34

Deferred Group Subsequence Retrieval � I/O optimization over AdvTopK � avoid excessive random disk I/Os � maximize buffer utilization � Delay a fixed size set of subsequence retrieval requests and enables batch retrieval in a sequential access manner � Introduce the group subsequence access list for storing all requests delayed for the next bulk access 35

Example of Group Subsequence Access List Window Request Group 36

Window-Group Distance � Derived by exploiting both delayed matching windows in each group and the largest distance in the group subsequence access list subsequence S [ i:j ] S s 1 s 2 s 3 s 4 LB_PAA ( P ( E ( q i )) , P ( s i )) =27 =11 ≥ 38 ≥ 38 E ( q 1 ) E ( q 2 ) E ( q 3 ) E ( q 4 ) Q U L ( ) WG-dist ( P ( E ( Q ), P (S[ i : j ])) : + + × − p p p p 11 27 38 4 2 37

Experimental Setup � Algorithms compared � DualMatchTopK, RangeTopK, AdvTopK, DeferredTopK � SeqTopK: sequential scan based algorithm exploiting LB_Keogh � Datasets used � UCR-DATA (33 data sets of different characteristics in the UCR time- series archive, 1,055,525 entries) � WALK-DATA (random walk data consisting of one million entries) � STOCK-DATA (real data set consisting of 329,112 entries) � MUSIC-DATA (pitch data set consisting of 2,373,120 entries extracted from 500 MIDI files ) � Linux Kernel 2.6 PC with 512 Mbytes RAM and Pentium IV 2.8 GHz CPU 38

� Experimental parameters 39

Effect of k Using UCR-DATA We see similar trends in terms of wall clock time. In terms of # of candidates, AdvTopK/DeferredTopK significantly In terms of # of page accesses, for small k, all index-based algorithms perform much better than SeqTopK and RangeTopK. outperform RangeTopK and SeqToK due to MDMWP-distance As k increases, # of page access of all the index-based algorithms and WG-distance based pruning. increase. 40

Effect of Buffer Size Using UCR-DATA As the buffer size increases, both the number of page accesses DeferredTopK shows almost constant performance and much and wall clock time decrease for all the index-based algorithms. better performance with a very small buffer size. 41

Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han - PowerPoint PPT Presentation

Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han (Kyungpook National University, Korea) Jinsoo Lee (Kyungpook National University, Korea) Yang-Sae Moon (Kangwon National University, Korea) Haifeng Jiang (Google Inc., USA) 1

Longest Common Subsequence C=c 1 c g is a subsequence of A=a 1 a m if C can be obtained

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Voting in Maines Ranked Choice Election A non-partisan guide to ranked choice elections

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

1 London.ca/Elections 2 What is Ranked Choice Voting? The City of London used Ranked Choice

The Ranked Sequence ADT A ranked sequence S (with n elements) supports the following methods:

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

On the Length of the Longest Common Subsequence Peter Rabinovitch Summary Consider two

Maximum Contiguous Subsequence Sum Check out from SVN: MCS CSSRac Races es Finish

Fast Parallel Longest Common Subsequence with General Integer Scoring Support Adnan Ozsoy , Arun

Efficient List-based Computation of the String Subsequence Kernel Slimane Bellaouar 1 Hadda

CMU 15-896 Social networks 1: Coordination Games Teacher: Ariel Procaccia Background

Progressive Embedding Hanxiao Shen, Zhongshi Jiang, Denis Zorin, Daniele Panozzo Geometric

Progressive ExpectationMaximization for Hierarchical Volumetric Photon Mapping Wenzel Jakob 1,2

Progressive Neural Architecture Search Chenxi Liu , Barret Zoph, Maxim Neumann, Jonathon Shlens,

PROGRESSIVE SCREENING: LONG-TERM CONTRACTING WITH A PRIVATELY KNOWN STOCHASTIC PROCESS Maher

PDE-Constrained Optimization using Progressively-Constructed Reduced-Order Models Matthew J.

Progressive Growing of GANs for Improved Quality, Stability, and Variation Paper: T.Karras,

TimeDependent Dielectric Breakdown in HighVoltage GaN MISHEMTs: The Role of Temperature