Chapter 7-2: Discrete S ete Sequentia ential D Data ta Jilles Vreeken IRDM ‘15/16 26 Nov 2015
IRDM Chapter 7, overview Time Series Basic Ideas 1. Prediction 2. Motif Discovery 3. Discrete Sequences Basic Ideas 4. Pattern Discovery 5. Hidden Markov Models 6. You’ll find this covered in Aggarwal Ch. 3.4, 14, 15 VII-1: 2 IRDM ‘15/16
IRDM Chapter 7, today Time Series Basic Ideas 1. Prediction 2. Motif Discovery 3. Discrete Sequences Basic Ideas 4. Pattern Discovery 5. Hidden Markov Models 6. You’ll find this covered in Aggarwal Ch. 3.4, 14, 15 VII-1: 3 IRDM ‘15/16
Chapter 7.3, ctd: Motif Disc Discove very Aggarwal Ch. 14.4, 3.4 VII-1: 4 IRDM ‘15/16
Dynamic Time Warping DTW stretches the time axis of one series to enable better matches (Aggarwal Ch. 3.4) VII-1: 5 IRDM ‘15/16
DTW, formally Let 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) be the optimal distance between the first 𝑗 and first 𝑘 elements of time series 𝑌 of length 𝑜 and 𝑍 of length 𝑛 repeat 𝑦 𝑗 𝐸𝐸𝐸 ( 𝑗 , 𝑘 − 1) repeat 𝑧 𝑘 𝐸𝐸𝐸 𝑗 , 𝑘 = 𝑒𝑗𝑒𝑒𝑒𝑜𝑒𝑒 𝑌 𝑗 , 𝑍 𝑘 + min � 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 ) repeat neither 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 − 1) We initialise as follows 𝐸𝐸𝐸 0,0 = 0 𝐸𝐸𝐸 0, 𝑘 = ∞ for all 𝑘 ∈ {1, … , 𝑜 } 𝐸𝐸𝐸 𝑗 , 0 = ∞ for all 𝑗 ∈ {1, … , 𝑛 } We can then simply iterate by increasing 𝑗 and 𝑘 (Aggarwal Ch. 3.4) VII-1: 6 IRDM ‘15/16
Computing DTW (1) Let 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) be the optimal distance between the first 𝑗 and first 𝑘 elements of time series 𝑌 of length 𝑜 and 𝑍 of length 𝑛 repeat 𝑦 𝑗 𝐸𝐸𝐸 ( 𝑗 , 𝑘 − 1) repeat 𝑧 𝑘 𝐸𝐸𝐸 𝑗 , 𝑘 = 𝑒𝑗𝑒𝑒𝑒𝑜𝑒𝑒 𝑌 𝑗 , 𝑍 𝑘 + min � 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 ) repeat neither 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 − 1) From the initialised values, can simply iterate by increasing 𝑗 and 𝑘 : for for 𝑗 = 1 to 𝑛 for for 𝑘 = 1 to 𝑜 compute 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) We can also compute it recursively, by dynamic programming. Both naïve strategies cost 𝑃 𝑜𝑛 , however. (Aggarwal Ch. 3.4) VII-1: 7 IRDM ‘15/16
Computing DTW (2) Let 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) be the optimal distance between the first 𝑗 elements of time series 𝑌 of length 𝑜 and the first 𝑘 elements of time series 𝑍 of length 𝑛 repeat 𝑦 𝑗 𝐸𝐸𝐸 ( 𝑗 , 𝑘 − 1) repeat 𝑧 𝑘 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 ) 𝐸𝐸𝐸 𝑗 , 𝑘 = 𝑒𝑗𝑒𝑒𝑒𝑜𝑒𝑒 𝑌 𝑗 , 𝑍 𝑘 + min � repeat neither 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 − 1) We can speed up computation by imposing constraints. e.g. a window constraint to compute 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) only when 𝑗 − 𝑘 ≤ 𝑥 we then only need max 0, i − w − min { 𝑜 , 𝑗 + 𝑥 } inner loops (Aggarwal Ch. 3.4) VII-1: 8 IRDM ‘15/16
Lower bounds on DTW Even smarter is to speed up DTW using a lower bound. 𝑜 𝑗 − 𝑉 𝑗 2 𝑍 if 𝑌 𝑗 > 𝑉 𝑗 𝑀𝑀 _ 𝐿𝑒𝐿𝐿𝐿 ( 𝑌 , 𝑍 ) = � � if 𝑌 𝑗 < 𝑀 𝑗 𝑗 − 𝑀 𝑗 2 𝑍 otherwise 0 𝑗=1 { 𝑌 𝑗−𝑠 : 𝑌 𝑗+𝑠 } Y 𝑉 𝑗 = max { 𝑌 𝑗−𝑠 : 𝑌 𝑗+𝑠 } 𝑀 𝑗 = min U where 𝑠 is the reach, the allowed range of warping X L VII-1: 9 IRDM ‘15/16
Discrete S Sequence ces VII- 2: 10 IRDM ‘15/16
Chapter 7.4: Basi asic Ideas eas Aggarwal Ch. 14.1-14.2 VII-2: 11 IRDM ‘15/16
Trouble in Time Series Paradise Continuous real-valued time series have their downsides mining results rely on either a dis istanc nce function n or assu sump mption ons indexing, pattern mining, summarisation, clustering, classification, and outlier detection results hence rely on arbitrar ary choices Discrete sequences are often easier to deal with mining results rely mostly on count nting ing How to transform a time series into an event sequence? discretisation VII-2: 12 IRDM ‘15/16
Approximating a Time Series (Lin et al. 2002, 2007) VII-2: 13 IRDM ‘15/16
SAX Symbolic Aggregate Approximation (SAX) most well-known approach to discretise a time series type of piece-wise aggregated approximation (PAA) How to do SAX divide the data into 𝑥 fr frames es compute the mean per frame perform equal-height binning over the means, to obtain an alphabet of 𝑒 characters (Lin et al. 2002, 2007) VII-2: 14 IRDM ‘15/16
Definitions A discrete seque uenc nce 𝑌 1 … 𝑌 𝑜 of length 𝑜 and dimensionality 𝑒 , contains 𝑒 discrete feature values at each of 𝑜 different timestamps 𝑒 1 … 𝑒 𝑜 . Each of the 𝑜 comp ompon onents 𝑌 𝑗 contains 𝑒 discrete 1 … 𝑦 𝑗 𝑒 ) collected at the 𝑗 th behavioral attributes ( 𝑦 𝑗 timestamp. The actual time stamps are usually ignored – they only induce an order on the components, or eve vents ts. VII-2: 15 IRDM ‘15/16
Types of discrete sequences In many applications, the dimensionality is 1 e.g. strings, such as text or genomes. for AATCGTAC over an alphabet Σ = {A, C, G, T} , each 𝑌 𝑗 ∈ Σ In some applications, each 𝑌 𝑗 is not a vector, but a se set e.g. a supermarket transaction, 𝑌 𝑗 ⊆ Σ there is no order within 𝑌 𝑗 We will consider the set-setting, as it is most general VII-2: 16 IRDM ‘15/16
Chapter 7.5: Freque uent nt P Pat atterns ns Aggarwal Ch. 15.2 VII-2: 17 IRDM ‘15/16
Sequential patterns A se sequ quential p patt attern is a sequence. to occur in the data, it has to be a subsequence of the data. 𝒴 = a b a a b b a b d c a d b a a b c a 𝒶 = a b Defini inition: n: Given two sequences 𝒴 = 𝑌 1 … 𝑌 𝑜 and 𝒶 = 𝑎 1 … 𝑎 𝑙 where all elements 𝑌 𝑗 and 𝑎 𝑗 in the sequences are sets. Then, the sequence 𝒶 is a subsequ equen ence of 𝒴 , if 𝑙 elements 𝑌 𝑗 1 … 𝑌 𝑗 𝑙 can be found in 𝒴 , such that 𝑗 1 < 𝑗 2 < ⋯ < 𝑗 𝑙 and 𝑎 𝑘 ⊆ 𝑌 𝑗 𝑘 for each 𝑘 ∈ {1 … 𝑙 } VII-2: 18 IRDM ‘15/16
Support Depending on whether we have a datab atabas ase 𝑬 of sequences, or a singl gle l long s g sequ equence, we have to define the suppo support of a sequential pattern differently. Standard, or ‘per sequence’ support counting given a database 𝑬 = { 𝒴 1 , … , 𝒴 𝑂 } , the support of a subsequence 𝒶 is the number of sequences in 𝑬 that contain 𝒶 . Window-based support counting given a single sequence 𝒴 , the support of a subsequence 𝒶 is the number of windo dows over 𝒴 that contain 𝒶 . (we can define frequency analogue as relative support) VII-2: 19 IRDM ‘15/16
Windows A wind ndow ow 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b Window-based support counting we can choose a window length 𝑥 , and sweep over the data VII-2: 20 IRDM ‘15/16
Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 1 Window-based support counting we can choose a window length 𝑥 , and sweep over the data VII-2: 21 IRDM ‘15/16
Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 1 Window-based support counting we can choose a window length 𝑥 , and sweep over the data VII-2: 22 IRDM ‘15/16
Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 1 : 2 Window-based support counting we can choose a window length 𝑥 , and sweep over the data VII-2: 23 IRDM ‘15/16
Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 2 : 3 Window-based support counting we can choose a window length 𝑥 , and sweep over the data VII-2: 24 IRDM ‘15/16
Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 3 : 4 Window-based support counting we can choose a window length 𝑥 , and sweep over the data support is now dependent on 𝑥 , what happens with longer 𝑥 ? VII-2: 25 IRDM ‘15/16
Recommend
More recommend