chapter 7 2 discrete s ete sequentia ential d data ta
play

Chapter 7-2: Discrete S ete Sequentia ential D Data ta Jilles - PowerPoint PPT Presentation

Chapter 7-2: Discrete S ete Sequentia ential D Data ta Jilles Vreeken IRDM 15/16 26 Nov 2015 IRDM Chapter 7, overview Time Series Basic Ideas 1. Prediction 2. Motif Discovery 3. Discrete Sequences Basic Ideas 4. Pattern


  1. Chapter 7-2: Discrete S ete Sequentia ential D Data ta Jilles Vreeken IRDM ‘15/16 26 Nov 2015

  2. IRDM Chapter 7, overview  Time Series Basic Ideas 1. Prediction 2. Motif Discovery 3.  Discrete Sequences Basic Ideas 4. Pattern Discovery 5. Hidden Markov Models 6. You’ll find this covered in Aggarwal Ch. 3.4, 14, 15 VII-1: 2 IRDM ‘15/16

  3. IRDM Chapter 7, today  Time Series Basic Ideas 1. Prediction 2. Motif Discovery 3.  Discrete Sequences Basic Ideas 4. Pattern Discovery 5. Hidden Markov Models 6. You’ll find this covered in Aggarwal Ch. 3.4, 14, 15 VII-1: 3 IRDM ‘15/16

  4. Chapter 7.3, ctd: Motif Disc Discove very Aggarwal Ch. 14.4, 3.4 VII-1: 4 IRDM ‘15/16

  5. Dynamic Time Warping DTW stretches the time axis of one series to enable better matches (Aggarwal Ch. 3.4) VII-1: 5 IRDM ‘15/16

  6. DTW, formally Let 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) be the optimal distance between the first 𝑗 and first 𝑘 elements of time series 𝑌 of length 𝑜 and 𝑍 of length 𝑛 repeat 𝑦 𝑗 𝐸𝐸𝐸 ( 𝑗 , 𝑘 − 1) repeat 𝑧 𝑘 𝐸𝐸𝐸 𝑗 , 𝑘 = 𝑒𝑗𝑒𝑒𝑒𝑜𝑒𝑒 𝑌 𝑗 , 𝑍 𝑘 + min � 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 ) repeat neither 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 − 1) We initialise as follows  𝐸𝐸𝐸 0,0 = 0  𝐸𝐸𝐸 0, 𝑘 = ∞ for all 𝑘 ∈ {1, … , 𝑜 }  𝐸𝐸𝐸 𝑗 , 0 = ∞ for all 𝑗 ∈ {1, … , 𝑛 } We can then simply iterate by increasing 𝑗 and 𝑘 (Aggarwal Ch. 3.4) VII-1: 6 IRDM ‘15/16

  7. Computing DTW (1) Let 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) be the optimal distance between the first 𝑗 and first 𝑘 elements of time series 𝑌 of length 𝑜 and 𝑍 of length 𝑛 repeat 𝑦 𝑗 𝐸𝐸𝐸 ( 𝑗 , 𝑘 − 1) repeat 𝑧 𝑘 𝐸𝐸𝐸 𝑗 , 𝑘 = 𝑒𝑗𝑒𝑒𝑒𝑜𝑒𝑒 𝑌 𝑗 , 𝑍 𝑘 + min � 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 ) repeat neither 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 − 1) From the initialised values, can simply iterate by increasing 𝑗 and 𝑘 : for for 𝑗 = 1 to 𝑛 for for 𝑘 = 1 to 𝑜 compute 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) We can also compute it recursively, by dynamic programming. Both naïve strategies cost 𝑃 𝑜𝑛 , however. (Aggarwal Ch. 3.4) VII-1: 7 IRDM ‘15/16

  8. Computing DTW (2) Let 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) be the optimal distance between the first 𝑗 elements of time series 𝑌 of length 𝑜 and the first 𝑘 elements of time series 𝑍 of length 𝑛 repeat 𝑦 𝑗 𝐸𝐸𝐸 ( 𝑗 , 𝑘 − 1) repeat 𝑧 𝑘 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 ) 𝐸𝐸𝐸 𝑗 , 𝑘 = 𝑒𝑗𝑒𝑒𝑒𝑜𝑒𝑒 𝑌 𝑗 , 𝑍 𝑘 + min � repeat neither 𝐸𝐸𝐸 ( 𝑗 − 1, 𝑘 − 1) We can speed up computation by imposing constraints.  e.g. a window constraint to compute 𝐸𝐸𝐸 ( 𝑗 , 𝑘 ) only when 𝑗 − 𝑘 ≤ 𝑥  we then only need max 0, i − w − min { 𝑜 , 𝑗 + 𝑥 } inner loops (Aggarwal Ch. 3.4) VII-1: 8 IRDM ‘15/16

  9. Lower bounds on DTW Even smarter is to speed up DTW using a lower bound. 𝑜 𝑗 − 𝑉 𝑗 2 𝑍 if 𝑌 𝑗 > 𝑉 𝑗 𝑀𝑀 _ 𝐿𝑒𝐿𝐿𝐿 ( 𝑌 , 𝑍 ) = � � if 𝑌 𝑗 < 𝑀 𝑗 𝑗 − 𝑀 𝑗 2 𝑍 otherwise 0 𝑗=1 { 𝑌 𝑗−𝑠 : 𝑌 𝑗+𝑠 } Y 𝑉 𝑗 = max { 𝑌 𝑗−𝑠 : 𝑌 𝑗+𝑠 } 𝑀 𝑗 = min U where 𝑠 is the reach, the allowed range of warping X L VII-1: 9 IRDM ‘15/16

  10. Discrete S Sequence ces VII- 2: 10 IRDM ‘15/16

  11. Chapter 7.4: Basi asic Ideas eas Aggarwal Ch. 14.1-14.2 VII-2: 11 IRDM ‘15/16

  12. Trouble in Time Series Paradise Continuous real-valued time series have their downsides  mining results rely on either a dis istanc nce function n or assu sump mption ons  indexing, pattern mining, summarisation, clustering, classification, and outlier detection results hence rely on arbitrar ary choices Discrete sequences are often easier to deal with  mining results rely mostly on count nting ing How to transform a time series into an event sequence?  discretisation VII-2: 12 IRDM ‘15/16

  13. Approximating a Time Series (Lin et al. 2002, 2007) VII-2: 13 IRDM ‘15/16

  14. SAX Symbolic Aggregate Approximation (SAX)  most well-known approach to discretise a time series  type of piece-wise aggregated approximation (PAA) How to do SAX  divide the data into 𝑥 fr frames es  compute the mean per frame  perform equal-height binning over the means, to obtain an alphabet of 𝑒 characters (Lin et al. 2002, 2007) VII-2: 14 IRDM ‘15/16

  15. Definitions A discrete seque uenc nce 𝑌 1 … 𝑌 𝑜 of length 𝑜 and dimensionality 𝑒 , contains 𝑒 discrete feature values at each of 𝑜 different timestamps 𝑒 1 … 𝑒 𝑜 . Each of the 𝑜 comp ompon onents 𝑌 𝑗 contains 𝑒 discrete 1 … 𝑦 𝑗 𝑒 ) collected at the 𝑗 th behavioral attributes ( 𝑦 𝑗 timestamp. The actual time stamps are usually ignored – they only induce an order on the components, or eve vents ts. VII-2: 15 IRDM ‘15/16

  16. Types of discrete sequences In many applications, the dimensionality is 1  e.g. strings, such as text or genomes.  for AATCGTAC over an alphabet Σ = {A, C, G, T} , each 𝑌 𝑗 ∈ Σ In some applications, each 𝑌 𝑗 is not a vector, but a se set  e.g. a supermarket transaction, 𝑌 𝑗 ⊆ Σ  there is no order within 𝑌 𝑗 We will consider the set-setting, as it is most general VII-2: 16 IRDM ‘15/16

  17. Chapter 7.5: Freque uent nt P Pat atterns ns Aggarwal Ch. 15.2 VII-2: 17 IRDM ‘15/16

  18. Sequential patterns A se sequ quential p patt attern is a sequence.  to occur in the data, it has to be a subsequence of the data. 𝒴 = a b a a b b a b d c a d b a a b c a 𝒶 = a b Defini inition: n: Given two sequences 𝒴 = 𝑌 1 … 𝑌 𝑜 and 𝒶 = 𝑎 1 … 𝑎 𝑙 where all elements 𝑌 𝑗 and 𝑎 𝑗 in the sequences are sets. Then, the sequence 𝒶 is a subsequ equen ence of 𝒴 , if 𝑙 elements 𝑌 𝑗 1 … 𝑌 𝑗 𝑙 can be found in 𝒴 , such that 𝑗 1 < 𝑗 2 < ⋯ < 𝑗 𝑙 and 𝑎 𝑘 ⊆ 𝑌 𝑗 𝑘 for each 𝑘 ∈ {1 … 𝑙 } VII-2: 18 IRDM ‘15/16

  19. Support Depending on whether we have a datab atabas ase 𝑬 of sequences, or a singl gle l long s g sequ equence, we have to define the suppo support of a sequential pattern differently. Standard, or ‘per sequence’ support counting  given a database 𝑬 = { 𝒴 1 , … , 𝒴 𝑂 } , the support of a subsequence 𝒶 is the number of sequences in 𝑬 that contain 𝒶 . Window-based support counting  given a single sequence 𝒴 , the support of a subsequence 𝒶 is the number of windo dows over 𝒴 that contain 𝒶 . (we can define frequency analogue as relative support) VII-2: 19 IRDM ‘15/16

  20. Windows A wind ndow ow 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b Window-based support counting  we can choose a window length 𝑥 , and sweep over the data VII-2: 20 IRDM ‘15/16

  21. Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 1 Window-based support counting  we can choose a window length 𝑥 , and sweep over the data VII-2: 21 IRDM ‘15/16

  22. Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 1 Window-based support counting  we can choose a window length 𝑥 , and sweep over the data VII-2: 22 IRDM ‘15/16

  23. Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 1 : 2 Window-based support counting  we can choose a window length 𝑥 , and sweep over the data VII-2: 23 IRDM ‘15/16

  24. Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 2 : 3 Window-based support counting  we can choose a window length 𝑥 , and sweep over the data VII-2: 24 IRDM ‘15/16

  25. Windows A window 𝒴 [ 𝑒 ; 𝑒 ] is a strict subsequence of sequence 𝒴 . 𝒴 [ 𝑒 ; 𝑒 ] = 𝑌 𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s 𝒴 = a b d c a d b a a b c a d a b a b c 𝒶 = a b : 3 : 4 Window-based support counting  we can choose a window length 𝑥 , and sweep over the data  support is now dependent on 𝑥 , what happens with longer 𝑥 ? VII-2: 25 IRDM ‘15/16

Recommend


More recommend