continuous imputation of missing values in streams of
play

Continuous Imputation of Missing Values in Streams of - PowerPoint PPT Presentation

Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series Kevin Wellenzohn 1 ohlen 1 Michael H. B os 2 Johann Gamper 2 Hannes Mitterer 2 Anton Dign 1 Department of Computer Science University of Zurich 2 Faculty


  1. Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series Kevin Wellenzohn 1 ohlen 1 Michael H. B¨ os 2 Johann Gamper 2 Hannes Mitterer 2 Anton Dign¨ 1 Department of Computer Science University of Zurich 2 Faculty of Computer Science Free University of Bolzano March 24, 2017 1

  2. South Tyrol 2

  3. Overview Problem. Streaming time series often have missing values , e.g. due to sensor failures or transmission delays! Goal. Accurately impute (i.e. recover) the latest measurement by exploiting the correlation among streams. Challenge. Streaming time series are often non-linearly correlated , e.g. due to phase shifts . 3

  4. Example Streaming Time Series s ? Temp. [ ° C] 23 ? s 22 ? 21 13:25 13:30 13:35 13:40 13:45 13:50 13:55 14:00 14:05 14:10 14:15 14:20 Time I The latest value at time 14:20 is missing and needs to be imputed (i.e. recovered). 4

  5. Approach 5

  6. Top- k Case Matching (TKCM) Intuition. Impute a missing value in time series s with past values from s when a set of correlated reference time series exhibited similar patterns . 6

  7. Top- k Case Matching (TKCM) Intuition. Impute a missing value in time series s with past values from s when a set of correlated reference time series exhibited similar patterns . Imputation Steps : 1. Draw query pattern over most recent values 6

  8. Top- k Case Matching (TKCM) Intuition. Impute a missing value in time series s with past values from s when a set of correlated reference time series exhibited similar patterns . Imputation Steps : 1. Draw query pattern over most recent values 2. Find k most similar non-overlapping patterns 6

  9. Top- k Case Matching (TKCM) Intuition. Impute a missing value in time series s with past values from s when a set of correlated reference time series exhibited similar patterns . Imputation Steps : 1. Draw query pattern over most recent values 2. Find k most similar non-overlapping patterns 3. Impute missing value using the k most-similar patterns 6

  10. Applying TKCM 23 s 21 Temp. [ ° C] 17 r 1 15 20 r 2 18 13:25 13:30 13:35 13:40 13:45 13:50 13:55 14:00 14:05 14:10 14:15 14:20 Time 7

  11. Applying TKCM 23 s 21 Temp. [ ° C] 17 r 1 15 20 r 2 18 13:25 13:30 13:35 13:40 13:45 13:50 13:55 14:00 14:05 14:10 14:15 14:20 Time 1. Define query pattern P (14:20) over d = 2 reference time series { r 1 , r 2 } in a time frame of l = 10 minutes 7

  12. Applying TKCM 23 s 21 Temp. [ ° C] 17 r 1 15 20 r 2 18 13:25 13:30 13:40 13:45 13:50 13:55 14:05 14:10 14:15 13:35 14:00 14:20 Time 2. The k = 2 most similar non-overlapping patterns are P (14:00) and P (13:35) 7

  13. Applying TKCM 23 s 21 Temp. [ ° C] 17 r 1 15 20 r 2 18 13:25 13:30 13:40 13:45 13:50 13:55 14:05 14:10 14:15 13:35 14:00 14:20 Time 3. Missing value is imputed as s (14:20) = 1 ˆ 2 ( s (14:00) + s (13:35)) = 21 . 85 ° C 7

  14. Query Pattern Pattern length l = 3 r 1 16 . 3 17 . 1 17 . 5 # reference time series d = 2 20 . 2 19 . 9 18 . 2 r 2 14:10 14:15 14:20 I With l > 1, TKCM takes the temporal context into account and captures how time series change over time I Pattern length l is important to deal with non-linear correlations 8

  15. Related Work 1. Centroid Decomposition (CD) I M. Khayati, M. H. B¨ ohlen, and J. Gamper. Memory-e ffi cient centroid decomposition for long time series. ICDE 2014 I Singular Value Decomposition (SVD) that expects linear correlations 2. SPIRIT I S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern discovery in multiple time-series. VLDB 2005 I Principal Component Analysis (PCA) that expects linear correlations 3. MUSCLES I B. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co-evolving time sequences. ICDE 2000 I Multi-variate linear regression that expects linear correlations 9

  16. Linear vs. Non-Linear Correlations 10

  17. Linear Correlations s ( t ) = sind( t ) r ( t ) = 1 . 5 × sind( t ) + 1 2 s − 1 2 r − 1 0 180 540 840 Time t I Time series s and r have di ff erent amplitude and o ff set 11

  18. Linear Correlations s ( t ) = sind( t ) r ( t ) = 1 . 5 × sind( t ) + 1 2 0.86 s − 1 s ( t ) 0 2 r − 0 . 86 − 1 0 180 540 840 0 1 2.3 r ( t ) Time t I Time series s and r have di ff erent amplitude and o ff set I They are linearly correlated and their Pearson Correlation Coe ffi cient is 1! 11

  19. Linear Correlations s ( t ) = sind( t ) r ( t ) = 1 . 5 × sind( t ) + 1 2 0.86 s 0.86 − 1 s ( t ) 0 2.3 2 r − 0 . 86 − 1 0 180 540 840 0 1 2.3 r ( t ) Time t I Time series s and r have di ff erent amplitude and o ff set I They are linearly correlated and their Pearson Correlation Coe ffi cient is 1! 11

  20. Non-Linear Correlations s ( t ) = sind( t ) r ( t ) = sind( t − 90) 1 0.86 s 0 − 1 s ( t ) 0 1 r 0 − 0 . 86 − 1 0 180 540 840 − 1 0 0.5 1 r ( t ) Time t I Time series s and r are phase-shifted by 90 degrees I They are non-linearly correlated and their Pearson Correlation Coe ffi cient is 0! 12

  21. Non-Linear Correlations s ( t ) = sind( t ) r ( t ) = sind( t − 90) 1 0 . 86 0.86 s 0 − 0 . 86 − 1 s ( t ) 0 1 0.5 r 0 − 0 . 86 − 1 0 180 540 840 − 1 0 0.5 1 r ( t ) Time t I Time series s and r are phase-shifted by 90 degrees I They are non-linearly correlated and their Pearson Correlation Coe ffi cient is 0! 12

  22. Pattern Length l and Non-Linear Correlations Pattern length l = 1 1 s 0 − 1 1 r 0 − 1 Pattern dissimilarity 2 1 0 0 180 540 840 Time t 13

  23. Pattern Length l and Non-Linear Correlations Pattern length l = 1 Pattern length l = 100 1 1 s s 0 0 − 1 − 1 1 1 r r 0 0 − 1 − 1 Pattern dissimilarity Pattern dissimilarity 2 2 1 1 0 0 0 180 540 840 0 180 540 840 Time t Time t I With l > 1 there are less patterns with pattern dissimilarity 0 13

  24. Chlorine Dataset I Chlorine dataset is phase-shifted and hence non-linearly correlated 0 . 2 Chlorine level s 0 . 1 0 . 2 s ( t ) 0 0 . 2 0 . 1 r 0 . 1 0 0 0 0 . 1 0 . 2 Time t r ( t ) 14

  25. Importance of Pattern Length l s s imputed by TKCM Chlorine level 0 . 2 0 . 2 0 . 1 0 . 1 0 0 Time t Time t Pattern length l = 1 Pattern length l = 72 I A larger pattern length decreases the oscillation in the imputed time series 15

  26. Experiments 16

  27. Datasets We use 4 datasets: 1. SBR I 130 meteorological time series from South Tyrol I linearly correlated 2. SBR-1d I SBR dataset shifted up to 1 day I non-linearly correlated 3. Flights I 8 time series I non-linearly correlated 4. Chlorine I 166 time series I non-linearly correlated 17

  28. Pattern Length l 1 . 4 2 . 5 linearly correlated non-linearly correlated RMSE RMSE 1 . 2 1 2 0 . 8 0 . 6 1 . 5 1 36 72 108 144 1 36 72 108 144 Pattern Length l Pattern Length l SBR SBR-1d 10 0 . 04 non-linearly correlated non-linearly correlated RMSE 8 RMSE 0 . 03 6 0 . 02 4 0 . 01 2 0 1 36 72 108 144 1 36 72 108 144 Pattern Length l Pattern Length l Flights Chlorine 18

  29. Comparison TKCM SPIRIT MUSCLES CD linearly correlated non-linearly correlated 6 6 RMSE RMSE 4 . 34 4 4 2 . 57 2 . 12 1 . 82 1 . 32 2 2 1 . 07 0 . 88 0 . 89 0 0 SBR SBR-1d 0 . 1 30 non-linearly correlated non-linearly correlated RMSE RMSE 20 . 7 0 . 054 20 0 . 049 14 . 67 0 . 05 0 . 036 8 . 35 10 0 . 014 3 . 57 0 0 Flights Chlorine I TKCM is more accurate on all non-linearly correlated datasets (SBR-1d, Flights, and Chlorine). 19

  30. Conclusion & Future Work Conclusion I TKCM imputes the current missing value in a stream using reference time series I TKCM exploits linear and non-linear correlations among time series Future work I Automatically choose reference time series I Improve e ffi ciency of TKCM by pruning candidate patterns 20

  31. Thanks! 21

Recommend


More recommend