quality of similarity rankings
play

Quality of Similarity Rankings in Time Series T. Bernecker, in - PowerPoint PPT Presentation

Quality of Similarity Rankings Quality of Similarity Rankings in Time Series T. Bernecker, in Time Series M. E. Houle, H.-P. Kriegel, P. Krger, 12th International Symposium on Spatial and Temporal M. Renz, E. Schubert, Databases (SSTD


  1. Quality of Similarity Rankings Quality of Similarity Rankings in Time Series T. Bernecker, in Time Series M. E. Houle, H.-P. Kriegel, P. Kröger, 12th International Symposium on Spatial and Temporal M. Renz, E. Schubert, Databases (SSTD 2011) A. Zimek Motivation Interpreting Thomas Bernecker 1 , Michael E. Houle 2 , Distance Fct. Hans-Peter Kriegel 1 , Peer Kröger 1 , Matthias Renz 1 , Distance Functions Curse of Dimens. Erich Schubert 1 , Arthur Zimek 1 SNN Distance Experiments SNN performance 1 Ludwig-Maximilians-Universität München, Munich, Germany Histograms Effects of noise 2 National Institute of Informatics, Tokyo, Japan Conclusions 2011-08-26 — Minneapolis, MN 1/18

  2. Time Series Distances Quality of Time series research Similarity Rankings in Time Series . . . has plenty of: T. Bernecker, M. E. Houle, H.-P. Kriegel, ◮ New distance functions P. Kröger, M. Renz, ◮ Dimensionality reduction E. Schubert, A. Zimek ◮ Approximations Motivation Interpreting Distance Fct. . . . but: Distance Functions Curse of Dimens. ◮ How big is a distance of 0 . 432 ? SNN Distance ◮ How big is a difference of 0 . 123 ? Experiments SNN performance Histograms Effects of noise What is the meaning of these values? Conclusions 2/18

  3. Interpreting distance functions Quality of Distance functions used to have a physical meaning: Similarity Rankings ◮ “As the crow flies” in Time Series T. Bernecker, ◮ “Taxicab metric” M. E. Houle, H.-P. Kriegel, P. Kröger, This worked well for the three-dimensional world. M. Renz, E. Schubert, A. Zimek But this is not so in time series: Motivation ◮ “Curse of dimensionality” Interpreting Distance Fct. loss of contrast in high-dimensional data Distance Functions ◮ Dimension-alignment as done by time warping Curse of Dimens. SNN Distance ◮ Edit distances treat big and small edits the same Experiments SNN performance Histograms Effects of noise But: the distance functions work! Conclusions 3/18

  4. The “Curse of Dimensionality” Quality of Commonly described as Similarity Rankings ◮ Distances become “indiscernible” in Time Series T. Bernecker, ◮ Distances “lose their usefulness” M. E. Houle, H.-P. Kriegel, ◮ Hypercube becomes “vastly” bigger than hypersphere P. Kröger, M. Renz, E. Schubert, ◮ Nearest and farthest neighbor become similar A. Zimek ◮ Mathematical: Motivation Interpreting dist max − dist min Distance Fct. → 0 lim Distance Functions dist min dim →∞ Curse of Dimens. SNN Distance Experiments SNN performance So they should not work. Histograms Effects of noise But: they do! Conclusions 4/18

  5. How bad is the “Curse of Dimensionality”? Quality of Some facts on the “Curse of Dimensionality” Similarity Rankings (from Houle et al. 2010): in Time Series T. Bernecker, ◮ Mathematics proven for i.i.d. data only M. E. Houle, H.-P. Kriegel, ◮ Relevant dimensions make the problem easier P. Kröger, M. Renz, ◮ Irrelevant dimensions make the problem harder E. Schubert, A. Zimek ◮ ⇒ mostly a matter of “signal to noise ratio” Motivation ◮ Numerical contrast goes away, Interpreting Distance Fct. but ranking still remains meaningful Distance Functions Curse of Dimens. SNN Distance Goal: Restore contrast and intuition Experiments SNN performance Histograms using the ranking information Effects of noise Conclusions of the existing distance functions! 5/18

  6. Shared Nearest Neighbor Similarity Quality of Idea: Similar objects have similar neighbors. Similarity Rankings in Time Series SNN s ( x , y ) = | NN s ( x ) ∩ NN s ( y ) | T. Bernecker, M. E. Houle, SNN s ( x , y ) H.-P. Kriegel, simcos s ( x , y ) = P. Kröger, s M. Renz, E. Schubert, A. Zimek Motivation Properties: Interpreting ◮ Intuitive value range from “None” to “All” Distance Fct. Distance Functions ◮ Intuitive interpretation (“social”) Curse of Dimens. SNN Distance ◮ Good contrast, good performance Experiments SNN performance ◮ Needs an “okay” existing ranking Histograms Effects of noise ◮ Extra parameter s to choose Conclusions ◮ More expensive to use (second order distance) 6/18

  7. Shared Nearest Neighbor Distance Quality of The similarity function needs to be transformed to a Similarity Rankings (non-metrical) distance function: in Time Series T. Bernecker, M. E. Houle, dinv s ( x , y ) = 1 − simcos s ( x , y ) H.-P. Kriegel, P. Kröger, dacos s ( x , y ) = arccos ( simcos s ( x , y )) M. Renz, E. Schubert, A. Zimek dln s ( x , y ) = − ln simcos s ( x , y ) Motivation Just like cosine distance. Interpreting Distance Fct. Interpretable as “cosine distance” in “neighbor space”. Distance Functions Curse of Dimens. Similar: Jaccard distance (metrical) SNN Distance Experiments J ( x , y ) := 1 − | NN s ( x ) ∩ NN s ( y ) | SNN performance Histograms | NN s ( x ) ∪ NN s ( y ) | Effects of noise Conclusions 7/18

  8. Experiments Quality of Similarity Rankings in Time Series T. Bernecker, M. E. Houle, H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert, Experimental results A. Zimek Motivation Interpreting Distance Fct. Distance Functions Curse of Dimens. SNN Distance Experiments SNN performance Histograms Effects of noise Conclusions 8/18

  9. Data sets used Quality of Four very different data sets: Similarity Rankings ◮ Cylinder-Bell-Funnel (CBF): artificial in Time Series T. Bernecker, ◮ Synthetic control: artificial M. E. Houle, H.-P. Kriegel, ◮ Leaf dataset: outlines of tree leafs P. Kröger, M. Renz, E. Schubert, ◮ Lightning-7: lightning strike emissions A. Zimek Each modified in different ways: Motivation ◮ Original data set Interpreting Distance Fct. ◮ Extended with noise (irrelevant attributes) Distance Functions Curse of Dimens. SNN Distance ◮ Extended with “signal” (relevant attributes) Experiments SNN performance Histograms Effects of noise Conclusions 9/18

  10. Unmodified data sets Quality of Similarity Rankings in Time Series T. Bernecker, M. E. Houle, Results on unmodified data sets H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert, A. Zimek Motivation Benefits of using SNN Interpreting Distance Fct. Exemplary on the Cylinder-Bell-Funnel Distance Functions Curse of Dimens. (artificial) data set SNN Distance Experiments SNN performance Histograms Effects of noise Conclusions 10/18

  11. Contrast gain using SNN Quality of Visual improvement (unmodified CBF data set): Similarity Rankings in Time Series T. Bernecker, M. E. Houle, H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert, A. Zimek Motivation Interpreting Euclidean DTW 20% LCSS 20% Distance Fct. Distance Functions Curse of Dimens. SNN Distance Experiments SNN performance Histograms Effects of noise Conclusions DTW s = 70 DTW s = 100 LCSS s = 100 11/18

  12. Distance Histograms Quality of Numerical contrast improved (unmodified CBF data set): Similarity Rankings in Time Series Primary distance Primary distance Primary distance 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 100 200 300 400 500 600 700 800 T. Bernecker, M. E. Houle, PDF PDF PDF H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert, PDF PDF PDF A. Zimek 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Motivation SNN 100 distance SNN 100 distance SNN 100 distance Interpreting Euclidean Manhattan DTW 20% Distance Fct. Distance Functions Primary distance Primary distance Primary distance 5 10 15 20 25 30 35 40 45 50 55 60 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 30 40 50 60 70 80 90 100 110 120 Curse of Dimens. SNN Distance PDF PDF PDF Experiments SNN performance Histograms PDF PDF Effects of noise PDF Conclusions 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 SNN 100 distance SNN 100 distance SNN 100 distance ERP 20% EDR 20% LCSS 20% 12/18

  13. Effect of neighborhood size s : Quality of Effect of variation of SNN size parameter s (CBF): Similarity Rankings in Time Series T. Bernecker, 1 1 1 0.95 0.95 0.95 M. E. Houle, Mean ROC AUC Mean ROC AUC Mean ROC AUC 0.9 0.9 0.9 H.-P. Kriegel, 0.85 0.85 0.85 P. Kröger, 0.8 0.8 0.8 0.75 0.75 0.75 M. Renz, 0.7 0.7 0.7 E. Schubert, 0.65 0.65 0.65 A. Zimek 0.6 0.6 0.6 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 SNN size SNN size SNN size Motivation Euclidean Manhattan DTW 20% Interpreting 1 1 1 Distance Fct. 0.95 0.95 0.95 Mean ROC AUC Mean ROC AUC Mean ROC AUC 0.9 0.9 0.9 Distance Functions 0.85 0.85 0.85 Curse of Dimens. 0.8 0.8 0.8 SNN Distance 0.75 0.75 0.75 0.7 0.7 0.7 Experiments 0.65 0.65 0.65 0.6 0.6 0.6 SNN performance 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 SNN size SNN size SNN size Histograms Effects of noise ERP 20% EDR 20% LCSS 20% Conclusions 13/18

  14. Modified data sets Quality of Similarity Rankings in Time Series T. Bernecker, M. E. Houle, Results on modified data sets H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert, A. Zimek Motivation Interpreting Adding noise to the data set, Distance Fct. Distance Functions Changing the signal to noise ratio Curse of Dimens. SNN Distance Experiments SNN performance Histograms Effects of noise Conclusions 14/18

Recommend


More recommend