Uncertain Time-Series Similarity: Return to the Basics Dallachiesa et al., VLDB 2012 Li Xiong, CS730
Problem • Problem: uncertain time-series similarity • Applications: – location tracking of moving objects; traffic monitoring; remote sensing • Uncertain time-series is pervasive – Imprecision of sensor observations – Privacy preserving transformations • Similarity matching is basis for many analysis and mining – Clustering – Shapelet – Motif – …
Overview • Review of 3 state-of-art techniques for similarity matching in uncertain time series – MUNICH, PROUD, DUST • Experimental comparison of the techniques for similarity matching on 17 real (perturbed) datasets • Two additional (simple) similarity measures which unexpectedly outperforms the state-of-art • Discussion of research directions
Modeling/Representing uncertain time-series • Repeated measurements (samples) • Probability density function (pdf) over the uncertain values
Modeling/Representing uncertain time-series • Repeated measurements (samples)
Modeling/Representing uncertain time-series • Probability density function (pdf) over the uncertain values
Similarity metrics • Euclidean Distance (ED) • Dynamic Time Warping (DTW)
Similarity based range query • Range query: given a collection of time-series C, a query sequence Q, find similar series S in C • Probabilistic range query
State-of-the-Art • MUNICH – Repeated observation model • PROUD – Random variable model • DUST – Random variable model
MUNICH • Repeated observation model • Euclidean distance (Lp-norm) and Dynamic Time Warping (DTW) 10 21/2/2011
MUNICH • Materialize uncertain sequences X and Y to all possible certain sequences • Define the set of distances between all possible sequences • Uncertain distance
MUNICH • Naïve Computation: exponential computation cost (note the typo) 12 CAO Chen, DB Group, CSE, HKUST 21/2/2011
MUNICH • Lower bounding and upper bounding the distance/probability • Approximate the samples using minimum bounding intervals
MUNICH • Minimum bounding interval
MUNICH • Compute upper bound and lower bound of distances between all possible interval sequences
MUNICH • Recall uncertain distance and probabilistic range query • Compute lower bound and upper bound for Pr
MUNICH • Pruning based on lower and upper bound True Hit True Drop • Stepwise refinement
PROUD • Pdf model and Euclidean distance • Probabilistic distance model
PROUD • Probabilistic distance model • The distance approaches a normal distribution when number of time points sufficiently large (central limit theorem)
PROUD • Recall probabilistic range query • CDF of normal distribution expressed as error function and compute • Compute normalized epsilon and test
DUST • Probability model • DUST similarity metric • Bayesian probability computation
DUST: A Generalized Notion of Similarity between Uncertain Time Series Smruti R. Sarangi and Karin Murthy IBM Research Labs, Bangalore, India
Resolving the Question Euclidean distance ( EUCL ) T 2 or T 3 ??? and Dynamic Time T 3 Warping ( DTW ) T 3 value T 2 T 1 T 2 DUST time • T 2 should be closer to T 1 than T 3 – This is because it is possible that T 2 and T 1 are the same time series. T 2 just has some additional error. – T 3 and T 1 can never be the same time series because the last value has a very large divergence 23
Extending Prior Work Prior Work Two time series are considered similar if : P( DIST (T 1 ,T 2 ) ≤ ε) ≥ τ DIST (T 1 , T 2 ) = sqrt( Σ i dist (T 1 [i], T 2 [i]) 2 ) dist(x,y) = |x-y| Assumption P( DIST (T 1 ,T 2 ) ≤ ε ) = p( DIST (T 1 ,T 2 ) = 0) ε (irrespective of the size of ε ) 24
Some Algebra P( DIST (T 1 ,T 2 ) ≤ ε ) > P( DIST (T 1 ,T 3 ) ≤ ε ) ≈ p( DIST (T 1 ,T 2 ) = 0) > p( DIST (T 1 ,T 3 ) = 0) Π i p( dist (T 1 [i], T 2 [i]) = 0) > Π i p( dist (T 1 [i], T 3 [i]) = 0) Σ i – log(p( dist (T 1 [i], T 2 [i]) = 0)) ≤ Σ i – log(p( dist (T 1 [i], T 3 [i]) = 0)) dist (x,y) is only -log ( φ (|T 1 [i] – T 2 [i]|) dependent on |x-y| φ (x) = p( dist (0,x) = 0) proved in the paper Definition dust (x,y) = -log( φ (|x-y|)) + log( φ (0) 25
DUST • Compute • Bayes Theorem • Require – Data distribution (uniform) – Error distribution
Comparison • Common assumption: value at each timestamp independent – Correlations neglected
Comparison MUNICH PROUD DUST Uncertainty Multiple Random Random modeling observations variable variable A priori Mean and Data knowledge standard distribution deviation and error distribution Distance Euclidean, Euclidean DUST, metric DTW Euclidean, DTW Similarity Probabilistic Probabilistic kNN queries queries range queries range queries
Experimental Study • Data – 17 real datasets from UCR: time series with exact values as ground truth – (not real) Perturbation with uniform, normal and exponential error distributions • Similarity matching: probabilistic range queries • Metric: F1 metric • Baseline: Euclidean distance
Moving average filters • Uncertain moving average (UMA) – weigh less the observations with larger errror standard deviation • Uncertain exponential moving average (UEMA) – weigh more the nearest neighbors
Discussion • Experiment and Analysis track paper • Good analytical and experimental survey • Unexpected results
Discussion • What’s realistic prior knowledge to assume? • How to model correlations between time points?
Recommend
More recommend