Creating Probabilistic Databases from Imprecise Time-Series Data Saket Sathe, Hoyoung Jeung, Karl Aberer EPFL, Switzerland 13th April, 2011 S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 1 / 15
Outline raw_values Probability time x y distribution p(R) 1 1.1 2.3 ? showing Alice’s 2 1.3 2.1 : : : position : : : prob_view y room 1 room 2 time room probability time = 1 1 1 0.5 1 2 0.1 1 3 0.3 3 σ area 1 4 0.1 room 4 as a reasonable 2 1 0.2 boundary x room 2 2 2 0.4 room 1 μ y 2 3 0.1 time = 2 2 4 0.3 p ( R ) dR room 4 room 3 room4 ∩ 3 σ area x S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 2 / 15
Outline raw_values Probability time x y distribution p(R) 1 1.1 2.3 ? showing Alice’s 2 1.3 2.1 : : : position : : : prob_view y room 1 room 2 time room probability time = 1 1 1 0.5 1 2 0.1 1 3 0.3 3 σ area 1 4 0.1 room 4 as a reasonable 2 1 0.2 boundary x room 2 2 2 0.4 room 1 μ y 2 3 0.1 time = 2 2 4 0.3 p ( R ) dR room 4 room 3 room4 ∩ 3 σ area x Dynamic Density Metrics Approximating Gaussian distributions using σ –cache Measure of Quality Parameter setting under provable guarantees Efficiently creating probabilistic views Experiments S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 2 / 15
Problem Setting H S p t (R t ) 1 t values r t-1 r t time t-H-1 t-1 t Dynamic Density Metric Given S H t − 1 , the dynamic density metric infers time-dependent probability distributions p t ( R t ) at time t , where R t is a random variable associated with r t . S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 3 / 15 σ
GARCH Metric p t (R t =r t ) ˆ p t (R t ) ~ N(r t , σ t ˆ ˆ 2 ) p t (R t =r t ) H S 1 t ˆ r t values r t time t-H-1 t-1 t r t is modeled using an ARMA model ˆ σ 2 ˆ t is modeled using a GARCH model σ 2 Thus p t ( R t ) is a N (ˆ r t , ˆ t ) . We refer to this approach as ARMA-GARCH S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 4 / 15
Quality of Dynamic Density Metrics σ 2 ˆ r t ˆ t ARMA-GARCH ARMA GARCH Uniform Thresholding (UT) ARMA u (user-specified) Variable Thresholding (VT) ARMA sample variance of S H t − 1 Kalman-GARCH Kalman Filter GARCH S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 5 / 15
Quality of Dynamic Density Metrics σ 2 r t ˆ ˆ t ARMA-GARCH ARMA GARCH Uniform Thresholding (UT) ARMA u (user-specified) sample variance of S H Variable Thresholding (VT) ARMA t − 1 Kalman-GARCH Kalman Filter GARCH Problem: The true density ˆ p t ( R t ) is not observable S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 5 / 15
Quality of Dynamic Density Metrics σ 2 ˆ r t ˆ t ARMA-GARCH ARMA GARCH Uniform Thresholding (UT) ARMA u (user-specified) sample variance of S H Variable Thresholding (VT) ARMA t − 1 Kalman-GARCH Kalman Filter GARCH Indirect Method Suppose p 1 ( R 1 ) , . . . , p T ( R T ) are the inferred densities and let z t = P ( R t ≤ r t ) then z t is uniformly distributed between (0 , 1) when p t ( R t ) = ˆ p t ( R t ) [Deibold et. al.]. � 1 � � � d { U Z ( z ) , Q Z ( z ) } = ( U Z ( x ) − Q Z ( x )) 2 , (1) � x =0 where U Z ( z ) is the ideal uniform cdf between (0 , 1) and Q Z ( z ) is the observed cdf of z t . We call d { U Z ( z ) , Q Z ( z ) } the density distance. S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 5 / 15
Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 Framework dynamic density metrics probabilistic view r = 10.2 generation query t = 2 user Ω―View builder r t Ω Λ ω 1 [2:4] r 1 0.50 ˆ σ ˆ t r r ω 2 [0:2] 0.01 sensor 1 4.2 4.0 0.3 ω 1 [4:6] r 2 0.23 2 5.9 6.0 3.2 ω 2 [2:4] 0.08 σ ― cache 3 7.1 7.0 2.9 ω 1 [5:7] r 3 0.25 4 7.9 7.7 0.2 ω 2 [3:5] 0.16 raw_values p t ( R t ) prob_view S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 6 / 15
Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 Problem: Large computational cost when time interval and n are large and ∆ is small (finer granularity) Framework dynamic density metrics probabilistic view r = 10.2 generation query t = 2 user Ω―View builder r t Λ Ω ω 1 [2:4] σ r 1 0.50 ˆ ˆ t r r ω 2 [0:2] 0.01 sensor 1 4.2 4.0 0.3 ω 1 [4:6] 0.23 r 2 2 5.9 6.0 3.2 ω 2 [2:4] 0.08 3 7.1 7.0 2.9 σ ― cache r 3 ω 1 [5:7] 0.25 4 7.9 7.7 0.2 ω 2 [3:5] 0.16 raw_values p t ( R t ) prob_view S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 7 / 15
Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 Idea: Cache and reuse computation of probability values from earlier times dynamic density metrics Framework probabilistic view r = 10.2 generation query t = 2 user Ω―View builder r t Ω Λ ω 1 [2:4] r 1 0.50 ˆ σ ˆ t r r ω 2 [0:2] 0.01 sensor 1 4.2 4.0 0.3 ω 1 [4:6] r 2 0.23 2 5.9 6.0 3.2 ω 2 [2:4] 0.08 σ ― cache 3 7.1 7.0 2.9 ω 1 [5:7] r 3 0.25 4 7.9 7.7 0.2 ω 2 [3:5] 0.16 raw_values p t ( R t ) prob_view S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 7 / 15
Constraint-Aware Caching σ 2 σ 2 Given: p t ( R t ) and p t ′ ( R t ′ ) are Gaussian with (ˆ r t , ˆ t ) and (ˆ r t ′ , ˆ t ′ ) Aim: Approximate values of p t ′ ( R t ′ ) by p t ( R t ) when t ′ > t S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 8 / 15
Constraint-Aware Caching σ 2 σ 2 Given: p t ( R t ) and p t ′ ( R t ′ ) are Gaussian with (ˆ r t , ˆ t ) and (ˆ r t ′ , ˆ t ′ ) Aim: Approximate values of p t ′ ( R t ′ ) by p t ( R t ) when t ′ > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 8 / 15
Constraint-Aware Caching σ 2 σ 2 Given: p t ( R t ) and p t ′ ( R t ′ ) are Gaussian with (ˆ r t , ˆ t ) and (ˆ r t ′ , ˆ t ′ ) Aim: Approximate values of p t ′ ( R t ′ ) by p t ( R t ) when t ′ > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 8 / 15
Constraint-Aware Caching σ 2 σ 2 Given: p t ( R t ) and p t ′ ( R t ′ ) are Gaussian with (ˆ r t , ˆ t ) and (ˆ r t ′ , ˆ t ′ ) Aim: Approximate values of p t ′ ( R t ′ ) by p t ( R t ) when t ′ > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint ρ λ remains unchanged ˆ ˆ 2 P ( R ; r , ) ˆ ˆ 2 t t t t P ( R ; r , ) Δ t ' t ' t ' t Δ a' b' a b ˆ ˆ r t' r t a'=r + λΔ b'=r + ( λ +1) Δ a=r + λΔ b=r + ( λ +1) Δ ˆ ˆ ˆ ˆ S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 8 / 15
Guaranteeing Distance Constraint We use the Hellinger distance denoted H [ · , · ] as a distance measure. 0 ≤ H ≤ 1 . Theorem: Distance Constraint Given a user-defined distance constraint H ′ , we guarantee that H [ p t ( R t ) , p t ′ ( R t ′ )] ≤ H ′ , if ˆ σ t ′ ≤ d s · ˆ σ t and ˆ σ t ′ > ˆ σ t where the parameter d s is chosen as any value satisfying, � 1 − H ′ 2 � 4 � 1 + 1 − d s ≤ . 1 − H ′ 2 � 2 � We call d s the ratio threshold. Example Suppose H ′ = 0 . 2 , then d s ≤ 1 . 5 Choose, say, d s = 1 . 4 then if ˆ σ t ′ σ t ≤ d s then H [ p t ( R t ) , p t ′ ( R t ′ )] ≤ 0 . 2 ˆ S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 9 / 15
ˆ ˆ ˆ ˆ Δ Initializing the σ –cache Let max (ˆ σ t ) and min (ˆ σ t ) be the maximum and minimum standard deviations observed in a probabilistic view generation query σ t ) = d Q Compute Q , such that, max (ˆ s · min (ˆ σ t ) ⌈Q⌉ gives us the maximum number of distributions that we should cache S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 10 / 15
Initializing the σ –cache Let max (ˆ σ t ) and min (ˆ σ t ) be the maximum and minimum standard deviations observed in a probabilistic view generation query ˆ σ t ) = d Q Compute Q , such that, max (ˆ s · min (ˆ σ t ) ⌈Q⌉ gives us the maximum number of distributions that we should cache - cached values min Q ˆ d ( ) s t cache memory 2 ˆ d m i n ( ) s t ˆ ) 1 d m i n ( s t n Δ σ t ′ < d q +1 Find d q s · min (ˆ σ t ) such that d q s · min (ˆ σ t ) ≤ ˆ · min (ˆ σ t ) s S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 10 / 15
Recommend
More recommend