A Probabilis+c Approach to Data Summariza+on Laurel Orr, Magdalena Balazinska, and Dan Suciu DB Research Day 2015 1
Original Database (All Flights in US) What are the most popular flights? Flights from Los Angeles FAST to San Diego Summary APPROXIMATE 2
Exis+ng Summariza+on Techniques Sampling Random, Stra+fied, Weighted,… Aggrega-on Online, Error Bounded (BlinkDB) AVG MIN MAX COUNT(*) 3
Aggrega-on Sampling Flights (origin, des+na+on, fl_+me, …) ~ 2.6 GB SELECT origin, COUNT(*) FROM Flights GROUP BY origin; Full Query Time: 20 sec SELECT * FROM Flights WHERE origin=‘SEATTLE, WA’ LIMIT 10; Full Query Time: 0.4 sec SELECT origin, COUNT(*) FROM Flights WHERE dest = ‘LAUREL, MS’ AND fl_time < 120 GROUP BY origin; Full Query Time: 30 sec 4
IDEA Find a compact, probabilis)c representa+on of our database Flights with high = Popular probability of existence By knowing the probability of rela+ons and tuples, we can answer queries probabilis+cally 5
The Simplest Summary Assume there is some concrete rela+on R(A, B), and you summarized R by its ac+ve domain and cardinality. Given this summary alone, what are the possible rela+ons R could have been (possible worlds of R)? 6
Possible World Seman+cs ac+ve domain A A A A A B a 1 a 1 a 2 a 2 4 a 1 b 1 a 1 a 2 a 1 a 2 a 2 b 2 X n = 2 B B B B slofed instance b 1 4 b 1 b 2 b 2 A B b 1 b 2 b 1 b 2 id1 Possible Instances = 16 id2 Pr( I ) = 1 X Pr( I ) = 1 16 I ∈ P W D set of all possible instances (stand for Possible WorlDs) 7
Possible World Seman+cs ac+ve domain A A A A A B a 1 a 1 a 2 a 2 4 a 1 b 1 a 1 a 2 a 1 a 2 a 2 b 2 X n = 2 B B B B slofed instance b 1 4 b 1 b 2 b 2 A B b 1 b 2 b 1 b 2 id1 Possible Instances = 16 id2 16 = 7 1 X Pr(( a 1 , b 1 )) = 16 Tuple Probability I ∈ P W D ( a 1 ,b 1 ) ∈ I 8
Adding Constraints probabilis+c instance ac+ve domain |σ R.A = a1 (R)| = 70 E [ | σ I.A = a 1 ( I ) | ] = 70 A B a 1 b 1 |σ R.A = a2 (R)| = 30 E [ | σ I.A = a 2 ( I ) | ] = 30 a 2 b 2 … . . . n = 100 E [ | σ I.A = a 1 ∧ I.B = b 1 ( I ) | ] = 40 |σ R.A = a1 ^ R.B = b1 (R)| = 40 X | σ I.A = a 1 ( I ) | Pr( I ) = 70 I ∈ P W D How can we solve for X | σ I.A = a 2 ( I ) | Pr( I ) = 30 Pr(I)? I ∈ P W D . . . X | σ I.A = a 1 ∧ I.B = b 1 ( I ) | Pr( I ) = 40 I ∈ P W D 9
Principle of Maximum Entropy The Principle of Maximum Entropy states that subject to prior data, the probability distribu+on which best represents the state of knowledge is the one that has the largest entropy In other words, you want to maximize X Pr ( I ) ∗ log( Pr ( I )) − I ∈ P W D over all possible worlds 10
More Formally R(A 1 , …, A m ), |R| = n D i = dis+nct domain of A i, Tup = {D 1 x D 2 x … x D m }, Φ = set of equality predicates φ | σ φ ( R ) | Pr ( I ) = P − n Y α φ φ ∈ Φ X Y P = α φ t ∈ T up φ ∈ Φ | φ ( t )= true all possible tuples in our ac+ve domain 11
To include constraints on each φ s R ( φ ) = | σ φ ( R ) | = E [ | σ φ ( I ) | ] We can show deriva+ve of P with s R ( φ ) = n α φ P α φ respect to α φ P To solve, maximize the poten+al func+on by gradient descent X ln( α φ ) s R ( φ ) − ln( P n ) Ψ = φ ∈ Φ 12
Query Transforma+on Aggregates: take expected value SELECT origin, COUNT(*) GROUP BY + COUNT(*) FROM Flights GROUP BY origin; For each origin o E[|σ origin=o (Flights)|] φ E [ | σ φ ( I ) | ] = n α φ P α φ P SELECT origin, E[| σ origin (Flights)|] An equa+on in terms of the α’s FROM Flights, alpha_origin,... we have calculated and stored WHERE origin=alphas.origin GROUP BY origin; 13
Op+miza+ons X Y P = α φ t ∈ T up φ ∈ Φ | φ ( t )= true all possible tuples in our ac+ve domain 1. Factorize P (solve 1D predicates independently) 2. Add relevant 2+D predicates (ex: [A = a1 ^ B = b1]) 3. Remove tuples that don’t exist X Y P ∗ = P − α φ t ∈ ( T up − R ) φ ∈ Φ | φ ( t )= true 4. Change Basis (for correla+ons) new afribute AB = f(A, B) (ex: AB = A – B) 14
Experiment with TPC-H SELECT order_date, ship_date, COUNT(*) FROM orders JOIN lineitem GROUP BY order_date, ship_date; Error = | Est − True | Est + True 15
Change Basis: order_date – ship_date 16
Conclusion • Introduced new way to summarize and approximately query massive datasets – Complements sampling and approximate aggrega+on • Allows fine grained control over which afributes and values get summarized • Encouraging preliminary results • S+ll need to befer address scalability and expand query language • Need to understand how best to choose sta+s+cs 17
Recommend
More recommend