Datab Databas ase L e Lear earni ning ng: To Towa ward a a D Database t that Be Becomes s Sm Smarter Every y Tim ime Presented by: Huanyi Chen
Where does the data come from? § Real world § The entire dataset follows certain underlying distribution Database Learning: Toward a Database that Becomes PAGE 2 Smarter Every Time
Income of a shop # of Day Income (CAD) Income of a shop per day 1 100 Income (CAD) 2 200 1800 3 400 1600 4 800 1400 5 1600 1200 1000 800 600 400 200 0 1 2 3 4 5 Database Learning: Toward a Database that Becomes PAGE 3 Smarter Every Time
Income of a shop # of Day Income (CAD) Income of a shop per day 1 100 Income (CAD) 2 200 1800 3 400 1600 4 800 1400 5 1600 1200 6 ? 1000 800 600 400 200 0 1 2 3 4 5 Database Learning: Toward a Database that Becomes PAGE 4 Smarter Every Time
Income of a shop § !"#$%& = 50 ∗ 2 , (n = Income of a shop per day 1, 2, 3 … ) Income (CAD) 7000 § No database needed if we 6000 can find the underlying distribution 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 Database Learning: Toward a Database that Becomes PAGE 5 Smarter Every Time
Which distribution do we care? The exact underlying distribution that generates the entire § dataset and future unseen data? Not possible § An exact underlying distribution that generates the entire § dataset but excludes future unseen data? Benefits nothing. One can always make a model by using every value § of a column, but this model is not able to predict anything. We still need to store future data in order to answer queries. A possible distribution that generates the entire dataset and § future unseen data! Database Learning: Toward a Database that Becomes PAGE 6 Smarter Every Time
Mismatching data § A possible distribution that generates the entire dataset and future unseen data is not able to match every data in the dataset § Not work when the accurate query results needed § Works in Approximate query processing (AQP) Database Learning: Toward a Database that Becomes PAGE 7 Smarter Every Time
Approximate Query Processing (AQP) Trade accuracy for response time § Results are based on samples § Previous query results have no help in future queries § so it comes Database Learning - learning from past query answers! § Database Learning: Toward a Database that Becomes PAGE 8 Smarter Every Time
Database Learning Engine: Verdict Target Workflow § Improve future query answers by using previous query answers from an AQP engine Database Learning: Toward a Database that Becomes PAGE 9 Smarter Every Time
Verdict § A query is decomposed into possibly multiple query snippets § the answer of a snippet is a single scalar value Database Learning: Toward a Database that Becomes PAGE 10 Smarter Every Time
Verdict § A query is decomposed into possibly multiple query snippets § Verdict exploits potential correlations between snippet answers to infer the answer of a new snippet # of Day Income (CAD) 1 100 old avg 2 200 new avg 3 400 4 800 5 1600 old avg and new avg are correlated Database Learning: Toward a Database that Becomes PAGE 11 Smarter Every Time
Inference § !"#$%&'()!*# + %,-$# = /%$0)1()!* Observations Rules Prediction 2*1!3$ = 50 ∗ 2 8 Shop Income 100, 200, 400, 800, 3200, 6400, … 1600 9 8 = 9 8:; + 9 8:< Fibonacci Initial: 1, 1 2, 3, 5, 8, … Verdict Past snippet answers Maximize the Improved answer from AQP conditional and error for new + joint probability snippet AQP answer for the distribution new snippet function (pdf) Database Learning: Toward a Database that Becomes PAGE 12 Smarter Every Time
̅ ̅ Inference: pdf , - ' , … , " *+# = % ,+& ' % ',+& If we have ! " # = % & " ,+& = then the prediction is the value of ̅ % ',+& that maximizes ! - % ',+& | " # = % & , … , " *+# = % ,+& " ,+& = Database Learning: Toward a Database that Becomes PAGE 13 Smarter Every Time
Inference: pdf How to find the pdf? § maximum entropy (ME) principal § § ℎ " = − ∫ " ⃗ ' ( log " ⃗ ' , ⃗ .ℎ/0/ ⃗ , ̅ 3 , … , ' 562 3 3 ', ' = (' 2 ' 562 ) The joint pdf maximizing the above entropy differs § depending on the kinds of given testable information Verdict uses the first and the second order statistics of the random § variables: mean, variances, and covariances. Database Learning: Toward a Database that Becomes PAGE 14 Smarter Every Time
Inference: pdf Database Learning: Toward a Database that Becomes PAGE 15 Smarter Every Time
Inference: model-based answer and error Generally Computing above conditional pdf may be a computationally expensive task Database Learning: Toward a Database that Becomes PAGE 16 Smarter Every Time
Inference: model-based answer and error However, computing the conditional pdf in lemma 1 is not expensive and computable; the result is another normal distribution. $ are given by: The mean ! " and variance # " Database Learning: Toward a Database that Becomes PAGE 17 Smarter Every Time
̈ ̈ ̈ Inference: model-based answer and error Model -based answer § " #$% = ' ( § Model -based error § ) #$% = * ( § § Improved answer and error " #$% , + + " #$% , ̈ ) #$% = ) #$% (if validation succeed) § " #$% , + + ) #$% = " #$% , ) #$% (if validation failed, return AQP answers) § Database Learning: Toward a Database that Becomes PAGE 18 Smarter Every Time
Inference: means, variances, and covariances § mean ( ⃗ " ) § the arithmetic mean of the past query answers for the mean of each random variable, # $ , …, # %&$ , ' # %&$ . § variances, and covariances ( Σ ) § the covariance between two query snippet answers is computable using the covariances between the attribute values involved in computing those answers Database Learning: Toward a Database that Becomes PAGE 19 Smarter Every Time
Inference: means, variances, and covariances # of Day Income (CAD) 1 100 old avg 2 200 new avg 3 400 4 800 5 1600 old avg and new avg are correlated Database Learning: Toward a Database that Becomes PAGE 20 Smarter Every Time
Inference: means, variances, and covariances Inter-tuple Covariances Income Income # of Day (CAD) (CAD) 1 100 100 2 200 200 3 400 400 4 800 800 5 1600 1600 Database Learning: Toward a Database that Becomes PAGE 21 Smarter Every Time
Inference: means, variances, and covariances § Estimate the inter-tuple covariances § analytical covariance functions § squared exponential covariance functions: capable of approximating any continuous target function arbitrarily closely as the number of observations (here, query answers) increases § compute variances, and covariances ( Σ ) efficiently Database Learning: Toward a Database that Becomes PAGE 22 Smarter Every Time
Experiments § Up to 23 × speedup for the same accuracy level § Small memory and computational overhead Database Learning: Toward a Database that Becomes PAGE 23 Smarter Every Time
Summary § An idea: Database Learning § learning from past query answers § An implementation: Verdict § Given mean, variances, and covariances § Apply maximum entropy principal § Find a joint probability distribution function § Improve answer and error based on conditioning on snippet answers § https://verdictdb.org Database Learning: Toward a Database that Becomes PAGE 24 Smarter Every Time
Q & A § Using testable information other than or in addition to mean, variances, covariances? § Are there any other possible inferential techniques? § Can we cut out training phase? Database Learning: Toward a Database that Becomes PAGE 25 Smarter Every Time
Recommend
More recommend