Processing Forecasting Queries Processing Forecasting Queries Songyun Duan, Shivnath Babu Duke University
Motivation Motivation � Real-time forecasting of future events based on historical data is useful in many domains Proactive system management � � If a performance problem is forecast, take corrective actions in advance to avoid it Adaptive query processing � Inventory planning � Environmental monitoring � And many others � � Need a framework to process forecasting queries automatically and efficiently 2
Forecasting Queries Forecasting Queries T X 1 X i X n … … X i � Select 1 0.2 0.3 … 0.1 … From D 2 0.4 1.3 … 2.2 … Forecast L … … … … … … ¿ 0.8 1.3 … 0.1 … Lead Time … L ¿+ L ? : historical time-series data up � D(T; X 1 ; X 2 ; : : : ; X n ) ¿ to timestamp Forecast(D, X i ; L) � Denoted as 3
An Example Forecasting Query An Example Forecasting Query Day A B C Select C 9 35 25 17 From Usage 10 35 46 68 11 13 46 16 Forecast 1 day 12 13 46 68 13 35 46 68 14 36 46 16 15 35 25 16 Lead time 16 13 47 68 17 12 25 16 ? 18 Table: Usage 4
Example Query Processing Example Query Processing -- a Na a Naï ïve Approach ve Approach -- C 1 Day A B C 17 9 35 25 17 Class attribute 68 10 35 46 68 16 11 13 46 16 68 12 13 46 68 68 13 35 46 68 16 14 36 46 16 16 15 35 25 16 68 16 13 47 68 16 26.6 17 12 25 16 ? ? 18 C 1 = 0.47*A + 1.18*B - 0.53*C 5
Example Query Processing Example Query Processing A ¡ 2 B ¡ 1 C 1 Day A B C Day 35 68 9 35 25 17 9 -- 25 -- 10 35 46 68 10 35 16 46 -- 11 13 46 16 11 13 68 46 12 13 46 68 12 13 68 46 35 13 35 46 68 13 16 46 36 14 36 46 16 14 16 46 35 15 35 25 16 15 68 25 13 16 13 47 68 16 16 47 ? 12 54 68 17 12 25 16 17 25 ? 18 C 1 = 1.24*A ¡ 2 + 0:3 ¤ B ¡ 1 Previous prediction=26.6 Bayesian network 6
Challenges Challenges To process real-time forecasting queries, � Challenge 1: generate a good processing strategy automatically and efficiently Apply appropriate transformations to the data � � E.g., shift, discretization, normalization, aggregation Pick the right type of statistical model � � E.g., multivariate linear regression (MLR), classification and regression tree (CART), Bayesian network (BN) � Challenge 2: for continuous forecasting over streaming data, adapt processing strategies when necessary 7
Outline Outline � Space of execution plans � Plan Search Algorithm � Processing continuous forecasting query � Experimental evaluation � Related work � Summary 8
Execution Plan Execution Plan Query: Forecast(D, X i ; L) Forecasting result � Logical operators Predictor Synopsis � Transformer: D 0 D ) u D 0 � E.g., Shift (X, δ ) Synopsis k Builder � Synopsis builder: T k B(D; Z ) ) Syn(f Y 1 ; ¢¢¢; Y n g ! Z) Transformers D 0 � Predictor: 1 P(Syn; u) ) u:Z T 1 � Synopsis Syn(f Y 1 ; ¢¢¢; Y n g ! Z) D � E.g., linear regression, regression tree Time-series data 9
Sample Execution Plan Sample Execution Plan Select C Synopsis = multivariate linear regression (MLR) C 1 From Usage = 1.24*A ¡ 2 + 0:3 ¤ B ¡ 1 54 Forecast 1 day Predictor MLR A ¡ 2 A ¡ 2 B ¡ 1 B ¡ 1 C 1 C 1 Day Day A B C u = (35; 47; ?) MLR … … … … … … learne D 0 … r ¼ (A ¡ 2 ;B ¡ 1 ) 68 … … 13 13 35 46 68 … … 16 46 14 14 36 46 16 … 46 16 16 35 Shift(B; ¡ 1) D’ 46 15 15 35 25 16 35 46 68 16 36 Shift(A; ¡ 2) 25 16 16 13 47 68 36 25 16 35 68 u 47 ? Shift (C; 1) 17 17 12 25 16 35 47 13 16 25 ? 18 12 ? Usage 10
Estimating Accuracy of a Plan Estimating Accuracy of a Plan Accuracy � How close are forecasting results to “ real values ” � � Given a dataset D and a plan P Actual value Predicted value Synopsis Synopsis Predictor a 1 b 1 type a 2 Synopsis b 2 T k Builder Training D a m b m T 1 P Example accuracy metric: q P Test i (a i ¡ b i ) 2 RMSE= m K-fold cross validation to get unbiased estimation 11
Find a Good Plan Quickly Find a Good Plan Quickly � Optimization challenge: minimize the number of plans executed before finding a plan with high accuracy � Efficient plan search to balance accuracy Vs. running time Algorithm 2 Accuracy of best Reasonably-good execution execution plan Algorithm 1 plan available found Lead Time so far 0 Elapsed processing time � Simplified plan space to describe our algorithms � Two types of transformers: Shift and Project � One synopsis: Bayesian Network (BN) 12
Fa’ ’s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm Fa Query: Forecast(Usage, C, 1) Dataset A B C A ¡ 1 B ¡ 1 C ¡ 1 A ¡ 2 B ¡ 2 C ¡ 2 C 1 Class (n = 3) 1 4 7 8 1 4 7 1 4 7 attribute δ Shift( , ) X i 2 5 8 9 2 5 8 2 5 8 (- � <= δ <0) 3 6 9 … 3 6 9 3 6 9 ( � =2) … … … … … … … … … ? Ranked list � Learn a synopsis and generate the plan? A ¡ 2 B ¡ 1 B ¡ 2 A ¡ 1 C B C ¡ 1 A C ¡ 2 C 1 � Imagine n = 100, � = 90 � Attribute Ranking � Extended data has 9000+ attributes � Linear-correlation-based � Takes too much time to get a plan � Entropy-based, e.g., information gain 13
Fa’ ’s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm Fa Shifted data A ¡ 1 B ¡ 1 C ¡ 1 A ¡ 2 A B C B ¡ 2 C ¡ 2 C 1 (n=3, � =2) Ranked list A ¡ 2 B ¡ 1 B ¡ 2 A ¡ 1 C B C ¡ 1 A C ¡ 2 C 1 A ¡ 2 C ¡ 1 C ¡ 2 ( , ) A ¡ 2 B ¡ 1 ( , , ) � Attribute selection P1 P2 Shif t(A; ¡ 2) Shif t(A; ¡ 2) � Fast Correlation-Based Filter (FCBF) Shif t(C; ¡ 1) Shif t(B ; ¡ 1) � Correlation-based Feature Selection (CFS) ¼ Shif t(C; ¡ 2) A ¡ 2 ;B ¡ 1 � Wrapper (Acc 1 ) (Acc 2 ) ¼ A ¡ 2 ;C ¡ 1 ;C ¡ 2 BN Predictor BN Predictor , otherwise increase � � Stop � Do forecasting using the plan with highest accuracy 14
Adaptive Fa’ ’s Plan Search (FPS s Plan Search (FPS- -A) A) Adaptive Fa Continuous query: Forecast(S[W], X i ; L) L … … … … … Stream S W � The ranked list and plans for S[W] A ¡ 2 B ¡ 1 A ¡ 1 B C ¡ 1 A C ¡ 2 C 1 B ¡ 2 B ¡ 2 C C (A ¡ 2 ; B ¡ 1 ) (A ¡ 2 ; C) (A ¡ 2 ; C; C ¡ 1 ) (A ¡ 2 ; C ¡ 1 ; C ¡ 2 ) P 0 P 0 P 1 P 2 Plan Plan Plan Plan 1 2 15
Outline Outline � Space of execution plans � Plan Search Algorithm � Processing continuous forecasting query � Experimental evaluation � Related work � Summary 16
Experimental Setting Experimental Setting � Target domain: system and database monitoring � Datasets (#attributes 3~250, #instances 700~15000 ) Aging dataset from a departmental cluster � � Aging behavior: progressive degradation in performance A real dataset parsed from logs of 98’ World Cup web-site � � Periodic segments – characteristic of most popular web-sites 5 testbed datasets � � Our testbed runs OLTP applications using MySQL � Simulated periodic workloads, aging behavior, and multiple resource contentions 2 synthetic datasets: simulated complicated patterns to � study the robustness of our algorithms 17
Multiple Chunks Vs. One Chunk Multiple Chunks Vs. One Chunk � Accuracy metric = balanced accuracy BA = 1 ¡ 0:5 ¤ ( # f al se posi t i ves + # f al se n egat i ves ) # n egat i ves # posi t i ves 1 BA of current best plan 0 . 9 M u ltip le c h u n k s 0 . 8 O n e c h u n k 0 . 7 0 . 6 0 . 5 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 E la p s e d p ro c e s s in g tim e (s e c ) Testbed dataset, Lead time = 25, n=50+, � = 30 18
Synopsis Comparison Synopsis Comparison FPS(BN) FPS(CART) FPS(MLR) FPS(SVM) FPS(RF) Dataset BA Time BA Time BA Time BA Time BA Time .71 62 .71 135 .64 36 0.51 1948 Aging-real FIFA-real .87 29 .85 37 .84 201 Periodic-small-tb .84 45 .85 249 .80 130 .86 22339 Multi-small-tb .91 53 .91 50 .85 19 .91 933 .82 14 .81 109 .80 24 .86 482 .85 3200 Aging-variant-tb � FPS using BN or CART can achieve accuracy comparable to more sophisticated synopses Lesson: More important to find right transformations � � FPS using BN or CART has lower running time � Thus: we use BN as the default synopsis 19
FPS Vs. State- -of of- -the the- -Art Synopsis Art Synopsis FPS Vs. State 1 F P S R F -base BA of current best plan 0.9 R F -shifts 0.8 0.7 0.6 0.5 0 1 2 3 10 10 10 10 E lap sed p rocessin g tim e (sec, log scale) Synthetic dataset, Lead time = 25, n = 3, � = 90 20
FPS- -adaptive Vs. FPS adaptive Vs. FPS- -nonadaptive nonadaptive FPS � Runtime overhead < 2% (Lead time = 25, � = 90) � Adaptability and Convergence 0.95 F P S -A d ap tive 0.9 F P S -N o nad ap tive BA of current best plan 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 N u m b er of tu p les p rocessed so far 21
Recommend
More recommend