robust online aggregation of forecasts
play

ROBUST ONLINE AGGREGATION OF FORECASTS APPLICATION TO ELECTRICITY - PowerPoint PPT Presentation

ROBUST ONLINE AGGREGATION OF FORECASTS APPLICATION TO ELECTRICITY LOAD FORECASTING Pierre Gaillard October 21, 2015 University of Copenhagen T he framework of this talk Sequential prediction of arbitrary time-series based on expert forecasts:


  1. ROBUST ONLINE AGGREGATION OF FORECASTS APPLICATION TO ELECTRICITY LOAD FORECASTING Pierre Gaillard October 21, 2015 University of Copenhagen

  2. T he framework of this talk Sequential prediction of arbitrary time-series based on expert forecasts: • a time-series y 1 , . . . , y n ∈ R d is to be predicted • Expert forecasts are available: e.g., given by some stochastic or machine-learning models (for us: black boxes ) At each forecasting instance t � 1 , . . . , n • forecasting black-box k ∈ { 1 , . . . , K } provides forecast x k , t of y t • we are ask to form a prediction � y t of y t with knowledge of ◦ the past observations y 1 , . . . , y t − 1 ◦ the current and past expert forecasts ( x k , s ) s ≤ t , 1 ≤ k ≤ K • we observe y t 1 / 16

  3. T he framework of this talk Sequential prediction of arbitrary time-series based on expert forecasts: • a time-series y 1 , . . . , y n ∈ R d is to be predicted • Expert forecasts are available: e.g., given by some stochastic or machine-learning models (for us: black boxes ) At each forecasting instance t � 1 , . . . , n • forecasting black-box k ∈ { 1 , . . . , K } provides forecast x k , t of y t • typical solution: assign a weight � p k , t to each expert and predict K � � � y t � p k , t x k , t k � 1 • we observe y t 1 / 16

  4. Ev aluation criterion W e consider a convex loss function ℓ : R d × R d → R , e.g., the square loss ℓ ( x , y ) � � x − y � 2 . Goal: minimize our average loss � n ℓ ( � L n � 1 � y t , y t ) . n t � 1 Difficulty: no stochastic assumption on the time series - neither on the observations ( y t ) - neither on the expert forecasts ( x k , t ) They are arbitrary and can be chosen by an adversary . If all experts are bad, good performance is hopeless ➥ relative criterion 2 / 16

  5. T he regret: a relative criterion W e evaluate our performance relatively to the ones of the experts n � n n � n 1 � 1 � 1 � 1 � ℓ t min ℓ k , t ℓ t − min ℓ k , t � + n n n n k � 1 ,..., K k � 1 ,..., K � ��������� �� ��������� � � ����������������� �� ����������������� � � ��������������������������������� �� ��������������������������������� � t � 1 t � 1 t � 1 t � 1 � � def def def � L ⋆ L n � Reg n n our reference performance average regret performance (approximation error) (estimation error) where � ℓ t � ℓ ( � y t , y t ) and ℓ k , t � ℓ ( x k , t , y t ) . Goal Perform almost as good as the best of the experts when n → ∞ � � lim sup sup Reg n ≤ 0 n →∞ ( y t ) , ( x k , t ) 3 / 16

  6. Best convex combination A more ambitious approximation error n K ℓ � q k x k , t , y t � 1 � � min � � n q ∈ ∆ K where ∆ K � � k � 1 q k � 1 � t � 1 k � 1 + : � K q ∈ R K . If an expert provides inaccurate forecasts which compensate other expert forecasts, we should increase its weight. ➥ The gradient trick formalizes this idea ( � Example for the square loss: ( x k , t − y t ) 2 y t − y t )( x k , t − y t ) ➝ Our prediction 4 / 16

  7. B rief summary A meta-statistical interpretation: • expert forecasts are given by some statistical forecasting methods, each possibly tuned with a different given set of parameters. They may rely on some stochastic model. • these ensemble forecasts are then combined in a robust and deterministic manner A trade-off: our final performance expresses these two parts � L n � L ⋆ n + R eg n 5 / 16

  8. Application: electricity load forecasting Goal: a day-ahead forecasting of the French electricity load Data characteristics: • January 1, 2008 – August 31, 2011 as a training data set • September 1, 2011 – June 15, 2012 (excluding some special days) as testing set • Electricity demand for EDF clients, at a half-hour step • Typical values: median = 43 496 MW, maximum = 78 922 MW • Three expert forecasters: GAM, CLR, KWF 6 / 16

  9. D ata looks like... 7 / 16

  10. Application: electricity load forecasting Convex loss functions considered: • squareloss: ℓ ( x , y ) � ( x − y ) 2 ➝ RMSE • absolute percentage of error: ℓ ( x , y ) � | x − y | / y ➝ MAPE Operational constraint: One-day ahead prediction at a half-hour step, i.e., 48 aggregated forecasts Expert forecasters: • GAM / generalized additive models (see Wood 2006; Wood, Goude, Shaw 2014) • CLR / curve linear regression (see Cho, Goude, Brossat, Yao 2013, 2014) • KWF / functional wavelet-kernel approach (see Antoniadis, Paparoditis, Sapatinas 2006; Antoniadis, Brossat, Cugliari, Poggi 2012, 2013) 8 / 16

  11. H ow good are our expert? Loss: RMSE and MAPE on the testing sets (with no warm-up period) � | � � n n ( � y t − y t | 1 � 1 � y t − y t ) 2 n n y t t � 1 t � 1 We look at the performance of the oracles : Uniform Best Best Best mean forecaster convex p linear u RMSE (MW) 725 744 629 629 MAPE (MW) 1.18 1.29 1.06 1.06 9 / 16

  12. A strategy to pick the convex weights The exponentially weighted average forecaster (EWA) Initialization: � P arameter: η > 0 p 1 � ( 1 / K , . . . , 1 / K ) At each time step t , we assign to expert k the weight � � − η � t − 1 exp s � 1 ℓ k , t � p k , t � � � � K − η � t − 1 j � 1 exp s � 1 ℓ j , s Performance: if the loss is convex and bounded by B , n n � + η B 2 log K � 1 � 1 � def R eg n ℓ t − min ℓ k , t ≤ n n η n 8 k t � 1 t � 1 � log K � 8 log K for η � B − 1 B ≤ n 2 n 10 / 16

  13. A strategy to pick the convex weights The exponentially weighted average forecaster (EWA) Initialization: � P arameter: η > 0 p 1 � ( 1 / K , . . . , 1 / K ) At each time step t , we assign to expert k the weight � p k , t − 1 e − ηℓ k , t − 1 � p k , t � j � 1 � � K p j , t − 1 e − ηℓ j , t − 1 Performance: if the loss is convex and bounded by B , n n � + η B 2 log K � 1 � 1 � def R eg n ℓ t − min ℓ k , t ≤ n n η n 8 k t � 1 t � 1 � log K � 8 log K for η � B − 1 B ≤ n 2 n 10 / 16

  14. P roof: let’s do some maths... Lemma (Hoeffding) Let X be a random variable taking value in [ 0 , B ] . Then for any s ∈ R log E � e sX � ≤ s E [ X ] + s 2 B 2 8 1. Upper bound the instantaneous loss � ℓ t � by convexity ℓ t � ℓ ( � � p t · x t , y t ) p t · ℓ ( x t , y t ) ≤ η log � p k , t e − ηℓ k , t � K + η B 2 by Hoeffding � � � − 1 � ≤ � � 8 � � k � 1 by definition of � � + η B 2 p k , t p k , t + 1 − 1 e − ηℓ k , t � η log � 8 p k , t + 1 η log � + η B 2 p k , t + 1 ℓ k , t + 1 � � p k , t 8 2. Sum over all t , the sum telescopes � n � ✟ + η nB 2 + η B 2 p k , n + 1 η log ✟✟ ≤ log K � ℓ t − ℓ k , t ≤ 1 � p k , 1 8 η n 8 t � 1 11 / 16

  15. Calibration of η Best theoretical value � 8 log K η ⋆ � B − 1 n Issue: n and B are not known in advance! Solutions: • “doubling trick” • adaptive learning rate η t picked according to some theoretical value • use simultaneously several learning rates. . . • calibrate on a grid by choosing � � η t ∈ arg min Loss of Exp. weights with η until time t − 1 η 12 / 16

  16. Application to electricity load forecasting (continued) Benchmark and oracles (RMSE) U niform Best Best Best mean forecaster convex p linear u RMSE (MW) 725 744 629 629 vs. Aggregated forecasts with convex weights Exp. weights (best η for theory) 644 Exp. weights (best η on data) 644 Exp. weights (best η tuned on data) 625 ML-Poly (tuned according to theory) 626 13 / 16

  17. Ev olution of the weights N o focus on a single member! Weights change significantly over time and do not converge (illustrate that performance of forecasters varies over time) 14 / 16

  18. Are all forecasters useful? Definitely yes! 3 forecasters ➝ only best 2 Exp. weights 625 ➝ 644 ML-Poly 626 ➝ 646 Forecasters not considered anymore can come back if needed 15 / 16

  19. C onclusion This was only a small glimpse into the work performed during my PhD at EDF R&D. I applied the method to many other data sets with good results ➝ Universality of the method Here, with Olivier we aim at working on • huge number of experts ➝ sparse and efficient methods • better calibration of the learning parameter to get faster rates • lower bounds • probabilistic forecasts by using the pinball loss Thanks 16 / 16

Recommend


More recommend