1
Data science and engineering for local weather forecasts Nikhil R Podduturi Data {Scientist, Engineer} November, 2016
Agenda About MeteoGroup ● Introduction to weather data ● Problem description ● Data science and weather forecasting ● Engineering ● Verification ● Results ● Questions ● 3
How many of you check weather forecasts frequently? 4
5
Weather data 6
1.5 TB/day 7
Types of data Observations: ● WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts etc) ● MeteoGroup measurement network 8
Types of data Observations: ● WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts etc) ● MeteoGroup measurement network Satellite data 9
Types of data Observations: ● WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts etc) ● MeteoGroup measurement network Satellite data Radar data 10
Types of data Observations: WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts ● etc) MeteoGroup measurement network ● Satellite data Radar data User data 11
Types of data Observations: WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts ● etc) MeteoGroup measurement network ● Satellite data Radar data User data Numerical weather prediction model data 12
Numerical weather prediction models ● Complex and Multidimensional data 13
Numerical weather prediction models ● Complex and multidimensional data ● 5 NWP models from different providers 14
Numerical weather prediction models ● Complex and multidimensional data ● 5 NWP models from different providers ● Data size per day - 0.5 TB 15
Data science and weather forecasting 16
17
Outcome ● Took 24 hours for 24 hour forecasts ● Grid interval - 736 km ● Poor results 18
MeteoGroup Forecasting system 19
MeteoGroup forecasting system 3 years of Machine learning NWP data Trained Forecasts model model 3 years of Daily NWP observation data data 20
MeteoGroup forecasting system Written in pascal 21
MeteoGroup forecasting system Written in pascal Runs on in house high performance computing cluster 22
MeteoGroup forecasting system Written in pascal Runs on in house high performance computing cluster Limitations ● Hard to maintain ● Not very transparent ● Scalability 23
Problem description 24
Next generation forecasting system ● Cloud based solution 25
Next generation forecasting system ● Cloud based solution ● Transparent 26
Next generation forecasting system ● Cloud based solution ● Transparent ● Scalable 27
Next generation forecasting system ● Cloud based solution ● Transparent ● Scalable ● Improve forecasting accuracy 28
Baseline model Downscale to Interpolate NWP data Linear model location missing values 29
Baseline model Downscale to Interpolate NWP data Linear model location missing values Outcome: ● Very fast ● Poor accuracy ● Multicollinearity 30
Iteration 1 ● Address multicollinearity using feature selection ● Scale the features Downscale to Interpolate Scale Feature NWP data Linear model location missing values features selection 31
Iteration 1 ● Address multicollinearity using feature selection ● Scale the features Downscale to Interpolate Scale Feature NWP data Linear model location missing values features selection Outcome: ● Improved accuracy 32
Iteration 2 ● Model selection between linear and non-linear models ● Advanced feature selection Model selection Advance Downscale to Interpolate Scale (linear and NWP data feature location missing values non-linear features selection models) 33
Iteration 2 ● Model selection between linear and non-linear models ● Advanced feature selection Model selection Advance Downscale to Interpolate Scale (linear and NWP data feature location missing values non-linear features selection models) Outcome: ● On par with existing forecasting system ● Slow training 34
Engineering to scale the product 35
Baseline model engineering (Scikit-learn, NumPy, Keras with TensorFlow) 36
Model engineering (Scikit-learn, NumPy, Keras with TensorFlow) Good: Python ML ecosystem ● Familiarity among the team ● Test driven and Agile Development ● Fail fast ● 37
Model engineering (Scikit-learn, NumPy, Keras with TensorFlow) Good: Python ML ecosystem ● Familiarity among the team ● Test driven and Agile Development ● Fail fast ● Bad: Not scalable ● 38
47000 * 15 * 360 model runs Locations Weather attributes Hours e.g: temperature, wind etc 39
Scaling with Apache Airflow Apache Airflow • By AirBnB • Apache product since early 2016 Directed Acyclic Graph (DAG) Components • UI • Scheduler • Executor(s) 40
Apache Airflow DAG ● Hooks (connections) ● Operators (tasks) ● Schedule ● Dependencies 41
Airflow and Mesos deploy AWS S3 persist Airflow scheduler Mesos cluster 42
Airflow and Mesos Cont Integ deploy Persist AWS S3 Airflow scheduler Mesos cluster 43
Verification 44
Model improvement cycle Deploy DAG Verify model Improve DAG 45
Forecast verification Forecast Engine AWS S3 with models JSON-LD 46
Verification metrics ● Mean absolute error ● Root mean squared error ● Mean error ● Heidke skill score ● Equitable threat score ● Probability density functions ● Error percentiles 47
Mean absolute error for different models (Temperature) 48
Probability distribution function for multiple models (Temperature) 49
Percentile graphs for each model (Temperature)
For demo please stop by MG booth 51
Results Cloud based solution AWS S3, EC2, ElastiCache ● Transparent Scalable Improve forecasting accuracy 52
Results Cloud based solution AWS S3, EC2, ElastiCache ● Transparent Verification microservice ● Scalable Improve forecasting accuracy 53
Results Cloud based solution AWS S3, EC2, ElastiCache ● Transparent Verification microservice ● Scalable Mesos cluster ● Training time a month to 5 hours (approx) ● Improve forecasting accuracy 54
Results Cloud based solution AWS S3, EC2, ElastiCache ● Transparent Verification microservice ● Scalable Mesos cluster ● Training time a month to 5 hours (approx) ● Improve forecasting accuracy On par or better ● 55
Improvements Hyperlocal AWS lambda integration Iterate for more accuracy 56
Questions? 57
We are hiring!
59
Recommend
More recommend