Welcome to DS504/CS586: Big Data Analytics Application I Prof. Yanhua Li Time: 6:00pm –8:50pm R Loca2on: KH 116 Fall 2017
• 16 critiques & Next Thur we have the last critique. – Already graded 4 of them. – Plan to grade 1-2 more.
• Grading – Projects (40%) • Project 1 (10%) • Project 2 (30%) – Final reports in the discussion forum (by 11:59pm 12/12 Tue); – Self-and-peer evalua2on form for project 2 (by 11:59PM 12/12 Tue); – WriPen work (30%): • Cri2ques + Project reports (20%) • Quiz (10%, with 5% each) – Oral work (30%): • Presenta2on (project presenta2on + reading assignment presenta2on)
• Final Project Presentation – 20 minutes each group (including Q&A and transition) – Schedule: • 12/14 Thu – Last week presentation data for all 7 teams – We will have snacks and soda.
Next Class: Summary and Discussion v Review of the semester v Plus the last critique/review 5
Service Providing • Urban Compu2ng, Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Social Network Air Pollution, ... Analysis, Networking Urban Data Analytics Data Mining, Machine Learning, Visualization • Graph Mining, Data Clustering, Urban Data Management Recommender systems Spatio-temporal index, streaming, trajectory, and graph data management,... • Indexing, Query Processing Human Meteorolo Road Air Social Energy Networks POIs Traffic mobility Quality gy Media • Error Correc2on, Map- Urban Sensing & Data Acquisition Matching Participatory Sensing, Crowd Sensing, Mobile Sensing • Representa2ve data collec2on: Sampling Urban Compu,ng: concepts, methodologies, and applica,ons . Zheng, Y., et al. ACM transac+ons on Intelligent Systems and Technology .
Real-world problems are always messy • Mul2ple models • Key features • Data Sparsity
• What do we do to solve a classifica2on/inference/ predic2on problem? – Data Cleaning – Feature selec2on – Inference model – Evalua2on • An example of how to solve real world applica2on problem
U-Air: When Urban Air Quality Meets Big Data Authors: Yu Zheng, Microsok Research Asia
Background Air quality monitor station • Air quality – NO2, SO2 – Aerosols: PM2.5, PM10 • Why it maPers – Healthcare – Pollu2on control and dispersal • Reality – Building a measurement sta2on is not easy – A limited number of sta2ons (poor coverage) Beijing only has 22 air quality monitor sta2ons in its urban areas (50kmx40km)
2PM, June 17, 2013
Challenges • Air quality varies by loca2ons non-linearly • Affected by many factors – Weathers, traffic, land use… – Subtle to model with a clear formula 0.30 0.25 Propor2on 0.20 0.15 >35% 0.10 0.05 0.00 0 40 80 120 160 200 240 280 320 360 400 440 480 Deviation of PM2.5 between S12 and S13 A) Beijing (8/24/2012 - 3/8/2013)
We do not really know the air quality of a loca,on without a monitoring sta,on!
Challenges • Exis2ng methods do not work well – Linear interpola2on – Classical dispersion models • Gaussian Plume models and Opera2onal Street Canyon models • Many parameters difficult to obtain: Vehicle emission rates, street geometry, the roughness coefficient of the urban surface… – Satellite remote sensing • Suffer from clouds • Does not reflect ground air quality • Vary in humidity, temperature, loca2on, and seasons – Outsourced crowd sensing using portable devices • Limited to a few gasses: CO 2 and CO • Sensors for detec2ng aerosol are not portable: PM10, PM2.5 • A long period of sensing process, 1-2 hours 30,000 + USD, 10ug/m 3 202×85×168 ( mm )
Inferring Real-Time and Fine-Grained air quality throughout a city using Big Data Meteorology Traffic Human Mobility POIs Road networks Historical air quality data Real-2me air quality reports
Applica2ons • Loca2on-based air quality awareness – Fine-grained pollu2on alert – Rou2ng based on air quality • Deploying new monitoring sta2ons • A step towards iden2fying the root cause of air pollu2on S1 S9 S6 S1 S2 S4 S3 S8 S10 S7 S5
Difficul2es • 1. How to iden2fy features from each kind of data source • 2. Incorporate mul2ple heterogeneous data sources into a learning model – Spa2ally-related data: POIs, road networks – Temporally-related data: traffic, meteorology, human mobility • 3. Data sparseness (liPle training data) – Limited number of sta2ons – Many places to infer
Methodology Overview • Par22on a city into disjoint grids • Extract features for each grid from its affec2ng region – Meteorological features – Traffic features – Human mobility features – POI features – Road network features • Co-training-based semi-supervised learning model for each pollutant – Predict the AQI labels – Data sparsity – Two classifiers
Meteorological Features: F m Rainy, Sunny, Cloudy, Foggy • Wind speed • Temperature • Humidity • Barometer pressure • Good AQI of PM 10 Moderate Unhealthy-S Unhealthy August to Dec. 2012 in Beijing Very Unhealthy
Traffic Features: F t Distribu2on of speed by 2me: F(v) • Expecta2on of speed: E(V) • 0 ≤ v <20 Standard devia2on of Speed: D • 20 ≤ v <40 v ≥ 40 km E ( v ) D ( v ) GPS trajectories generated by over 30,000 taxis From August to Dec. 2012 in Beijing Good Moderate Unhealthy-S Unhealthy Very Unhealthy
Human Mobility Features: F h • Human mobility implies – Traffic flow – Land use of a loca2on – Func2on of a region (like residen2al or business areas) • Features: Number of arrivals f a and leavings (departures) f l – Number of arrivals 𝑔↓𝑏 and leavings 𝑔↓𝑚 f a f a Good Good Moderate Moderate Unhealthy-S Unhealthy-S Unhealthy Unhealthy Very Unhealthy Very Unhealthy f l f l A) AQI of PM 10 B) AQI of NO 2 Parks vs factories
Extrac2ng Traffic/Human Mobility Features Offline spa2o-temporal indexing • t a : arrival 2me • Traj : traj ID • I i : the index of the first GPS point (in the trajectory) entering a grid • I o : the index of the last GPS point (in the trajectory) entering the grid •
POI Features: F p • Why POI – Indicate the land use and the func2on of the region – the traffic paPerns in the region • Features – Numbers of POIs over categories – Por2on of vacant places – The changes in the number of POIs • Factories, shopping malls, • hotel and real estates • Parks, decora2on and furniture markets
Road Network Features: F r Why road networks • – Have a strong correla2on with traffic flows – A good complementary of traffic modeling Features: • – Total length of highways 𝑔↓ℎ f h – Total length of other (low-level) road segments 𝑔↓𝑠 f r – The number of intersec2ons 𝑔↓𝑡 in the grid’s affec2ng region f s
Semi-Supervised Learning Model • Philosophy of the model s 4 Time l s 1 s 2 s 3 – States of air quality t i s 4 • Temporal dependency in a loca2on l s 1 s 2 s 3 • Geo-correla2on between loca2ons t 2 – Genera2on of air pollutants s 4 e c l a p s 1 s o • Emission from a loca2on s 2 s 3 e t 1 G • Propaga2on among loca2ons – Two sets of features A location with AQI labels A location to be inferred • Spa2ally-related Temporal dependency Spatial correlation • Temporally-related Co-Training Road Networks: F r Spa2al Classifier POIs: F p Spatial Traffic: F t Meteorologic: F m Temporal Classifier Human mobility: F h Temporal
Co-Training-Based Learning Model • Spa2al classifier – Model the spa2al correla2on between AQI of different loca2ons – Using spa2ally-related features – Based on a neural network • Input genera2on w 11 b 1 1 F p ∆ P 1 x D 1 w' 11 – Select n sta2ons to pair with F r 1 ∆ R 1 x D 1 b' 1 – Perform m rounds w 1 l 1 d 1 x D 2 c 1 x F p x F r l x c x b'' k ∆ P kx F p D 1 w r k ∆ R kx F r D 1 b' r l k D 2 d kx w pq w' qr c k b q Input generation ANN
Ar2ficial Neural Networks (ANN) Black box X 1 X 2 X 3 Y Input 1 0 0 0 X 1 1 0 1 1 Output 1 1 0 1 1 1 1 1 X 2 Y 0 0 1 0 0 1 0 0 X 3 0 1 1 1 0 0 0 0 Output Y is 1 if at least two of the three inputs are equal to 1.
Ar2ficial Neural Networks (ANN) Input nodes Black box X 1 X 2 X 3 Y Output 1 0 0 0 X 1 node 0.3 1 0 1 1 1 1 0 1 0.3 1 1 1 1 X 2 Y S 0 0 1 0 0 1 0 0 X 3 0 1 1 1 0.3 t=0.4 0 0 0 0 Y I ( 0 . 3 X 0 . 3 X 0 . 3 X 0 . 4 0 ) = + + − > 1 2 3 1 if z is true ⎧ where I ( z ) = ⎨ 0 otherwise ⎩
Ar2ficial Neural Networks (ANN) • Model is an assembly of Input nodes Black box inter-connected nodes and Output X 1 weighted links node w 1 w 2 X 2 Y S • Output node sums up each w 3 of its input value according X 3 t to the weights of its links Perceptron Model • Compare output node = ∑ Y I ( w X t ) against some threshold t − or i i i Y sign ( w X t ) ∑ = − i i i
General Structure of ANN x 1 x 2 x 3 x 4 x 5 Input Layer Input Neuron i Output I 1 w i1 Activation w i2 S i O i I 2 O i function w i3 Hidden g(S i ) I 3 Layer threshold, t Output Training ANN means learning the Layer weights of the neurons y
Recommend
More recommend