ds504 cs586 big data analytics data pre processing and
play

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 233 Spring 2018 Merged CS586 and DS504 Graded one review Examples of Reviews/ Critiques Random selection.


  1. Welcome to DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm – 8:50pm R Location: AK 233 Spring 2018

  2. Merged CS586 and DS504 Graded one review Examples of Reviews/ Critiques Random selection.

  3. The Data Equation Oceans of Data Ocean Biodiversity Praia de Forte, Brazil Informatics, Hamburg

  4. Data Quality Dimensions v Accuracy § Errors in data Example:”Jhn” vs. “John” v Currency § Lack of updated data Example: Residence (Permanent) Address: out-dated vs. up-to-dated v Consistency § Discrepancies into the data Example: ZIP Code and City consistent v Completeness § Lack of data § Partial knowledge of the records in a table

  5. Geographic outliers - GIS Country, State, named district, etc. Gazetteer of Brazilian localities Ocean Biodiversity Informatics, Hamburg

  6. What do we mean by ‘ Data Quality ’ ? An essential or distinguishing characteristic necessary for data to be fit for use. SDTS 02/92 The general intent of describing the quality of a particular dataset or record is to describe the fitness of that dataset or record for a particular use that one may have in mind for the data. Chrisman, 1991

  7. Loss of data quality ? Loss of data quality can occur at many stages: v At the time of collection v During digitisation v During documentation v During storage and archiving v During analysis and manipulation v At time of presentation v And through the use to which they are put Don’t underestimate the simple elegance of quality improvement. Other than teamwork, training, and discipline, it requires no special skills. Anyone who wants to can be an effective contributor. (Redman 2001).

  8. Data Cleaning v Data cleaning tasks § Accuracy: Smooth out noisy data § Currency: Update the records § Consistency: Correct inconsistent data § Completeness: Fill in missing values

  9. Map matching

  10. Map-matching v Problem: (Sampled data) § GPS trajectory = a sequence of GPS locations with time stamps § Map a GPS trajectory onto a road network § a sequence of GPS points à a sequence of road segments

  11. Spatial Data v e 4 e 3 e 3 .end e 1 e 2 e 3 .start

  12. Map-Matching v Why it is important § A fundamental step in many transportation applications • Navigation and driving • Traffic analysis • Taxi dispatching and recommendations § Examples: • Find the vehicles passing Institute Road • Calculate the average travel time from WPI campus to MIT campus North Ashland St • When will the Bus 3 arrive at stop Highland St &

  13. Map-Matching v Simple solution for high-sampling-rate data § Weighted distance Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  14. Map-Matching (for low sampling rate) v Why difficult ? ? ? ? ? ? (a) Parallel roads b) Overpass c) Spur

  15. Map-Matching v According to the additional information used § Geometric § Topological § Probabilistic § Advanced techniques v According to the range of sampling points § Local/incremental § Global Yu Zheng . Trajectory Data Mining: An Overview. ACM Transaction on Intelligent Systems and Technology, 6, 3, 2015.

  16. Map-matching v Insights § Consider both local and global information § Incorporating both spatial and temporal features p b p a p c Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  17. Map-matching framework Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  18. Map-matching v Solution (incorporating spatial information) § (Observation Probability) Model local possibility 𝑑 𝑗 𝑑 𝑗 1 3 𝑓 𝑗 1 𝑓 𝑗 3 ( xj i − µ )2 1 𝑞 𝑗 N ( c j i ) = 2 πσ e 2 σ 2 𝑑 𝑗 2 √ 𝑓 𝑗 2 • (Transmission Probability) Considering context (global) 𝑞 𝑗+1 𝑑 𝑗 2 d i − 1 → i 𝑞 𝑗−1 V ( c t i − 1 → c s i ) = w ( i − 1 ,t ) → ( i,s ) 𝑑 𝑗 1 𝑞 𝑗 § Spatial analysis function F s ( c t i − 1 → c s i ) = V ( c t i − 1 → c s i ) ∗ N ( c s i ) Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  19. Map-matching • Solution (Cosine Similarity) • Temporal analysis function (Considering temporal information) • Shortest path is used. P k u =1 ( e 0 u .v × ¯ v ( i � 1 ,t ) ! ( i,s ) ) F t ( c t i � 1 → c s i ) = qP k qP k v 2 u .v ) 2 × u =1 ( e 0 u =1 ¯ ( i � 1 ,t ) ! ( i,s ) A Highway P i P i- 1 A Service Road P k u =1 l u v ( i − 1 ,t ) → ( i,s ) = ¯ ∆ t i − 1 → i Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  20. Map-matching • Aggregating – Spatial and temporal information – Local and global information • Dynamic programing P 1 's candidates P 2 's candidates P n 's candidates 𝑑 1 1 1 → 𝑑 2 𝑑 1 1 𝑑 𝑜 1 𝑑 2 1 𝑑 1 2 𝑑 𝑜 𝑑 2 2 2 𝑑 1 3 → 𝑑 2 𝑑 1 3 2 • Spatio-temporal function F ( c t i − 1 → c s i ) = F s ( c t i − 1 → c s i ) ∗ F t ( c t i − 1 → c s i ) , 2 ≤ i ≤ n Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  21. Map-matching • Path Selection n F ( c s i − 1 X F ( P c ) = N ( c s 1 i − 1 → c s i 1 ) + i ) i =2 P = argmax P c F ( P c ) • Dynamic programing P 1 's candidates P 2 's candidates P n 's candidates 𝑑 1 1 1 → 𝑑 2 𝑑 1 1 𝑑 𝑜 1 𝑑 2 1 𝑑 1 2 𝑑 𝑜 𝑑 2 2 2 𝑑 1 3 → 𝑑 2 𝑑 1 3 2 Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  22. Map-matching Example • Path Selection

  23. Map-matching framework Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  24. Localized ST-Matching Strategy • Path Selection Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  25. Evaluations A N = # Correctly Matched Road Seg # all road segments P Length Matched Road Seg A L = Length of the trajectory Yin Lou, Chengyang Zhang, Yu Zheng , et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

  26. Homework assginement 26

  27. Project 1 Example USPS https://ribbs.usps.gov/intelligentmail_package/documents/ tech_guides/PUB199IMPBImpGuide.pdf Project 1 Proposals

  28. Next Class: Data Management v Do assigned readings before class v Be prepared, read and review required readings on your own in advance! v Do literature survey: find and read related papers if any v Bring your questions to the class and look for answers during the class. v Submit reviews/critiques In Canvas before class v Bring 2 hardcopies to the class v Hand in one copy, and keep one copy with you. v Review Writing: http://users.wpi.edu/~yli15/courses/DS504Fall16/Critiques.html v Attend in-class discussions v Please ask and answer questions in (and out of) class! v Let ’ s try to make the class interactive and fun! 28

Recommend


More recommend