analyzing big data from complex systems
play

Analyzing Big Data From Complex Systems: Smart Cards in Urban - PowerPoint PPT Presentation

Analyzing Big Data From Complex Systems: Smart Cards in Urban Transportation Networks Soong Moon Kang School of Management University College London smkang@ucl.ac.uk The Institute for Korean Regional Studies Seoul National University


  1. Analyzing Big Data From Complex Systems: Smart Cards in Urban Transportation Networks Soong Moon Kang School of Management University College London smkang@ucl.ac.uk The Institute for Korean Regional Studies Seoul National University September 6, 2016

  2. Transport for London (TfL) Oyster Card Wikicommons • Introduced in 2003 • By June 2012: - More than 43 million cards issued - Used by more than 80% of all public transport

  3. Agenda: • Study 1: Patterns of Urban Movement • Study 2: Predicting Traffic Volumes and Estimating Effects of Disruptions • Study 3: Extensions of the Study on the Patterns of Urban Movement • Study 4: Extensions of the Study on the Effects of Disruptions • Discussions

  4. Study 1: Patterns of Urban Movement • "Structure of Urban Movements: Polycentric Activity and Entangled Hierarchical Flows” PLoS ONE , January 7, 2011, 6(1):e15923. (with Camille Roth, Michael Batty and Marc Barthélémy)

  5. Data: • March 31, 2008 — April 6, 2008 (1 week) - 11.22 million journeys (trips) - 2.03 million individual users (IDs) • Information for each ID: - time and location of tap-in and tap-out  individual movements

  6. Descriptives: Distribution of travel distances can be fitted with a negative binomial function distribution of distances between stations distribution of journeys 9.28 km

  7. Descriptives: Travel propensity actual flow ( w ij ) vs random random simulation (given in- and out-flow at stations)  null-model of randomized journeys

  8. Descriptives: Flow distribution: normalized histogram of flows of individuals power law with exponent ≈ 1.3  strong heterogeneity of individual movements w ij : flow of passengers between stations i and j

  9. Descriptives: Distribution of total flows: Zipf plot with for morning peak hours (7am – 10am) • Exponential decay  most of total flows concentrated on few stations

  10. Polycenters: Identifying polycenters: 1. Arrange stations by decreasing order of inflow  definition of centers by decreasing importance 2. Account for geographical proximity  aggregate all stations within a distance (1,500 meters) within the defined center 3. Continue until we capture a large percentage of total flow (60% of total flow)

  11. Polycenters: Hierarchical organization

  12. Polycenters: Northern Stations West End Western Stations City Docklands West London Museums Parliament Mid-Town Government

  13. Polycenters: Anisotropy - Use random simulation from travel propensity to study relative orientation of incoming flow anisotropy  if no bias, fully isotropic (= 1)

  14. Polycenters:

  15. Structure of Flows: How flows from single stations (sources) go to centers - squares: sources (single stations) - grey: 20% of total inflow - circles: centers - red: 40% of total inflow

  16. Structure of Flows: Proportion of links going from sources to centers (group) Group I Group II Group III For more than 80% of the sources, the most important link (1 st link) - connects to a center of Group I For more than 80% of the sources, the least important link (10 th link) - connects to a center of Group III.

  17. Study 1: Patterns of Urban Movement • Contributions: - application of complex systems analytical tools to a novel data - a new approach to determine polycenters - attempt to model hierarchical nature of urban movements • Limitations: - exploratory - naive

  18. Study 2: Predicting Traffic Volumes and Estimating Effects of Disruptions • "Predicting Traffic Volumes and Estimating the Effects of Shocks in Massive Transportation Systems” Proceedings of the National Academy of Sciences ( PNAS ) May 5, 2015, 112(18): 5643 – 5648. (with Ricardo Silva and Edoardo M. Airoldi)  Introducing statistical analysis into complex systems

  19. Data: • February 2011 — February 2012 - 70 weekdays and 25 weekend days - 211 million journeys (trips) - 10.7 million individual users (IDs)  1.71 journeys per user per day  1.76 million users per day  3 million journeys per day - 374 stations open during the period (underground + overground + DLR)

  20. Data: • Weekdays only

  21. Statistical Model: Basic Idea:

  22. Statistical Model: Basic Idea:

  23. Statistical Model: Basic Idea:

  24. Statistical Model: Basic Idea: Smart Card Data “Natural Regime” Model Network Structure Data “Disruption” Model Disruption Logs Passenger Route Surveys

  25. “Natural Regime” Model: Smart Card Data “Natural Regime” Model Network Structure Data “Disruption” Model Disruption Logs Passenger Route Surveys

  26. “Natural Regime” Model: Basic Idea:

  27. “Natural Regime” Model: Assessment: - Fivefold cross-validation (i.e., 14 days of test data for each fold): Test if the fine-grained model with 374×374 ≅ 140,000 components overfits as compared to the fully aggregated (blackbox) models, and under which conditions the model does better

  28. “Disruption” Model: Smart Card Data “Natural Regime” Model Network Structure Data “Disruption” Model Disruption Logs Passenger Route Surveys

  29. “Disruption” Model: Basic Model:

  30. “Disruption” Model: Results: Average number of exits per minute at Victoria LU station on Tuesday, January 17, 2012. The blue curve represents the 1-min-ahead prediction under the natural regime using the tracking model. Given a disruption from 6:00 PM to 7:00 PM between Victoria station and Brixton station in the Victoria line , - blue horizontal line : the average expected exit rate given by the tracking model under the natural regime , - red horizontal line : the averaged observed exit count , and - black horizontal line : the prediction given by the disruption model

  31. “Disruption” Model: Assessment: (A) Relative errors for line segment events. The absolute error of tracking model for the line segment disruption varies from 3.0 (all stations) to 12.2 (stations with 85 tap-outs per minute or more) persons per minute. (B) Relative errors for station events. The absolute error varies from 3.5 (all stations) to 10.5 (stations with 75 tap-outs per minute or more) persons per minute.

  32. Station Sensitivity Index: How sensitive stations are to line closures: Red dots: top 10% by number of tap-outs

  33. Study 2: Predicting Traffic Volumes and Estimating Effects of Disruptions • Contributions: - application of statistical and machine learning techniques to complex systems - good model to describe and predict the effects of disruptions • Limitation: - simplistic

  34. Study 3: Extensions of the Study on the Patterns of Urban Movement • with Michael Batty, Hae Ran Shin, Ricardo Silva and Chen Zhong  Introducing statistical analysis into the study of urban movement patterns

  35. Study 3a: Passenger Travel Distributions Basic Idea:

  36. Study 3a: Passenger Travel Distributions Basic Idea: frequency frequency 0 distance 0 distance Station B Station A

  37. Study 3a: Passenger Travel Distributions Basic Idea: frequency frequency 0 distance 0 distance Station B Station A

  38. Study 3a: Passenger Travel Distributions Some Research Questions: - Do travel distributions of the passengers entering specific stations reveal a more generic pattern?  “local” versus “global” - If a generic pattern exist, how it relates to the urban geography?  “ center ” versus “periphery”

  39. Study 3b: Passenger Travel Distributions and Geographic Socio-Economic Characteristics Basic Idea: - Correlate passenger travel distributions with geographic socio- economic characteristics such as income, education, age, employment and family composition.

  40. Study 3: Extensions of the Study on the Patterns of Urban Movement • Data:  London and Seoul  Major challenges: - Only one day of data from Seoul - Fine grained socio-economic data for Seoul

  41. Study 4: Extensions of the Study on the Effects of Disruptions • with Ricardo Silva  Refining the statistical analyses  Ultimate goal: real-time assessment of effects of disruptions system-wide

  42. Study 4a: Probabilistic and Causal Approaches Basic Idea:

  43. Study 4a: Probabilistic and Causal Approaches Basic Ideas: - provide a full probabilistic model of movement inside the subway network system - estimate the distribution (instead of only the expectation) of travel times, link loads and exit numbers given a disruption  causal inference

  44. Study 4b: Passenger-level modeling Basic Ideas: - model by taking into account the behaviour of individual travellers, instead of aggregated counts - collect fine-grained passenger movement data using mobile apps

  45. Discussion

Recommend


More recommend