massive address datasets
play

Massive Address Datasets Department of Computer Science, Stony Brook - PowerPoint PPT Presentation

Effective Scalable and Integrative Geocoding for Massive Address Datasets Department of Computer Science, Stony Brook University Sina Rashidian , Xinyu Dong, Amogh Avadhni, Prachi Poddar, Fusheng Wang November 2017 INTRODUCTION Open Data


  1. Effective Scalable and Integrative Geocoding for Massive Address Datasets Department of Computer Science, Stony Brook University Sina Rashidian , Xinyu Dong, Amogh Avadhni, Prachi Poddar, Fusheng Wang November 2017

  2. INTRODUCTION Open Data Integrative Introduction 2/23 Results Background Sources Geocoding

  3. Motivations • Spatial big data analysis increasing everyday – Accessibility of large scale open data • Public health studies – Health data is widely accessible with government open data initiatives, geo- crowdsourcing, and social media – Using low resolution spatial data e.g., county/zip code, is more often – Lack of high resolution spatial data – Lack of efficient and scalable methods Open Data Integrative Introduction 3/23 Results Background Sources Geocoding

  4. Problems • Sensitivity of data, for instance: – Patients ’ privacy in public health studies – Protected Health Information (PHI) – Health Insurance Portability and Accountability Act (HIPAA) • High cost for commercial geocoding – Geocoding 20M records ~ 10K USD • Limited scalability of current geocoding systems – Daily/monthly transactions limitations, 1M per month or 100K per day – Geocoding 20M records → 200 days! Open Data Integrative Introduction 4/23 Results Background Sources Geocoding

  5. Goal • Geocoding system with following features: – Free of charge → Suitable for academic usage – Scalable and fast → Supports high volume of input data – Accurate and robust → Result must be reliable – Local → Respects data sensitivity, such as patients’ privacy • Challenges – Lack of a free complete, accurate and, reliable reference – Free data sources could be noisy and incomplete – Different data sources do not share same set of features – When there are multiple possible answers, which one is better? Open Data Integrative Introduction 5/23 Results Background Sources Geocoding

  6. BACKGROUND Open Data Integrative Background 6/23 Results Introduction Sources Geocoding

  7. Classic Geocoding Model • Consist of two major parts: Parsing Searching 12-11 North Stony – Parsing Clean Tokens Brook Road, Stony Raw Address – Searching Brook, NY, 11794 • Fixed scoring system based on Tokenizing Build Query only text similarity 12, 11, North, Stony Tokenized Brook, Rd, Stony DataBase • Improvements are based on Address Brook, NY, 11794 techniques Is Cleaning valid? 12, 11, n, stony brook, rd, stony brook, ny, Clean Tokens Answer 11794 Open Data Integrative Background 7/23 Results Introduction Sources Geocoding

  8. DATA SOURCES Open Data Integrative 8/23 Results Introduction Background Sources Geocoding

  9. Data Sources • Linear Based – Topologically Integrated Geographic Encoding and Referencing (TIGER) – Cons: Missing city information, based on address ranges → Interpolation error • Polygon/Point Based – Tax Parcels – New York Street and Address Maintenance Program (SAM) – OpenStreetMap (OSM) – OpenAddresses – Cons: Incompleteness, having partial information, messiness Open Data Integrative 9/23 Results Introduction Background Sources Geocoding

  10. INTEGRATIVE GEOCODING Integrative Open Data 10/23 Results Introduction Background Sources Geocoding

  11. Integrative Geocoding Model • Consist of three major parts: – Parsing – Oriented Searching – Intelligent Selection • Parallel processing approach • Training Section as a pre- processing task • Effective and scalable integrative geocoder (EaserGeocoder) Integrative Open Data 11/23 Results Introduction Background Sources Geocoding

  12. Oriented Searching 1. Generate queries based on each TIGER dataset’s characteristics Generate OpenAddresses Specific Query – E.g., TIGER does not have city Generate Tax Parcels Specific Query name Relax Generate SAM Search Engine Query Specific Query – Increasing efficiency Relax Generate Search Engine Query Specific Query – More accurate results Acceptab Relax le ? Search Engine Query 2. Search Database Acceptab Relax le ? Search Engine Query Candidate Result Acceptab 3. Relaxed Search le ? Candidate Result Close – Finding nearest ones instead of enough? Candidate Result exact match Candidate Result – Expanding the scope iteratively Integrative Open Data 12/23 Results Introduction Background Sources Geocoding

  13. Intelligent Answer Selection – Case Study • “21 Airport Rd, Binghamton, NY 13901” – Perfect Match TIGER, spatial error ~450m – Partial Match OpenAddresses, spatial error ~350m • “510 Main St, Oneida, NY 13421” – Perfect Match TIGER, spatial error ~50m – Partial Match OpenAddresses, spatial error ~30km! • Partial match is similar, only zip code is missing in both cases • Preset rules for choosing better reference leads to non-optimal or even wrong answer • The state-of-art is to choose both of them optimally! Integrative Open Data 13/23 Results Introduction Background Sources Geocoding

  14. Intelligent Answer Selection • Why just text similarity is not enough? 1. Each source could be more accurate in one specific region 2. Implicit factors such as population density • Machine learning based approach • Gradient tree boosting – Learning small predictive models – Decision trees – Learning a model for predicting the best one Integrative Open Data 14/23 Results Introduction Background Sources Geocoding

  15. Classification • Originally binary classification – Classify based on an acceptance threshold • Treat all correct candidates same – Some candidates are more correct! – Considering spatial error • Muli-class classifier – 3 classes between 0 and the threshold – Choosing nearest class Integrative Open Data 15/23 Results Introduction Background Sources Geocoding

  16. RESULTS Open Data Integrative Results 16/23 Introduction Background Sources Geocoding

  17. Accuracy Type Name Google % Here % MapQuest % Consolidation % Commercial Google - 97.18 96.55 99.10 Commercial Here 94.88 - 97.17 99.26 Commercial MapQuest 94.52 97.16 - 98.82 Commercial GeoServices 94.68 95.30 94.71 96.43 Non-Commercial EaserGeocoder 96.46 97.12 95.82 97.93 Non-Commercial Nominatim 54.74 54.15 54.08 55.68 Non-Commercial Geonames 82.65 83.20 83.53 84.40 Non-Commercial DataSciToolkit 89.05 89.50 89.52 90.71 • 18,890 residential addresses crawled from one real estate website • Defined 400 meters as the threshold for accuracy Open Data Integrative Results 17/23 Introduction Background Sources Geocoding

  18. Spatial Error Type Name Google (m) Here (m) MapQuest (m) Consolidation (m) Commercial Google - 24.29 55.52 18.37 Commercial Here 24.29 - 55.20 17.67 Commercial MapQuest 55.85 55.20 - 35.86 Commercial GeoServices 35.75 31.08 51.16 29.02 Non-Commercial EaserGeocoder 31.08 26.85 53.30 26.90 Non-Commercial Nominatim 65.55 63.88 57.98 58.09 Non-Commercial Geonames 93.38 91.78 75.32 83.58 Non-Commercial DataSciToolkit 87.42 85.47 71.36 77.39 Open Data Integrative Results 18/23 Introduction Background Sources Geocoding

  19. Spatial Accuracy Variation - Histogram Number of Addresses=18,890 Open Data Integrative Results 19/23 Introduction Background Sources Geocoding

  20. Scalability Tests Open Data Integrative Results 20/23 Introduction Background Sources Geocoding

  21. EaserGeocoder • The system is available from: http://bmidb.cs.stonybrook.edu/easergeocoder/ • Completely free for academic usage! Open Data Integrative Results 21/23 Introduction Background Sources Geocoding

  22. Summary • Introduction – Problem, Motivation, Goal • Background – Classic Geocoding Model • Open Data Sources – Linear-Based, Point-Based, Community Contributed • Integrative Geocoding – Integrative Geocoding Model, Oriented Searching, Intelligent Answer Selection • Results – Accuracy, Spatial Error, Spatial Accuracy Variation, Scalability Open Data Integrative 22/23 Introduction Background Results Sources Geocoding

  23. Thank a lot for your attention 23/23

  24. Any Questions? 24/23

  25. EXTRA MATERIALS 25/23

  26. Overview - Integrative Geocoding Model • Utilizing multiple free open reference sources – Maximizing coverage and accuracy – Paying no cost for data • Searching each data source based on its characteristics – Unchanged, except pre-data cleaning and standardization – Extracting candidates , most similar matches from sources • Choosing best answer among candidates – By using machine learning techniques Open Data Integrative Introduction 26/23 Results Background Sources Geocoding

  27. Traditional Geocoding Method • Street Network Map as the source • Interpolation methods for estimating the location • Accuracy depends on: – Density – Estimation error • The most common method in geocoding systems Open Data Integrative Background 27/23 Results Introduction Sources Geocoding

  28. Linear-Based Dataset • Topologically Integrated Geographic Encoding and Referencing (TIGER) • Street Network Map • Vast coverage and reliable • Cons: ▪ Does not consist of exact building locations address → Linear Interpolation ▪ Parcel homogeneity ▪ Offset from beginning and end Open Data Integrative 28/23 Results Introduction Background Sources Geocoding

Recommend


More recommend