Effective Scalable and Integrative Geocoding for Massive Address Datasets Department of Computer Science, Stony Brook University Sina Rashidian , Xinyu Dong, Amogh Avadhni, Prachi Poddar, Fusheng Wang November 2017
INTRODUCTION Open Data Integrative Introduction 2/23 Results Background Sources Geocoding
Motivations • Spatial big data analysis increasing everyday – Accessibility of large scale open data • Public health studies – Health data is widely accessible with government open data initiatives, geo- crowdsourcing, and social media – Using low resolution spatial data e.g., county/zip code, is more often – Lack of high resolution spatial data – Lack of efficient and scalable methods Open Data Integrative Introduction 3/23 Results Background Sources Geocoding
Problems • Sensitivity of data, for instance: – Patients ’ privacy in public health studies – Protected Health Information (PHI) – Health Insurance Portability and Accountability Act (HIPAA) • High cost for commercial geocoding – Geocoding 20M records ~ 10K USD • Limited scalability of current geocoding systems – Daily/monthly transactions limitations, 1M per month or 100K per day – Geocoding 20M records → 200 days! Open Data Integrative Introduction 4/23 Results Background Sources Geocoding
Goal • Geocoding system with following features: – Free of charge → Suitable for academic usage – Scalable and fast → Supports high volume of input data – Accurate and robust → Result must be reliable – Local → Respects data sensitivity, such as patients’ privacy • Challenges – Lack of a free complete, accurate and, reliable reference – Free data sources could be noisy and incomplete – Different data sources do not share same set of features – When there are multiple possible answers, which one is better? Open Data Integrative Introduction 5/23 Results Background Sources Geocoding
BACKGROUND Open Data Integrative Background 6/23 Results Introduction Sources Geocoding
Classic Geocoding Model • Consist of two major parts: Parsing Searching 12-11 North Stony – Parsing Clean Tokens Brook Road, Stony Raw Address – Searching Brook, NY, 11794 • Fixed scoring system based on Tokenizing Build Query only text similarity 12, 11, North, Stony Tokenized Brook, Rd, Stony DataBase • Improvements are based on Address Brook, NY, 11794 techniques Is Cleaning valid? 12, 11, n, stony brook, rd, stony brook, ny, Clean Tokens Answer 11794 Open Data Integrative Background 7/23 Results Introduction Sources Geocoding
DATA SOURCES Open Data Integrative 8/23 Results Introduction Background Sources Geocoding
Data Sources • Linear Based – Topologically Integrated Geographic Encoding and Referencing (TIGER) – Cons: Missing city information, based on address ranges → Interpolation error • Polygon/Point Based – Tax Parcels – New York Street and Address Maintenance Program (SAM) – OpenStreetMap (OSM) – OpenAddresses – Cons: Incompleteness, having partial information, messiness Open Data Integrative 9/23 Results Introduction Background Sources Geocoding
INTEGRATIVE GEOCODING Integrative Open Data 10/23 Results Introduction Background Sources Geocoding
Integrative Geocoding Model • Consist of three major parts: – Parsing – Oriented Searching – Intelligent Selection • Parallel processing approach • Training Section as a pre- processing task • Effective and scalable integrative geocoder (EaserGeocoder) Integrative Open Data 11/23 Results Introduction Background Sources Geocoding
Oriented Searching 1. Generate queries based on each TIGER dataset’s characteristics Generate OpenAddresses Specific Query – E.g., TIGER does not have city Generate Tax Parcels Specific Query name Relax Generate SAM Search Engine Query Specific Query – Increasing efficiency Relax Generate Search Engine Query Specific Query – More accurate results Acceptab Relax le ? Search Engine Query 2. Search Database Acceptab Relax le ? Search Engine Query Candidate Result Acceptab 3. Relaxed Search le ? Candidate Result Close – Finding nearest ones instead of enough? Candidate Result exact match Candidate Result – Expanding the scope iteratively Integrative Open Data 12/23 Results Introduction Background Sources Geocoding
Intelligent Answer Selection – Case Study • “21 Airport Rd, Binghamton, NY 13901” – Perfect Match TIGER, spatial error ~450m – Partial Match OpenAddresses, spatial error ~350m • “510 Main St, Oneida, NY 13421” – Perfect Match TIGER, spatial error ~50m – Partial Match OpenAddresses, spatial error ~30km! • Partial match is similar, only zip code is missing in both cases • Preset rules for choosing better reference leads to non-optimal or even wrong answer • The state-of-art is to choose both of them optimally! Integrative Open Data 13/23 Results Introduction Background Sources Geocoding
Intelligent Answer Selection • Why just text similarity is not enough? 1. Each source could be more accurate in one specific region 2. Implicit factors such as population density • Machine learning based approach • Gradient tree boosting – Learning small predictive models – Decision trees – Learning a model for predicting the best one Integrative Open Data 14/23 Results Introduction Background Sources Geocoding
Classification • Originally binary classification – Classify based on an acceptance threshold • Treat all correct candidates same – Some candidates are more correct! – Considering spatial error • Muli-class classifier – 3 classes between 0 and the threshold – Choosing nearest class Integrative Open Data 15/23 Results Introduction Background Sources Geocoding
RESULTS Open Data Integrative Results 16/23 Introduction Background Sources Geocoding
Accuracy Type Name Google % Here % MapQuest % Consolidation % Commercial Google - 97.18 96.55 99.10 Commercial Here 94.88 - 97.17 99.26 Commercial MapQuest 94.52 97.16 - 98.82 Commercial GeoServices 94.68 95.30 94.71 96.43 Non-Commercial EaserGeocoder 96.46 97.12 95.82 97.93 Non-Commercial Nominatim 54.74 54.15 54.08 55.68 Non-Commercial Geonames 82.65 83.20 83.53 84.40 Non-Commercial DataSciToolkit 89.05 89.50 89.52 90.71 • 18,890 residential addresses crawled from one real estate website • Defined 400 meters as the threshold for accuracy Open Data Integrative Results 17/23 Introduction Background Sources Geocoding
Spatial Error Type Name Google (m) Here (m) MapQuest (m) Consolidation (m) Commercial Google - 24.29 55.52 18.37 Commercial Here 24.29 - 55.20 17.67 Commercial MapQuest 55.85 55.20 - 35.86 Commercial GeoServices 35.75 31.08 51.16 29.02 Non-Commercial EaserGeocoder 31.08 26.85 53.30 26.90 Non-Commercial Nominatim 65.55 63.88 57.98 58.09 Non-Commercial Geonames 93.38 91.78 75.32 83.58 Non-Commercial DataSciToolkit 87.42 85.47 71.36 77.39 Open Data Integrative Results 18/23 Introduction Background Sources Geocoding
Spatial Accuracy Variation - Histogram Number of Addresses=18,890 Open Data Integrative Results 19/23 Introduction Background Sources Geocoding
Scalability Tests Open Data Integrative Results 20/23 Introduction Background Sources Geocoding
EaserGeocoder • The system is available from: http://bmidb.cs.stonybrook.edu/easergeocoder/ • Completely free for academic usage! Open Data Integrative Results 21/23 Introduction Background Sources Geocoding
Summary • Introduction – Problem, Motivation, Goal • Background – Classic Geocoding Model • Open Data Sources – Linear-Based, Point-Based, Community Contributed • Integrative Geocoding – Integrative Geocoding Model, Oriented Searching, Intelligent Answer Selection • Results – Accuracy, Spatial Error, Spatial Accuracy Variation, Scalability Open Data Integrative 22/23 Introduction Background Results Sources Geocoding
Thank a lot for your attention 23/23
Any Questions? 24/23
EXTRA MATERIALS 25/23
Overview - Integrative Geocoding Model • Utilizing multiple free open reference sources – Maximizing coverage and accuracy – Paying no cost for data • Searching each data source based on its characteristics – Unchanged, except pre-data cleaning and standardization – Extracting candidates , most similar matches from sources • Choosing best answer among candidates – By using machine learning techniques Open Data Integrative Introduction 26/23 Results Background Sources Geocoding
Traditional Geocoding Method • Street Network Map as the source • Interpolation methods for estimating the location • Accuracy depends on: – Density – Estimation error • The most common method in geocoding systems Open Data Integrative Background 27/23 Results Introduction Sources Geocoding
Linear-Based Dataset • Topologically Integrated Geographic Encoding and Referencing (TIGER) • Street Network Map • Vast coverage and reliable • Cons: ▪ Does not consist of exact building locations address → Linear Interpolation ▪ Parcel homogeneity ▪ Offset from beginning and end Open Data Integrative 28/23 Results Introduction Background Sources Geocoding
Recommend
More recommend