RISIS / Working with geographical data UPEM geocoding and clustering methods applied to EUPRO FP3 subdataset Lionel Villard, Michel Revollo 10/09/2015 1/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges Main goals Analyzing the geographical distribution of FP3 adresses and measuring the aggregation effects by identifying the existing geographical spaces where a high density of activity takes place. 2/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges 1/ Selection of attributs, cleaning step and external data The addresses were clean, no need to further treatment. Additions of external data: Two digits country code and english country name (ISO 3166-1 alpha-2 norm) 3/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges We chose to use two different attributes: sAddress_orig : complete addresses, with eventually a building names, postal codes, cities, countries 19 710 objects 5 without address (excluded) 4 % with only a country in the address (excluded) sCity and ISO country names : 95,8 % with a city name We tried to use postal code : not accurate with the batchgeocode geocoding engine. 4/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges 2/ Automatic geocoding step Automatic grabbing of the results of batch geocode web application, in two steps : Complete addresses Cities with ISO country names Returned information : Returned cleaned address Longitude and latitude coordinates Accuracy of the coordinates 5/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges Examples of results for addresses Examples of results for cities 6/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges 3/ Filtering accuracies and sources Prioritization of the results of the geocoded addresses against cities (better precision, ex: building level...) accuracy % Geocoded Addr LabelAccuracy 1 0% Country level 2 0% Region (state, province, prefecture, etc.) level 3 51% Sub-region (county, municipality, etc.) level 4 0% Town (city, village) level 5 13% Post code (zip code) level 6 0% Street level 7 0% Intersection level 8 6% Address level 9 37% Premise (building name, property name, shopping center, etc.) level 7/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges Sources of addresses, accuracies and geocoded addresses Info origine Accuracy Nb Geocoded Addr % Geoloc By Accuracy addresses 3 8018 47% addresses 5 2469 15% addresses 8 1106 7% addresses 9 5390 32% Total 16983 100% cities 3 1445 96% cities 5 19 1% cities 8 2 0% cities 9 37 2% Total 1503 100% Total geocoded addresses : (16983+1503)/19715 = 93,8% 8/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges Top 10 : geocoded addresses per country Top iso_ctry_code_alpha2 TotalNbAddr Geocoded Addr % Geocoded 1 FR 3174 2989 94,2% 2 DE 2966 2800 94,4% 3 GB 2610 2518 96,5% 4 IT 2036 1957 96,1% 5 ES 1587 1460 92,0% 6 NL 1370 1291 94,2% 7 BE 1023 920 89,9% 8 GR 923 857 92,8% 9 DK 764 719 94,1% 10 PT 652 613 94,0% 9/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges 4 / Clustering step A method based on a combination of two sequential approaches 1 / Identification of the initial clusters with a density-based algorithm (DBScan, 1996) that is able to identifying the area where the activities are concentrated. The clusters are defined by two parameters fixed before the calculation: all points of a cluster are surrounded by at least X points in a circle with a diameter of Y km. Where are located the area in which activity is the most intense? 10/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges 2 / In a second step, we compare two different dimensions of the relation between the initial clusters: 2.1 How intense are the relations between the initial clusters (less than 100 km between the centroids) ? RI/Relative Interconnectivity 2.2 Does the final cluster will have a similar profil of collaborations as the two initial clusters taken separately (to avoid large variations of density of links in the final cluster) ? RC/Relative Closeness (Not relevant in our prototype : no relation between addresses) 11/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges 5 / Drawing clusters boundaries Main goal : convert points group by a unique cluster key into areas delimited by boundaries Using Minimum Convex Polygons (MPC or convex hull) of the software Geospatial Modelling Environment (Hawthorne Beyer, 2014) 12/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges 6 / Naming step Main goals : finding a relevant name for each cluster (readable and easily understandable name, not which does not depend on the data) identifying the core cities of the clusters Method : geographical intersection of two layers populated Places : layer of points for cities produced by Natural Earth project (Fourth Edition, Oct. 2009-2012, mainly members of North American Cartographic Information Society) many capitals, major cities and towns, plus a sampling of smaller towns in sparsely inhabited regions Cluster s shapes : layer of shapes for clusters 13/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges Geoprocessing : intersection Selected cities with population All the 7323 cities with population (2012) inside clusters shapes All clusters shapes 14/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges Examples of cluster names Building the cluster s name by a popularity criteria : IdClusterD ClustAddr ClustName 1 1034 Athens / Piraievs names of the core cities are ordered by population 2 362 Lisbon 3 77 Valencia 4 466 Madrid 5 120 Thessaloniki 6 197 Barcelona 7 562 Rome / Vatican City 8 121 Toulouse 9 112 Montpellier 10 75 Pisa 11 97 Florence 12 88 Genoa 13 82 Bologna 14 150 Turin 15 159 Grenoble 16 308 Milan 17 130 Lyon 18 272 Munich 19 79 Vienna Main organisations in proportion of addresses in the clusters 20 2552 Paris / Versailles IdClusterD ClustAddr ClustName stOrg NbOrgAdd Pc 42 1662 Kobenhavn / Malmo / Roskilde Technical University of Denmark - Danmarks Tekniske 97 5,84% 42 1662 Kobenhavn / Malmo / Roskilde University of Copenhagen - Koebenhavns Universitet (KU) 91 5,48% 25 1110 Brussels / Namur Katholieke Universiteit Leuven 108 9,73% 25 1110 Brussels / Namur Universite catholique de Louvain 73 6,58% 1 1034 Athens / Piraievs National Technical University of Athens (NTUA) 87 8,41% 7 562 Rome / Vatican City Universitá di Roma La Sapienza, University of Rome La Sapienza 40 7,12% 30 531 Essen / Wuppertal Ruhr-Universität Bochum 29 5,46% 4 466 Madrid UPM Universidad Politecnica de Madrid/Madrid Polytechnical 55 11,80% 4 466 Madrid CSIC - Consejo Superior de Investigaciones Cientificas/Higher 52 11,16% 4 466 Madrid UCM Universidad Complutense de Madrid 49 10,52% 15/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges See the maps ! Two thresholds has been tested (Minimal number of addresses in 25 km to begin a cluster) : 75 addresses are needed to begin a cluster, and 100 addresses Proportion of addresses inside and ouside clusters 100 % 75 % Clust 9298 50,3% 10446 56,5% Hclut 9187 49,7% 8039 43,5% Total 18485 18485 16/17
Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges 7 / Further work Quality check : is there differences (distance) between geocoded cities and geocoded addresses ? Analytical dimensions : combining these geographical information with other attributes (temporal, sectorial...) to analyse the geographical dynamics of FPs projects. Merging close clusters : with relations between addresses, we would be able two compare close clusters and merge them if they have similar characteristics (in terms of relations) Generalisation : applying this process to all FPs 17/17
Recommend
More recommend