variability of country names and identifiers in datasets
play

Variability of country names and identifiers in datasets Reconciling - PowerPoint PPT Presentation

Variability of country names and identifiers in datasets Reconciling practical and cultural perspectives International Cartographic Conference, Dresden Laura Kostanski | Sara Jane Farmer | Rob Atkinson August 2013 GOVERNMENT AND COMMERCIAL


  1. Variability of country names and identifiers in datasets – Reconciling practical and cultural perspectives International Cartographic Conference, Dresden Laura Kostanski | Sara ‐ Jane Farmer | Rob Atkinson August 2013 GOVERNMENT AND COMMERCIAL SERVICES THEME

  2. Today’s Presentation • Overview • Cultural Reasons for Multiple Country Names • Impact of Cultural Reasons • Multiple Country Name Datasets • Reconciling Information • Spatial Identifier Reference Framework (SIRF) Approach

  3. Overview • There are multiple country name datasets in use • e.g. ISO 3166, UNSTATS , Alexandria Digital Library, CIA Fact Book, UN ‐ FAO • Multiple stakeholders in creation and use of data using these names • e.g. World Bank, Statistics Agencies, Crisis Response and Social Protection Groups. Time spent accessing and reconciling data is costly and delays production of • results from analysis • The same issues apply to most, perhaps all, identifiers of spatial objects Preview of how we might tackle this problem •

  4. Context CSIRO. UNSDI Gazetteer for Social Protection in Indonesia

  5. Data Analysis Utopia Way Inc. investigated files in the data.un.org dataset. … Country names were discovered in multiple fields, such as: •country of birth, •country of citizenship, •country or area, •country or territory, •country or territory of asylum or residence, •country or territory of origin, •reference area. and identified significant issues with country name alignments and mismatches. An automated matching process was set up to explore the extent of the issue. In all, 21,195,188 rows of data were analysed.

  6. Common “Errors” Index error Examples Withdrawn countries with no “East Timor", "Czechoslovakia, Czechoslovak Socialist Republic”, "USSR, ISO3166 code Union of Soviet Socialist Republics", “Yemen, Yemen Arab Republic", “Yemen, Democratic, People's Democratic Republic of", “Yugoslavia, Socialist Federal Republic of”, “Germany, Federal Republic of”, “German Democratic Republic”, “US Miscellaneous Pacific Islands", “Wake Island", “Serbia and Montenegro". Abbreviation “Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”, “&” for “and”. Added markers “+” added to the end of region names, to differentiate them from countrynames. “MDG_” added to region names, e.g. “MDG_Southern Asia”. Capitalisation “YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the”, “the” for “The”. Brackets “()” or “[]” instead of “Virgin Islands (British)” for “British Virgin Islands”. commas Standards confusion The ISO3166 labels “name” and “official_name” were both used in the same datasets (“name” is available for all countries; “official_name” is not). Use of familiar names Brunei, Ivory Coast, China, Libya issues with character translation Cote d'Ivoire, Åland Islands, Curaçao, Réunion Misspellings Double spaces, trailing spaces, “South Asia” vs “Southern Asia”.

  7. Long names, short names

  8. Data sets providing country names Organisation Name of Data Set United Nations Statistics Division Country and Region Codes for Statistical Use Working Group on Country Names, List of Country Names United Nations Group of Experts on Geographic Names Terminology Section, Multilingual Terminology Database (UNTERM) Department for General Assembly and Conference Management International Standards Organisation ISO 3166: Codes for the representation of names of countries and their subdivisions (parts 1, 2 and 3) (ISO) Food and Agriculture Organisation of Global Administrative Unit Layers (GAUL) the United Nations United Nations Geospatial Second Administrative Level Boundaries (SALB) Information Working Group (UNGIWG) National Geospatial Intelligence Federal Information Processing Standard (FIPS) 10 ‐ 4 : Countries, Dependencies, Areas of Special Agency Sovereignty, and their Principal Administrative Divisions NATO Standards Agreement (STANAG) 1059

  9. Two Aspects of Country Name Datasets 1: Development of datasets Why is there a proliferation of country name sources? • Cultural issues • Development practices 2: Usage How, in a digital age of ‘big data’ analytics and SDIs, can newly emerging technologies such as the Spatial Identifier Reference Framework (SIRF) assist in reducing the ambiguity associated with multiple, heterogeneous country name sources? • Can we do better? What do we need to do it?

  10. Cultural Issues • Toponyms provide communities with identity ( Toponymic Identity is both reflected and reinforced) • Country names are the highest ‐ order toponyms • Problems are similar at lower levels, compounded by scale (size of problem) and higher rates of change (e.g. electoral boundaries, urban growth)

  11. Endonym/Exonym Above and beyond associations with an individual’s attachment to the Endonym of their country, there are often multiple Exonyms used by other languages.  e.g. Deutschland = Germany or Allemagne

  12. Other Cultural Country Naming Considerations Formal/Informal naming applications (particularly prevalent in the social media world ‐ e.g. ‘Oz’ for Australia) Political/Non ‐ Political Usage e.g. ‘Commonwealth of Australia’ Change over time e.g. Czechoslovakia Non ‐ standardised international conventions e.g. Saint or St? The or none?

  13. The Impact All of these cultural mores impact on the ability of people and organisations to record country name information in a standardised, transparent manner. Thus, there exists a proliferation of country name lists which are officially promoted by international agencies. This impact is then intensified in usage ,

  14. Options Suggested improvements to the indices and standards include: 1. Improve access to source data a. Make the UN’s regions list available as a csv file online, to include withdrawn country codes, assignment dates and withdrawal dates (these are needed to match names for earlier years). b. Make the UN’s economic status list available as a csv file online. 2. Lobby to improve content a. ISO to create a region (Africa, West Africa, North America etc.) code standard. b. ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in Bolivia’s name). 3. Policy a. Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN online development data should attempt to adhere to. 4. Better citation mechanisms a. Standardised metadata and identifiers that “resolve” – i.e. links back to data b. Shared infrastructure to link all the information together

  15. Spatial Identifier Reference Framework CSIRO has been working with stakeholders including UN, National agencies and others on a set of standards and infrastructure services to support discovering and linking multiple sources of spatial references. This is being presented in more detail in: 6D.3 Spatial Identifier Reference Framework (SIRF): Realising the potential of SDI Using Spatial Identifiers to Link Multiple Information Systems (#633) 1 , Robert Atkinson 1 , Laura Kostanski 2 Paul Box S6 ‐ D ‐ SDI Tuesday, August 27, 2013 04:30 p.m. ‐ 05:45 p.m. ‐ Room: Conference Level ‐ C1

  16. One real world feature: a bus station BIG Department of Transport National Gazetteer of Indonesia Bus Terminals Represented in multiple systems using different names, and classified and represented in different ways Identifier Feature Type Footprint Identifier Feature Type Footprint Merak Terminal Polygon Merak, Stasiun Bis Transport Point Currently systems are Merak disconnected and difficult to integrate Merak, Stasiun Bis Terminus Dataset Gazetir Indeonesia Spatial Identifier Merak, Stasiun Bis Merak REFERENCE FRAMEWORK (Gazetteer Entry) (Gazetteer Entry) Terminus Dataset Links gazetteers (based on same Gazetir Indonesia (Gazetteer) (Gazetteer) Same as feature in different gazetteers) Used in Used in used in web applications and other online resources. Navigation application Online Public Passenger Travel Stats Linked Resource Transport Map Application Linked Resource Linked Resource

  17. Identifiers This is the “tricky part” Lets start with the practical implication… Catchment ExtractionRate Storage 1123343 730 300 Catchment Area Geometry Boundary 1123343 33535.4 151.3344, ‐ 35.330…….

  18. “Distributed” references Catchment ExtractionRate Storage 1123343 730 300 How to ask for this entity Internet How to deliver this entity Catchment Boundary Area Geometry 1123343 33535.4 151.3344, ‐ 35.330…….

  19. SDI resource One real world feature: a bus station access BIG Department of Transport Provenance National Gazetteer of Indonesia Bus Terminals Represented in multiple systems using different names, and classified and represented in different ways URI Identifier Feature Type Footprint Identifier Feature Type Footprint Merak Terminal Polygon Merak, Stasiun Bis Transport Point Currently systems are Merak disconnected and difficult to integrate Merak, Stasiun Bis Terminus Dataset Gazetir Indeonesia Describe Spatial Identifier Discover Merak, Stasiun Bis Merak REFERENCE FRAMEWORK (Gazetteer Entry) (Gazetteer Entry) Terminus Dataset Links gazetteers (based on same Gazetir Indonesia (Gazetteer) (Gazetteer) Same as feature in different gazetteers) Used in Used in used in web applications and other online resources. Link Navigation application Online Public Passenger Travel Stats Linked Resource Transport Map Application Linked Resource Linked Resource

Recommend


More recommend