GeoDISCO: Encyclopedic Geographical Discourse in France from the Enlightenment to Wikipedia D. Vigier, T. Joliveau, L. Moncla, K. McDonough, A. Brenon ludovic.moncla@liris.cnrs.fr GIR’19
What is this project about? The GéoDisco Project asks How does geographical discourse develop in French encyclopedias between the 18th century and today? 1 Encyclopédie ou Dictionnaire raisonné des sciences, des arts et des métiers, par une Société de Gens de lettres (1751-1772) edited by Diderot and d’Alembert 1 2 Encyclopedia Universalis (2018 digital edition) 3 French Wikipedia (July 2018) 1. https://artfl-project.uchicago.edu/ GeoDISCO GIR’19 – 2/22
Overview Three main axes 1. Named Entity Recognition and Classification in the EDDA - Improving NER with a linguistic approach 2. Toponym Disambiguation - Building a network of relations between toponyms 3. Extracting explicit locations in EDDA - Extracting and interpreting geographical coordinates from EDDA articles GeoDISCO GIR’19 – 3/22
Improving NER with a linguistic approach Corpus analysis using the TXM platform ( http://textometrie.ens-lyon.fr/ ) • The goal is to find specific patterns in order to improve the PERDIDO NER rules Methodology • Identify the most frequent proper nouns in the geography subcorpus based on the POS tagging (Treetagger) • Manual selection of the 30 most frequent occurrences for several types of entities - country, city, region, person. • For each list, compute the co-occurrences ordered by the specificity score GeoDISCO GIR’19 – 4/22
Improving NER with a linguistic approach Most important co-occurrences Place Position Person country city region par , de, selon , -1 de, en , le de, à, dans de, en , le sous , suivant ville, cour, coutume, France, saint, roi, de, dans , bourg, -2 parlement, prévot, ... duc, comte, ... empereur, pape, ... royaume, rivière, roi, ... punctation mark punctuation mark punctuation mark +1 punctuation mark numeric value et I , II , IV , ... dans, au, capitale, Sicile, géographie, +2 numeric value royaume, ... Valais, Baptiste, ... • prepositions, list of nouns, . . . GeoDISCO GIR’19 – 5/22
Improving NER with a linguistic approach GeoDISCO GIR’19 – 6/22
Toponym Disambiguation and Historical Texts : Challenges There is no gazetteer for the 18th-c. world • Modern or historical gazetteers contain lots of noise • Typical ranking solutions do not apply well in these cases (e.g. population) • EDDA’s complex structure means that one geography article may refers to more than one place • Sometimes it is impossible to match a record in any resource to a toponym • articles explicitly refuse to pin down a location for a toponym • there is no existing gazetteer record for a place that was nonetheless documented in the past We propose to make use of toponym relations and attributes internal to the corpus of documents itself for toponym resolution GeoDISCO GIR’19 – 7/22
Building the Network EDDA contains 20.7 million words in 44 632 entries among them 14 457 articles classified as ’ geographie ’ Nodes • get the list of headwords from articles metadata • normalize headwords • prepositions, punctuation marks, and/or alternate names or spelling • e.g. ’ Brassaw, ou Gronstat ’, ’ Adiazzo, Adiazze ou Ajaccio ’ Edges • relationship between nodes • extract place names with a custom version of the Perdido geoparser • a new edge is created between the current node and all the corresponding node of each toponym in the content GeoDISCO GIR’19 – 8/22
In-degree centrality Rank Node Score 1 france 0.1130 2 italie 0.0853 3 allemagne 0.0814 4 afrique 0.0481 5 espagne 0.0462 6 naples 0.0211 7 pologne 0.0199 8 paris 0.0183 9 océan 0.0161 10 perse 0.0158 GeoDISCO GIR’19 – 9/22
Betweenness centrality Rank Node Score 1 mer méditerranée 0.0373 2 france 0.0223 3 allemagne 0.0223 4 natolie 0.0220 5 monde 0.0136 6 italie 0.0131 7 lycie 0.0129 8 lycus 0.0126 9 issus 0.0122 10 europe 0.0102 GeoDISCO GIR’19 – 10/22
GeoDISCO GIR’19 – 11/22
Using the Network for Disambiguation Our hypothesis is that the quantitative citation network reveals qualitative relations. • We compute an ego-centered network • We compute the betweenness centrality measure of this ego-network • The node with the highest value is selected as the most related. GeoDISCO GIR’19 – 12/22
Using the Network for Disambiguation 83 over 100 responses are correct For a city the method returns the name of the country to which it belongs aziruth egypte → cezimbra portugal → . . . For a country it returns the name of a neighboring location pérou egypte → vénézuéla grenade (la nouvelle) → . . . In 18% of correct answers (15 over 83) the returned name is not present in the content of the article isaurie natolie → salé mer méditerranée → walcheren flessingue → . . . GeoDISCO GIR’19 – 13/22
Classification of nodes city 6 378 unclassified 5 041 hydronym 1 193 country 1 033 mountain 174 GeoDISCO GIR’19 – 14/22
Extracting explicit locations in EDDA Some articles of EDDA are explicitly located GeoDISCO GIR’19 – 15/22
Extracting Geographic coordinates in EDDA Many kinds of location expressions • By absolute geographic coordinates - Long. 22. 30. latit. 45. 33 • By distances to other locations - à 5 lieues au midi & au dessou de Lyon, à 15 au nord-ouest de Grenoble, & à 108 au sud-est de Paris. • By spatial relations - On a line: sur le bord oriental du Rhône - Within an area: province de France - Adjacent to an other entity: bornée à l’occident par le Rhône • By logical relations - Grenoble en est la capitale • . . . GeoDISCO GIR’19 – 16/22
Extracting Geographic coordinates in EDDA Examples - long. 62. 50. lat. 3. 28. Problems and constraints - Lat. 42. 8. long.. 67. 35. - Long. 36. 4. lat. 40. 48. (D. J.) • Iregular expressions - Long. 40. 5. latit. 62. 6. - Lat. 14. 20 - 16. 15. long. 58. 30 - 59. • References to different kind of entities - Lat. 42 degrés, 20 minutes long. 306 degrés, 50 & quelques minutes. - Point: pair of coordinates - Lat. 37 degrés long. 27 & demi - Area: 4 latitude and longitude references - long. 135. 20. lat. mérid. - Long. 18 d. 26 ’. 6". lat. 48 d. 57 ’. 43". CONCHITE • Several pairs of coordinates - entre le 32 & le 41 de long. & le 10 & le 20 de lat - à 12 d. de long . & à 33. de latit - More of one place in the article - la long. à 103. 50. & la lat. à 26 - Several supposition of one location - Long. 110 d. & lat. 46. 45. selon Uluhbeg; & long . 116. & according to different sources of authors lat. 45. selon Nassiredden. - Abulféda lui donne 78 d. 4 ’. de long. . elle a, selon quelques - uns, 43 d. 30 ’. de latit. septentrionale. (GIUND) GeoDISCO GIR’19 – 17/22
Looking for textual patterns Comparison of two methods Interactive exploration + manual extraction • CQL Queries (TXM) - [word="Longitude"]|[word ="Longit"]|[word="Long"]| [word="longitude"]|[word ="longit"]|[word="long"] - [word="Latitude"]|[word ="Latit"]|[word="Lat"]|[word="latitude"]| [word="lat"]|[word="lat"] • Laborious and fastidious rearrangement of the results � • A useful knowledge of the different cases found in EDDA � Automatic annotation of most frequent redundant patterns • Automatic retrieval � • Missing some specific cases � - situé entre le 45 & 47 degré de long. & entre le 15 & 23 degré de lat. - sous le troisième degré de long et sous le 20e de lat GeoDISCO GIR’19 – 18/22
Georeferencing the place names • Still at the very beginning - 4702 articles have coordinates (merging the results of the 2 methods) • Shortcomings - Longitude missing (sometimes) - North or South precision missing for latitude (quite often) • Prime meridian? - Officially in France since Richelieu: Ferro Meridian In fact located by Delisle at 20˚ of Paris Meridian - d’Alembert in the articles Latitude and Méridien of EDDA: sometime the authors use a local meridian GeoDISCO GIR’19 – 19/22
https://arcg.is/1STjfW GeoDISCO GIR’19 – 20/22
https://arcg.is/1STjfW GeoDISCO GIR’19 – 21/22
Thank you for your attention CONTACT Ludovic Moncla ludovic.moncla@liris.cnrs.fr Thierry Joliveau thierry.joliveau@univ-st-etienne.fr
Recommend
More recommend