an etymological approach to cross language orthographic
play

An Etymological Approach to Cross-Language Orthographic Similarity. - PowerPoint PPT Presentation

An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian Alina Maria Ciobanu, Liviu P. Dinu University of Bucharest Center for Computational Linguistics http://nlp.unibuc.ro EMNLP 2014 Overview


  1. An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian Alina Maria Ciobanu, Liviu P. Dinu University of Bucharest Center for Computational Linguistics http://nlp.unibuc.ro EMNLP 2014

  2. Overview • Orthographic similarity: motivation and approach • Identifying language relationships • Computing degrees of similarity • Results on 3 Romanian corpora from different historical periods • Results on Europarl (Romanian subcorpus) • Conclusions and future work Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 2

  3. Language similarity • The similarity of natural languages is a fairly vague notion, both linguists and non-linguists having intuitions about which languages are more similar to which others [McMahon and McMahon, 2003]. • Four types of similarity: typological, morphological, syntatic, lexical [Homola and Kubon, 2006]. • It is necessary to develop quantitative and computational methods in this field [McMahon and McMahon, 2003]. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 3

  4. Applications • Linguistic phylogeny reconstruc- tion [Alekseyenko et al, 2012; Barbanc ¸on et al, 2013]. • Machine translation [Koppel and Ordan, 2011]. • Language acquisition [Benati and VanPatten, 2011]. • Language intelligibility assess- ment [Gooskens et al, 2008]. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 4

  5. Our approach • A language L1 is closer to a language L2 when texts written in L2 are easier understood by speakers of L1 without prior knowledge of L2 . • When people read a text in a foreign language, they first identify the words which resemble words from their native language. • Two types of related words: victoria (lat.) • Word-etymon pairs n e t o y m m y o • Cognate pairs t n e cognates victorie (ro.) vittoria (it.) Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 5

  6. Orthographic similarity • Some pairs of related words are closer than others. • Word-etymon pairs: a (ro.), luna (lat.) vs. b˘ an (ro.), veteranus (lat.) lun˘ atrˆ • Cognate pairs: ant (ro.), vent (fr.) vs. castel (ro.), chˆ ateau (fr.) vˆ Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 6

  7. Algorithm and methodology Input: corpus C in L 1 1. Text processing 1.1. Remove stop words 1.2. Lemmatize 2. Language relationships identification 2.1. Detect etymologies 2.2. Identify cognates 2.3. Cluster by language families 3. Language similarity computation 3.1. Measure word distances 3.2. Compute degrees of similarity Output: similarity hierarchy for L 1 Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 7

  8. Similarity method Definition C (L 1 ) Lingua (L 2 ) Given a string distance ∆, we define the dis- tance between languages L 1 and L 2 (with fre- x i 1 w i 1 etymology x i 2 w i 2 quency support from corpus C in L 1 ) as fol- etymology lows: N lingua x j 1 w j 1 cognates x j 2 cognates w j 2 � Nlingua ∆( w i , x i ) N lingua (1) i =1 ∆( L 1 , L 2 ) = 1 − + N words N words x k 1 λ x k 2 λ Definition N words - N lingua x k 3 λ x k 4 The similarity between L 1 and L 2 is: λ Sim ( L 1 , L 2 ) = 1 − ∆( L 1 , L 2 ) (2) |C| = N words, |Lingua| = N lingua Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 8

  9. Etymology detection • We extract etymologies from electronic dictionaries. Pattern � abbr class="abbrev" title="limba language name" � language abbreviation � /abbr � Entry � b � etymon � /b � � b � capitol � /b � � abbr class="abbrev" title="limba italiana" � it. � /abbr � � b � capitolo � /b � � abbr class="abbrev" title="limba latina" � lat. � /abbr � � b � capitulum � /b � Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 9

  10. Etymology detection • We extract etymologies from electronic dictionaries. Pattern � abbr class="abbrev" title="limba language name" � language abbreviation � /abbr � Entry � b � etymon � /b � � b � capitol � /b � � abbr class="abbrev" title="limba italiana" � it. � /abbr � � b � capitolo � /b � � abbr class="abbrev" title="limba latina" � lat. � /abbr � � b � capitulum � /b � Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 10

  11. Etymology detection • We extract etymologies from electronic dictionaries. Pattern � abbr class="abbrev" title="limba language name" � language abbreviation � /abbr � Entry � b � etymon � /b � � b � capitol � /b � � abbr class="abbrev" title="limba italiana" � it. � /abbr � � b � capitolo � /b � � abbr class="abbrev" title="limba latina" � lat. � /abbr � � b � capitulum � /b � Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 11

  12. Etymology detection • We extract etymologies from electronic dictionaries. Pattern � abbr class="abbrev" title="limba language name" � language abbreviation � /abbr � Entry � b � etymon � /b � � b � capitol � /b � � abbr class="abbrev" title="limba italiana" � it. � /abbr � � b � capitolo � /b � � abbr class="abbrev" title="limba latina" � lat. � /abbr � � b � capitulum � /b � Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 12

  13. Cognate identification w has L 2 determine (w,e) input word etymology etymologies YES w in L 1 and and etymons etymon e for w NO translate w in L 2 => t L 1 dictionaries determine etymologies Google and etymons Translate for t w and t have common (w,t) L 2 YES etymology dictionaries and ancestor NO Ø Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 13

  14. Orthographic metrics • We use string similarity metrics to compute the orthographic similarity between related words. • Many methods have been used so far, but we cannot say which is the most appropriate for a given task. • We use three orthographic metrics and compare their results. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 14

  15. Orthographic metrics The edit distance The longest common subsequence ratio LD ( w i , w j ) LCS ( w i , w j ) ∆( w i , w j ) = (3) ∆( w i , w j ) = (4) max ( | w i | , | w j | ) max ( | w i | , | w j | ) where LD ( w i , w j ) is the number of operations where LCS ( w i , w j ) is the longest common required to transform w i in w j . subsequence of w i and w j . The rank distance Given two rankings L 1 = ( x 1 , x 2 , ..., x n ) and L 2 = ( y 1 , y 2 , ..., y n ), and V ( L 1 ), V ( L 2 ) their alphabets, the rank distance is defined as follows: � � � ∆( L 1 , L 2 ) = | ord ( x | L 1 ) − ord ( x | L 2 ) | + ord ( x | L 1 ) + ord ( x | L 2 ) x ∈ V ( L 1) ∩ V ( L 2) x ∈ V ( L 1) \ V ( L 2) x ∈ V ( L 2) \ V ( L 1) (5) where ord ( x | L ) is the rank of x in ranking L , in a Borda sense. To extend the distance to words, we index each character with a number equal to the number of its previous occurrences in the given word. For normalization, we divide the rank distance by the maximum possible value between w i and w j : | w i | ( | w i | + 1) / 2 + | w j | ( | w j | + 1) / 2. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 15

  16. Application: Romanian • Romanian is a Romance language, surrounded by Slavic languages. • Its communication with the Ro- mance kernel was difficult. • Its position in the Romance family is controversial, either isolated or more integrated within the group [McMa- hon and McMahon, 2003]. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 16

  17. Datasets • 17 th and 18 th century: Romanian chronicles. (Chronicles) • 19 th century: the publishing works of the Romanian poet Mihai Eminescu. (Eminescu) • 21 st century: the parliamentary debates held in the Romanian Parliament. (Parliament) • The basic Romanian lexicon. (RVR) #words #stop words #lemmas Dataset token type token type type Parliament 22,469,290 162,399 14,451,178 214 40,065 Eminescu 870,828 65,742 565,396 212 21,456 Chronicles 253,786 28,936 170,582 193 8,189 RVR 2,464 2,464 124 124 2,252 Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 17

Recommend


More recommend