An Etymological Approach to Cross-Language Orthographic Similarity. - PowerPoint PPT Presentation

An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian Alina Maria Ciobanu, Liviu P. Dinu University of Bucharest Center for Computational Linguistics http://nlp.unibuc.ro EMNLP 2014

Overview • Orthographic similarity: motivation and approach • Identifying language relationships • Computing degrees of similarity • Results on 3 Romanian corpora from different historical periods • Results on Europarl (Romanian subcorpus) • Conclusions and future work Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 2

Language similarity • The similarity of natural languages is a fairly vague notion, both linguists and non-linguists having intuitions about which languages are more similar to which others [McMahon and McMahon, 2003]. • Four types of similarity: typological, morphological, syntatic, lexical [Homola and Kubon, 2006]. • It is necessary to develop quantitative and computational methods in this field [McMahon and McMahon, 2003]. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 3

Applications • Linguistic phylogeny reconstruc- tion [Alekseyenko et al, 2012; Barbanc ¸on et al, 2013]. • Machine translation [Koppel and Ordan, 2011]. • Language acquisition [Benati and VanPatten, 2011]. • Language intelligibility assess- ment [Gooskens et al, 2008]. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 4

Our approach • A language L1 is closer to a language L2 when texts written in L2 are easier understood by speakers of L1 without prior knowledge of L2 . • When people read a text in a foreign language, they first identify the words which resemble words from their native language. • Two types of related words: victoria (lat.) • Word-etymon pairs n e t o y m m y o • Cognate pairs t n e cognates victorie (ro.) vittoria (it.) Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 5

Orthographic similarity • Some pairs of related words are closer than others. • Word-etymon pairs: a (ro.), luna (lat.) vs. b˘ an (ro.), veteranus (lat.) lun˘ atrˆ • Cognate pairs: ant (ro.), vent (fr.) vs. castel (ro.), chˆ ateau (fr.) vˆ Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 6

Algorithm and methodology Input: corpus C in L 1 1. Text processing 1.1. Remove stop words 1.2. Lemmatize 2. Language relationships identification 2.1. Detect etymologies 2.2. Identify cognates 2.3. Cluster by language families 3. Language similarity computation 3.1. Measure word distances 3.2. Compute degrees of similarity Output: similarity hierarchy for L 1 Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 7

Similarity method Definition C (L 1 ) Lingua (L 2 ) Given a string distance ∆, we define the distance between languages L 1 and L 2 (with fre- x i 1 w i 1 etymology x i 2 w i 2 quency support from corpus C in L 1 ) as fol- etymology lows: N lingua x j 1 w j 1 cognates x j 2 cognates w j 2 � Nlingua ∆( w i , x i ) N lingua (1) i =1 ∆( L 1 , L 2 ) = 1 − + N words N words x k 1 λ x k 2 λ Definition N words - N lingua x k 3 λ x k 4 The similarity between L 1 and L 2 is: λ Sim ( L 1 , L 2 ) = 1 − ∆( L 1 , L 2 ) (2) |C| = N words, |Lingua| = N lingua Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 8

Etymology detection • We extract etymologies from electronic dictionaries. Pattern � abbr class="abbrev" title="limba language name" � language abbreviation � /abbr � Entry � b � etymon � /b � � b � capitol � /b � � abbr class="abbrev" title="limba italiana" � it. � /abbr � � b � capitolo � /b � � abbr class="abbrev" title="limba latina" � lat. � /abbr � � b � capitulum � /b � Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 9

Cognate identification w has L 2 determine (w,e) input word etymology etymologies YES w in L 1 and and etymons etymon e for w NO translate w in L 2 => t L 1 dictionaries determine etymologies Google and etymons Translate for t w and t have common (w,t) L 2 YES etymology dictionaries and ancestor NO Ø Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 13

Orthographic metrics • We use string similarity metrics to compute the orthographic similarity between related words. • Many methods have been used so far, but we cannot say which is the most appropriate for a given task. • We use three orthographic metrics and compare their results. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 14

Orthographic metrics The edit distance The longest common subsequence ratio LD ( w i , w j ) LCS ( w i , w j ) ∆( w i , w j ) = (3) ∆( w i , w j ) = (4) max ( | w i | , | w j | ) max ( | w i | , | w j | ) where LD ( w i , w j ) is the number of operations where LCS ( w i , w j ) is the longest common required to transform w i in w j . subsequence of w i and w j . The rank distance Given two rankings L 1 = ( x 1 , x 2 , ..., x n ) and L 2 = ( y 1 , y 2 , ..., y n ), and V ( L 1 ), V ( L 2 ) their alphabets, the rank distance is defined as follows: � � � ∆( L 1 , L 2 ) = | ord ( x | L 1 ) − ord ( x | L 2 ) | + ord ( x | L 1 ) + ord ( x | L 2 ) x ∈ V ( L 1) ∩ V ( L 2) x ∈ V ( L 1) \ V ( L 2) x ∈ V ( L 2) \ V ( L 1) (5) where ord ( x | L ) is the rank of x in ranking L , in a Borda sense. To extend the distance to words, we index each character with a number equal to the number of its previous occurrences in the given word. For normalization, we divide the rank distance by the maximum possible value between w i and w j : | w i | ( | w i | + 1) / 2 + | w j | ( | w j | + 1) / 2. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 15

Application: Romanian • Romanian is a Romance language, surrounded by Slavic languages. • Its communication with the Ro- mance kernel was difficult. • Its position in the Romance family is controversial, either isolated or more integrated within the group [McMa- hon and McMahon, 2003]. Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 16

Datasets • 17 th and 18 th century: Romanian chronicles. (Chronicles) • 19 th century: the publishing works of the Romanian poet Mihai Eminescu. (Eminescu) • 21 st century: the parliamentary debates held in the Romanian Parliament. (Parliament) • The basic Romanian lexicon. (RVR) #words #stop words #lemmas Dataset token type token type type Parliament 22,469,290 162,399 14,451,178 214 40,065 Eminescu 870,828 65,742 565,396 212 21,456 Chronicles 253,786 28,936 170,582 193 8,189 RVR 2,464 2,464 124 124 2,252 Alina Maria Ciobanu, Liviu P. Dinu | An Etymological Approach to Cross-Language Orthographic Similarity | 17

An Etymological Approach to Cross-Language Orthographic Similarity. - PowerPoint PPT Presentation

An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian Alina Maria Ciobanu, Liviu P. Dinu University of Bucharest Center for Computational Linguistics http://nlp.unibuc.ro EMNLP 2014 Overview

Basic Ray Tracing CMSC 435/634 Projections orthographic axis-aligned orthographic perspective

Basic Ray Tracing CMSC 435/634 Projections orthographic axis-aligned orthographic perspective

Competing Standards Orthographic and Epigraphic Standardisation in Italy 500-100 BC Katherine

TECHNICAL DRAWING TECHNIQUES ~ SKETCHING Susie Boreham SUMMER FOUNDATION 2013 Yujin Sung

lecture 4 projections - orthographic - parallel - perspective + vanishing points view

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Linking the Tower of Babel: Modelling a Massive Set of Etymological Dictionaries as RDF Frank

Srujan Kumar Enaganti Terms and Definitions Etymological Origin The basic texts The

Retroconversion Of A Complex Etymological Dictionary European Master in Lexicography 2009-2010

Orthographic Educational Game for Portuguese Language Countries Paula Chaves, Luan Paschoal,

Outcome 2 Components Graphemic / Orthographic control Andreas Guder Freie University Berlin,

Cross-Language Information Retrieval Carol Peters ISTI-CNR, Pisa Cross-Language Information

Cross Ram Support Set Ram accessories 1 Cross Ram Support Set Set composition The Cross

Orthographic features for bilingual lexicon induction Parker Riley and Daniel Gildea University

From 3D to 2D: Orthographic and Perspective ProjectionPart 1 History Geometrical

1 st Posidonia Sea Tourism Forum Panel 2 The Economics of Sea Tourism Athens, June 21 st 22

Stavros Hatzakos MedCruise President MedCruise member traffic

WWW.SOSSIOBANDA.IT from Italy English language SOSSIO Their music starts from the Alta Murgia

AUGUS T 2004 V1.4 I B T E C H N O L O G Y E X A M P L E S O F C M M I B E N E F I T S

Proposed candidates to the Board Ms Sophie Boissard Born on 4 July 1970 in Paris (France), Sophie

Ocean Rig UDW Inc. 4 th Quarter Ended December 31, 2014 Earnings Presentation NASDAQ: ORIG

2016 FULL-YEAR RESULTS MARCH 1 st , 2017 DISCLAIMER This presentation contains estimates and/or

San Francisco Financial Overview Ben Rosenfield Controller Ted Egan Chief Economist February

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us