LexStat: Automatic Detection of Cognates in Multilingual Wordlists - PowerPoint PPT Presentation

adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not. Finish when the results are satisfying enough. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Basic Procedure . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by . . . . . 7 / 28

adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not. Finish when the results are satisfying enough. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Basic Procedure . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and . . . . . 7 / 28

Finish when the results are satisfying enough. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Basic Procedure . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not. . . . . . 7 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Basic Procedure . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not. Finish when the results are satisfying enough. . . . . . 7 / 28

Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific : Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue” Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Language-Specific Similarity Measure . . . . . . . . 8 / 28

Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific : Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue” Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Language-Specific Similarity Measure . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. . . . . . 8 / 28

The most crucial aspect of correspondence-based similarity is that it is language-specific : Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue” Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Language-Specific Similarity Measure . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. . . . . . 8 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Language-Specific Similarity Measure . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific : Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. . . . . . bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue” 8 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Language-Specific Similarity Measure . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific : Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. . . . . . Meaning German Dutch English Zahn [ ʦ aːn] tand [ t ɑnt] tooth [ t ʊːθ] “tooth” zehn [ ʦ eːn] tien [ t iːn] ten [ t ɛn] “ten” Zunge [ ʦ ʊŋə] tong [ t ɔŋ] tongue [ t ʌŋ] “tongue” 8 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . The Comparative Method . Language-Specific Similarity Measure . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific : Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. . . . . . Meaning Shanghai Beijing Guangzhou [ ʨ iɤ³⁵] Beijing [ ʨ iou²¹⁴] [ k ɐu³⁵] “nine” [ ʨ iŋ⁵⁵ʦɔ²¹] Beijing [ ʨ iɚ⁵⁵] [ k ɐm⁵³jɐt²] “today” [koŋ⁵⁵ ʨ i²¹] Beijing [kuŋ⁵⁵ ʨ i⁵⁵] [ k ɐi⁵⁵koŋ⁵⁵] “rooster” 8 / 28

. Alignment Analyses . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches 9 / 28

In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Alignment Analyses . . . . . . . . 9 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Alignment Analyses . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols. . . . . . 9 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Alignment Analyses . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols. . . . . . t ɔ x t ə r d ɔː t ə r 9 / 28 1

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Alignment Analyses . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols. . . . . . t ɔ x t ə r d ɔː t ə r - 9 / 28 1

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches - m . i - s l a Alignment Analyses . a c - s . . n ) i o e m c In alignment analyses, sequences are arranged in a matrix in d n e d a s n t a such a way that corresponding elements occur in the same s a b i d s y - e t l column, while empty cells resulting from non-corresponding i l h a d c e u t s a elements are filled with gap symbols. , u . . m g s . . . . i e f o . n ( o e r i r e t ɔ x t t ə r o a b c . c t m s n i f e i u e t m n n c e n n d e a g h i t i e s t l a t i d m a e n r o h o g r t f o y n C d t i d ɔː t ə r i r e - s a t e a l i h l u c c t a m 9 / 28 1

Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35). Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Sound Classes . . . . . . . . 10 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Sound Classes . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35). . . . . . 10 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Sound Classes . . . g p k b Sounds which often occur in correspondence relations in genetically related languages ʧ ʤ v f can be clustered into classes (types). It is assumed “that phonetic correspondences ʒ t ʃ d inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35). s z θ ð . . . . . 1 10 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Sound Classes . . . Sounds which often occur in K P correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular T S than those between different ‘types’” (Dolgopolsky 1986: 35). . . . . . 1 10 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches - m o f I c : n s e o d h r t d o . t . , e w o s s n e Sound Classes . a o s e b w . . s s a t y i l w f Sounds which often occur in l c l o a K r P e u d s h s n t correspondence relations in n t u u o a o s n , s e i genetically related languages o n t s r a i o n e n i o h can be clustered into classes t g a c t o c c o g i f n (types). It is assumed “that i w e t i n b t d e r t o d s a phonetic correspondences r t i g i d e f e t e r e a inside a ‘type’ are more regular T S g h h n d t c g g u t o a than those between different j n C m e i r r a a ‘types’” (Dolgopolsky 1986: 35). p y . e s h d . . . t r o . w 1 10 / 28

Sound classes and alignment analyses can be easily combined by representing phonetic sequences internally as sound classes and comparing the sound classes with traditional alignment algorithms. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Sound-Class-Based Alignment (SCA) . . . . . . . . 11 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Sound-Class-Based Alignment (SCA) . . . Sound classes and alignment analyses can be easily combined by representing phonetic sequences internally as sound classes and comparing the sound classes with traditional alignment algorithms. . . . . . 11 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches . Sound-Class-Based Alignment (SCA) . . . Sound classes and alignment analyses can be easily combined by representing phonetic sequences internally as sound classes and comparing the sound classes with traditional alignment algorithms. . . . . . INPUT TOKENIZATION CONVERSION tɔxtər t, ɔ, x, t, ə, r t ɔ x … → T O G … dɔːtər d, ɔː, t, ə, r d ɔː t … → T O T … ALIGNMENT CONVERSION OUTPUT T O G … → t ɔ x … t ɔ x t ə r T O G T E R T O - … → d oː - … d ɔː x t ə r T O - T E R 1 11 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Automatic Approaches - r e c m . a o Sound-Class-Based Alignment (SCA) . - r f l n a . . o d t e n d Sound classes and alignment analyses can be easily combined v e e i m r s e a n d by representing phonetic sequences internally as sound classes b g s i l e e a and comparing the sound classes with traditional alignment b r o e y c h a s t algorithms. m y . e b c . . n . n d . o a e i t d t s a INPUT TOKENIZATION l CONVERSION i e c d i i tɔxtər t, ɔ, x, t, ə, r y t ɔ x … → T O G … f d i t n s n dɔːtər d, ɔː, t, ə, r d ɔː t … → T O T … e e a r d o d i l c o e s h t y a ALIGNMENT s CONVERSION OUTPUT t n e i r g r a T O G … → t ɔ x … t ɔ x t ə r h o T O G T E R l t i C m T O - … → d oː - … d ɔː x t ə r n T O - T E R i . i a s m t e h h t i t r o g 1 11 / 28

. Similarity . . . Almost all current automatic approaches are based on a language-independent similarity measure, while the comparative method applies a language-specific one. All automatic approaches will therefore yield the same scores for phenotypically identical sequences, regardless of the language systems they belong to. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Traditional vs. Automatic Approaches 12 / 28

Almost all current automatic approaches are based on a language-independent similarity measure, while the comparative method applies a language-specific one. All automatic approaches will therefore yield the same scores for phenotypically identical sequences, regardless of the language systems they belong to. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Traditional vs. Automatic Approaches . Similarity . . . . . . . . 12 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Traditional vs. Automatic Approaches . Similarity . . . Almost all current automatic approaches are based on a language-independent similarity measure, while the comparative method applies a language-specific one. All automatic approaches will therefore yield the same scores for phenotypically identical sequences, regardless of the language systems they belong to. . . . . . 12 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . LexStat 13 / 28

Sequence Input sequences are read from specifically for- matted input files 1 Sequence Conversion sequences are converted to sound classes and prosodic profiles 2 Scoring-Scheme Creation using a permutation method, language- specific scoring schemes are determined 3 Distance Calculation based on the language-specific scoring- scheme, pairwise distances between sequences are calculated 4 Sequence Clustering sequences are clustered into cognate sets whose average distance is beyond a certain threshold Sequence Output information regarding sequence clustering is written to file using a specific format Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Working Procedure 14 / 28

1 Sequence Conversion sequences are converted to sound classes and prosodic profiles 2 Scoring-Scheme Creation using a permutation method, language- specific scoring schemes are determined 3 Distance Calculation based on the language-specific scoring- scheme, pairwise distances between sequences are calculated 4 Sequence Clustering sequences are clustered into cognate sets whose average distance is beyond a certain threshold Sequence Output information regarding sequence clustering is written to file using a specific format Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Working Procedure Sequence Input sequences are read from specifically for- matted input files 14 / 28

2 Scoring-Scheme Creation using a permutation method, language- specific scoring schemes are determined 3 Distance Calculation based on the language-specific scoring- scheme, pairwise distances between sequences are calculated 4 Sequence Clustering sequences are clustered into cognate sets whose average distance is beyond a certain threshold Sequence Output information regarding sequence clustering is written to file using a specific format Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Working Procedure Sequence Input sequences are read from specifically for- matted input files 1 Sequence Conversion sequences are converted to sound classes and prosodic profiles 14 / 28

3 Distance Calculation based on the language-specific scoring- scheme, pairwise distances between sequences are calculated 4 Sequence Clustering sequences are clustered into cognate sets whose average distance is beyond a certain threshold Sequence Output information regarding sequence clustering is written to file using a specific format Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Working Procedure Sequence Input sequences are read from specifically for- matted input files 1 Sequence Conversion sequences are converted to sound classes and prosodic profiles 2 Scoring-Scheme Creation using a permutation method, language- specific scoring schemes are determined 14 / 28

4 Sequence Clustering sequences are clustered into cognate sets whose average distance is beyond a certain threshold Sequence Output information regarding sequence clustering is written to file using a specific format Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Working Procedure Sequence Input sequences are read from specifically for- matted input files 1 Sequence Conversion sequences are converted to sound classes and prosodic profiles 2 Scoring-Scheme Creation using a permutation method, language- specific scoring schemes are determined 3 Distance Calculation based on the language-specific scoring- scheme, pairwise distances between sequences are calculated 14 / 28

Sequence Output information regarding sequence clustering is written to file using a specific format Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Working Procedure Sequence Input sequences are read from specifically for- matted input files 1 Sequence Conversion sequences are converted to sound classes and prosodic profiles 2 Scoring-Scheme Creation using a permutation method, language- specific scoring schemes are determined 3 Distance Calculation based on the language-specific scoring- scheme, pairwise distances between sequences are calculated 4 Sequence Clustering sequences are clustered into cognate sets whose average distance is beyond a certain threshold 14 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Working Procedure Sequence Input sequences are read from specifically for- matted input files 1 Sequence Conversion sequences are converted to sound classes and prosodic profiles 2 Scoring-Scheme Creation using a permutation method, language- specific scoring schemes are determined 3 Distance Calculation based on the language-specific scoring- scheme, pairwise distances between sequences are calculated 4 Sequence Clustering sequences are clustered into cognate sets whose average distance is beyond a certain threshold Sequence Output information regarding sequence clustering is written to file using a specific format 14 / 28

LexStat ist implemented as part of the LingPy Python library (see http://lingulist.de/lingpy ) for automatic tasks in historical linguistics. The current release of LingPy (lingpy-1.0) provides methods for pairwise and multiple sequence alignment (SCA), automatic cognate detection (LexStat), and plotting routines (see the online documentation for details). LexStat can be invoked from the Python shell or inside Python scripts (examples are given in the online documentation). Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Implementation 15 / 28

The current release of LingPy (lingpy-1.0) provides methods for pairwise and multiple sequence alignment (SCA), automatic cognate detection (LexStat), and plotting routines (see the online documentation for details). LexStat can be invoked from the Python shell or inside Python scripts (examples are given in the online documentation). Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Implementation LexStat ist implemented as part of the LingPy Python library (see http://lingulist.de/lingpy ) for automatic tasks in historical linguistics. 15 / 28

LexStat can be invoked from the Python shell or inside Python scripts (examples are given in the online documentation). Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Implementation LexStat ist implemented as part of the LingPy Python library (see http://lingulist.de/lingpy ) for automatic tasks in historical linguistics. The current release of LingPy (lingpy-1.0) provides methods for pairwise and multiple sequence alignment (SCA), automatic cognate detection (LexStat), and plotting routines (see the online documentation for details). 15 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Implementation LexStat ist implemented as part of the LingPy Python library (see http://lingulist.de/lingpy ) for automatic tasks in historical linguistics. The current release of LingPy (lingpy-1.0) provides methods for pairwise and multiple sequence alignment (SCA), automatic cognate detection (LexStat), and plotting routines (see the online documentation for details). LexStat can be invoked from the Python shell or inside Python scripts (examples are given in the online documentation). 15 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Input and Output ID Items German English Swedish 1 hand hant hænd hand 2 woman fraʊ wʊmən kvina 3 know kɛnən nəʊ çɛna 3 know vɪsən - veːta … … … … … 16 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Input and Output ID Items German COG English COG Swedish COG 1 hand hant 1 hænd 1 hand 1 2 woman fraʊ 2 wʊmən 3 kvina 4 3 know kɛnən 5 nəʊ 5 çɛna 5 3 know vɪsən 6 - 0 veːta 6 … … … … … … … … 16 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Input and Output 16 / 28

. Sound Classes and Prosodic Context . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Internal Representation of Sequences 17 / 28

All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Internal Representation of Sequences . Sound Classes and Prosodic Context . . . . . . . . 17 / 28

All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Internal Representation of Sequences . Sound Classes and Prosodic Context . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). . . . . . 17 / 28

The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Internal Representation of Sequences . Sound Classes and Prosodic Context . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). . . . . . 17 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Internal Representation of Sequences . Sound Classes and Prosodic Context . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment. . . . . . 17 / 28

. Attested Distribution . . . carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold . . . . . . Creation of the Expected Distribution . . . shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results . . . . . . Calculation of Similarity Scores . . . Calculation of log-odds scores from the distributions. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Scoring-Scheme Creation 18 / 28

carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold . Creation of the Expected Distribution . . . shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results . . . . . . Calculation of Similarity Scores . . . Calculation of log-odds scores from the distributions. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Scoring-Scheme Creation . Attested Distribution . . . . . . . . 18 / 28

. Creation of the Expected Distribution . . . shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results . . . . . . Calculation of Similarity Scores . . . Calculation of log-odds scores from the distributions. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Scoring-Scheme Creation . Attested Distribution . . . carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold . . . . . 18 / 28

shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results . Calculation of Similarity Scores . . . Calculation of log-odds scores from the distributions. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Scoring-Scheme Creation . Attested Distribution . . . carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold . . . . . . Creation of the Expected Distribution . . . . . . . . 18 / 28

. Calculation of Similarity Scores . . . Calculation of log-odds scores from the distributions. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Scoring-Scheme Creation . Attested Distribution . . . carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold . . . . . . Creation of the Expected Distribution . . . shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results . . . . . 18 / 28

Calculation of log-odds scores from the distributions. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Scoring-Scheme Creation . Attested Distribution . . . carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold . . . . . . Creation of the Expected Distribution . . . shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results . . . . . . Calculation of Similarity Scores . . . . . . . . 18 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Scoring-Scheme Creation . Attested Distribution . . . carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold . . . . . . Creation of the Expected Distribution . . . shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results . . . . . . Calculation of Similarity Scores . . . Calculation of log-odds scores from the distributions. . . . . . 18 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Scoring-Scheme Creation English German Att. Exp. Score #[t,d] #[t,d] 3.0 1.24 6.3 #[t,d] #[ʦ] 3.0 0.38 6.0 #[t,d] #[ʃ,s,z] 1.0 1.99 -1.5 #[θ,ð] #[t,d] 7.0 0.72 6.3 #[θ,ð] #[ʦ] 0.0 0.25 -1.5 #[θ,ð] #[s,z] 0.0 1.33 0.5 [t,d]$ [t,d]$ 21.0 8.86 6.3 [t,d]$ [ʦ]$ 3.0 1.62 3.9 [t,d]$ [ʃ,s]$ 6.0 5.30 1.5 [θ,ð]$ [t,d]$ 4.0 1.14 4.8 [θ,ð]$ [ʦ]$ 0.0 0.20 -1.5 [θ,ð]$ [ʃ,s]$ 0.0 0.80 0.5 19 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Scoring-Scheme Creation Initial Final town [taʊn] hot [hɔt] English Zaun [ʦaun] heiß [haɪs] German thorn [θɔːn] mouth [maʊθ] English Dorn [dɔrn] Mund [mʊnt] German dale [deɪl] head [hɛd] English Tal [taːl] Hut [huːt] German 19 / 28

Clusters 1 2 3 3 1 3 Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Sequence Clustering Ger. Eng. Dan. Swe. Dut. Nor. Ger. [frau] 0.00 0.95 0.81 0.70 0.34 1.00 Eng. [wʊmən] 0.95 0.00 0.78 0.90 0.80 0.80 Dan. [kvenə] 0.81 0.78 0.00 0.17 0.96 0.13 Swe. [kvinːa] 0.70 0.90 0.17 0.00 0.86 0.10 Dut. [vrɑuʋ] 0.34 0.80 0.96 0.86 0.00 0.89 Nor. [kʋinə] 1.00 0.80 0.13 0.10 0.89 0.00 20 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Sequence Clustering Ger. Eng. Dan. Swe. Dut. Nor. Ger. [frau] 0.00 0.95 0.81 0.70 0.34 1.00 Eng. [wʊmən] 0.95 0.00 0.78 0.90 0.80 0.80 Dan. [kvenə] 0.81 0.78 0.00 0.17 0.96 0.13 Swe. [kvinːa] 0.70 0.90 0.17 0.00 0.86 0.10 Dut. [vrɑuʋ] 0.34 0.80 0.96 0.86 0.00 0.89 Nor. [kʋinə] 1.00 0.80 0.13 0.10 0.89 0.00 Clusters 1 2 3 3 1 3 20 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . * * * * * v o l - d e m o r t * * v - l a d i m i r - * * v a l - d e m a r - * * * Evaluation * 21 / 28 1

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Gold Standard 22 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Gold Standard File Family Lng. Itm. Entr. Source GER Germanic 7 110 814 Starostin (2008) ROM Romance 5 110 589 Starostin (2008) SLV Slavic 4 110 454 Starostin (2008) PIE Indo-Eur. 18 110 2057 Starostin (2008) OUG Uralic 21 110 2055 Starostin (2008) BAI Bai 9 110 1028 Wang (2006) SIN Sinitic 9 180 1614 Hóu (2004) KSL varia 8 200 1600 Kessler (2001) JAP Japonic 10 200 1986 Shirō (1973) 22 / 28

. Set Comparison . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . . . . . . Pair Comparison . . . Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Evaluation Measures 23 / 28

Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . Pair Comparison . . . Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Evaluation Measures . Set Comparison . . . . . . . . 23 / 28

. Pair Comparison . . . Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard. . . . . . Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Evaluation Measures . Set Comparison . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . . . . . 23 / 28

Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard. Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Evaluation Measures . Set Comparison . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . . . . . . Pair Comparison . . . . . . . . 23 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Evaluation Measures . Set Comparison . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . . . . . . Pair Comparison . . . Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard. . . . . . 23 / 28

Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) LexStat – language-specific distance scores Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Tests 24 / 28

Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) LexStat – language-specific distance scores Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Tests Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) 24 / 28

SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) LexStat – language-specific distance scores Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Tests Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) 24 / 28

LexStat – language-specific distance scores Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Tests Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) 24 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Tests Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) LexStat – language-specific distance scores 24 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . General Results 25 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . General Results Score LexStat SCA Simple Alm. Sound Cl. Identical Pairs 0.85 0.82 0.76 0.74 Precision 0.59 0.51 0.39 0.39 Recall 0.68 0.57 0.47 0.55 F-Score 0.63 0.55 0.42 0.46 25 / 28

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . General Results 1.0 LexStat SCA NED 0.9 Turchin 0.8 0.7 0.6 SLV KSL GER BAI SIN PIE ROM JAP OUG 25 / 28 1

Pairwise decisions were extracted from the KSL dataset and compared with the Gold Standard. 72 borrowings were explicitly marked along with their source by Kessler (2001). 83 chance resemblances were determined automatically by taking non-cognate word pairs with an NED score less than 0.6. LexStat SCA Simple Alm. Sound Cl. Borrowings 50% 61% 49% 53% Chance Resemblances 17% 42% 89% 31% Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Specific Results 26 / 28

72 borrowings were explicitly marked along with their source by Kessler (2001). 83 chance resemblances were determined automatically by taking non-cognate word pairs with an NED score less than 0.6. LexStat SCA Simple Alm. Sound Cl. Borrowings 50% 61% 49% 53% Chance Resemblances 17% 42% 89% 31% Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Specific Results Pairwise decisions were extracted from the KSL dataset and compared with the Gold Standard. 26 / 28

83 chance resemblances were determined automatically by taking non-cognate word pairs with an NED score less than 0.6. LexStat SCA Simple Alm. Sound Cl. Borrowings 50% 61% 49% 53% Chance Resemblances 17% 42% 89% 31% Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . Specific Results Pairwise decisions were extracted from the KSL dataset and compared with the Gold Standard. 72 borrowings were explicitly marked along with their source by Kessler (2001). 26 / 28

LexStat: Automatic Detection of Cognates in Multilingual Wordlists - PowerPoint PPT Presentation

Keys to the Past Identification of Cognates LexStat Evaluation . . . . . . . . . . . . . . . . . . . . . . LexStat: Automatic Detection of Cognates in Multilingual Wordlists . . . . . Johann-Mattis List Institute for Romance

Germanic Cognates Answers father thief thing thirst thanks thorn brother bath thousand

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

Creating Large-Scale Multilingual Cognate Tables Winston Wu and David Yarowsky Center for

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

Introduction to English Linguistics 6: Indo-European and Germanic Cognates Sanskrit Latin

Introduction to English Linguistics 8: Indo-European and Germanic Cognates Sanskrit Latin

Automatic Defect Detection Andrzej Wasylkowski Overview Automatic Defect Detection

Automatic Disfluency Automatic Disfluency Detection in Multi-party Detection in Multi-party

Semi-automatic generation of multilingual glossaries Ilan Kernerman K Dictionaries Ltd, Tel Aviv

Monitoring and analysing multilingual media reports Monitoring and analysing multilingual media

Multilingual User Generated Content at Wikipedia Alolita Sharma Director of Language Engineering

Multilingual Web: Affordable for SMEs and Small Organizations? Multilingual Communication

Verbs in the Open Multilingual Wordnet Francis Bond Linguistics and Multilingual Studies,

9/25/20 The Great Pain Masqueraders: Thoracic Outlet Syndrome, Piriformis Syndrome, and

to treat femoroacetabular impingement syndrome (FAIS) Click to edit Master title style Joanne L

Assumption-based Reasoning Often we want our agents to make assumptions rather than doing

Tutorial on separation logic Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS)

Prime Implicate Generation in Equational Logic Mnacho Echenim Nicolas Peltier Sophie Tourret

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU Charles Sanders Peirce:

Transformational Priors The Big Concept Over Grammars Want to parse (or build a syntactic

PSYC 001 General Psychology I understand the confidence-rating system that will be used on exams.

Sambuz

Useful Links

Newsletter

Mail Us