Introduction to Dialectometry II Wilbert Heeringa German Academic Exchange Service – DAAD University of Bielefeld, Faculty of Linguistics and Literary Studies Frisian Academy Abidjan, December, 19–23, 2016 1
Topics Validation of distance measures Consistency of distance measures Quality of classifications Cluster algorithms Fuzzy clustering Cophenetic multidimensional scaling maps Reference point maps 2
Validation of distance measures 3
Experiment • In Norway: everybody speaks dialect, there is not a standard language. • In the period 1999–2002 Jørn Almberg and Kristian Skarbø recorded about 50 Norwegian dialects. • The fable ‘The North Wind and the Sun’ was taken as a basis. • This text was also used in IPA handbooks published in 1949 and 1999. • Speakers were asked to translate the text and to read it aloud. • Audio files and transcriptions available at: http://www.ling.hf.ntnu.no/nos/ 4
Experiment • Perception experiment carried out in the Spring of 2000 by Charlotte Gooskens. • 15 recordings of 15 dialects were used. • In each of the 15 locations, a group of 16 to 27 high school pupils listened to all 15 texts. • The texts were presented in a randomized order. 5
Bodø Verdal Bjugn Stjørdal Fræna Trondheim Herøy Lesja The geographic distribution of Lillehammer the 15 Norwegian dialects. Bergen Bø Borre Halden Larvik Time 6
Experiment • Task: each pupil notes for each text the distance of the corresponding dialect compared to his own dialect. • Scale from 1 (similar to own dialect) to 10 (not similar to own dialect). • Final result: a 15 × 15 perceptual distance matrix. 7
Experiment Be Bj Bo Bø Bo Fr Ha He La Le Li St Ti Tr Ve Bergen 1.7 9.0 8.2 8.0 7.7 7.7 8.2 6.9 8.0 8.9 8.5 8.4 4.8 8.5 8.0 " Bjugn 9.1 3.4 6.4 8.2 9.2 5.8 8.3 8.0 8.4 7.3 9.1 2.2 8.0 3.3 2.8 " Bodø 8.7 7.9 1.5 8.3 8.3 6.6 7.9 7.8 7.3 8.0 8.7 6.6 8.1 6.2 6.3 " Bø 8.1 7.8 7.5 1.0 7.7 8.1 4.9 7.8 5.3 6.0 5.1 7.1 6.3 8.2 8.6 " Borre 6.1 8.8 7.8 6.5 1.7 8.5 1.8 7.5 1.6 7.5 2.0 7.2 7.5 8.5 9.1 " Fræna 9.0 7.5 7.1 8.4 8.8 3.1 8.1 7.8 8.5 7.2 9.0 6.6 7.4 6.1 7.6 " Halden 7.0 8.2 8.0 6.8 4.0 8.1 2.8 7.9 2.8 6.6 3.0 7.4 7.0 8.0 8.3 " Herøy 8.6 9.3 8.4 8.5 9.1 7.0 8.6 1.2 9.3 9.3 9.4 8.5 7.5 7.5 8.2 " Larvik 7.4 8.7 7.6 4.0 4.0 7.7 3.2 5.6 3.4 7.1 4.6 8.2 6.8 8.3 7.5 " Lesja 8.5 7.6 7.8 7.4 8.2 7.3 7.6 7.7 7.6 1.0 7.1 6.9 7.2 7.7 8.2 " Lillehammer 6.7 8.3 8.1 6.2 4.4 8.0 3.1 7.5 4.1 7.3 2.7 7.6 6.8 8.7 8.1 " Stjørdal 8.7 3.7 6.8 7.7 8.1 6.0 7.5 7.7 8.3 7.1 8.3 2.0 7.7 3.8 3.4 " Time 7.0 9.3 8.4 8.1 8.4 8.3 8.0 7.2 8.2 9.1 8.8 8.8 1.8 8.8 9.0 " Trondheim 7.8 5.8 6.7 7.5 6.4 7.3 6.0 7.1 5.9 7.9 6.3 4.4 7.6 3.3 6.8 " Verdal 8.8 3.4 6.4 8.2 8.4 5.7 7.2 7.9 7.9 7.4 8.4 1.8 7.9 3.1 2.6 " Perceptual distances among 15 Norwegian dialeact varieties. Row names represent listener groups, column names represent dialect speakers. 8
Average perceptual distances between 15 Norwegian dialects. Darker lines connect closer points, lighter lines more remote ones. Distance pairs A – B / B – A are averaged. 9
Experiment • Using the transcriptions we measure lexical distances and pronunciation distances among the 15 local dialect variaties. • Each dialect text usually consists of 58 different words. • Validation: How well do the dialectometric distances correlate with the perceptual distances? 10
Correlations (1) lexical r expl. var. relative difference value 0.27 7% weighted difference value 0.37 14% pronunciation aggregate r expl. var. Levenshtein (1) 0.71 50% Levenshtein (2) 0.70 49% Levenshtein (3) 0.67 45% Levenshtein PMI (1) 0.71 50% Levenshtein PMI (3) 0.67 45% 11
Correlations (2) • In the measurements binary weighting is used. Suprasegmentals and diacritics are ignored. • No difference between ‘classic’ Levenshtein and PMI Levenshtein, but alignments made by PMI Levenshtein are better, see Wieling, Proki´ c and Nerbonne (2009). 12
Left: perceptual distances. Right: lexical weighted difference value distances. Darker lines connect closer points, lighter lines more remote ones. r = 0 . 37 13
Left: perceptual distances. Right: non-normalized Levenshtein distances. Darker lines connect closer points, lighter lines more remote ones. r = 0 . 71 . 14
Consistency of distance measures 15
Consistency • How many items do we need for dialect comparison? Rule of thumb: 100 items (Goebl). • In order to answer this question more precisely, measure the degree to which different words in the data set give the same signal of linguistic relationships between the dialects: measure Cronbach’s Alpha . • Example: measure Levenshtein distance between three dialects using four words. In this example we normalize Levenshtein distances per word pair. 16
Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word seen . 17
Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word hart . 18
Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word son . 19
Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word house . 20
Consistency • General pattern: Haarlem and Almelo are linguistically relatively close to each other and relatively distant to Grouw. • Levenshtein distances between the three local dialects: seen hart son house Grouw vs. Haarlem 71 25 100 75 Grouw vs. Almelo 83 25 75 33 Haarlem vs. Almelo 60 20 50 50 • Using the values in the columns the words are correlated to each other. 21
Consistency • Correlations between words: r n seen vs. hart 0.85 3 seen vs. son 0.48 3 seen vs. house -0.43 3 hart vs. son 0.87 3 hart vs. house 0.11 3 son vs. house 0.59 3 • The average inter-correlation r is 0.41. 22
Consistency • Cronbach’s α can be written as a function of the number of words and the average inter-correlation among the words: n w × ¯ r α = 1 + ( n w − 1) × ¯ r where n w is the number of words which is in our example 4. • Calculation: 4 × 0 . 41 α = 1 + (4 − 1) × 0 . 41 = 0 . 74 • If all words have the same geographic distribution of variants the value of Cronbach’s alpha is 1, if there is no consistency between the words in the data set the value is 0. • A generally accepted threshold for consistency of the data is 0.70. 23
Consistency • In general: the more items are included, the higher Cronbach’s Alpha. • If the Cronbach’s Alpha value is very low, add more items! 24
Consistency 1.0 0.7 0.8 Cronbachs’s alpha Cronbach’s alpha 0.5 0.6 0.3 0.4 0.2 0.1 0.0 -0.1 0 20 40 60 80 100 0 20 40 60 80 100 120 number of words number of words Left: Cronbach’s α values for random subsets of 2 through 107 words (lexical weighted difference values) and 360 local dialects. From 86 words on α is always higher than 0.70. For 107 words α is equal to 0.75. Right: Cronbach’s α values for random subsets of 2 through 125 words (Levenshtein distance) and 360 local dialects. From 13 words on α is always higher than 0.70. For 125 words α is equal to 0.97. 25
Quality of classifications 26
Quality of classifications • For clustering compare cophenetic distances to original distances. • For multidimensional scaling compare interpoint multidimensional scaling distances to original distances. 27
Cophenetic distances • In a dendrogram the distances between clusters are represented by the length of the branches. Grouw Delft Haarlem Hattem Lochem 0 10 20 30 40 • Cophenetic distance: distance between two local dialects as found in the dendrogram. • Find the shortest path between two local dialects and the longest distance in one direction within the shortest path. 28
Cophenetic distances Grouw Haarlem Delft Hattem Lochem Grouw 0 44 44 44 44 Haarlem 44 0 16 36.25 36.25 Delft 44 16 0 36.25 36.25 Hattem 44 36.25 36.25 0 20 Lochem 44 36.25 36.25 20 0 29
Cophenetic distances • Cophenetic correlation coefficient: measure of how faithfully the pairwise distances between local dialects as suggested by the dendrogram preserve the original pairwise distances. • Correlate the pairwise cophenetic distances with the original pairwise distances: r = 0.99 • The amount of variance in the original distances explained by the cophenetic distances is r 2 × 100 = 97.6%. 30
Interpoint multidimensional scaling distances • With multidimensional scaling the five local dialects are plotted in two-dimensional space so that the distances are preserved as well as possible: 30 Grouw 20 second dimension 10 Lochem Hattem 0 -10 Haarlem Delft -20 -30 -40 -20 0 20 40 fi rst dimension 31
Interpoint multidimensional scaling distances • We can calculate interpoint distances between the local dialects: 30 Grouw 20 second dimension 10 Lochem Hattem 0 -10 Haarlem Delft -20 -30 -40 -20 0 20 40 fi rst dimension • Distance between Grouw (-24,21) and Hattem (18, 6): � ( − 24 − 18) 2 + (21 − 6) 2 = 44 . 6 32
Recommend
More recommend