Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 Kate Kelley 2 Nishant Kambhatla 1 Carolyn Chen 1 Anoop Sarkar 1 1 Natural Language Laboratory 2 Department of Classical, Near School of Computing Science Eastern, and Religious Studies Simon Fraser University University of British Columbia 7 June 2019 1 / 37
Outline Introduction to Proto-Elamite Experiments Sign Clustering n -Gram Frequency LDA Topic Modeling Summary References 2 / 37
Introduction 3 / 37
Proto-Elamite Overview 4 / 37
Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37
Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37
Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37
Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37
Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37
Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37
Proto-Elamite Overview Proto-Elamite Proto-Cuneiform N08A N01 N14 N34 N48 N45 N50 6 / 37
Proto-Elamite Overview 7 / 37
Proto-Elamite Data ◮ Corpus transcribed by CDLI 8 / 37
Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign 8 / 37
Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) 8 / 37
Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types 8 / 37
Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric 8 / 37
Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric ◮ 287 basic non-numeric 8 / 37
Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric ◮ 287 basic non-numeric ◮ 1087 variants 8 / 37
Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric ◮ 287 basic non-numeric ◮ 1087 variants ◮ 249 complex graphemes 8 / 37
Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts 9 / 37
Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts ◮ Hierarchical clustering of signs 9 / 37
Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts ◮ Hierarchical clustering of signs ◮ n -gram frequencies 9 / 37
Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts ◮ Hierarchical clustering of signs ◮ n -gram frequencies ◮ LDA topic modelling 9 / 37
Contributions ◮ Rediscover results from manual investigation of the corpus 10 / 37
Contributions ◮ Rediscover results from manual investigation of the corpus ◮ Highlight novel patterns to inform future decipherment attempts 10 / 37
Contributions ◮ Rediscover results from manual investigation of the corpus ◮ Highlight novel patterns to inform future decipherment attempts ◮ Provide code for other groups to work with proto-Elamite 10 / 37
Sign Clustering 11 / 37
Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. 12 / 37
Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: 12 / 37
Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) 12 / 37
Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities 12 / 37
Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities ◮ Brown clustering 12 / 37
Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities ◮ Brown clustering Reduce impact of noise by finding common groupings across all three techniques. 12 / 37
Sign Clustering Results Rediscover results from manual work: ◮ Groups variants believed to have similar/identical function 13 / 37
Sign Clustering Results Rediscover results from manual work: ◮ Groups “syllabic” signs (Dahl 2019, Desset 2016, Meriggi 1971) Neighbor HMM Brown 13 / 37
Sign Clustering Results Novel grouping: signs resembling numerals Neighbor HMM Brown 14 / 37
Sign Clustering Results Novel grouping: signs resembling numerals or written with rounded stylus. Neighbor HMM Brown 14 / 37
n -Gram Frequency 15 / 37
n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. 16 / 37
n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context. 16 / 37
n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context. Did not count n -grams containing numeric signs. ◮ Want to focus on undeciphered signs. 16 / 37
n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context. Did not count n -grams containing numeric signs. ◮ Want to focus on undeciphered signs. ◮ Do not want n -grams spanning multiple entries. 16 / 37
n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 17 / 37
n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 Highlighted signs may... ◮ Qualify M388? 17 / 37
n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 Highlighted signs may... ◮ Qualify M388? ◮ Identifying specific classes of individual 17 / 37
n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 Highlighted signs may... ◮ Qualify M388? ◮ Identifying specific classes of individual ◮ Form series of names built on M097 ∼ h M004 M218? 17 / 37
Recommend
More recommend