sign clustering and topic extraction in proto elamite
play

Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 - PowerPoint PPT Presentation

Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 Kate Kelley 2 Nishant Kambhatla 1 Carolyn Chen 1 Anoop Sarkar 1 1 Natural Language Laboratory 2 Department of Classical, Near School of Computing Science Eastern, and Religious


  1. Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 Kate Kelley 2 Nishant Kambhatla 1 Carolyn Chen 1 Anoop Sarkar 1 1 Natural Language Laboratory 2 Department of Classical, Near School of Computing Science Eastern, and Religious Studies Simon Fraser University University of British Columbia 7 June 2019 1 / 37

  2. Outline Introduction to Proto-Elamite Experiments Sign Clustering n -Gram Frequency LDA Topic Modeling Summary References 2 / 37

  3. Introduction 3 / 37

  4. Proto-Elamite Overview 4 / 37

  5. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  6. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  7. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  8. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  9. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  10. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  11. Proto-Elamite Overview Proto-Elamite Proto-Cuneiform N08A N01 N14 N34 N48 N45 N50 6 / 37

  12. Proto-Elamite Overview 7 / 37

  13. Proto-Elamite Data ◮ Corpus transcribed by CDLI 8 / 37

  14. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign 8 / 37

  15. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) 8 / 37

  16. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types 8 / 37

  17. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric 8 / 37

  18. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric ◮ 287 basic non-numeric 8 / 37

  19. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric ◮ 287 basic non-numeric ◮ 1087 variants 8 / 37

  20. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric ◮ 287 basic non-numeric ◮ 1087 variants ◮ 249 complex graphemes 8 / 37

  21. Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts 9 / 37

  22. Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts ◮ Hierarchical clustering of signs 9 / 37

  23. Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts ◮ Hierarchical clustering of signs ◮ n -gram frequencies 9 / 37

  24. Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts ◮ Hierarchical clustering of signs ◮ n -gram frequencies ◮ LDA topic modelling 9 / 37

  25. Contributions ◮ Rediscover results from manual investigation of the corpus 10 / 37

  26. Contributions ◮ Rediscover results from manual investigation of the corpus ◮ Highlight novel patterns to inform future decipherment attempts 10 / 37

  27. Contributions ◮ Rediscover results from manual investigation of the corpus ◮ Highlight novel patterns to inform future decipherment attempts ◮ Provide code for other groups to work with proto-Elamite 10 / 37

  28. Sign Clustering 11 / 37

  29. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. 12 / 37

  30. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: 12 / 37

  31. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) 12 / 37

  32. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities 12 / 37

  33. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities ◮ Brown clustering 12 / 37

  34. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities ◮ Brown clustering Reduce impact of noise by finding common groupings across all three techniques. 12 / 37

  35. Sign Clustering Results Rediscover results from manual work: ◮ Groups variants believed to have similar/identical function 13 / 37

  36. Sign Clustering Results Rediscover results from manual work: ◮ Groups “syllabic” signs (Dahl 2019, Desset 2016, Meriggi 1971) Neighbor HMM Brown 13 / 37

  37. Sign Clustering Results Novel grouping: signs resembling numerals Neighbor HMM Brown 14 / 37

  38. Sign Clustering Results Novel grouping: signs resembling numerals or written with rounded stylus. Neighbor HMM Brown 14 / 37

  39. n -Gram Frequency 15 / 37

  40. n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. 16 / 37

  41. n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context. 16 / 37

  42. n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context. Did not count n -grams containing numeric signs. ◮ Want to focus on undeciphered signs. 16 / 37

  43. n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context. Did not count n -grams containing numeric signs. ◮ Want to focus on undeciphered signs. ◮ Do not want n -grams spanning multiple entries. 16 / 37

  44. n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 17 / 37

  45. n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 Highlighted signs may... ◮ Qualify M388? 17 / 37

  46. n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 Highlighted signs may... ◮ Qualify M388? ◮ Identifying specific classes of individual 17 / 37

  47. n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 Highlighted signs may... ◮ Qualify M388? ◮ Identifying specific classes of individual ◮ Form series of names built on M097 ∼ h M004 M218? 17 / 37

Recommend


More recommend