natural language processing
play

Natural Language Processing Diachronics Dan Klein UC Berkeley - PowerPoint PPT Presentation

12/1/2014 Natural Language Processing Diachronics Dan Klein UC Berkeley Includes joint work with Alex Bouchard Cote, Tom Griffiths, and David Hall 1 12/1/2014 The Task 2 12/1/2014 Lexical Reconstruction Latin focus French Spanish


  1. 12/1/2014 Natural Language Processing Diachronics Dan Klein – UC Berkeley Includes joint work with Alex Bouchard ‐ Cote, Tom Griffiths, and David Hall 1

  2. 12/1/2014 The Task 2

  3. 12/1/2014 Lexical Reconstruction Latin focus French Spanish Italian Portuguese feu fuego fuoco fogo 3

  4. 12/1/2014 Tree of Languages  We assume the phylogeny is known  Much work in biology, e.g. work by Warnow, Felsenstein, Steele…  Also in linguistics, e.g. Warnow et al., Gray and Atkinson… http://andromeda.rutgers.edu/~jlynch/language.html 4

  5. 12/1/2014 Evolution through Sound Changes Eng. camera from Latin, “camera obscura” camera / kamera / Latin Deletion: / e /, / a / Change: / k / .. / t ṏ / .. / ṏ / Insertion: / b / chambre / ṏ amb Й / French Eng. chamber from Old Fr. before the initial / t / dropped 5

  6. 12/1/2014 Changes are Systematic camera / kamera / numerus / numerus / e  _ e  _ camra / kamra / numrus / numrus / 6

  7. 12/1/2014 Changes are Contextual camera / kamera / e  _ e  _ / after stress camra / kamra / 7

  8. 12/1/2014 Changes Have Structure camra / kamra / _  b _  b / m_r _  [ stop x ] / [ nasal x ]_r cambra / kambra / 8

  9. 12/1/2014 Changes are Systematic English Great Vowel Shift (Simplified!) “time” = teem “time” = taim i e a 9

  10. 12/1/2014 Diachronic Evidence Yahoo! Answers [ca 2000] Appendix Probi [ca 300] tonight not tonite tonitru non tonotru 10

  11. 12/1/2014 Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon 11

  12. 12/1/2014 The Data 12

  13. 12/1/2014 The Data  Data sets  Small: Romance  French, Italian, Portuguese, Spanish  2344 words  Complete cognate sets FR IT PT ES  Target: (Vulgar) Latin 13

  14. 12/1/2014 The Data  Data sets  Small: Romance  French, Italian, Portuguese, Spanish  2344 words  Complete cognate sets FR IT PT ES  Target: (Vulgar) Latin  Large: Austronesian  637 languages  140K words  Incomplete cognate sets  Target: Proto ‐ Austronesian 14

  15. 12/1/2014 Austronesian 15

  16. 12/1/2014 Austronesian Examples From the Austronesian Basic Vocabulary Database 16

  17. 12/1/2014 The Model 17

  18. 12/1/2014 Simple Model: Single Characters G G C G G C C C C C G G [cf. Felsenstein 81] 18

  19. 12/1/2014 Changes are Systematic /fokus/ /fokus/ /kentrum/ /fogo/ /fogo/ /sentro/ /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ /t ṏ ƌ ntro/ /sentro/ /sentro/ 19

  20. 12/1/2014 Parameters are Branch ‐ Specific focus  ES  IB LA /fokus/  IT  PT /fogo/ IB fuoco fuego fogo /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ IT ES PT [Bouchard ‐ Cote, Griffiths, Klein, 07] 20

  21. 12/1/2014 Edits are Contextual, Structured o # f /fokus/ Ꜽ w # f  IT /fw Ꜽ ko/ 21

  22. 12/1/2014 Inference 22

  23. 12/1/2014 Learning: Objective /fokus/ z /fogo/ /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ w 23

  24. 12/1/2014 Learning: EM  M ‐ Step  Find parameters which fit /fokus/ (expected) sound change counts /fogo/  Easy: gradient ascent on theta /fw Ꜽ ko/ /fwe ɋ o/ /fogo/  E ‐ Step  Find (expected) change /fokus/ counts given parameters  Hard: variables are string ‐ /fogo/ valued /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ 24

  25. 12/1/2014 Computing Expectations Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence ‘grass’ [Holmes 01, Bouchard ‐ Cote, Griffiths, Klein 07] 25

  26. 12/1/2014 A Gibbs Sampler ‘grass’ 26

  27. 12/1/2014 A Gibbs Sampler ‘grass’ 27

  28. 12/1/2014 A Gibbs Sampler ‘grass’ 28

  29. 12/1/2014 Getting Stuck ? How could we jump to a state where the liquids /r/ and /l/ have a common ancestor? 29

  30. 12/1/2014 Getting Stuck 30

  31. 12/1/2014 Efficient Sampling: Vertical Slices Single Sequence Resampling Ancestry Resampling [Bouchard ‐ Cote, Griffiths, Klein, 08] 31

  32. 12/1/2014 Results 32

  33. 12/1/2014 Results: Romance 33

  34. 12/1/2014 Learned Rules / Mutations 34

  35. 12/1/2014 Learned Rules / Mutations 35

  36. 12/1/2014 Results: Austronesian 36

  37. 12/1/2014 Examples: Austronesian [Bouchard ‐ Cote, Hall, Griffiths, Klein, 13] 37

  38. 12/1/2014 Result: More Languages Help Distance from Blust [1993] Reconstructions Mean edit distance Number of modern languages used 38

  39. 12/1/2014 Visualization: Learned Universals *The model did not have features encoding natural classes 39

  40. 12/1/2014 Regularity and Functional Load In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English p/d versus t/th High Load: p/d: pot/dot, pin/din dress/press, pew/dew, ... Low Load: t/th: thin/tin 40

  41. 12/1/2014 Functional Load: Timeline 1955: Functional Load Hypothesis (FLH): Sound changes are less frequent when they merge phonemes with high functional load [Martinet, 55] 1967: Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67] (Based on 4 languages as noted by [Hocket, 67; Surandran et al., 06]) Our approach: we reexamined the question with two orders of magnitude more data [Bouchard ‐ Cote, Hall, Griffiths, Klein, 13] 41

  42. 12/1/2014 Regularity and Functional Load Data: only 4 languages from the Austronesian data Merger posterior probability Each dot is a sound change identified by the system Functional load as computed by [King, 67] 42

  43. 12/1/2014 Regularity and Functional Load Data: all 637 languages from the Austronesian data Merger posterior probability Functional load as computed by [King, 67] 43

  44. 12/1/2014 Extensions 44

  45. 12/1/2014 Cognate Detection ‘fire’  /fw Ꜽ ko/ /v ƌ rbo/ /t ṏ ƌ ntro/ /sentro/ /ber Ǎ o/ /fwe ɋ o/ /v ƌ rbo/ /fogo/ /s ƌ ntro/ [Hall and Klein, 11] 45

  46. 12/1/2014 Grammar Induction GL Avg rel gain: 29% IE G RM 70 WG NG 60 50 Portuguese Swedish Chinese Spanish Slovene English Danish Dutch 40 30 20 10 0 [Berg ‐ Kirkpatrick and Klein, 07] 46

  47. 12/1/2014 Language Diversity Why are the languages of the world so similar? Universal grammar answer: Hardware constraints Common source answer: Not much time has passed [Rafferty, Griffiths, and Klein, 09] 47

Recommend


More recommend