probabilistic spelling correction
play

Probabilistic Spelling Correction CE-324: Modern Information - PowerPoint PPT Presentation

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford) Applications of


  1. Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford)

  2. Applications of spelling correction 2

  3. Spelling Tasks  Spelling Error Detection  Spelling Error Correction:  Autocorrect  hte  the  Suggest a correction  Suggestion lists 3

  4. Types of spelling errors  Non-word Errors  graffe  giraffe  Real-word Errors  Typographical errors  three  there  Cognitive Errors (homophones)  piece  peace,  too  two  your  you ’ re  Real-word correction almost needs to be context sensitive 4

  5. Spelling correction steps  For each word w , generate candidate set:  Find candidate words with similar pronunciations  Find candidate words with similar spellings  Choose best candidate  By “ Weighted edit distance ” or “ Noisy Channel ” approach  Context-sensitive – so have to consider whether the surrounding words “ make sense ” “ Flying form Heathrow to LAX ”  ” Flying from Heathrow to LAX ” 5

  6. Candidate Testing: Damerau-Levenshtein edit distance  Minimal edit distance between two strings, where edits are:  Insertion  Deletion  Substitution  Transposition of two adjacent letters 6

  7. 7

  8. Noisy channel intuition 8

  9. Noisy channel  We see an observation 𝑦 of a misspelled word  Find the correct word 𝑥 9

  10. Language Model  Take a big supply of words withT tokens: 𝑞 𝑥 = 𝐷(𝑥) 𝑈 C(w) = # occurrences of w  Supply of words  your document collection  In other applications:  you can take the supply to be typed queries (suitably filtered) – when a static dictionary is inadequate 10

  11. Unigram prior probability  Counts from 404,253,213 words in Corpus of Contemporary English (COCA) 11

  12. Channel model probability  Error model probability, Edit probability Misspelled word x = x1, x2, x3, … ,xm Correct word w = w1, w2, w3, … , wn  P(x|w) = probability of the edit (deletion/insertion/substitution/transposition) 12

  13. Calculating p(x|w)  Still a research question.  Can be estimated.  Some simply ways. i.e.,  Confusion matrix  A square 26 × 26 table which represents how many times one letter was incorrectly used instead of another.  Usually, there are four confusion matrix:  deletion, insertion, substitution and transposition.

  14. Computing error probability: Confusion matrix  del[x,y]: count(xy typed as x)  ins[x,y]: count(x typed as xy)  sub[x,y]: count(y typed as x)  trans[x,y]: count(xy typed as yx) Inser*on and dele*on condi*oned on previous character 14

  15. Confusion matrix for subs*tu*on The cell [o,e] in a substitution confusion matrix would give the 15 count of times that e was substituted for o.

  16. Channel model 16

  17. Smoothing probabili*es: Add-1 smoothing |A| character alphabet 17

  18. Channel model for acress 18

  19. 19

  20. 20

  21. Noisy channel for real-word spell correc*on  Given a sentence w1,w2,w3, … ,wn  Generate a set of candidates for each word wi  Candidate(w1) = {w1, w ’ 1 , w ’’ 1 , w ’’’ 1 , … }  Candidate(w2) = {w2, w ’ 2 , w ’’ 2 , w ’’’ 2 , … }  Candidate(wn) = {wn, w ’ n , w ’’ n , w ’’’ n , … }  Choose the sequenceW that maximizes P(W) 21

  22. Incorpora*ng context words: Context-sensi*ve spelling correc*on  Determining whether actress or across is appropriate will require looking at the context of use  A bigram language model condi*ons the probability of a word on (just) the previous word 𝑄(𝑥 1 … 𝑥 𝑜 ) = 𝑄(𝑥 1 )𝑄(𝑥 2 |𝑥 1 ) … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 ) 22

  23. Incorpora*ng context words  For unigram counts, 𝑄(𝑥 𝑙 ) is always non-zero  if our dic*onary is derived from the document collec*on  This won ’ t be true of 𝑄(𝑥 𝑙 |𝑥 𝑙−1 ) .We need to smooth  add-1 smoothing on this condi*onal distribu*on  Interpolate a unigram and a bigram: 23

  24. Using a bigram language model 24

  25. Using a bigram language model 25

  26. Noisy channel for real-word spell correc*on 26

  27. Noisy channel for real-word spell correc*on 27

  28. Simplifica*on: One error per sentence 28

  29. Where to get the probabili*es  Language model  Unigram  Bigram  Channel model  Same as for non-word spelling correc*on  Plus need probability for no error, P(w|w) 29

  30. Probability of no error  What is the channel probability for a correctly typed word?  P( “ the ” | “ the ” )  If you have a big corpus, you can es*mate this percent correct  But this value depends strongly on the applica*onbility of no error 30

  31. Peter Norvig ’ s “ thew ” example 31

  32. Improvements to channel model  Allow richer edits (Brill and Moore 2000)  ent  ant  ph  f  le  al  Incorporate pronuncia*on into channel (T outanova and Moore 2002)  Incorporate device into channel  Not all Android phones need have the same error model  But spell correc*on may be done at the system level 32

Recommend


More recommend