Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford)
Applications of spelling correction 2
Spelling Tasks Spelling Error Detection Spelling Error Correction: Autocorrect hte the Suggest a correction Suggestion lists 3
Types of spelling errors Non-word Errors graffe giraffe Real-word Errors Typographical errors three there Cognitive Errors (homophones) piece peace, too two your you ’ re Real-word correction almost needs to be context sensitive 4
Spelling correction steps For each word w , generate candidate set: Find candidate words with similar pronunciations Find candidate words with similar spellings Choose best candidate By “ Weighted edit distance ” or “ Noisy Channel ” approach Context-sensitive – so have to consider whether the surrounding words “ make sense ” “ Flying form Heathrow to LAX ” ” Flying from Heathrow to LAX ” 5
Candidate Testing: Damerau-Levenshtein edit distance Minimal edit distance between two strings, where edits are: Insertion Deletion Substitution Transposition of two adjacent letters 6
7
Noisy channel intuition 8
Noisy channel We see an observation 𝑦 of a misspelled word Find the correct word 𝑥 9
Language Model Take a big supply of words withT tokens: 𝑞 𝑥 = 𝐷(𝑥) 𝑈 C(w) = # occurrences of w Supply of words your document collection In other applications: you can take the supply to be typed queries (suitably filtered) – when a static dictionary is inadequate 10
Unigram prior probability Counts from 404,253,213 words in Corpus of Contemporary English (COCA) 11
Channel model probability Error model probability, Edit probability Misspelled word x = x1, x2, x3, … ,xm Correct word w = w1, w2, w3, … , wn P(x|w) = probability of the edit (deletion/insertion/substitution/transposition) 12
Calculating p(x|w) Still a research question. Can be estimated. Some simply ways. i.e., Confusion matrix A square 26 × 26 table which represents how many times one letter was incorrectly used instead of another. Usually, there are four confusion matrix: deletion, insertion, substitution and transposition.
Computing error probability: Confusion matrix del[x,y]: count(xy typed as x) ins[x,y]: count(x typed as xy) sub[x,y]: count(y typed as x) trans[x,y]: count(xy typed as yx) Inser*on and dele*on condi*oned on previous character 14
Confusion matrix for subs*tu*on The cell [o,e] in a substitution confusion matrix would give the 15 count of times that e was substituted for o.
Channel model 16
Smoothing probabili*es: Add-1 smoothing |A| character alphabet 17
Channel model for acress 18
19
20
Noisy channel for real-word spell correc*on Given a sentence w1,w2,w3, … ,wn Generate a set of candidates for each word wi Candidate(w1) = {w1, w ’ 1 , w ’’ 1 , w ’’’ 1 , … } Candidate(w2) = {w2, w ’ 2 , w ’’ 2 , w ’’’ 2 , … } Candidate(wn) = {wn, w ’ n , w ’’ n , w ’’’ n , … } Choose the sequenceW that maximizes P(W) 21
Incorpora*ng context words: Context-sensi*ve spelling correc*on Determining whether actress or across is appropriate will require looking at the context of use A bigram language model condi*ons the probability of a word on (just) the previous word 𝑄(𝑥 1 … 𝑥 𝑜 ) = 𝑄(𝑥 1 )𝑄(𝑥 2 |𝑥 1 ) … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 ) 22
Incorpora*ng context words For unigram counts, 𝑄(𝑥 𝑙 ) is always non-zero if our dic*onary is derived from the document collec*on This won ’ t be true of 𝑄(𝑥 𝑙 |𝑥 𝑙−1 ) .We need to smooth add-1 smoothing on this condi*onal distribu*on Interpolate a unigram and a bigram: 23
Using a bigram language model 24
Using a bigram language model 25
Noisy channel for real-word spell correc*on 26
Noisy channel for real-word spell correc*on 27
Simplifica*on: One error per sentence 28
Where to get the probabili*es Language model Unigram Bigram Channel model Same as for non-word spelling correc*on Plus need probability for no error, P(w|w) 29
Probability of no error What is the channel probability for a correctly typed word? P( “ the ” | “ the ” ) If you have a big corpus, you can es*mate this percent correct But this value depends strongly on the applica*onbility of no error 30
Peter Norvig ’ s “ thew ” example 31
Improvements to channel model Allow richer edits (Brill and Moore 2000) ent ant ph f le al Incorporate pronuncia*on into channel (T outanova and Moore 2002) Incorporate device into channel Not all Android phones need have the same error model But spell correc*on may be done at the system level 32
Recommend
More recommend