language and computers
play

Language and Computers Tokenization Inflection Writers Aids - PowerPoint PPT Presentation

Language and Computers Writers Aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges Language and Computers Tokenization Inflection Writers Aids Productivity Non-word error detection


  1. Language and Computers Writers’ Aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges Language and Computers Tokenization Inflection Writers’ Aids Productivity Non-word error detection Dictionaries N-gram analysis Based on Dickinson, Brew, & Meurers (2013) Isolated-word error Indiana University correction Rule-based methods Fall 2013 Similarity key techniques Probabilistic methods Minimum edit distance Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 1 / 76

  2. Language and Why people care about spelling Computers Writers’ Aids Introduction ◮ Misspellings can cause misunderstandings Error causes Keyboard mistypings ◮ Standard spelling makes it easy to organize words & Phonetic errors Knowledge problems text: Challenges Tokenization ◮ e.g., Without standard spelling, how would you look up Inflection Productivity things in a lexicon or thesaurus? Non-word error ◮ e.g., Optical character recognition software (OCR) can detection use knowledge about standard spelling to recognize Dictionaries N-gram analysis scanned words even for hardly legible input. Isolated-word error correction ◮ Standard spelling makes it possible to provide a single Rule-based methods Similarity key techniques text, accessible to a wide range of readers (different Probabilistic methods Minimum edit distance backgrounds, speaking different dialects, etc.). Error correction for web queries ◮ Using standard spelling can make a good impression in Grammar correction social interaction. Syntax and Computing Grammar correction rules Caveat emptor 2 / 76

  3. Language and How are spell checkers used? Computers Writers’ Aids Introduction Error causes Keyboard mistypings ◮ interactive spelling checkers = spell checker detects Phonetic errors Knowledge problems errors as you type. Challenges ◮ It may or may not make suggestions for correction. Tokenization Inflection ◮ It needs a “real-time” response (i.e., must be fast) Productivity ◮ It is up to the human to decide if the spell checker is Non-word error detection right or wrong, and so we may not require 100% Dictionaries N-gram analysis accuracy (especially with a list of choices) Isolated-word error ◮ automatic spelling correctors = spell checker runs on correction Rule-based methods a whole document, finds errors, and corrects them Similarity key techniques Probabilistic methods Minimum edit distance ◮ A much more difficult task. Error correction for ◮ A human may or may not proofread the results later. web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 3 / 76

  4. Language and Detection vs. Correction Computers Writers’ Aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges ◮ There are two distinct tasks: Tokenization Inflection ◮ error detection = simply find the misspelled words Productivity ◮ error correction = correct the misspelled words Non-word error detection ◮ e.g., It might be easy to tell that ater is a misspelled Dictionaries N-gram analysis word, but what is the correct word? water ? later ? after ? Isolated-word error correction ◮ Note that detection is a prerequisite for correction. Rule-based methods Similarity key techniques Probabilistic methods Minimum edit distance Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 4 / 76

  5. Language and Error causes Computers Writers’ Aids Keyboard mistypings Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Space bar issues Challenges Tokenization Inflection ◮ run-on errors = two separate words become one Productivity Non-word error ◮ e.g., the fuzz becomes thefuzz detection Dictionaries ◮ split errors = one word becomes two separate items N-gram analysis Isolated-word error ◮ e.g., equalization becomes equali zation correction Rule-based methods ◮ Note that the resulting items might still be words! Similarity key techniques Probabilistic methods Minimum edit distance ◮ e.g., a tollway becomes atoll way Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 5 / 76

  6. Language and Error causes Computers Writers’ Aids Keyboard mistypings (cont.) Introduction Error causes Keyboard mistypings Phonetic errors Keyboard proximity Knowledge problems Challenges Tokenization ◮ e.g., Jack becomes Hack since h and j are next to each Inflection Productivity other on a typical American keyboard Non-word error detection Dictionaries N-gram analysis Physical similarity Isolated-word error correction ◮ similarity of shape, e.g., mistaking two physically similar Rule-based methods Similarity key techniques Probabilistic methods letters when typing up something handwritten Minimum edit distance ◮ e.g., tight for fight Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 6 / 76

  7. Language and Error causes Computers Writers’ Aids Phonetic errors Introduction Error causes Keyboard mistypings phonetic errors Phonetic errors Knowledge problems Challenges = errors based on the sounds of a language (not necessarily Tokenization on the letters) Inflection Productivity Non-word error ◮ homophones = two words which sound the same detection Dictionaries ◮ e.g., red / read (past tense), cite / site / sight , N-gram analysis they’re / their / there Isolated-word error correction ◮ letter/word substitution: replacing a letter (or sequence Rule-based methods Similarity key techniques of letters) with a similar-sounding one Probabilistic methods Minimum edit distance ◮ e.g., John kracked his nuckles. Error correction for web queries instead of John cracked his knuckles. Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 7 / 76

  8. Language and Error causes Computers Writers’ Aids Knowledge problems Introduction Error causes Keyboard mistypings Phonetic errors ◮ not knowing a word and guessing its spelling (can be Knowledge problems phonetic) Challenges Tokenization ◮ e.g., sientist Inflection Productivity ◮ not knowing a rule and guessing it Non-word error detection Dictionaries ◮ e.g., Do we double a consonant for ing words? N-gram analysis jog → joging Isolated-word error correction joke → jokking Rule-based methods Similarity key techniques ◮ knowing something is odd about the spelling, but Probabilistic methods Minimum edit distance guessing the wrong thing Error correction for web queries ◮ e.g., typing siscors for the non-regular scissors Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 8 / 76

  9. Language and Challenges & Techniques for spelling correction Computers Writers’ Aids Introduction Before we turn to how we detect spelling errors, we’ll look Error causes briefly at three issues: Keyboard mistypings Phonetic errors ◮ Tokenization : What is a word? Knowledge problems Challenges ◮ Inflection : How are some words related? Tokenization Inflection ◮ Productivity of language : How many words are there? Productivity Non-word error detection How we handle these issues determines how we build a Dictionaries dictionary. N-gram analysis Isolated-word error And then we’ll turn to the techniques used: correction Rule-based methods Similarity key techniques ◮ Non-word error detection Probabilistic methods Minimum edit distance ◮ Isolated-word error correction Error correction for web queries ◮ Context-dependent word error detection and correction Grammar correction → grammar correction Syntax and Computing Grammar correction rules Caveat emptor 9 / 76

  10. Language and Tokenization Computers Writers’ Aids Intuitively a “word” is simply whatever is between two Introduction spaces, but this is not always so clear. Error causes Keyboard mistypings ◮ contractions = two words combined into one Phonetic errors Knowledge problems ◮ e.g., can’t , he’s , John’s [car] (vs. his car ) Challenges Tokenization ◮ multi-token words = (arguably) a single word with a Inflection Productivity space in it Non-word error detection ◮ e.g., New York , in spite of , deja vu Dictionaries N-gram analysis ◮ hyphens (note: can be ambiguous if a hyphen ends a Isolated-word error correction line) Rule-based methods Similarity key techniques ◮ Some are always a single word: e-mail , co-operate Probabilistic methods Minimum edit distance ◮ Others are two words combined into one: Error correction for Columbus-based , sound-change web queries Grammar correction ◮ Abbreviations: may stand for multiple words Syntax and Computing Grammar correction rules ◮ e.g., etc. = et cetera , ATM = Automated Teller Machine Caveat emptor 10 / 76

Recommend


More recommend