the computer and natural language
play

The Computer and Natural Language Challenges Tokenization (Ling - PowerPoint PPT Presentation

Computers and Language Writers aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems The Computer and Natural Language Challenges Tokenization (Ling 445/515) Inflection Productivity Writers aids


  1. Computers and Language Writers’ aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems The Computer and Natural Language Challenges Tokenization (Ling 445/515) Inflection Productivity Writers’ aids (Spelling and Grammar Correction) Non-word error detection Dictionaries N-gram analysis Isolated-word error Markus Dickinson correction Rule-based methods Dept. of Linguistics, Indiana Similarity key techniques Autumn 2010 Probabilistic methods Minimum edit distance Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 1 / 74

  2. Computers and Why people care about spelling Language Writers’ aids Introduction ◮ Misspellings can cause misunderstandings Error causes Keyboard mistypings ◮ Standard spelling makes it easy to organize words & Phonetic errors Knowledge problems text: Challenges Tokenization ◮ e.g., Without standard spelling, how would you look up Inflection Productivity things in a lexicon or thesaurus? Non-word error ◮ e.g., Optical character recognition software (OCR) can detection use knowledge about standard spelling to recognize Dictionaries N-gram analysis scanned words even for hardly legible input. Isolated-word error correction ◮ Standard spelling makes it possible to provide a single Rule-based methods Similarity key techniques text, accessible to a wide range of readers (different Probabilistic methods Minimum edit distance backgrounds, speaking different dialects, etc.). Error correction for web queries ◮ Using standard spelling can make a good impression in Grammar correction social interaction. Syntax and Computing Grammar correction rules Caveat emptor 2 / 74

  3. Computers and How are spell checkers used? Language Writers’ aids Introduction Error causes Keyboard mistypings ◮ interactive spelling checkers = spell checker detects Phonetic errors Knowledge problems errors as you type. Challenges ◮ It may or may not make suggestions for correction. Tokenization Inflection ◮ It needs a “real-time” response (i.e., must be fast) Productivity ◮ It is up to the human to decide if the spell checker is Non-word error detection right or wrong, and so we may not require 100% Dictionaries N-gram analysis accuracy (especially with a list of choices) Isolated-word error ◮ automatic spelling correctors = spell checker runs on correction Rule-based methods a whole document, finds errors, and corrects them Similarity key techniques Probabilistic methods ◮ A much more difficult task. Minimum edit distance Error correction for ◮ A human may or may not proofread the results later. web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 3 / 74

  4. Computers and Detection vs. Correction Language Writers’ aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges ◮ There are two distinct tasks: Tokenization Inflection ◮ error detection = simply find the misspelled words Productivity ◮ error correction = correct the misspelled words Non-word error detection ◮ e.g., It might be easy to tell that ater is a misspelled Dictionaries N-gram analysis word, but what is the correct word? water ? later ? after ? Isolated-word error correction ◮ Note that detection is a prerequisite for correction. Rule-based methods Similarity key techniques Probabilistic methods Minimum edit distance Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 4 / 74

  5. Computers and Error causes Language Writers’ aids Keyboard mistypings Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Space bar issues Challenges Tokenization Inflection ◮ run-on errors = two separate words become one Productivity Non-word error ◮ e.g., the fuzz becomes thefuzz detection Dictionaries ◮ split errors = one word becomes two separate items N-gram analysis Isolated-word error ◮ e.g., equalization becomes equali zation correction Rule-based methods ◮ Note that the resulting items might still be words! Similarity key techniques Probabilistic methods Minimum edit distance ◮ e.g., a tollway becomes atoll way Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 5 / 74

  6. Computers and Error causes Language Writers’ aids Keyboard mistypings (cont.) Introduction Error causes Keyboard mistypings Phonetic errors Keyboard proximity Knowledge problems Challenges Tokenization ◮ e.g., Jack becomes Hack since h and j are next to each Inflection Productivity other on a typical American keyboard Non-word error detection Dictionaries N-gram analysis Physical similarity Isolated-word error correction ◮ similarity of shape, e.g., mistaking two physically similar Rule-based methods Similarity key techniques Probabilistic methods letters when typing up something handwritten Minimum edit distance ◮ e.g., tight for fight Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 6 / 74

  7. Computers and Error causes Language Writers’ aids Phonetic errors Introduction Error causes phonetic errors = errors based on the sounds of a Keyboard mistypings Phonetic errors language (not necessarily on the letters) Knowledge problems Challenges Tokenization ◮ homophones = two words which sound the same Inflection Productivity ◮ e.g., red / read (past tense), cite / site / sight , Non-word error detection they’re / their / there Dictionaries N-gram analysis ◮ Spoonerisms = switching two letters/sounds around Isolated-word error correction ◮ e.g., It’s a tavy grain with biscuit wheels. Rule-based methods Similarity key techniques ◮ letter/word substitution: replacing a letter (or sequence Probabilistic methods Minimum edit distance of letters) with a similar-sounding one Error correction for web queries ◮ e.g., John kracked his nuckles. Grammar correction instead of John cracked his knuckles. Syntax and Computing Grammar correction rules Caveat emptor 7 / 74

  8. Computers and Error causes Language Writers’ aids Knowledge problems Introduction Error causes Keyboard mistypings Phonetic errors ◮ not knowing a word and guessing its spelling (can be Knowledge problems phonetic) Challenges Tokenization ◮ e.g., sientist Inflection Productivity ◮ not knowing a rule and guessing it Non-word error detection Dictionaries ◮ e.g., Do we double a consonant for ing words? N-gram analysis jog → joging Isolated-word error correction joke → jokking Rule-based methods Similarity key techniques ◮ knowing something is odd about the spelling, but Probabilistic methods Minimum edit distance guessing the wrong thing Error correction for web queries ◮ e.g., typing siscors for the non-regular scissors Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 8 / 74

  9. Computers and Challenges & Techniques for spelling correction Language Writers’ aids Introduction Before we turn to how we detect spelling errors, we’ll look Error causes briefly at three issues: Keyboard mistypings Phonetic errors ◮ Tokenization : What is a word? Knowledge problems Challenges ◮ Inflection : How are some words related? Tokenization Inflection ◮ Productivity of language : How many words are there? Productivity Non-word error detection How we handle these issues determines how we build a Dictionaries dictionary. N-gram analysis Isolated-word error And then we’ll turn to the techniques used: correction Rule-based methods Similarity key techniques ◮ Non-word error detection Probabilistic methods Minimum edit distance ◮ Isolated-word error correction Error correction for web queries ◮ Context-dependent word error detection and correction Grammar correction → grammar correction Syntax and Computing Grammar correction rules Caveat emptor 9 / 74

  10. Computers and Tokenization Language Writers’ aids Intuitively a “word” is simply whatever is between two Introduction spaces, but this is not always so clear. Error causes Keyboard mistypings ◮ contractions = two words combined into one Phonetic errors Knowledge problems ◮ e.g., can’t , he’s , John’s [car] (vs. his car ) Challenges Tokenization ◮ multi-token words = (arguably) a single word with a Inflection Productivity space in it Non-word error detection ◮ e.g., New York , in spite of , deja vu Dictionaries N-gram analysis ◮ hyphens (note: can be ambiguous if a hyphen ends a Isolated-word error correction line) Rule-based methods Similarity key techniques ◮ Some are always a single word: e-mail , co-operate Probabilistic methods Minimum edit distance ◮ Others are two words combined into one: Error correction for Columbus-based , sound-change web queries Grammar correction ◮ Abbreviations: may stand for multiple words Syntax and Computing Grammar correction rules ◮ e.g., etc. = et cetera , ATM = Automated Teller Machine Caveat emptor 10 / 74

Recommend


More recommend