Language and Language and Language and Who cares about spelling? Why people care about spelling Computers Computers Computers Topic 4: Topic 4: Topic 4: Writer’s aids Writer’s aids Writer’s aids Aoccdrnig to a rscheearch at Cmabrigde Introduction Introduction Introduction Uinervtisy, it deosn’t mttaer in waht oredr the ltteers ◮ Misspellings can cause misunderstandings and real-life Error causes Error causes Error causes in a wrod are, the olny iprmoetnt tihng is taht the Keyboard mistypings Keyboard mistypings Keyboard mistypings problems: Phonetic errors Phonetic errors Phonetic errors frist and lsat ltteer be at the rghit pclae. The rset Knowledge problems Knowledge problems Knowledge problems Linguistics 384: Language and Computers ◮ For example: Difficult issues can be a toatl mses and you can sitll raed it wouthit Difficult issues Difficult issues Tokenization Tokenization ◮ Did you see her god yesterday? It’s a big golden Tokenization Topic 4: Writer’s aids (Spelling and Grammar Correction) porbelm. Tihs is bcuseae the huamn mnid deos not Inflection Inflection Inflection retriever. Productivity Productivity Productivity raed ervey lteter by istlef, but the wrod as a wlohe. ◮ This will be a fee [free] concert. Non-word error Non-word error Non-word error detection detection detection ◮ 1991 Bell Atlantic & Pacific Bell telephone network Scott Martin ∗ Dictionaries Dictionaries Dictionaries N-gram analysis N-gram analysis N-gram analysis outages were partly caused by a typographical error: (See http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/ for Dept. of Linguistics, OSU Isolated-word error Isolated-word error Isolated-word error A 6 in a line of computer code was supposed to be a D . correction correction correction the story behind this supposed research report.) Winter 2008 “That one error caused the equipment and software to Rule-based methods Rule-based methods Rule-based methods Similarity key techniques Similarity key techniques Similarity key techniques fail under an avalanche of computer-generated Probabilistic methods Probabilistic methods Probabilistic methods A dtcoor has aimttded the magltheuansr of a Minimum edit distance Minimum edit distance Minimum edit distance messages.” (Wall Street Journal, Nov. 25, 1991) Grammar correction taegene cceanr ptinaet who deid aetfr a haptosil Grammar correction Grammar correction Syntax Syntax Syntax durg bednlur. Computing with Syntax Computing with Syntax Computing with Syntax ∗ The course was created by Chris Brew, Markus Dickinson and Detmar Meurers. Grammar correction rules Grammar correction rules Grammar correction rules Caveat emptor Caveat emptor Caveat emptor 1 / 72 2 / 72 3 / 72 Why people care about spelling (cont.) Language and How are spell checkers used? Language and Detection vs. Correction Language and Computers Computers Computers Topic 4: Topic 4: Topic 4: Writer’s aids Writer’s aids Writer’s aids ◮ Standard spelling makes it easy to organize words and Introduction Introduction Introduction ◮ interactive spelling checkers = spell checker detects Error causes Error causes Error causes text: errors as you type. ◮ There are two distinct tasks: Keyboard mistypings Keyboard mistypings Keyboard mistypings Phonetic errors Phonetic errors Phonetic errors ◮ e.g., Without standard spelling, how would you look up Knowledge problems Knowledge problems Knowledge problems ◮ It may or may not make suggestions for correction. ◮ error detection = simply find the misspelled words things in a lexicon or thesaurus? Difficult issues Difficult issues Difficult issues ◮ Requires a “real-time” response (i.e., must be fast) ◮ error correction = correct the misspelled words ◮ e.g., Optical character recognition software can use Tokenization Tokenization Tokenization ◮ It is up to the human to decide if the spell checker is Inflection Inflection Inflection ◮ e.g., It might be easy to tell that ater is a misspelled knowledge about standard spelling to recognize Productivity Productivity Productivity right or wrong. Non-word error Non-word error Non-word error scanned words even for hardly legible input. word, but what is the correct word? water ? later ? after ? ◮ If there are a list of choices, we may not require 100% detection detection detection Dictionaries Dictionaries Dictionaries ◮ Standard spelling makes it possible to provide a single accuracy in the corrected word ⇒ Depends on what we want to do with our results as to N-gram analysis N-gram analysis N-gram analysis text, which is accessible to a wide range of readers Isolated-word error Isolated-word error what we want to do. Isolated-word error ◮ automatic spelling correctors = spell checker runs on correction correction correction (different backgrounds, speaking different dialects, etc.). Note, though, that detection is a prerequisite for Rule-based methods a whole document, finds errors, and corrects them Rule-based methods Rule-based methods Similarity key techniques Similarity key techniques Similarity key techniques correction. ◮ Using standard spelling is associated with being Probabilistic methods Probabilistic methods Probabilistic methods ◮ A much more difficult task. Minimum edit distance Minimum edit distance Minimum edit distance well-educated, i.e., is used to make a good impression ◮ A human may or may not proofread the results later. Grammar correction Grammar correction Grammar correction in social interaction. Syntax Syntax Syntax Computing with Syntax Computing with Syntax Computing with Syntax Grammar correction rules Grammar correction rules Grammar correction rules Caveat emptor Caveat emptor Caveat emptor 4 / 72 5 / 72 6 / 72 Language and Language and Language and What causes errors? Keyboard mistypings Keyboard mistypings (cont.) Computers Computers Computers Topic 4: Topic 4: Topic 4: Writer’s aids Writer’s aids Writer’s aids Introduction Introduction Introduction Error causes Error causes Error causes Keyboard proximity Space bar issues Keyboard mistypings Keyboard mistypings Keyboard mistypings Phonetic errors Phonetic errors Phonetic errors Knowledge problems Knowledge problems Knowledge problems ◮ e.g., Jack becomes Hack since h and j are next to each ◮ run-on errors = two separate words become one Difficult issues Difficult issues Difficult issues ◮ Keyboard mistypings Tokenization Tokenization other on a typical American keyboard Tokenization ◮ e.g., the fuzz becomes thefuzz Inflection Inflection Inflection Productivity Productivity Productivity ◮ Phonetic errors ◮ split errors = one word becomes two separate items Non-word error Non-word error Non-word error Physical similarity ◮ Knowledge problems detection detection detection ◮ e.g., equalization becomes equali zation Dictionaries Dictionaries Dictionaries N-gram analysis N-gram analysis N-gram analysis ◮ similarity of shape, e.g., mistaking two physically similar Isolated-word error Isolated-word error Isolated-word error Note that the resulting items might still be words! correction correction correction letters when typing up something handwritten Rule-based methods Rule-based methods Rule-based methods ◮ e.g., a tollway becomes atoll way Similarity key techniques Similarity key techniques ◮ e.g., tight for fight Similarity key techniques Probabilistic methods Probabilistic methods Probabilistic methods Minimum edit distance Minimum edit distance Minimum edit distance Grammar correction Grammar correction Grammar correction Syntax Syntax Syntax Computing with Syntax Computing with Syntax Computing with Syntax Grammar correction rules Grammar correction rules Grammar correction rules Caveat emptor Caveat emptor Caveat emptor 7 / 72 8 / 72 9 / 72
Recommend
More recommend