mitsubishi electric resear ch labora tories cambridge
play

MITSUBISHI ELECTRIC RESEAR CH LABORA TORIES CAMBRIDGE RESEAR - PDF document

MITSUBISHI ELECTRIC RESEAR CH LABORA TORIES CAMBRIDGE RESEAR CH CENTER Determini sti c P art-Of-Sp eec h T agging with Finite State T ransducers Emman uel Ro c he and Yv es Sc hab es Mitsubishi Electric Researc


  1. MITSUBISHI ELECTRIC RESEAR CH LABORA TORIES CAMBRIDGE RESEAR CH CENTER Determini sti c P art-Of-Sp eec h T agging with Finite State T ransducers Emman uel Ro c he and Yv es Sc hab es Mitsubishi Electric Researc h Lab oratories 201 Broadw a y , Cam bridge, MA 02139 e-mail: and roche@merl.com schabes@merl.com TR-94-07. V ersion 3.0 Marc h 1995 Abstract Sto c hastic approac hes to natural language pro cessing ha v e often b een preferred to rule-based approac hes b ecause of their robustness and their automatic train- ing capabilities. This w as the case for part-of-sp eec h tagging un til Brill sho w ed ho w state-of-the-art part-of-sp eec h tagging can b e ac hiev ed with a rule-based tagger b y inferring rules from a training corpus. Ho w ev er, curren t implemen- tations of the rule-based tagger run more slo wly than previous approac hes. In this pap er, w e presen t a �nite-state tagger inspired b y the rule-based tagger whic h op erates in optimal time in the sense that the time to assign tags to a sen tence corresp onds to the time required to follo w a single path in a determin- istic �nite-state mac hine. This result is ac hiev ed b y enco ding the application of the rules found in the tagger as a non-deterministic �nite-state transducer and then turning it in to a deterministic transducer. The resulting determinis- tic transducer yields a part-of-sp eec h tagger whose sp eed is dominated b y the access time of mass storage devices. W e then generalize the tec hniques to the class of transformation-based systems. Publishe d in Computational Linguistics, June 1995 21(2), 227-253. This w ork ma y not b e copied or repro duced in whole or in part for an y commercial purp ose. P ermission to cop y in whole or in part without pa ymen t of fee is gran ted for nonpro�t educational and researc h pur- p oses pro vided that all suc h whole or partial copies include the follo wing: a notice that suc h cop ying is b y p ermission of Mitsubishi Electric Researc h Lab oratories of Cam bridge, Massac h usetts; an ac kno wledgmen t of the authors and individu al con tributions to the w ork; and all applicabl e p ortions of the cop yrigh t notice. Cop ying, repro duction, or republishi ng for an y other purp ose shall require a license with pa ymen t of fee to Mitsubishi Electric Researc h Lab oratories. All righ ts reserv ed. Cop yrigh t � c Mitsubishi Electric Researc h Lab oratories, 1995 201 Broadw a y , Cam bridge, Massac h usetts 02139

  2. Revisions history . 1. V ersion 1.0, Ma y 2nd 1994. 2. V ersion 1.1, June 16th 1994. 3. V ersion 1.2, June 22nd 1994. 4. V ersion 1.3, July 27th 1994. 5. V ersion 1.4, July 1994. 6. V ersion 2.0, Decem b er 9th 1994. 7. This v ersion is Revision 3.0 of Date: 95/03 .

  3. 1 1 In tro duction Finite-state devices ha v e imp ortan t applications to man y areas of computer science, including pattern matc hing, databases and compiler tec hnology . Al- though their linguistic adequacy to natural language pro cessing has b een questioned in the past (Chomsky , 1964), there has recen tly b een a dramatic renew al of in terest in the application of �nite-state devices to sev eral as- p ects of natural language pro cessing. This renew al of in terest is due to the sp eed and the compactness of �nite-state represen tations. This e�ciency is explained b y t w o prop erties: �nite-state devices can b e made determin- istic, and they can b e turned in to a minim al form. Suc h represen tations ha v e b een successfully applied to di�eren t asp ects of natural language pro- cessing, suc h as morphological analysis and generation (Karttunen, Kaplan, and Zaenen, 1992; Clemenceau, 1993), parsing (Ro c he, 1993; T apanainen and V outilainen, 1993), phonology (Lap orte, 1993; Kaplan and Ka y , 1994) and sp eec h recognition (P ereira, Riley , and Sproat, 1994). Although �nite- state mac hines ha v e b een used for part-of-sp eec h tagging (T apanainen and V outilainen, 1993; Silb erztein, 1993), none of these approac hes has the same �exibilit y as sto c hastic tec hniques. Unlik e sto c hastic approac hes to part-of- sp eec h tagging (Ch urc h, 1988; Kupiec, 1992; Cutting et al., 1992; Merialdo, 1990; DeRose, 1988; W eisc hedel et al., 1993), up to no w the kno wledge found in �nite-state taggers has b een handcrafted and cannot b e automatically acquired. Recen tly , Brill (1992 ) describ ed a rule-based tagger whic h p erforms as w ell as taggers based up on probabilistic mo dels and whic h o v ercomes the limitations common in rule-based approac hes to language pro cessing: it is robust and the rules are automatically acquired. In addition, the tagger requires drastically less space than sto c hastic taggers. Ho w ev er, curren t im- plemen tations of Brill's tagger are considerably slo w er than the ones based on probabilistic mo dels since it ma y require RC n elemen tary steps to tag an input of n w ords with R rules requiring at most C tok ens of con text. Although the sp eed of curren t part-of-sp eec h taggers is acceptable for in teractiv e systems where a sen tence at a time is b eing pro cessed, it is not adequate for applications where large b o dies of text need to b e tagged, suc h as in information retriev al, indexing applications and grammar c hec king sys- tems. F urthermore, the space required for part-of-sp eec h taggers is also an issue in commerci al p ersonal computer applications suc h as grammar c hec k- MERL-TR-94-07. V ersion 3.0 Marc h 1995

  4. 2 ing systems. In addition, part-of-sp eec h taggers are often b eing coupled with a syn tactic analysis mo dule. Usually these t w o mo dules are written in dif- feren t framew orks, making it v ery di�cult to in tegrate in teractions b et w een the t w o mo dules. In this pap er, w e design a tagger that requires n steps to tag a sen tence of length n , indep enden t of the n um b er of rules and the length of the con- text they require. The tagger is represen ted b y a �nite-state transducer, a framew ork whic h can also b e the basis for syn tactic analysis. This �nite-state tagger will also b e found useful com bined with other language comp onen ts since it can b e naturally extended b y comp osing it with �nite-state trans- ducers whic h could enco de other asp ects of natural language syn tax. Relying on algorithms and formal c haracterization describ ed in later sec- tions, w e explain ho w eac h rule in Brill's tagger can b e view ed as a non- deterministic �nite-state transducer. W e also sho w ho w the application of all rules in Brill's tagger is ac hiev ed b y comp osing eac h of these non- deterministic transducers and wh y non-determinism arises in this transducer. W e then pro v e the correctness of the general algorithm for determinizi ng (whenev er p ossible) �nite-state transducers and w e successfully apply this algorithm to the previously obtained non-deterministic transducer. The re- sulting deterministic transducer yields a part-of-sp eec h tagger whic h op erates in optimal time in the sense that the time to assign tags to a sen tence cor- resp onds to the time required to follo w a single path in this deterministic �nite-state mac hine. W e also sho w ho w the lexicon used b y the tagger can b e optimally enco ded using a �nite-state mac hine. The tec hniques used for the construction of the �nite-state tagger are then formalized and mathematically pro v en correct. W e in tro duce a pro of of soundness and completeness with a w orst case complexit y analysis for an algorithm for determinizi ng �nite-state transducers. W e conclude b y pro ving ho w the metho d can b e applied to the class of transformation-based error-driv en systems. 2 Ov erview of Brill's T agger Brill's tagger is comprised of three parts, eac h of whic h is inferred from a training corpus: a lexical tagger, an unkno wn w ord tagger and a con textual tagger. F or purp oses of exp osition, w e will p ostp one the discussion of the MERL-TR-94-07. V ersion 3.0 Marc h 1995

Recommend


More recommend