Appears in: Carl Weir (ed.), Statistical ly -Ba se d Natural Language Processing Technique s: Papers from the 1992 Workshop, pp. 20-27. Technical Report W-92-01, AAAI Press, Menlo Park, 1992. A Probabilistic P arser and Its Application Mark A. Jones Jason M. Eisner A T&T Bell Lab oratories Emman uel College, Cam bridge 600 Moun tain Av en ue, Rm. 2B-435 Cam bridge CB2 3AP England Murra y Hill, NJ 07974{063 6 jme14@pho enix.cam bri dge.ac. uk jones@researc h.att.com Abstract out of earlier w ork [Jones et al 1991 ] on correcting the output of optical c haracter recognition (OCR) systems. W e describ e a general approac h to the probabilis- W e w ere amazed at ho w m uc h correction w as p ossible tic parsing of con text-free grammars. The metho d using only lo w-lev el statistical kno wledge ab out En- in tegrates con text-sensitiv e statistical kno wledge glish (e.g., the frequency of digrams lik e \pa") and of v arious t yp es (e.g., syn tactic and seman tic) and ab out common OCR mistak es (e.g., rep orting \c" for can b e trained incremen tally from a brac k eted cor- \e"). As man y as 90% of incorrect w ords could b e �xed pus. W e in tro duce a v arian t of the GHR con text- within the telephon y sublanguage domain, and 70{80% free recognition algorithm, and explain ho w to for broader samples of English. Naturally w e w on- adapt it for e�cien t probabilistic parsing. In split- dered whether more sophisticated uses of statistical corpus testing on a real-w orld corpus of sen tences kno wledge could aid in suc h tasks as the one describ ed from soft w are testing do cumen ts, with 20 p ossible ab o v e. The recen t literature also re�ects an increas- parses for a sen tence of a v erage length, the sys- ing in terest in statistical training metho ds for man y tem �nds and iden ti�es the correct parse in 96% NL tasks, including parsing [Jelinek and La�ert y 1991 , of the sen tences for whic h it �nds an y parse, while Magerman and Marcus 1991 , Bobro w 1991 , pro ducing only 1.03 parses p er sen tence for those Magerman and W eir 1992 , Blac k, Jelinek, et al 1992 ], sen tences. Signi�can tly , this success rate w ould b e part of sp eec h tagging [Ch urc h 1988 ], and corp ora only 79% without the seman tic statistics. alignmen t [Dagan et al 1991 , Gale and Ch urc h 1991 ]. Simply stated, w e seek to build a parser that can construct accurate syn tactic and seman tic analyses for In tro duction the sen tences of a giv en language. The parser should In constrained domains, natural language pro cessing kno w little or nothing ab out the target language, sa v e can often pro vide lev erage. A t A T&T, for instance, what it can disco v er statistically from a represen ta- NL tec hnology can p oten tially help automate man y tiv e corpus of analyzed sen tences. When only unan- asp ects of soft w are dev elopmen t. A t ypical example alyzed sen tences are a v ailable, a practical approac h o ccurs in the soft w are testing area. Here 250,000 En- is to parse a small set of sen tences b y hand, to get glish sen tences sp ecify the op erational tests for a tele- started, and then to use the parser itself as a to ol to phone switc hing system. The c hallenge is to to ex- suggest analyses (or partial analyses) for further sen- tract at least the surface con ten t of this highly ref- tences. A similar \b o otstrapping" approac h is found eren tial, naturally o ccurring text, as a �rst step in in [Simmo ns 1990 ]. The precise grammatical theory automating the largely man ual testing pro cess. The w e use to hand-analyze sen tences should not b e cru- sen tences v ary in length and complexit y , ranging from cial, so long as it is applied consisten tly and is not short sen tences suc h as \Station B3 go es onho ok" to 50 unduly large. w ord sen tences con taining paren theticals, sub ordinate clauses, and conjunction. F ortunately the discourse P arsing Algorithms is reasonably w ell fo cused: a large but �nite n um b er of telephonic concepts en ter in to a limited set of logi- F ollo wing [Graham et al 1980 ], w e adopt the follo wing cal relationships. Suc h fo cus is c haracteristic of man y notation. An arbitrary con text-free grammar is giv en sublanguages with practical imp ortance (e.g., medical b y = ( V � ; ), where is the v o cabulary of all G ; P ; S V records). sym b ols, � is the set of terminal sym b ols, P is the W e desire to press forw ard to NL tec hniques that set of rewrite rules, and is the start sym b ol. F or S are robust, that do not need complete grammars in ad- an input sen tence = , let denote the w a a : : : a w 1 2 n i;j v ance, and that can b e trained from existing corp ora of substring a : : : a and w = w denote the pre�x i +1 j i 0 ;i sample sen tences. Our approac h to this problem grew of length i . W e use Greek letters ( �; � ; : : : ) to denote
Recommend
More recommend