Gábor Csernyi Department of English Linguistics University of Debrecen gabor.csernyi@arts.unideb.hu http://ieas.unideb.hu/~csernyi
The architecture of the Xerox Linguistic Environment (XLE) reflects a modular pattern: (tokenizer(s)); (morphological analyzer(s)); lexicon in the form of files containing lexical entries; grammar in the form of files comprising the grammatical rules (with functional annotations). 2
Different modules responsible for the different (sub)tasks of parsing: TOKENIZER MORPHOLOGICAL LEXICON SYNTAX, ANALYZER SEMANTICS morphological tokenization lexical lookup chart parsing analysis 3
Relevant commands in XLE: tokens analyze-string <word-form> parse “<(category:) string>” 4
DEMO HUNGARIAN CONFIG (1.0) CHARACTERENCODING utf-8. ROOTCAT ROOT. FILES . LEXENTRIES (DEMO HUNGARIAN). RULES (DEMO HUNGARIAN). TEMPLATES (DEMO HUNGARIAN). GOVERNABLERELATIONS SUBJ OBJ OBJ2 OBL OBL-?+ COMP XCOMP. SEMANTICFUNCTIONS ADJUNCT TOPIC. NONDISTRIBUTIVES NUM PERS. EPSILON e. OPTIMALITYORDER NOGOOD. ---- 5
The first line is a special line, it specifies: the version of grammar (DEMO); the language (HUNGARIAN); that this is the configuration file (CONFIG); the XLE version number (1.0). 6
Other parts of the config file: FILES: all the (external) files needed for parsing (and also generation): tokenizers, morphological FSTs and other transducers, the lexicon(s), the grammar file(s); LEXENTRIES: the list of the lexicons – if there is more than one, order might be important); RULES: the grammar (if rules are structured into different files, the headers of each should be listed here); TEMPLATES: reference to the template file; 7
GOVERNABLERELATIONS: a list of grammatical functions; SEMANTICFUNCTIONS: attributes the values of which are required to contain a PRED; NONDISTRIBUTIVES: attributes that do not distribute when coordinated; EPSILON: a category that is not overt in the c- structure; OPTIMALITYORDER: ranking of optimality constraints; ----: the configuration file is closed with four dashes. 8
The lexical entries, the morphology (if exists, together with the tokeizer(s)), the rules and the templates sections also follow the same pattern: they start with a header (by which they can be called in the appropriate sections ( LEXENTRIES , RULES , TEMPLATES , MORPHOLOGY ) of the configuration file); they should be terminated with four dashes. These sections can be placed either in the configuration file, or they can also be stored in different files (e.g.: lexicon for nouns, lexicon for verbs, etc.). In this last case, they are to be indicated in the FILES section as well. 9
General form of rules: category --> category1: schemata1; category2: schemata2; … . A simple rule: S --> NP VP. Each rule is terminated with a dot. 10
Assigning grammatical functions in the rules (see schemata above): S --> NP: (^ SUBJ)=! (! CASE)=nom; VP. When schemata are given and order is important, a semicolon must follow (or a period, if it ends the rule). 11
Optionality can be expressed with the help of parentheses (surrounding the optional element in the rule). When order is “not important” regarding the categories (on the right side of the rule), they (with the schemata) are to be separated with a comma. VP --> V: ^=!, (NP: (^ OBJ)=!). 12
Disjunction: | ( → use of curly brackets) NP --> {(D) N | PRON}. Use of Kleene star: on the right side of the rule “attached” to the category in question to account for zero or more repetitions. VP --> V: ^=!; (NP: (^ OBJ)=!) PP*. 13
General form of a lexical entry: word Category1 Morphcode1 Schemata1; Category2 Morphcode2 Schemata2; … . Here, Category is the category of the word; Morphcode tells XLE whether the analyises of the word are provided by the morphological analyzer (» the morphcode is XLE), or it is only the lexical entry that provides information about the given form (» the morphcode is *); Schemata are similar to those in the grammar file. 14
Example: eszik V * {(^ PRED)='eszik <(^ SUBJ)>' |(^ PRED)='eszik <(^ SUBJ) (^ OBJ)>' (^ OBJ CASE)=acc} (^ SUBJ CASE)=nom (^ SUBJ NUM)=1 (^ SUBJ PERS)=sg (^ TNS-ASP TENSE)=pres (^ TNS-ASP MOOD)=indicative (^ TNS-ASP PROG)=+. 15
Example: eszik V * { (^ PRED)='eszik <(^ SUBJ)>' | (^ PRED)='eszik <(^ SUBJ) (^ OBJ)>' (^ OBJ CASE)=acc } (^ SUBJ CASE)=nom (^ SUBJ NUM)=1 (^ SUBJ PERS)=sg (^ TNS-ASP TENSE)=pres (^ TNS-ASP MOOD)=indicative (^ TNS-ASP PROG)=+. 16
Multiple entries (for the same form) use of special tags: related to whole entries: ETC extension of previous entry; ONLY keep only this entry; Placed in front of subentries: + add new subentry; - remove subentry; ! override subentry; = keep subentry. 17
Example: base entry: baa N XLE @(NOUN baa); V XLE @(VERB baa); A XLE @(ADJ baa). in some other later part: baa +P XLE @(PREP baa); =N; ONLY. => the effective entry: baa N XLE @(NOUN baa); P XLE @(PREP baa). 18
xfst lexc
Finite-state transducers Non-deterministic Possible additional functions (Kaplan et al. 1997): Normalization: removing additional white spaces Editing: removing tags from annotated text Capitalization handling: upper-case, lower-case Contraction handling Compound word isolation / multiword expression recognition Using more than one: through composition 20
Finite-state transducers (other external analyzers are also possible, provided their output is properly mapped to what XLE expects at this level) => effectiveness in speed Non-deterministic: multiple analyses when possible Stemming, morphological features as tags (e.g.: +Nom) Composition of morphological FSTs is possible (also: union or priority union) Guessers (Kaplan et al. 2004) 21
Butt, Miriam, King, Tracy H., Niño , María -Eugenia and Segond, Frédérique . 1999. A Grammar Writer ’s Cookbook. Stanford: CSLI Publications. Kaplan, Ronald M. and Newman, Paula S. 1997. Lexical Resource Reconciliation in the Xerox Linguistic Environment. In ACL/EACL’ 98 Computational environments for grammar development and linguistic engineering, 54-61. Kaplan, Ronald M., Maxwell, John T., King, Tracy H. and Crouch, Richard. 2004. Integrating Finite-state Technology with Deep LFG Grammars. In Proceedings of the ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP . 21
Recommend
More recommend