Dependency-Based Hybrid Syntactic Analysis for Languages with a Rather Free Word Order Guntis B ā rzdi ņ š, Normunds Gr ū z ī tis , Gunta Nešpore and Baiba Saul ī te Institute of Mathematics and Computer Science University of Latvia NODALIDA 2007, Tartu, May 25-26, 2007
SemTi-Kamols Project • Initial intention: – Integration of Latvian and the latest semantic web technologies • Natural language is a challenge and a good measure for advanced semantic web development – Ontology based natural language processing • Text meaning representation (TMR) • Current course: – Grammatical analysis of a raw text • Creation of a morphologically and syntactically annotated corpus • Chunking of Latvian web pages – Model building from a controlled text (TMR)
The Problem • To develop a grammar model, which would facilitate TMR • Top-down approach T EVENT OBJECT ... CHANGE_LOCATION syntactic structures semantic structures move run – What is a sense of a word? • Bottom-up approach – Conveyance of the meaning from the surface to the model
Dependency Grammars • Bottom-up approach by drawing functional links between words • Good for inflective languages : declensions, free word order • Facilitates analysis via a relatively small set of rules • Convenience (linguistic tradition) & flexibility • Direct mapping to semantic structures: verb / predicate / event object subject Verb Verb subject object agent object agent object Noun [Nom] Noun [Sg,Acc] Noun [Sg,Acc] Noun [Nom] v ī rs pa ņē ma gr ā matu gr ā matu pa ņē ma v ī rs (the man) (took) (the book) (the book) (is taken) (by the man) subject [Noun,Nom] -> [Verb] agent object [Noun,Sg,Acc] -> [Verb] object
Hybrid Grammars • Constituency-based approaches with dependency elements – Head-driven phrase structure grammar S � NP {GEN,NUM,nom} VP {GEN,NUM} NP � Det Adj {GEN,NUM,CASE} Noun {GEN,NUM,CASE} VP � Verb NP {acc} – TIGER annotation scheme • Nodes = constituents, edges = functions • Discontinuous constituents • Dependency-based approaches with constituency elements – Our hybrid parsing method • Good for rather free word order languages with analytical word forms
x-Words • A concept of “x-word”, which in a sense is the core idea • Acts as a glue between the two worlds due to its dual nature – A non-terminal symbol in the phrase structure grammar, and as such during the parsing process substitutes all entities forming this constituent – A regular word that can act as a head for depending words and/or as a dependent of another head word according to the dependency grammar ([_ ,[v,aux ,Tense,Nr,_ ]], [_ ,[v,aux ,past ,Nr,_ ]], ,Ø ,Trans]]) � [_ ,[v,main,Ø [x-verb,[v,main,Tense,Nr,Trans,perf]] ir bijis j ā dod — ‘ have had to give ’ • Simple sentence structure + dependencies & agreements
Implementation • A list of “simple” word forms A-Table along with their morphological Word Morphological Features features vasar ā [n,f,sg,loc] var [v,aux,present,pl,trans] • Acquired on-the-fly via peld ē ties [v,m,inf,0,intrans] morphological analysis • A list of multi-word patterns X-Table x-Word Morphology Constituents • x-Words can be nested in x-coord ... ... other x-Words x-prep ... ... • Explicit and implicit x-verb ... ... constituents may interleave • A list of possible head- B-Table dependant pairs Function Head Dependant modifier [_,{v,m}] [_,{n,loc}] • Simple word and x-word subject [x-verb,{v,m,Nr}] [_,{n,Nr,nom}] heads/dependants are treated attribute [_,{n}] [_,{n,gen}] equally
An Example Parse Tree Verb Verb modifier modifier r r e e i i PLACE PLACE f f i i d d o o m m TIME TIME subject subject AGENT AGENT x-Prep x-Prep attribute attribute Adv Adv Noun Noun Prep Prep Noun Noun ATTRIBUTE ATTRIBUTE attribute attribute e e ATTRIBUTE ATTRIBUTE t t u u OWNER OWNER b b i i r r t t x-Sub x-Sub t t a a x-Coord x-Coord Noun Noun Comma Comma Adv Adv t t x-Verb x-Verb c c e e j j AGENT AGENT b b u u s s Adj Adj Conj Conj Adj Adj Pron Pron Modal Modal Verb Verb Vasar ā Vasar ā lieli lieli un un mazi mazi b ē rni b ē rni dodas dodas uz uz Baltijas Baltijas j ū ru j ū ru kur kur vi ņ i vi ņ i var var peld ē ties. peld ē ties. , , (In summer) (In summer) (big) (big) (and) (and) (small) (small) (kids) (kids) (are going) (are going) (to) (to) (the Baltic) (the Baltic) (sea) (sea) (where) (where) (they) (they) (can) (can) (swim) (swim)
Graphical Notation – Nested Boxes
Computational Challenge DG: high computational complexity PSG: low computational complexity Hybrid structure parser: extremely high computational complexity • The parsing algorithm and the grammar is incomplete – Exponential complexity, relative to the length of a sentence – Only fragments (chunks) can be fully parsed in complex real-life sentences • The task is to find the longest parseable chunks – Run the parser on all sub-sequences of the sentence – x-Words are devices that cancel off substrings • Chunking reveals the non-parseable fragments
Syntactic Phenomena & x-Words (1) • Types of analytical forms of a predicate that are currently described in our grammar: – perfect tenses and moods – passive voice – semantic modifiers – nominal and adverbial predicates subject subject x-Verb x-Verb r r e e i i f f d d i i Noun Noun o o Aux Aux Adj Adj m m Adv Adv persiks persiks ir ir ļ oti ļ oti sul ī gs sul ī gs (a peach) (a peach) (is) (is) (very) (very) (juicy) (juicy) ([_,[v,aux,Tense,Nr,Prs]],[_,[adj,Gen,Nr,nom]]) � [x-verb,[v,m,Tense,Nr,Prs,Gen,nom]]
Syntactic Phenomena & x-Words (2) • Prepositional phrases : preposition + nomen in an appropriate form • The nomen may be involved in other dependencies modifier modifier Verb Verb x-Prep x-Prep e e t t u u b b i i r r t t t t Prep Prep a a Noun Noun e e t t u u b b i i r r Noun Noun t t t t a a Adj Adj skat ī ties skat ī ties pa pa gaišas gaišas istabas istabas logu logu (to look) (to look) (through) (through) (a light) (a light) (of room) (of room) (a window) (a window)
Syntactic Phenomena & x-Words (3) • Coordinated parts of sentence can be regarded as an x-word, as they have the same syntactic role • Morphological features are in agreement , thus can be inherited with no loss of information subject subject x-Coord x-Coord object object Noun Noun Verb Verb Conj Conj Verb Verb Noun Noun meitene meitene s ē ž s ē ž un un lasa lasa gr ā matu gr ā matu (a girl) (a girl) (is sitting) (is sitting) (and) (and) (reading) (reading) (a book) (a book)
Syntactic Phenomena & x-Words (4) • Subordinate clauses are seen as x-words as well – They link to the principal clauses as single parts of a sentence (both syntactically and semantically) – Typically they are dependants of a single (x-)word: • noun � attributive clause • verb � object clause • verb � modifier clause • Coordinated clauses – Could be joined under an artificial node, analogous to coordinated parts of a sentence – However, semantically each clause is treated as a separate sentence • Both types of clauses can be expanded to a full-fledged simple sentence structure • Extremely high computational complexity
Discontinuous Constituents • One of the main issue dealing with a phrase-structure grammar • Non-projective parse trees is very rare phenomenon in dependency grammars – A dependency grammar is essentially not based on constituents – At the root there is a verb to which all the other syntactic primitives are connected • Written text & neutral word order vs. speech • Discontinuous x-words are implicitly covered by the natural interleaving of dependants within them – Dependants that linearly stand inside of an x-word are not allowed to be connected to the x-word as whole, but to a particular constituent of it
Evaluation • Performance and complexity aspects are not considered much • Method has been implemented in a running parser of Latvian • Currently we have formalized ~450 patterns of x-words and ~200 dependency rules – The table of morphological features for each sentence is built on-the-fly – A significant amount of work is still pending to accomplish a nearly complete coverage of syntax – Consistency of the set of rules – Chunking facilitates debugging and improvement of the grammar • If a parse tree is syntactically valid there is no reason to reject it (while selectional restrictions are not considered) • Morphological analysis � syntax parsing � syntactic and lexical valences � model builder
Future Plans • Coverage of the grammar • Quantitative: – Run the chunker on all Latvian texts available • Raw-text corpus (~100 milj. running words) • Web corpus (~4 GB) • Exploitation of the BalticGRID infrastructure • Average chunk size statistics for various classes of documents – Use results to iteratively... • create a partially annotated Latvian text corpus • improve the chunker and the grammar • Qualitative: – Sofie treebank (just started)
Recommend
More recommend