 
              Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on Statistical Parsing of Morphologically-Rich Languages 12th International Conference on Parsing Technologies Jinho D. Choi & Martha Palmer University of Colorado at Boulder October 6th, 2011 choijd@colorado.edu Thursday, October 6, 2011
Dependency Parsing in Korean • Why dependency parsing in Korean? - Korean is a flexible word order language. S SOV construction S NP-OBJ-1 S NP-SBJ VP NP-SBJ VP AP VP AP VP NP-OBJ VP NP-OBJ VP ��� ��� �� ���� �� ��� ��� ����� ���� Him she still *T* loved She still him loved OBJ ADV ADV SBJ SBJ OBJ 2 Thursday, October 6, 2011
Dependency Parsing in Korean • Why dependency parsing in Korean? - Korean is a flexible word order language. - Rich morphology makes it easy for dependency parsing. 그녀 ! + ! 는 그 ! + ! 를 ���� She + Aux. particle He + Obj. case marker loved SBJ ADV OBJ ��� ��� �� She still him 3 Thursday, October 6, 2011
Dependency Parsing in Korean • Statistical dependency parsing in Korean - Sufficiently large training data is required. • Not much training data available for Korean dependency parsing. • Constituent Treebanks in Korean - Penn Korean Treebank: 15K sentences. - K AIST Treebank: 30K sentences. - Sejong Treebank: 60K sentences. • The most recent and largest Treebank in Korean. • Containing Penn Treebank style constituent trees. 4 Thursday, October 6, 2011
Sejong Treebank • Phrase structure - Including phrase tags, POS tags, and function tags. - Each token can be broken into several morphemes. S ! �� ( ��� )/NP+ � /JX ��� ! ��� ���������� /MAG NP-SBJ VP ! �� ����� /NP+ � /JKO AP VP ! �������� /NNG+ � /XSV+ � /EP+ � /EF ���� NP-OBJ VP ��� ��� �� ���� She still him loved Tokens are mostly separated by white spaces. 5 Thursday, October 6, 2011
Sejong Treebank Phrase-level tags Function tags Sentence Subject S SBJ Quotative clause Object Q OBJ Noun phrase Complement NP CMP Verb phrase Noun modifier VP MOD Copula phrase Predicate modifier VNP AJT Adverb phrase Conjunctive AP CNJ Adnoun phrase Vocative DP INT Interjection phrase Parenthetical IP PRN General noun Adnoun Prefinal EM Auxiliary PR NNG MM EP JX NNP Proper noun MAG General adverb EF Final EM JC Conjunctive PR Bound noun Conjunctive adverb Conjunctive EM Interjection NNB MAJ EC IC NP Pronoun JKS Subjective CP ETN Nominalizing EM SN Number NR Numeral JKC Complemental CP ETM Adnominalizing EM SL Foreign word VV Verb JKG Adnomial CP XPN Noun prefix SH Chinese word VA Adjective JKO Objective CP XSN Noun DS NF Noun-like word Auxiliary predicate Adverbial CP Verb DS Predicate-like word VX JKB XSV NV VCP Copula JKV Vocative CP XSA Adjective DS NA Unknown word Negation adjective Quotative CP Base morpheme SF , SP , SS , SE , SO , SW VCN JKQ XR 6 Thursday, October 6, 2011 ��� ��� �� ���� ��� �� ��� � ��� ���������� �� ����� � ���� �������� � � �
Dependency Conversion • Conversion steps - Find the head of each phrase using head-percolation rules. • All other nodes in the phrase become dependents of the head. - Re-direct dependencies for empty categories. • Empty categories are not annotated in the Sejong Treebank. • Skipping this step generates only projective dependency trees. - Label (automatically generated) dependencies. • Special cases - Coordination, nested function tags. 7 Thursday, October 6, 2011
Dependency Conversion • Head-percolation rules - Achieved by analyzing each phrase in the Sejong Treebank. Korean is a head-final language. S r VP;VNP;S;NP|AP;Q;* Q l S|VP|VNP|NP;Q;* NP r NP;S;VP;VNP;AP;* VP r VP;VNP;NP;S;IP;* VNP r VNP;NP;S;* AP r AP;VP;NP;S;* DP r DP;VP;* IP r IP;VNP;* X|L|R r * No rules to find the head morpheme of each token. 8 Thursday, October 6, 2011
Dependency Conversion • Dependency labels - Labels retained from the function tags. - Labels inferred from constituent relations. S input : ( c, p ) , where c is a dependent of p . l output : A dependency label l as c − p . ← NP-SBJ VP begin AP VP if p = root then ROOT → l elif c .pos = AP then ADV → l NP-OBJ VP elif p .pos = AP then AMOD → l elif p .pos = DP then DMOD → l ��� ��� �� ���� elif p .pos = NP then NMOD → l She still him loved elif p .pos = VP|VNP|IP then VMOD → l OBJ else DEP → l ADV end Algorithm 1 : Getting inferred labels. SBJ 9 Thursday, October 6, 2011 ��� ��� �� ���� �� �� ��� �� ���
Dependency Conversion • Coordination - Previous conjuncts as dependents of the following conjuncts. • Nested function tag - Nodes with nested f-tags become the heads of the phrases. S NP-SBJ VP NP-CNJ NP-SBJ NP-OBJ VP NP-CNJ NP-SBJ �� �� ��� �� ��� I_and he_and she home left CNJ CNJ OBJ SBJ 10 Thursday, October 6, 2011
Dependency Parsing • Dependency parsing algorithm - Transition-based, non-projective parsing algorithm. • Choi & Palmer, 2011. - Performs transitions from both projective and non-projective dependency parsing algorithms selectively. • Linear time parsing speed in practice for non-projective trees. • Machine learning algorithm - Liblinear L2-regularized L1-loss support vector. Jinho D. Choi & Martha Palmer. 2011. Getting the Most out of Transition-based Dependency Parsing. In Proceedings of ACL:HLT’11 11 Thursday, October 6, 2011
Dependency Parsing • Feature selection - Each token consists of multiple morphemes (up to 21). - P OS tag feature of each token? • (NNG & XSV & EP & EF & SF) vs. (NNG | XSV | EP | EF | SF) • Sparse information vs. lack of information. Happy medium? ! �� /NNP+ �� /NNG+ � /JX ����� Nakrang_ �������� Nakrang + Princess + JX ! �� /NNP+ �� /NNG+ � /JKO ����� Hodong_ ������ Hodong + Prince + JKO ����� ! �� /NNG+ � /XSV+ � /EP+ � /EF+./SF Love + XSV + EP + EF + . ������ 12 Thursday, October 6, 2011
Dependency Parsing • Morpheme selection The first morpheme FS The last morpheme before JO|DS|EM LS Particles ( J* in Table 1) JK Derivational suffixes ( XS* in Table 1) DS Ending markers ( E* in Table 1) EM The last punctuation, only if there is no other PY morpheme followed by the punctuation �� /NNP+ �� /NNG+ � /JX �� �� �� �� �� �� Nakrang + Princess + JX �� /NNP �� /NNG � /JX � � � �� /NNP+ �� /NNG+ � /JKO �� /NNP �� /NNG � /JKO � � � Hodong + Prince + JKO ����� ����� ����� �� /NNG � � � /XSV � /EF � /SF �� /NNG+ � /XSV+ � /EP+ � /EF+./SF ������ �������� ������ Love + XSV + EP + EF + . ����� �� �� � 13 ����� �� �� � Thursday, October 6, 2011 �� �� �� �� �� �� �� �� � � � � ����� �� � � � �� �� � � � � ����� ����� ����� �� � � � � � ������ �������� ������ ����� �� �� � ����� �� �� � ����� �� � � �
Dependency Parsing • Feature extraction - Extract features using only important morphemes. • Individual POS tag features of the1st and 3rd tokens. : NNP 1 , NNG 1 , JK 1 , NNG 3 , XSV 3 , EF 3 • Joined features of POS tags between the 1st and 3rd tokens. : NNP 1 _ NNG 3 , NNP 1 _ XSV 3 , NNP 1 _ EF 3 , JK 1 _ NNG 3 , JK 1 _ XSV 3 - Tokens used: w i , w j , w i±1 , w j±1 �� /NNP+ �� /NNG+ � /JX �� �� �� �� �� �� Nakrang + Princess + JX �� /NNP �� /NNG � /JX � � � �� /NNP+ �� /NNG+ � /JKO �� /NNP �� /NNG � /JKO � � � Hodong + Prince + JKO ����� ����� ����� �� /NNG � � � /XSV � /EF � /SF �� /NNG+ � /XSV+ � /EP+ � /EF+./SF �������� ������ ������ Love + XSV + EP + EF + . ����� �� �� � 14 ����� �� �� � Thursday, October 6, 2011 ����� �� � � �
Experiments • Corpora - Dependency trees converted from the Sejong Treebank. - Consists of 20 sources in 6 genres. • Newspaper (NP), Magazine (MZ), Fiction (FI), Memoir (ME), Informative Book (IB), and Educational Cartoon (EC). - Evaluation sets are very diverse compared to training sets. • Ensures the robustness of our parsing models. NP MZ FI ME IB EC T 8,060 6,713 15,646 5,053 7,983 1,548 D 2,048 - 2,174 - 1,307 - E 2,048 - 2,175 - 1,308 - # of sentences in each set 15 Thursday, October 6, 2011
Recommend
More recommend