Joint Word Segmentation and pos-Tagging using a Single Perceptron Yue Zhang and Stephen Clark Oxford University Computing Laboratory June 5, 2008 Oxford University Computing Laboratory
Introduction of Chinese pos-tagging • Chinese sentences are written as character sequences �� � � � I like reading • Word segmentation is a necessary step before pos -tagging �� � � � Input Ilikereading � � � � � Segment I like reading � /PN � � /V � � /N Tag I/PN like/V reading/N • The traditional approach treats word segmentation and pos -tagging as two separate steps Oxford University Computing Laboratory 1
Two observations • Segmentation errors propagate to the step of pos -tagging �� � � � Input Ilikereading �� � � � Segment Ili ke reading �� /N � /V � � /N Tag Ili/N ke/V reading/N • information about pos helps to improve segmentation � /CD � /M � /N � /CD � � /JJ or � � � � � /CD � /CD � /CD � /CD � /CD � /CD or Oxford University Computing Laboratory 2
Joint segmentation and tagging • The observations lead to the solution of joint segmentation and pos - tagging �� � � � Input Ilikereading � /PN � � /V � � /N Output I/PN like/V reading/N • Consider segmentation and pos information simultaneously • The most appropriate output is chosen from all possible segmented and tagged outputs Oxford University Computing Laboratory 3
Challenges • How to evaluate the correctness of outputs – the model • How to perform decoding – choose the best from all possible outputs Difficulty in the large combined search space: O (2 n − 1 T n ) . Dependending on the feature set, dynamical programming can be inefficient too (which is the case for this paper). • How to automatically train parameters in the model Challenge in the training of features for segmentation and pos -tagging simultaneously. Oxford University Computing Laboratory 4
Existing solutions Ng and Low (2004) • The model: maps joint segmentation and pos -tagging to a character tagging problem, assigning two types of tags to each character to indicate segmentation and pos information respectively. � /s PN � /b V � /e V � /b N � /e N � (I) � � (like) � � (reading) • Decoding: beam search • Training: maximum entropy model for sequence labeling Oxford University Computing Laboratory 5
Existing solutions Shi and Wang (2007) • The model: take the N -best output from the word segmentor and pass them to a separate pos -tagger, ranking candidates by the overall probability score from the segmentor and tagger. • Decoding: A* for word segmentation and dynamic programming for tagging. • Training: conditional random field for sequence labeling. Oxford University Computing Laboratory 6
Existing solutions Potential disadvantage for both models above is the restriction of interaction between segmentation and pos information. • For the character based method, whole word information is not explicitly associated with pos . • For the reranking method, interaction is limited to the best output list from the word segmentor. Oxford University Computing Laboratory 7
Our proposed model The motivation is not to pose any restriction on the interaction between word and pos information during processing. • The model: a linear model with both word segmentation and pos -tagging features. • Decoding: a multiple-beam search algorithm. • Training: the generalized perceptron. Oxford University Computing Laboratory 8
The baseline • Word segmentor from our previous research (Zhang and Clark, 2007) • The perceptron pos -tagger from Collins (2002) Oxford University Computing Laboratory 9
The baseline word segmentor • Linear model trained by the generalized perceptron • Features are extracted from a word bigram context • Encompass both word and character information • Standard beam search decoder Oxford University Computing Laboratory 10
Features from the baseline segmentor 1 word w 2 word bigram w 1 w 2 3 single-character word w 4 a word of length l with starting character c 5 a word of length l with ending character c 6 space-separated characters c 1 and c 2 7 character bigram c 1 c 2 in any word 8 the first / last characters c 1 / c 2 of any word 9 word w immediately before character c 10 character c immediately before word w 11 the starting characters c 1 and c 2 of two consecutive words 12 the ending characters c 1 and c 2 of two consecutive words 13 a word of length l with previous word w 14 a word of length l with next word w Oxford University Computing Laboratory 11
The baseline pos-tagger • Linear model trained by the generalized perceptron • Features redefined for Chinese, including tag trigrams • Standard beam search decoder Oxford University Computing Laboratory 12
Features from the baseline pos-tagger 1, 2, 3 tag t with word w , tag bigram t 1 t 2 , tag trigram t 1 t 2 t 3 4 tag t followed by word w 5 word w followed by tag t 6 word w with tag t and previous character c 7 word w with tag t and next character c 8 tag t on single-character word w in character trigram c 1 wc 2 9 tag t on a word starting with char c 10 tag t on a word ending with char c 11 tag t on a word containing char c in the middle 12 tag t on a word starting with char c 0 and containing char c 13 tag t on a word ending with char c 0 and containing char c 14 tag t on a word containing repeated char cc 15 tag t on a word starting with character category g 16 tag t on a word ending with character category g Oxford University Computing Laboratory 13
The joint segmentor and pos-tagger • Linear model trained by the generalized perceptron • Features are the union of baseline segmentor and tagger features • Multiple beam search decoder Oxford University Computing Laboratory 14
The joint segmentor and pos-tagger • Formulation of the joint segmentation and tagging problem Given an input sentence x , the output F ( x ) satisfies: F ( x ) = arg max Score ( y ) y ∈ GEN ( x ) • The model (denoting the global feature vector for y with Φ( y ) ): Score ( y ) = Φ( y ) · � w Oxford University Computing Laboratory 15
The joint segmentor and pos-tagger Inputs : training examples ( x i , y i ) Initialization : set � w = 0 Algorithm : for t = 1 ..T , i = 1 ..N calculate z i = arg max y ∈ GEN ( x i ) Φ( y ) · � w if z i � = y i w = � w + Φ( y i ) − Φ( z i ) � Outputs : � w Oxford University Computing Laboratory 16
The joint segmentor and pos-tagger • Decoding algorithm is one of the biggest challenges. – Exact inference would be very slow even with dynamic programming – The standard beam search gave inferior accuracy • A multiple beam search decoding algorithm – An agenda given to each character in the input sentence, recording the best segmented and pos -tagged candidates ending with the character – The input sentence is processed incrementally by characters – When each character is processed, all possible words ending with the character are considered, each possible being combined with previous partial candidates previous character to form new partial candidates – System returns the best item from the last agenda Oxford University Computing Laboratory 17
The joint segmentor and pos-tagger A B C D E Oxford University Computing Laboratory 18
The joint segmentor and pos-tagger A B C D E A/T2 A/T1 Oxford University Computing Laboratory 19
The joint segmentor and pos-tagger A B C D E A/T2 B/T2 A/T1 B/T1 A/T2 AB/T2 A/T1 AB/T1 Oxford University Computing Laboratory 20
The joint segmentor and pos-tagger A B C D E A/T2 B/T2 A/T1 B/T1 A/T2 AB/T2 ABC/T2 A/T1 AB/T1 ABC/T1 Oxford University Computing Laboratory 21
The joint segmentor and pos-tagger A B C D E A/T2 B/T2 A/T2 BC/T2 A/T1 B/T1 A/T2 BC/T1 A/T2 AB/T2 A/T1 BC/T1 A/T1 AB/T1 ABC/T2 ABC/T1 A/T1 BC/T2 Oxford University Computing Laboratory 22
The joint segmentor and pos-tagger A B C D E A/T2 B/T2 A/T2 B/T2 C/T1 A/T1 B/T1 AB/T1 C/T1 A/T2 AB/T2 A/T2 BC/T2 A/T1 AB/T1 A/T2 BC/T1 A/T1 BC/T1 ABC/T2 … Oxford University Computing Laboratory 23
The joint segmentor and pos-tagger A B C D E A/T2 B/T2 A/T2 B/T2 C/T1 … … A/T1 B/T1 AB/T1 C/T1 … … A/T2 AB/T2 A/T2 BC/T2 … … A/T1 AB/T1 A/T2 BC/T1 … … Oxford University Computing Laboratory 24
Optimization techniques • The tag dictionary – Frequent words – Closed-set tags • The maximum word length record for each tag • Only the best is stored among candidates in the same context. • All the above information are updated online Oxford University Computing Laboratory 25
Experiments • The experimental data: Chinese Treebank 4 • Test set: 10 -fold cross validation on Chinese Treebank 3 • Development set: The rest of the data are used to determine the number of training iterations, analyse the influence of various factors and draw the distribution of typical errors Oxford University Computing Laboratory 26
The learning curves 0.9 0.92 0.89 0.91 F-score F-score 0.88 0.9 0.87 0.89 0.86 0.88 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Number of training iterations Number of training iterations Oxford University Computing Laboratory 27
Recommend
More recommend