exploiting syntactic structure for language modeling
play

Exploiting Syntactic Structure for Language Modeling Ciprian Chelba, - PowerPoint PPT Presentation

Hierarchical Structure in Natural Language ended_S ended_VP with_PP loss_NP of_PP cents_NP contract_NP loss_NP cents_NNS 7_CD the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN Words are hierarchically organized in


  1. Hierarchical Structure in Natural Language ended_S ended_VP with_PP loss_NP of_PP cents_NP contract_NP loss_NP cents_NNS 7_CD the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN � Words are hierarchically organized in syntactic constituents — tree structure � Part of Speech(POS) and Non-Terminal(NT) tags identify the type of constituent � Lexicalized annotation of intermediate nodes in the tree Identifying the syntactic structure � Parsing ✔ Automatic parsing of natural language text is an area of active research Microsoft Research Speech.Net

  2. Exploiting Syntactic Structure for Language Modeling Ciprian Chelba, Frederick Jelinek � Hierarchical Structure in Natural Language � Speech Recognition: Statistical Approach � Basic Language Modeling: – Measures for Language Model Quality – Current Approaches to Language Modeling � A Structured Language Model: – Language Model Requirements – Word and Structure Generation – Research Issues – Model Performance: Perplexity results on UPenn-Treebank – Model Performance: Perplexity and WER results on WSJ/SWB/BN � Any Future for the Structured Language Model? – Richer Syntactic Dependencies – Syntactic Structure Portability – Information Extraction from Text Microsoft Research Speech.Net

  3. Speech Recognition — Statistical Approach Speaker’s Speech Acoustic Linguistic Mind Producer Processor Decoder ^ Speech W A W Speaker Speech Recognizer Acoustic Channel ^ W = a rgmax P ( W j A ) = a rgmax P ( A j W ) � P ( W ) W W ( A j W ) acoustic model : channel probability; � P � P ( W ) language model : source probability; ^ � search for the most likely word string W . ✔ due to the large vocabulary size — tens of thousands of words — an exhaustive search is intractable. Microsoft Research Speech.Net

  4. Basic Language Modeling Estimate the source probability ( W ) , = P W w ; w ; : : : ; w 1 2 n from a training corpus — millions of words of text chosen for its similarity to the expected utterances. Parametric conditional models: P ( w =w : : : w ) ; � 2 � ; w 2 V i 1 i � 1 i � � parameter space � � V source alphabet (vocabulary) ✔ Source Modeling Problem Microsoft Research Speech.Net

  5. Measures for Language Model Quality Word Error Rate (WER) TRN: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS HYP: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS 1 0 0 0 0 0 1 1 1 0 0 :4 errors per 10 words in transcription; WER = 40% Evaluating WER reduction is computationally expensive. Perplexity(PPL) ! N 1 X P P L ( M ) = exp � ln [ P ( w j w : : : w ) ℄ M i 1 i � 1 N i =1 ✔ different than maximum likelihood estimation: the test data is not seen during the model estimation process; ✔ good models are smooth: P ( w j w : : : w ) > � i 1 i � 1 M Microsoft Research Speech.Net

  6. Current Approaches to Language Modeling Assume a Markov source of order n ; equivalence classification of a given context: [ w : : : w ℄ = w : : : w = h 1 i � 1 i � n +1 i � 1 n ( w ) Data sparseness: 3-gram model j w ; w i i � 2 i � 1 � approx. 70% of the trigrams in the training data have been seen once. � the rate of new (unseen) trigrams in test data relative to those observed in a training corpus of size 38 million words is 32% for a 20,000-words vocabulary; Smoothing: recursive linear interpolation among relative frequency estimates of different orders f ( � ) ; k = 0 : : : n using a recursive mixing scheme: k P ( u j z ; : : : ; z ) = n 1 n � ( z ; : : : ; z ) � P ( u j z ; : : : ; z ) + (1 � � ( z ; : : : ; z )) � f ( u j z ; : : : ; z ) ; 1 n n � 1 1 n � 1 1 n n 1 n ( u ) = m ( U ) P unif or � 1 Parameters: � = f � ( z ; : : : ; z ); ount ( u j z ; : : : ; z ) ; 8 ( u j z ; : : : ; z ) 2 T g n n n 1 1 1 Microsoft Research Speech.Net

  7. Exploiting Syntactic Structure for Language Modeling � Hierarchical Structure in Natural Language � Speech Recognition: Statistical Approach � Basic Language Modeling: ☞ A Structured Language Model: – Language Model Requirements – Word and Structure Generation – Research Issues: � Model Component Parameterization � Pruning Method � Word Level Probability Assignment � Model Statistics Reestimation – Model Performance: Perplexity results on UPenn-Treebank – Model Performance: Perplexity and WER results on WSJ/SWB/BN Microsoft Research Speech.Net

  8. A Structured Language Model � Generalize trigram modeling (local) by taking advantage of sentence structure (influ- ence by more distant past) � Use exposed heads h (words w and their corresponding non-terminal tags l ) for prediction: P ( w j T ) = P ( w j h ( T ) ; h ( T )) i i i � 2 i � 1 i i is the partial hidden structure, with head assignment, provided to T W i ended_VP’ with_PP loss_NP of_PP contract_NP loss_NP cents_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS after Microsoft Research Speech.Net

  9. Language Model Requirements � Model must operate left-to-right: P ( w =w : : : w ) i 1 i � 1 � In hypothesizing hidden structure, the model can use only word-prefix W ; i.e. , not i the complete sentence w ; :::; w ; :::; w n +1 as all conventional parsers do! 0 i � Model complexity must be limited; even trigram model faces critical data sparseness problems � Model will assign joint probability to sequences of words and hidden parse structure: P ( T ; ) W i i Microsoft Research Speech.Net

  10. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ; Microsoft Research Speech.Net

  11. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ;

  12. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ;

  13. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ;

  14. predict word ended_VP’ PREDICTOR TAGGER with_PP loss_NP null tag word PARSER of_PP adjoin_{left,right} cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ;

  15. Word and Structure Generation P ( T ; W ) = n +1 n +1 n +1 Y P ( w j h ; h ) P ( g j w ; h :tag ; h :tag ) P ( T j w ; g ; ) T i � 2 � 1 i i � 1 � 2 i i i i � 1 | {z } | {z } | {z } parser i =1 tagger predictor � The predictor generates the next word w i with probability P ( w = v j h ; h ) � 2 � 1 i � The tagger attaches tag g i to the most recently generated word w i with probability P ( g j w ; h :tag ; h :tag ) i i � 1 � 2 � The parser builds the partial parse i from ; w i , and g i in a series of moves T T i � 1 ending with null , where a parser move a is made with probability P ( a j h ; h ) ; � 2 � 1 f (adjoin-left, NTtag) , (adjoin-right, NTtag) , null g a 2 Microsoft Research Speech.Net

  16. Research Issues � Model component parameterization — equivalence classifications for model compo- nents: P ( w = v j h ; h ) ; P ( g j w ; h :tag ; h :tag ) ; P ( a j h ; h ) i � 2 � 1 i i � 1 � 2 � 2 � 1 � Huge number of hidden parses — need to prune it by discarding the unlikely ones ( w ) � Word level probability assignment — calculate P =w : : : w i 1 i � 1 � Model statistics estimation — unsupervised algorithm for maximizing P ( W ) (mini- mizing perplexity) Microsoft Research Speech.Net

  17. Pruning Method k Number of parses T k for a given word prefix W k is jf T gj � O (2 ) ; k Prune most parses without discarding the most likely ones for a given sentence Synchronous Multi-Stack Pruning Algorithm � the hypotheses are ranked according to ln( P ( W ; T )) k k � each stack contains partial parses constructed by the same number of parser oper- ations The width of the pruning is controlled by: � maximum number of stack entries � log-probability threshold Microsoft Research Speech.Net

Recommend


More recommend