A semi-automatic structure learning method for language modeling Vitor Pera September 11, 2019 Faculdade de Engenharia da Universidade do Porto (FEUP) 1/14
Outline Linguistic Classes Prediction Model (LCPM) LCPM’s Structure Learning Method Preliminary Results Conclusions References 2/14
Linguistic Classes Prediction Model (LCPM) • Multiclass-dependent Ngram ( M > N > 1 ) � P ( ω t | ω 1: t − 1 ) = P ( ω t | c t , ω 1: t − 1 ) P ( c t | ω 1: t − 1 ) c t ∈ C ( ω t ) � ≈ P ( ω t | c t , ω t − N +1: t − 1 ) P ( c t | c t − M +1: t − 1 ) c t ∈ C ( ω t ) 3/14
Linguistic Classes Prediction Model (LCPM) • Multiclass-dependent Ngram ( M > N > 1 ) � P ( ω t | ω 1: t − 1 ) = P ( ω t | c t , ω 1: t − 1 ) P ( c t | ω 1: t − 1 ) c t ∈ C ( ω t ) � ≈ P ( ω t | c t , ω t − N +1: t − 1 ) P ( c t | c t − M +1: t − 1 ) c t ∈ C ( ω t ) • LCPM (FLM formalism) c ↔ f 1: K P ( f 1: K | f 1: K P ( c t | c t − M +1: t − 1 ) − − − − − → t − M +1: t − 1 ) t 3/14
Linguistic Classes Prediction Model (LCPM) • Multiclass-dependent Ngram ( M > N > 1 ) � P ( ω t | ω 1: t − 1 ) = P ( ω t | c t , ω 1: t − 1 ) P ( c t | ω 1: t − 1 ) c t ∈ C ( ω t ) � ≈ P ( ω t | c t , ω t − N +1: t − 1 ) P ( c t | c t − M +1: t − 1 ) c t ∈ C ( ω t ) • LCPM (FLM formalism) c ↔ f 1: K P ( f 1: K | f 1: K P ( c t | c t − M +1: t − 1 ) − − − − − → t − M +1: t − 1 ) t • LCPM structure learning (Goal) • accurate and simple • two steps method 3/14
LCPM’s Structure Learning Method - Step 1: Intro • Given • The need for a LCPM to compute P ( f 1: K | f 1: K t − M +1: t − 1 ) t (factors not known, yet) • Common knowledge on Linguistics • Full knowledge of the specific language interface 4/14
LCPM’s Structure Learning Method - Step 1: Intro • Given • The need for a LCPM to compute P ( f 1: K | f 1: K t − M +1: t − 1 ) t (factors not known, yet) • Common knowledge on Linguistics • Full knowledge of the specific language interface • Solve (non-automatically) • Which linguistic features use? • Which linguistic features exhibit some special statistical independence property? 4/14
LCPM’s Structure Learning Method - Step 1: Procedure 1. Choose the linguistic features ( → f 1: K ) • Informative to model P ( ω t | f 1: K , ω t − N +1: t − 1 ) t • Adequate to data resources (annotation and robustness) 5/14
LCPM’s Structure Learning Method - Step 1: Procedure 1. Choose the linguistic features ( → f 1: K ) • Informative to model P ( ω t | f 1: K , ω t − N +1: t − 1 ) t • Adequate to data resources (annotation and robustness) 2. Make the (credible) assumption: f n t is statistically independent of any other factors, given its own history, iff 1 ≤ n ≤ J (accordingly, split f 1: K → f 1: J ++ f J +1: K , 1 ≤ J < K ) 5/14
LCPM’s Structure Learning Method - Step 1: Procedure 1. Choose the linguistic features ( → f 1: K ) • Informative to model P ( ω t | f 1: K , ω t − N +1: t − 1 ) t • Adequate to data resources (annotation and robustness) 2. Make the (credible) assumption: f n t is statistically independent of any other factors, given its own history, iff 1 ≤ n ≤ J (accordingly, split f 1: K → f 1: J ++ f J +1: K , 1 ≤ J < K ) LCPM factorization � � J � P ( f J +1: K | f 1: J , f 1: K i =1 P ( f i t | f i t − M +1: t − 1 ) t − M +1: t − 1 ) t t � �� � Step 2 5/14
LCPM’s Structure Learning Method - Step 1: Example Given some application and a corpus annotated by multiple tags 1. Admit the following tags are judged as the most appropriate: • Part-of-speech (POS) • Semantic tag (ST) • Gender inflection (GI) 6/14
LCPM’s Structure Learning Method - Step 1: Example Given some application and a corpus annotated by multiple tags 1. Admit the following tags are judged as the most appropriate: • Part-of-speech (POS) • Semantic tag (ST) • Gender inflection (GI) 2. Assuming that from these three LFs only ST can be predicted based uniquely on its own history: • ST → f 1 • (POS,GI) → f 2:3 6/14
LCPM’s Structure Learning Method - Step 1: Example Given some application and a corpus annotated by multiple tags 1. Admit the following tags are judged as the most appropriate: • Part-of-speech (POS) • Semantic tag (ST) • Gender inflection (GI) 2. Assuming that from these three LFs only ST can be predicted based uniquely on its own history: • ST → f 1 • (POS,GI) → f 2:3 Results the LCPM approximation: P ( f 1:3 | f 1:3 t − M +1: t − 1 ) ≈ P ( f 1 t | f 1 t − M +1: t − 1 ) P ( f 2:3 | f 1 t , f 1:3 t − M +1: t − 1 ) t t 6/14
LCPM’s Structure Learning Method - Step 2: Intro • Goal is to learn the structure of statistical model to compute P ( f J +1: K | f 1: J , f 1: K t − M +1: t − 1 ) , more precisely ... t t 7/14
LCPM’s Structure Learning Method - Step 2: Intro • Goal is to learn the structure of statistical model to compute P ( f J +1: K | f 1: J , f 1: K t − M +1: t − 1 ) , more precisely ... t t • Determine automatically Z ⊂ f 1: K t − M +1: t − 1 such that • | Z | is fixed and | Z | << | f 1: K t − M +1: t − 1 | (robustness constraint) • and P ( f J +1: K | f 1: J , Z ) approximates the original conditional t t probabilities according to Information Theory based criteria 7/14
LCPM’s Structure Learning Method - Step 2: Intro • Goal is to learn the structure of statistical model to compute P ( f J +1: K | f 1: J , f 1: K t − M +1: t − 1 ) , more precisely ... t t • Determine automatically Z ⊂ f 1: K t − M +1: t − 1 such that • | Z | is fixed and | Z | << | f 1: K t − M +1: t − 1 | (robustness constraint) • and P ( f J +1: K | f 1: J , Z ) approximates the original conditional t t probabilities according to Information Theory based criteria Notation simplification (hereafter): Y = f J +1: K X = f 1: J Z ⊂ W = f 1: K ; ; t − M +1: t − 1 ; → P ( Y | X, Z ) t t 7/14
LCPM’s SL Method - Step 2: Rules to determine Z • Information Theory measures • Conditional entropy, H ( Y | X ) • Conditional mutual information (CMI), I ( Y ; Z | X ) • Cross-context conditional mutual information (CCCMI), I X l ( Y ; Z | X m ) 8/14
LCPM’s SL Method - Step 2: Rules to determine Z • Information Theory measures • Conditional entropy, H ( Y | X ) • Conditional mutual information (CMI), I ( Y ; Z | X ) • Cross-context conditional mutual information (CCCMI), I X l ( Y ; Z | X m ) • Possible/experimented rules ( → P ( Y | X, Z ) w/ Z ⊂ W ) • To discard Z ∗ If I ( Y ; Z ∗ | X ) < η H ( Y | X ) then Z ∗ is non-relevant 8/14
LCPM’s SL Method - Step 2: Rules to determine Z • Information Theory measures • Conditional entropy, H ( Y | X ) • Conditional mutual information (CMI), I ( Y ; Z | X ) • Cross-context conditional mutual information (CCCMI), I X l ( Y ; Z | X m ) • Possible/experimented rules ( → P ( Y | X, Z ) w/ Z ⊂ W ) • To discard Z ∗ If I ( Y ; Z ∗ | X ) < η H ( Y | X ) then Z ∗ is non-relevant • To determine Z ∗ Z ∗ = argmax { I ( Y ; Z | X ) } Z ⊂ W | Z | = ζ 8/14
LCPM’s SL Method - Step 2: Rules to determine Z (cont.) • Rule to determine Z ∗ using the “Utility” measure N λ Z ∗ = argmax { N λ ( Y ; Z | X ) } , 0 < λ ≤ 1 Z ⊂ W | Z | = ζ 9/14
LCPM’s SL Method - Step 2: Rules to determine Z (cont.) • Rule to determine Z ∗ using the “Utility” measure N λ Z ∗ = argmax { N λ ( Y ; Z | X ) } , 0 < λ ≤ 1 Z ⊂ W | Z | = ζ where N λ ( Y ; Z | X ) represents � � � � P ( X m ) I ( Y ; Z | X m ) − λ P ( X l ) I X l ( Y ; Z | X m ) X m X l � = X m 9/14
LCPM’s SL Method - Step 2: Rules to determine Z (cont.) • Rule to determine Z ∗ using the “Utility” measure N λ Z ∗ = argmax { N λ ( Y ; Z | X ) } , 0 < λ ≤ 1 Z ⊂ W | Z | = ζ where N λ ( Y ; Z | X ) represents � � � � P ( X m ) I ( Y ; Z | X m ) − λ P ( X l ) I X l ( Y ; Z | X m ) X m X l � = X m and I X l ( Y ; Z | X m ) represents P ( Y, Z | X m ) � � P ( Y, Z | X l ) log P ( Y | X m ) P ( Z | X m ) Y Z 9/14
LCPM’s SL Method - Step 2: Example Choose Z 1 or Z 2 to model P ( Y | X, Z ) ; Problem: X ∈ { F, S } , Y ∈ { A, B, U } , Z 1 ∈ { C, D, V } , Z 2 ∈ { E, F, W } 10/14
LCPM’s SL Method - Step 2: Example Choose Z 1 or Z 2 to model P ( Y | X, Z ) ; Problem: X ∈ { F, S } , Y ∈ { A, B, U } , Z 1 ∈ { C, D, V } , Z 2 ∈ { E, F, W } Data: P ( X = F ) = P ( X = S ) 10/14
LCPM’s SL Method - Step 2: Example Choose Z 1 or Z 2 to model P ( Y | X, Z ) ; Problem: X ∈ { F, S } , Y ∈ { A, B, U } , Z 1 ∈ { C, D, V } , Z 2 ∈ { E, F, W } Data: P ( X = F ) = P ( X = S ) “Utility” & Solutions: N 0 ( Y ; Z 1 | X ) < N 0 ( Y ; Z 2 | X ) (near equality) ∴ λ = 0 ⇒ choose Z 2 N 1 ( Y ; Z 1 | X ) > N 1 ( Y ; Z 2 | X ) ∴ λ = 1 ⇒ choose Z 1 10/14
Recommend
More recommend