Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Discriminative Models 1(11)
1. Generative and Discriminative Models 2. Log-Linear Models 3. Local Discriminative Models 4. Global Discriminative Models 5. Reranking Discriminative Models 2(11)
Generative and Discriminative Models A generative statistical model defines the joint probability P ( x , y ) of input x and output y ◮ Pros: ◮ Learning problems have closed form solutions ◮ Related probabilities can be derived: ◮ Conditionalization: P ( y | x ) = P ( x , y ) P ( x ) ◮ Marginalization: P ( x ) = � y P ( x , y ) ◮ Cons: ◮ Rigid independence assumptions (or intractable parsing) ◮ Indirect modeling of parsing problem Discriminative Models 3(11)
Generative and Discriminative Models A discriminative statistical model defines the conditional probability P ( y | x ) of output y given input x ◮ Pros: ◮ No rigid independence assumptions ◮ More direct modeling of parsing problem ◮ Cons: ◮ Learning problems require numerical approximation ◮ Related probabilities cannot be derived: ◮ No way to compute P ( x , y ) from P ( y | x ) ◮ No way to compute P ( x ) or P ( y ) from P ( y | x ) Discriminative Models 4(11)
Generative and Discriminative Models Two classes of discriminative models: ◮ Conditional models: ◮ Explicitly model the conditional probability P ( y | x ) ◮ Used in mapping X → Y : argmax y P ( y | x ) ◮ Purely discriminative models: ◮ Directly optimize mapping X → Y ◮ No explicit model of conditional probability P ( y | x ) Discriminative Models 5(11)
Log-Linear Models �� k � exp i = 1 f i ( x , y ) · w i P ( y | x ) = �� k � � y ′ ∈ GEN ( x ) exp i = 1 f i ( x , y ′ ) · w i ◮ f i ( x , y ) = feature function ◮ w i = feature weight �� k � ◮ exp i = 1 f i ( x , y ) · w i > 0 �� k � �� k � ◮ exp i = 1 f i ( x , y ) · w i ≤ � y ′ ∈ GEN ( x ) exp i = 1 f i ( x , y ′ ) · w i ◮ 0 ≤ P ( y | x ) ≤ 1 ◮ � y ′ P ( y ′ | x ) = 1 Discriminative Models 6(11)
Log-Linear Models y ∗ = argmax y P ( y | x ) exp [ � k i = 1 f i ( x , y ) · w i ] = argmax y y ′∈ GEN ( x ) exp [ � k i = 1 f i ( x , y ′ ) · w i ] � �� k � = argmax y exp i = 1 f i ( x , y ) · w i � k = argmax y i = 1 f i ( x , y ) · w i Discriminative Models 7(11)
Local Discriminative Models m � P ( y | x ) = P ( d i | Φ( d 1 , . . . , d i − 1 , x )) i = 1 �� k � exp i = 1 f i (Φ( d 1 , . . . , d i − 1 ) , d i ) · w i P ( d i | Φ( d 1 , . . . , d i − 1 , x )) = �� k � � d ′ ∈ GEN ( x ) exp i = 1 f i (Φ( d 1 , . . . , d i − 1 ) , d ′ ) · w i ◮ Conditional model over local decisions ◮ Pros: unconstrained features, efficient learning/decoding ◮ Cons: approximate search (beam search or similar) Discriminative Models 8(11)
Global Discriminative Models �� k � exp i = 1 f i ( x , y ) · w i P ( y | x ) = �� k � � y ′ ∈ GEN ( x ) exp i = 1 f i ( x , y ′ ) · w i ◮ Conditional model over global structure ◮ Factorization for efficient inference (dynamic programming) ◮ Pros: exact learning/decoding ◮ Cons: only local features, less efficient Discriminative Models 9(11)
Reranking �� k � exp i = 1 f i ( x , y ) · w i P ( y | x ) = �� k � � y ′ ∈ GEN n ( x ) exp i = 1 f i ( x , y ′ ) · w i ◮ Conditional model over global structure ◮ GEN n ( x ) = n-best list for efficient inference ◮ Pros: unconstrained features, (almost) exact learning/decoding ◮ Cons: can be inefficient Discriminative Models 10(11)
Reranking On Exact and Approximate Methods What if the objective function we want to maximize is not efficiently computable in our favorite model? 1. Use a simpler model (e.g., restrict feature scope) 2. Use approximate inference (e.g., beam search or reranking) 3. Use another objective function (e.g., labeled recall) Which strategy works best is (usually) an empirical question! Discriminative Models 11(11)
Recommend
More recommend