learning and reasoning with incomplete data foundations
play

Learning and Reasoning With Incomplete Data: Foundations and - PowerPoint PPT Presentation

Learning and Reasoning With Incomplete Data: Foundations and Algorithms Manfred Jaeger Machine Intelligence Group Aalborg University Tutorial UAI 2010 1 / 54 Outline Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption


  1. Learning and Reasoning With Incomplete Data: Foundations and Algorithms Manfred Jaeger Machine Intelligence Group Aalborg University Tutorial UAI 2010 1 / 54

  2. Outline Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests Tutorial UAI 2010 2 / 54

  3. Key References Introduction 1 D. Rubin, Inference and Missing Data . Biometrika 63, 1976 2 D.F. Heitjan and D. Rubin , Ignorability and Coarse Data . Ann. Stats. 19, 1991 3 R.D. Gill, M.J. van der Laan and J.M. Robins, Coarsening at Random: Characterizations, Conjectures, Counter-Examples . Proc. 1st. Seattle Symposium in Biostatistics, 1997 4 P .D. Grünwald and J.Y. Halpern, Updating Probabilities . JAIR 19, 2003 5 M. Jaeger, Ignorability for Categorical Data . Ann. Stats. 33, 2005 6 M. Jaeger, Ignorability in Statistical and Probabilistic Inference . JAIR 24, 2005 7 M. Jaeger, The AI&M Procedure for Learning from Incomplete Data . UAI 2006 8 M. Jaeger, On Testing the Missing at Random Assumption . ECML 2006 9 R.D. Gill and P .D. Grünwald, An Algorithmic and a Geometric Characterization of Coarsening at Random . Ann. Stats. 36, 2008. Tutorial UAI 2010 3 / 54

  4. Learning from Incomplete Data Introduction heads tails Partially observed sequence of 10 coin tosses: h , t , ? , h , ? , h , ? , h , t , ? “Face-value” likelihood function for estimating the probability of heads : 10 P θ ( d i ) = θ 4 · ( 1 − θ ) 2 · 1 4 Y L ( θ ) = P θ ( data ) = i = 1 Maximized by θ = 2 / 3. Is this correct if “?” means: not reported because . . . ◮ . . . coin rolled off the table? ◮ . . . one observer does not know whether “harp” is heads or tails of the Irish Euro? Tutorial UAI 2010 4 / 54

  5. Inference by Conditioning Introduction The famous Monty Hall problem Argument for staying with chosen door: P ( prize = 1 ) P ( prize = 1 | prize � = 2 ) = P ( prize ∈ { 1 , 3 } ) = 1 / 2 Argument for switching to door 3: "door 3 ’inherits’ the probability mass of door 2, and thus P ( prize = 3 ) = 2 / 3 ” Tutorial UAI 2010 5 / 54

  6. The Common Problem Introduction Can we identify X is observed ∼ X has happened Coin tossing example : X : either h or t Monty Hall : X : goat behind door 2 Tutorial UAI 2010 6 / 54

  7. Outline Coarse Data Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests Tutorial UAI 2010 7 / 54

  8. Missing Values and Coarse Data Coarse Data Data set with missing values: X 1 X 2 X 3 d 1 true ? high d 2 false false ? d 3 true ? medium Other types of incompleteness: ◮ Partly observed values: X 3 � = high ◮ Constraints on multiple variables: X 1 = true or X 2 = true Coarse data model [2]: incomplete observations can correspond to any subset of complete observations ◮ More general than missing values ◮ Same as partial information in probability updating ◮ cf. prize ∈ { 1 , 3 } ◮ Simplifies theoretical analysis Tutorial UAI 2010 8 / 54

  9. Coarse Data Model Coarse Data ◮ Finite set of states (possible worlds): W = { x 1 , . . . , x n } ◮ Complete data variable X with values in W , governed by distribution P θ ( θ ∈ Θ) . ◮ Incomplete data variable Y with values in 2 W , governed by conditional distribution P λ ( · | X ) ( λ ∈ Λ) . Y space { x 1 , x 2 , x 3 } { x 1 , x 2 } { x 1 , x 3 } { x 2 , x 3 } { x 1 } { x 2 } { x 3 } 0.5 0.1 0.2 0.2 x 1 0.4 0.2 0.04 0.08 0.08 X space 1.0 0 0 0 x 2 0.4 0.4 0 0 0 X ∈ { x 2 , x 3 } 0 0.5 0.5 0 x 3 0.2 0 0.1 0.1 0 P θ P λ P θ,λ Y = { x 2 , x 3 } Tutorial UAI 2010 9 / 54

  10. Outline The CAR Assumption Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests Tutorial UAI 2010 10 / 54

  11. Learning from Coarse Data The CAR Assumption Data : observations of Y : U = U 1 , U 2 , . . . , U N U i ∈ 2 W From correct to face-value likelihood: L ( θ, λ | U ) Y = P θ,λ ( Y = U i ) i Y X = P θ,λ ( Y = U i , X = x ) i x ∈ U i Y X = P θ ( X = x ) P λ ( Y = U i | X = x ) Ass.: constant for x ∈ U i i x ∈ U i Y X = P λ ( Y = U i | X ∈ U i ) P θ ( X = x ) i x ∈ U i Y = P λ ( Y = U i | X ∈ U i ) P θ ( U i ) i Profile Likelihood L ( θ, λ | U ) ∼ Y max P θ ( U i ) Face-value likelihood λ i Tutorial UAI 2010 11 / 54

  12. Inference by Conditioning The CAR Assumption Observation : value of Y : U ∈ 2 W Updating to posterior belief: P θ ( X = x ) P λ ( Y = U | X = x ) P θ,λ ( X = x | Y = U ) = Ass.: constant for x ∈ U P θ,λ ( Y = U ) P θ ( X = x ) P λ ( Y = U | X ∈ U ) = P θ,λ ( Y = U ) P θ ( X = x ) P θ,λ ( X ∈ U | Y = U ) = P θ ( X ∈ U ) = P θ ( X = x | X ∈ U ) Tutorial UAI 2010 12 / 54

  13. Essential CAR The CAR Assumption Data (observation) is coarsened at random (CAR) [1,2] if for all U : P λ ( Y = U | X = x ) is constant for x ∈ U (e-CAR) The CAR assumption justifies ◮ learning by maximization of face-value likelihood (EM algorithm) ◮ belief updating by conditioning Is that it? . . . not quite . . . what does (e-CAR) mean: for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U } for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U , P θ ( X = x ) > 0 } Tutorial UAI 2010 13 / 54

  14. Conditioning and Weak CAR The CAR Assumption In the justification for conditioning: P θ ( X = x ) P λ ( Y = U | X = x ) = P θ ( X = x ) P λ ( Y = U | X ∈ U ) P θ,λ ( Y = U ) P θ,λ ( Y = U ) Needed: for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U , P θ ( X = x ) > 0 } (w-CAR) Tutorial UAI 2010 14 / 54

  15. Profile Likelihood and Strong CAR The CAR Assumption In the derivation of the face-value likelihood: L ( θ, λ | U ) Y X max = max P θ ( X = x ) P λ ( Y = U i | X = x ) λ λ i x ∈ U i Y = max P λ ( Y = U i | X ∈ U i ) P θ ( U i ) λ i Y ≈ P θ ( U i ) i ◮ Only if domain of λ -maximization is independent of θ ◮ “Paramter distinctness” [1] ◮ Domain of λ -maximization must not depend on support ( P θ ) ◮ If we assume only weak CAR, then the domain of λ -maximization does depend on support ( θ ) ◮ Need for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U } (s-CAR) Tutorial UAI 2010 15 / 54

  16. Examples The CAR Assumption Strong CAR: x 1 0.4 0.3 0.2 0.4 0.1 x 2 0.0 0.7 0.2 0 0.1 x 3 0.6 0.5 0.4 0 0.1 Weak CAR, not strong CAR: x 1 0.4 0.1 0.6 0.2 0.1 x 2 0.0 x 3 0.6 0.2 0.2 0.5 0.1 Tutorial UAI 2010 16 / 54

  17. Example: Data The CAR Assumption State space with 4 states, parametric model, and empirical probabilities from 13 observations: A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB ab A ¯ B a ( 1 − b ) ¯ AB ( 1 − a ) b A ¯ ¯ B ( 1 − a )( 1 − b ) Tutorial UAI 2010 17 / 54

  18. Example: Face-Value Likelihood The CAR Assumption Face-value likelihood function for parameters a , b : 0.0005 0.0004 0.0003 0.0002 0.0001 0 0 1 0.8 0.6 0.5 0.4 0.2 b a 0 Maximum at ( a , b ) ≈ ( 0 . 845 , 0 . 636 ) Tutorial UAI 2010 18 / 54

  19. Example: Learned Distribution The CAR Assumption Distribution learned under s-CAR assumption: A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB 0.54 λ 1 λ 2 A ¯ B 0.31 λ 1 λ 3 ¯ AB 0.1 λ 3 λ 4 ¯ A ¯ λ 2 B 0.05 Question are there s-CAR λ parameters defining the joint distribution of X , Y with learned marginal on W, and observed empirical distribution on 2 W ? No : λ 2 = 1 ⇒ λ 1 = 0 ⇒ P ( Y = A ) = 0 � = 6 / 13 Tutorial UAI 2010 19 / 54

  20. Example: w-CAR Likelihood The CAR Assumption The profile likelihood under w-CAR differs from the face-value likelihood by set-of-support specific constants [5]: 1e–07 8e–08 6e–08 4e–08 2e–08 0 0 1 0.8 0.6 0.5 0.4 0.2 b a 0 Maximum at ( a , b ) = ( 9 / 13 , 1 . 0 ) Tutorial UAI 2010 20 / 54

  21. Example: Learned Distribution The CAR Assumption Distribution learned under w-CAR assumption: A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB 9/13 2 / 3 1 / 3 A ¯ B 0.0 ¯ AB 4/13 1 / 4 3 / 4 A ¯ ¯ B 0.0 Question are there w-CAR λ parameters defining the joint distribution of X , Y with learned marginal on W, and observed empirical distribution on 2 W ? Yes! Tutorial UAI 2010 21 / 54

  22. Example: Summary The CAR Assumption The following were jointly inconsistent: ◮ Observed empirical distribution of Y ◮ Learned distribution of X under s-CAR assumption ◮ s-CAR assumption Jointly consistent were: ◮ Observed empirical distribution of Y ◮ Learned distribution of X under w-CAR assumption ◮ w-CAR assumption Tutorial UAI 2010 22 / 54

  23. CAR is everything? The CAR Assumption Gill, van der Laan, Robins [3]: “CAR is everything” That is: for every distribution P of Y there exists a joint distribution of X , Y , s.t. ◮ The marginal for Y is P ◮ The joint is s-CAR A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB 7/14 7 / 13 6 / 13 A ¯ B 5/14 7 / 13 6 / 13 ¯ AB 2/14 6 / 13 7 / 13 A ¯ ¯ 6 / 13 7 / 13 B 0 Tutorial UAI 2010 23 / 54

Recommend


More recommend