Learning and Reasoning With Incomplete Data: Foundations and - PowerPoint PPT Presentation

Learning and Reasoning With Incomplete Data: Foundations and Algorithms Manfred Jaeger Machine Intelligence Group Aalborg University Tutorial UAI 2010 1 / 54

Outline Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests Tutorial UAI 2010 2 / 54

Key References Introduction 1 D. Rubin, Inference and Missing Data . Biometrika 63, 1976 2 D.F. Heitjan and D. Rubin , Ignorability and Coarse Data . Ann. Stats. 19, 1991 3 R.D. Gill, M.J. van der Laan and J.M. Robins, Coarsening at Random: Characterizations, Conjectures, Counter-Examples . Proc. 1st. Seattle Symposium in Biostatistics, 1997 4 P .D. Grünwald and J.Y. Halpern, Updating Probabilities . JAIR 19, 2003 5 M. Jaeger, Ignorability for Categorical Data . Ann. Stats. 33, 2005 6 M. Jaeger, Ignorability in Statistical and Probabilistic Inference . JAIR 24, 2005 7 M. Jaeger, The AI&M Procedure for Learning from Incomplete Data . UAI 2006 8 M. Jaeger, On Testing the Missing at Random Assumption . ECML 2006 9 R.D. Gill and P .D. Grünwald, An Algorithmic and a Geometric Characterization of Coarsening at Random . Ann. Stats. 36, 2008. Tutorial UAI 2010 3 / 54

Learning from Incomplete Data Introduction heads tails Partially observed sequence of 10 coin tosses: h , t , ? , h , ? , h , ? , h , t , ? “Face-value” likelihood function for estimating the probability of heads : 10 P θ ( d i ) = θ 4 · ( 1 − θ ) 2 · 1 4 Y L ( θ ) = P θ ( data ) = i = 1 Maximized by θ = 2 / 3. Is this correct if “?” means: not reported because . . . ◮ . . . coin rolled off the table? ◮ . . . one observer does not know whether “harp” is heads or tails of the Irish Euro? Tutorial UAI 2010 4 / 54

Inference by Conditioning Introduction The famous Monty Hall problem Argument for staying with chosen door: P ( prize = 1 ) P ( prize = 1 | prize � = 2 ) = P ( prize ∈ { 1 , 3 } ) = 1 / 2 Argument for switching to door 3: "door 3 ’inherits’ the probability mass of door 2, and thus P ( prize = 3 ) = 2 / 3 ” Tutorial UAI 2010 5 / 54

The Common Problem Introduction Can we identify X is observed ∼ X has happened Coin tossing example : X : either h or t Monty Hall : X : goat behind door 2 Tutorial UAI 2010 6 / 54

Outline Coarse Data Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests Tutorial UAI 2010 7 / 54

Missing Values and Coarse Data Coarse Data Data set with missing values: X 1 X 2 X 3 d 1 true ? high d 2 false false ? d 3 true ? medium Other types of incompleteness: ◮ Partly observed values: X 3 � = high ◮ Constraints on multiple variables: X 1 = true or X 2 = true Coarse data model [2]: incomplete observations can correspond to any subset of complete observations ◮ More general than missing values ◮ Same as partial information in probability updating ◮ cf. prize ∈ { 1 , 3 } ◮ Simplifies theoretical analysis Tutorial UAI 2010 8 / 54

Coarse Data Model Coarse Data ◮ Finite set of states (possible worlds): W = { x 1 , . . . , x n } ◮ Complete data variable X with values in W , governed by distribution P θ ( θ ∈ Θ) . ◮ Incomplete data variable Y with values in 2 W , governed by conditional distribution P λ ( · | X ) ( λ ∈ Λ) . Y space { x 1 , x 2 , x 3 } { x 1 , x 2 } { x 1 , x 3 } { x 2 , x 3 } { x 1 } { x 2 } { x 3 } 0.5 0.1 0.2 0.2 x 1 0.4 0.2 0.04 0.08 0.08 X space 1.0 0 0 0 x 2 0.4 0.4 0 0 0 X ∈ { x 2 , x 3 } 0 0.5 0.5 0 x 3 0.2 0 0.1 0.1 0 P θ P λ P θ,λ Y = { x 2 , x 3 } Tutorial UAI 2010 9 / 54

Outline The CAR Assumption Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests Tutorial UAI 2010 10 / 54

Learning from Coarse Data The CAR Assumption Data : observations of Y : U = U 1 , U 2 , . . . , U N U i ∈ 2 W From correct to face-value likelihood: L ( θ, λ | U ) Y = P θ,λ ( Y = U i ) i Y X = P θ,λ ( Y = U i , X = x ) i x ∈ U i Y X = P θ ( X = x ) P λ ( Y = U i | X = x ) Ass.: constant for x ∈ U i i x ∈ U i Y X = P λ ( Y = U i | X ∈ U i ) P θ ( X = x ) i x ∈ U i Y = P λ ( Y = U i | X ∈ U i ) P θ ( U i ) i Profile Likelihood L ( θ, λ | U ) ∼ Y max P θ ( U i ) Face-value likelihood λ i Tutorial UAI 2010 11 / 54

Inference by Conditioning The CAR Assumption Observation : value of Y : U ∈ 2 W Updating to posterior belief: P θ ( X = x ) P λ ( Y = U | X = x ) P θ,λ ( X = x | Y = U ) = Ass.: constant for x ∈ U P θ,λ ( Y = U ) P θ ( X = x ) P λ ( Y = U | X ∈ U ) = P θ,λ ( Y = U ) P θ ( X = x ) P θ,λ ( X ∈ U | Y = U ) = P θ ( X ∈ U ) = P θ ( X = x | X ∈ U ) Tutorial UAI 2010 12 / 54

Essential CAR The CAR Assumption Data (observation) is coarsened at random (CAR) [1,2] if for all U : P λ ( Y = U | X = x ) is constant for x ∈ U (e-CAR) The CAR assumption justifies ◮ learning by maximization of face-value likelihood (EM algorithm) ◮ belief updating by conditioning Is that it? . . . not quite . . . what does (e-CAR) mean: for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U } for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U , P θ ( X = x ) > 0 } Tutorial UAI 2010 13 / 54

Conditioning and Weak CAR The CAR Assumption In the justification for conditioning: P θ ( X = x ) P λ ( Y = U | X = x ) = P θ ( X = x ) P λ ( Y = U | X ∈ U ) P θ,λ ( Y = U ) P θ,λ ( Y = U ) Needed: for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U , P θ ( X = x ) > 0 } (w-CAR) Tutorial UAI 2010 14 / 54

Profile Likelihood and Strong CAR The CAR Assumption In the derivation of the face-value likelihood: L ( θ, λ | U ) Y X max = max P θ ( X = x ) P λ ( Y = U i | X = x ) λ λ i x ∈ U i Y = max P λ ( Y = U i | X ∈ U i ) P θ ( U i ) λ i Y ≈ P θ ( U i ) i ◮ Only if domain of λ -maximization is independent of θ ◮ “Paramter distinctness” [1] ◮ Domain of λ -maximization must not depend on support ( P θ ) ◮ If we assume only weak CAR, then the domain of λ -maximization does depend on support ( θ ) ◮ Need for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U } (s-CAR) Tutorial UAI 2010 15 / 54

Examples The CAR Assumption Strong CAR: x 1 0.4 0.3 0.2 0.4 0.1 x 2 0.0 0.7 0.2 0 0.1 x 3 0.6 0.5 0.4 0 0.1 Weak CAR, not strong CAR: x 1 0.4 0.1 0.6 0.2 0.1 x 2 0.0 x 3 0.6 0.2 0.2 0.5 0.1 Tutorial UAI 2010 16 / 54

Example: Data The CAR Assumption State space with 4 states, parametric model, and empirical probabilities from 13 observations: A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB ab A ¯ B a ( 1 − b ) ¯ AB ( 1 − a ) b A ¯ ¯ B ( 1 − a )( 1 − b ) Tutorial UAI 2010 17 / 54

Example: Face-Value Likelihood The CAR Assumption Face-value likelihood function for parameters a , b : 0.0005 0.0004 0.0003 0.0002 0.0001 0 0 1 0.8 0.6 0.5 0.4 0.2 b a 0 Maximum at ( a , b ) ≈ ( 0 . 845 , 0 . 636 ) Tutorial UAI 2010 18 / 54

Example: Learned Distribution The CAR Assumption Distribution learned under s-CAR assumption: A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB 0.54 λ 1 λ 2 A ¯ B 0.31 λ 1 λ 3 ¯ AB 0.1 λ 3 λ 4 ¯ A ¯ λ 2 B 0.05 Question are there s-CAR λ parameters defining the joint distribution of X , Y with learned marginal on W, and observed empirical distribution on 2 W ? No : λ 2 = 1 ⇒ λ 1 = 0 ⇒ P ( Y = A ) = 0 � = 6 / 13 Tutorial UAI 2010 19 / 54

Example: w-CAR Likelihood The CAR Assumption The profile likelihood under w-CAR differs from the face-value likelihood by set-of-support specific constants [5]: 1e–07 8e–08 6e–08 4e–08 2e–08 0 0 1 0.8 0.6 0.5 0.4 0.2 b a 0 Maximum at ( a , b ) = ( 9 / 13 , 1 . 0 ) Tutorial UAI 2010 20 / 54

Example: Learned Distribution The CAR Assumption Distribution learned under w-CAR assumption: A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB 9/13 2 / 3 1 / 3 A ¯ B 0.0 ¯ AB 4/13 1 / 4 3 / 4 A ¯ ¯ B 0.0 Question are there w-CAR λ parameters defining the joint distribution of X , Y with learned marginal on W, and observed empirical distribution on 2 W ? Yes! Tutorial UAI 2010 21 / 54

Example: Summary The CAR Assumption The following were jointly inconsistent: ◮ Observed empirical distribution of Y ◮ Learned distribution of X under s-CAR assumption ◮ s-CAR assumption Jointly consistent were: ◮ Observed empirical distribution of Y ◮ Learned distribution of X under w-CAR assumption ◮ w-CAR assumption Tutorial UAI 2010 22 / 54

CAR is everything? The CAR Assumption Gill, van der Laan, Robins [3]: “CAR is everything” That is: for every distribution P of Y there exists a joint distribution of X , Y , s.t. ◮ The marginal for Y is P ◮ The joint is s-CAR A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB 7/14 7 / 13 6 / 13 A ¯ B 5/14 7 / 13 6 / 13 ¯ AB 2/14 6 / 13 7 / 13 A ¯ ¯ 6 / 13 7 / 13 B 0 Tutorial UAI 2010 23 / 54

Learning and Reasoning With Incomplete Data: Foundations and - PowerPoint PPT Presentation

Learning and Reasoning With Incomplete Data: Foundations and Algorithms Manfred Jaeger Machine Intelligence Group Aalborg University Tutorial UAI 2010 1 / 54 Outline Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption

Incomplete Information Econ 400 University of Notre Dame Econ 400 (ND) Incomplete Information

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Foundations of Incomplete Contracts Oliver Hart and John Moore Ana McDowall, Francesco Palazzo,

Synthesis under incomplete information Andreas Augustin June 12, 2008 Andreas Augustin

Evidential and Causal Reasoning Much reasoning in AI can be seen as evidential reasoning ,

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

recap to this point foundations foundations foundations foundations genetics =

Approximate Stream Reasoning with Incomplete State Information Fourth Stream Reasoning Workshop,

CHAPTER-4 1 LOGIC AND REASONING ! Knowledge and ! Reasoning in Knowledge- Reasoning Based

SECTION 1: Introductions Code Reasoning Forward Reasoning CODE REASONING +

Probabilistic Reasoning; Probabilistic Reasoning; Network-based reasoning Network-based

Bayesian Network Parameter Learning from Incomplete Data Guy Van den Broeck, Karthika Mohan,

Bayesian Games and Auctions Mihai Manea MIT Games of Incomplete Information Incomplete

Randomness Task 6: Coping with Incomplete Knowledge: Overview You flip a coin. It either

Game Theory Strategic Form Games with Incomplete Information Levent Ko ckesen Ko c

Incomplete conditionals A pragmatic analysis Chi-H e Elder University of Cambridge LAGB

The Impact of seq on Free Theorems-Based Program Transformations Janis Voigtl ander

Improving network agility with seamless BGP reconfigurations Laurent Vanbever

EMBODIED CARBON IN THE BUILT ENVIRONMENT: SESSION 5 - REUSE August 17, 2018 Disclaimer Webinar

A practical view on linear algebra tools Evgeny Epifanovsky University of Southern California

Stable-Matching Voronoi Diagrams David Eppstein University of California, Irvine 21st Japan

R Modules for Accurate and Bugs Inaccuracies Reliable Statistical Computing Too little

The problem of Energy Disaggrega4on/ Load Monitoring Analysis

Technology Update Technology Update Terrell Russell, Ph.D. June 9-12, 2020 @terrellrussell