Probabilistic Graphical Models Probabilistic Graphical Models Learning with partial observations Siamak Ravanbakhsh Fall 2019
Learning objectives Learning objectives different types of missing data learning with missing data and hidden vars: directed models undirected models develop an intuition for expectation maximization variational interpretation
Two settings for partial observations Two settings for partial observations missing data each instance in is missing some values D
Two settings for partial observations Two settings for partial observations missing data why model hidden variables? each instance in is missing some values D hidden variables original causes variables that are never observed mediating cause effect image credit: Murphy's book
Two settings for partial observations Two settings for partial observations missing data why model hidden variables? each instance in is missing some values D hidden variables original causes variables that are never observed mediating cause latent variable models effect observations have common cause widely used in machine learning image credit: Murphy's book
Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe
Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe observe while is missing ( X = [ X ; X ]) X X o h h o
Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe observe while is missing ( X = [ X ; X ]) X X o h h o missing completely at random (MCAR) P ( X , O ) = P ( X ) P ( O ) X X θ ) 1− x throw to generate p ( x ) = θ (1 − x
Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe observe while is missing ( X = [ X ; X ]) X X o h h o missing completely at random (MCAR) P ( X , O ) = P ( X ) P ( O ) X X θ ) 1− x throw to generate p ( x ) = θ (1 − x throw to decide show/hide ψ ) 1− o p ( o ) = ψ (1 − o
Learning with MCAR Learning with MCAR P ( X , O ) = P ( X ) P ( O ) missing completely at random (MCAR) θ ) 1− x throw to generate p ( x ) = θ (1 − x θ ) 1− o throw to decide show/hide p ( o ) = ψ (1 − o
Learning with MCAR Learning with MCAR P ( X , O ) = P ( X ) P ( O ) missing completely at random (MCAR) θ ) 1− x throw to generate p ( x ) = θ (1 − x θ ) 1− o throw to decide show/hide p ( o ) = ψ (1 − o objective: learn a model for X, from the data D = { x (1) ( M ) , … , x } o o each may include values for a different subset of vars. x o
Learning with MCAR Learning with MCAR P ( X , O ) = P ( X ) P ( O ) missing completely at random (MCAR) θ ) 1− x throw to generate p ( x ) = θ (1 − x θ ) 1− o throw to decide show/hide p ( o ) = ψ (1 − o objective: learn a model for X, from the data D = { x (1) ( M ) , … , x } o o each may include values for a different subset of vars. x o since , we can ignore the obs. patterns P ( X , O ) = P ( X ) P ( O ) optimize: ℓ( D , θ ) = log p ( x , x ) ∑ x ∑ x ∈ D o h o h
A more general criteria A more general criteria ⊥ ∣ X missing at random (MAR) O X X h o if there is information about the obs. pattern in X O X h then it is also in X o
A more general criteria A more general criteria ⊥ ∣ X missing at random (MAR) O X X h o if there is information about the obs. pattern in X O X h then it is also in X o throw the thumb-tack twice X = [ X , X ] 1 2 missing at random example if hide = 1 X X 2 1 missing completely at random otherwise show X 1
A more general criteria A more general criteria ⊥ ∣ X missing at random (MAR) O X X h o if there is information about the obs. pattern in X O X h then it is also in X o throw the thumb-tack twice X = [ X , X ] 1 2 missing at random example if hide = 1 X X 2 1 missing completely at random otherwise show X 1 no "extra" information in the obs. pattern > ignore it optimize: ℓ( D , θ ) = ∑ x log ∑ x p ( x , x ) o h ∈ D o h
Likelihood function Likelihood function marginal for partial observations fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave
Likelihood function Likelihood function marginal for partial observations fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave partially observed: does not decompose not convex anymore
Likelihood function Likelihood function marginal for partial observations fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave partially observed: does not decompose likelihood for a single assignment to the latent vars. not convex anymore ℓ( D , θ ) = log p ( x , x ) ∑ x ∑ x o h ∈ D o h
Likelihood function: Likelihood function: example example marginal for a directed model fully observed case decomposes: x ℓ( D , θ ) = log p ( x , y , z ) ∑ x , y , z ∈ D y z = log p ( x ) + log p ( y ∣ x ) + log p ( z ∣ x ) ∑ x ∑ x , y ∑ x , z
Likelihood function: Likelihood function: example example marginal for a directed model fully observed case decomposes: x ℓ( D , θ ) = log p ( x , y , z ) ∑ x , y , z ∈ D y z = log p ( x ) + log p ( y ∣ x ) + log p ( z ∣ x ) ∑ x ∑ x , y ∑ x , z x is always missing (e.g., in a latent variable model ) ℓ( D , θ ) = ∑ y , z ∈ D log ∑ x p ( x ) p ( y ∣ x ) p ( z ∣ x ) cannot decompose it!
Parameter learning Parameter learning with missing data with missing data Directed models: option 1: obtain the gradient of marginal likelihood option 2: expectation maximization (EM) variational interpretation
Parameter learning Parameter learning with missing data with missing data Directed models: option 1: obtain the gradient of marginal likelihood option 2: expectation maximization (EM) variational interpretation undirected models: obtain the gradient of marginal likelihood EM is not a good option here
Parameter learning Parameter learning with missing data with missing data Directed models: option 1: obtain the gradient of marginal likelihood option 2: expectation maximization (EM) variational interpretation all of these options need inference for each step of undirected models: learning obtain the gradient of marginal likelihood EM is not a good option here
Gradient of the marginal Gradient of the marginal likelihood likelihood (directed models) example log marginal likelihood: ℓ( D ) = log p ( a ) p ( b ) p ( c ∣ a , b ) p ( d ∣ c ) ∑ ( a , d )∈ D ∑ b , c hidden
Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) example log marginal likelihood: ℓ( D ) = log p ( a ) p ( b ) p ( c ∣ a , b ) p ( d ∣ c ) ∑ ( a , d )∈ D ∑ b , c hidden take the derivative: ∂ 1 ′ ′ ℓ( D ) = p ( d , c ∣ a , d ) ∑ ( a , d )∈ D ∂ p ( d ∣ c ) ′ ′ p ( d ∣ c ) ′ ′ need inference for this
Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) example log marginal likelihood: ℓ( D ) = log p ( a ) p ( b ) p ( c ∣ a , b ) p ( d ∣ c ) ∑ ( a , d )∈ D ∑ b , c hidden take the derivative: ∂ 1 ′ ′ ℓ( D ) = p ( d , c ∣ a , d ) ∑ ( a , d )∈ D ∂ p ( d ∣ c ) ′ ′ p ( d ∣ c ) ′ ′ need inference for this what happens to this expression if every variable is observed?
Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) for a Bayesian Network with CPT ∂ 1 ∣ x ℓ( D ) = p ( x ∣ pa ) ∑ x i x o ∈ D ∂ p ( x ∣ pa ) p ( x ∣ pa ) i o i x i x i i some specific assignment run inference for each observation
Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) for a Bayesian Network with CPT ∂ 1 ∣ x ℓ( D ) = p ( x ∣ pa ) ∑ x i x o ∈ D ∂ p ( x ∣ pa ) p ( x ∣ pa ) i o i x i x i i some specific assignment run inference for each observation a technical issue: gradient is always non-negative no constraint of the form p ( x ∣ pa ) = 1 ∑ x x reparametrize (e.g., using softmax) or use Lagrange multipliers
Recommend
More recommend