probabilistic graphical models probabilistic graphical
play

Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Learning with partial observations Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives different types of missing data learning with missing data and hidden vars:


  1. Probabilistic Graphical Models Probabilistic Graphical Models Learning with partial observations Siamak Ravanbakhsh Fall 2019

  2. Learning objectives Learning objectives different types of missing data learning with missing data and hidden vars: directed models undirected models develop an intuition for expectation maximization variational interpretation

  3. Two settings for partial observations Two settings for partial observations missing data each instance in is missing some values D

  4. Two settings for partial observations Two settings for partial observations missing data why model hidden variables? each instance in is missing some values D hidden variables original causes variables that are never observed mediating cause effect image credit: Murphy's book

  5. Two settings for partial observations Two settings for partial observations missing data why model hidden variables? each instance in is missing some values D hidden variables original causes variables that are never observed mediating cause latent variable models effect observations have common cause widely used in machine learning image credit: Murphy's book

  6. Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe

  7. Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe observe while is missing ( X = [ X ; X ]) X X o h h o

  8. Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe observe while is missing ( X = [ X ; X ]) X X o h h o missing completely at random (MCAR) P ( X , O ) = P ( X ) P ( O ) X X θ ) 1− x throw to generate p ( x ) = θ (1 − x

  9. Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe observe while is missing ( X = [ X ; X ]) X X o h h o missing completely at random (MCAR) P ( X , O ) = P ( X ) P ( O ) X X θ ) 1− x throw to generate p ( x ) = θ (1 − x throw to decide show/hide ψ ) 1− o p ( o ) = ψ (1 − o

  10. Learning with MCAR Learning with MCAR P ( X , O ) = P ( X ) P ( O ) missing completely at random (MCAR) θ ) 1− x throw to generate p ( x ) = θ (1 − x θ ) 1− o throw to decide show/hide p ( o ) = ψ (1 − o

  11. Learning with MCAR Learning with MCAR P ( X , O ) = P ( X ) P ( O ) missing completely at random (MCAR) θ ) 1− x throw to generate p ( x ) = θ (1 − x θ ) 1− o throw to decide show/hide p ( o ) = ψ (1 − o objective: learn a model for X, from the data D = { x (1) ( M ) , … , x } o o each may include values for a different subset of vars. x o

  12. Learning with MCAR Learning with MCAR P ( X , O ) = P ( X ) P ( O ) missing completely at random (MCAR) θ ) 1− x throw to generate p ( x ) = θ (1 − x θ ) 1− o throw to decide show/hide p ( o ) = ψ (1 − o objective: learn a model for X, from the data D = { x (1) ( M ) , … , x } o o each may include values for a different subset of vars. x o since , we can ignore the obs. patterns P ( X , O ) = P ( X ) P ( O ) optimize: ℓ( D , θ ) = log p ( x , x ) ∑ x ∑ x ∈ D o h o h

  13. A more general criteria A more general criteria ⊥ ∣ X missing at random (MAR) O X X h o if there is information about the obs. pattern in X O X h then it is also in X o

  14. A more general criteria A more general criteria ⊥ ∣ X missing at random (MAR) O X X h o if there is information about the obs. pattern in X O X h then it is also in X o throw the thumb-tack twice X = [ X , X ] 1 2 missing at random example if hide = 1 X X 2 1 missing completely at random otherwise show X 1

  15. A more general criteria A more general criteria ⊥ ∣ X missing at random (MAR) O X X h o if there is information about the obs. pattern in X O X h then it is also in X o throw the thumb-tack twice X = [ X , X ] 1 2 missing at random example if hide = 1 X X 2 1 missing completely at random otherwise show X 1 no "extra" information in the obs. pattern > ignore it optimize: ℓ( D , θ ) = ∑ x log ∑ x p ( x , x ) o h ∈ D o h

  16. Likelihood function Likelihood function marginal for partial observations fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave

  17. Likelihood function Likelihood function marginal for partial observations fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave partially observed: does not decompose not convex anymore

  18. Likelihood function Likelihood function marginal for partial observations fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave partially observed: does not decompose likelihood for a single assignment to the latent vars. not convex anymore ℓ( D , θ ) = log p ( x , x ) ∑ x ∑ x o h ∈ D o h

  19. Likelihood function: Likelihood function: example example marginal for a directed model fully observed case decomposes: x ℓ( D , θ ) = log p ( x , y , z ) ∑ x , y , z ∈ D y z = log p ( x ) + log p ( y ∣ x ) + log p ( z ∣ x ) ∑ x ∑ x , y ∑ x , z

  20. Likelihood function: Likelihood function: example example marginal for a directed model fully observed case decomposes: x ℓ( D , θ ) = log p ( x , y , z ) ∑ x , y , z ∈ D y z = log p ( x ) + log p ( y ∣ x ) + log p ( z ∣ x ) ∑ x ∑ x , y ∑ x , z x is always missing (e.g., in a latent variable model ) ℓ( D , θ ) = ∑ y , z ∈ D log ∑ x p ( x ) p ( y ∣ x ) p ( z ∣ x ) cannot decompose it!

  21. Parameter learning Parameter learning with missing data with missing data Directed models: option 1: obtain the gradient of marginal likelihood option 2: expectation maximization (EM) variational interpretation

  22. Parameter learning Parameter learning with missing data with missing data Directed models: option 1: obtain the gradient of marginal likelihood option 2: expectation maximization (EM) variational interpretation undirected models: obtain the gradient of marginal likelihood EM is not a good option here

  23. Parameter learning Parameter learning with missing data with missing data Directed models: option 1: obtain the gradient of marginal likelihood option 2: expectation maximization (EM) variational interpretation all of these options need inference for each step of undirected models: learning obtain the gradient of marginal likelihood EM is not a good option here

  24. Gradient of the marginal Gradient of the marginal likelihood likelihood (directed models) example log marginal likelihood: ℓ( D ) = log p ( a ) p ( b ) p ( c ∣ a , b ) p ( d ∣ c ) ∑ ( a , d )∈ D ∑ b , c hidden

  25. Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) example log marginal likelihood: ℓ( D ) = log p ( a ) p ( b ) p ( c ∣ a , b ) p ( d ∣ c ) ∑ ( a , d )∈ D ∑ b , c hidden take the derivative: ∂ 1 ′ ′ ℓ( D ) = p ( d , c ∣ a , d ) ∑ ( a , d )∈ D ∂ p ( d ∣ c ) ′ ′ p ( d ∣ c ) ′ ′ need inference for this

  26. Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) example log marginal likelihood: ℓ( D ) = log p ( a ) p ( b ) p ( c ∣ a , b ) p ( d ∣ c ) ∑ ( a , d )∈ D ∑ b , c hidden take the derivative: ∂ 1 ′ ′ ℓ( D ) = p ( d , c ∣ a , d ) ∑ ( a , d )∈ D ∂ p ( d ∣ c ) ′ ′ p ( d ∣ c ) ′ ′ need inference for this what happens to this expression if every variable is observed?

  27. Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) for a Bayesian Network with CPT ∂ 1 ∣ x ℓ( D ) = p ( x ∣ pa ) ∑ x i x o ∈ D ∂ p ( x ∣ pa ) p ( x ∣ pa ) i o i x i x i i some specific assignment run inference for each observation

  28. Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) for a Bayesian Network with CPT ∂ 1 ∣ x ℓ( D ) = p ( x ∣ pa ) ∑ x i x o ∈ D ∂ p ( x ∣ pa ) p ( x ∣ pa ) i o i x i x i i some specific assignment run inference for each observation a technical issue: gradient is always non-negative no constraint of the form p ( x ∣ pa ) = 1 ∑ x x reparametrize (e.g., using softmax) or use Lagrange multipliers

Recommend


More recommend