prior driven cluster allocation in bayesian mixture models
play

Prior-Driven Cluster Allocation in Bayesian Mixture Models Sally - PowerPoint PPT Presentation

Prior-Driven Cluster Allocation in Bayesian Mixture Models Sally Paganin sally.paganin@berkeley.edu JSM 2020 August 03, 2020 Amy Herring David Dunson Andrew Olshan Duke University Duke University UNC at Chapel Hill Introduction


  1. Prior-Driven Cluster Allocation in Bayesian Mixture Models Sally Paganin sally.paganin@berkeley.edu JSM 2020 August 03, 2020

  2. Amy Herring David Dunson Andrew Olshan Duke University Duke University UNC at Chapel Hill

  3. Introduction Clustering is one of the canonical data analysis goal in statistics • Distance based methods : distance metric between data points • Model-based clustering : rely on discrete mixture models Bayesian perspective : allow to incorporate prior information

  4. Introduction Clustering is one of the canonical data analysis goal in statistics • Distance based methods : distance metric between data points • Model-based clustering : rely on discrete mixture models Bayesian perspective : allow to incorporate prior information What if, we have prior information on the clustering itself?

  5. Introduction Clustering is one of the canonical data analysis goal in statistics • Distance based methods : distance metric between data points • Model-based clustering : rely on discrete mixture models Bayesian perspective : allow to incorporate prior information What if, we have prior information on the clustering itself? Motivating application - Birth defects data • Relate exposure factors to the development risk of a defect • Prior information available (biology/expert’s judgments) � We aim to provide methods to facilitate data-adaptive clustering, both using information in the data and external knowledge .

  6. National Birth Defect Prevention Study • Population-based case-control study � 300 controls/ 100 cases per year since 1997 � monthly n. of controls ∝ n. of births previous year • Cases ( 37 major birth defect) � Birth defects surveillance system + clinical genetist review � Cases with known etiology were excluded • Controls ❤tt♣✿✴✴✇✇✇✳♥❜❞♣s✳♦r❣✴ � Non-malformed live birth � Birth certificates or hospital delivery records • Data collection � CATI (English/Spanish) within 24 months

  7. National Birth Defect Prevention Study • Population-based case-control study � 300 controls/ 100 cases per year since 1997 � monthly n. of controls ∝ n. of births previous year • Cases ( 37 major birth defect) � Birth defects surveillance system + clinical genetist review � Cases with known etiology were excluded • Controls ❤tt♣✿✴✴✇✇✇✳♥❜❞♣s✳♦r❣✴ � Non-malformed live birth � Birth certificates or hospital delivery records • Data collection � CATI (English/Spanish) within 24 months We focus on the Congenital Heart Defects ( CDH ) which are problems in the structure of the heart that are present at birth.

  8. Congenital Heart Defects Clinical importance priority in public health � most frequent class of defects � high impact on pediatric mortality Statistical relevance : challenge in birth defects modeling � Most defects are too rare for individual study � Difficult to determine how best to group birth defects

  9. Congenital Heart Defects Clinical importance priority in public health � most frequent class of defects � high impact on pediatric mortality Statistical relevance : challenge in birth defects modeling � Most defects are too rare for individual study � Difficult to determine how best to group birth defects Experts have provided a mechanistic classification of the defects � relies on biological knowledge and embryologic development � translates in a prior guess c 0 for the clustering

  10. Set partitions A set partition c of an integer [ n ] is a collection of non-empty disjoint subsets { B 1 , B 2 , . . . , B K } such that ∪ K i B i = [ n ] • Number of partitions of [ n ] into k blocks � Stirling numbers S ( n, k ) • Total number of set partitions � Bell number B n = � n k =1 S ( n, k )

  11. Set partitions A set partition c of an integer [ n ] is a collection of 11111 non-empty disjoint subsets { B 1 , B 2 , . . . , B K } such that ∪ K 2111 i B i = [ n ] • Number of partitions of [ n ] into k blocks � Stirling numbers S ( n, k ) 311 • Total number of set partitions � Bell number B n = � n k =1 S ( n, k ) 221 • Configuration λ = {| B 1 | , . . . , | B K |} � sequence of block cardinalities � individuate an integer partition , a set of 41 positive integers { λ 1 , . . . , λ K } such that � K i =1 λ i = n 32 5

  12. Modeling birth defects • i = 1 , . . . , N heart defects, j = 1 , . . . , n i observations • y ij = 1 if observation j has the b.d. i while y ij = 0 is a control • x T ij = ( x ij 1 , . . . , x ijp ) observed values for p dichotomous variables Grouped logistic regression logit ( π ij ) = α i + x T y ij ∼ Ber ( π ij ) ij β c i , j = 1 , . . . , n i , α i ∼ N ( a 0 , τ − 1 0 ) β c i | c ∼ N p ( b , Q ) i = 1 , . . . , N, Bayesian framework : assign a prior probability p ( c ) � Exchangeable Partition Probability Function (EPPF)

  13. Dirichlet Process: p ( c ) ∝ � K i =1 ( | B i | − 1)! Uniform distribution p ( c ) ∝ 1 / B N Pitman-Yor Process: p ( c ) ∝ � K i =1 (1 − σ ) | B i |

  14. How to account for c 0 ? Base idea : penalize a baseline EPPF in order to center the prior distribution on the given partition c 0 p ( c | c 0 , ψ ) ∝ p 0 ( c ) exp {− ψd ( c , c 0 ) } (1) • p 0 ( c ) indicates a baseline distribution (EPPF) on Π N • d ( c , c 0 ) a suitable distance between partitions � ideally a metric on the set partitions lattice • ψ penalization parameter controlling for the centering p ( c | c 0 , ψ ) → p 0 ( c ) � ψ = 0 � ψ → ∞ p ( c | c 0 , ψ ) = δ c 0

  15. How to account for c 0 ? Base idea : penalize a baseline EPPF in order to center the prior distribution on the given partition c 0 p ( c | c 0 , ψ ) ∝ p 0 ( c ) exp {− ψd ( c , c 0 ) } (1) • p 0 ( c ) indicates a baseline distribution (EPPF) on Π N • d ( c , c 0 ) a suitable distance between partitions � ideally a metric on the set partitions lattice • ψ penalization parameter controlling for the centering p ( c | c 0 , ψ ) → p 0 ( c ) � ψ = 0 � ψ → ∞ p ( c | c 0 , ψ ) = δ c 0 Choice of the distance � Variation of information [Meila (2007)] • VI ( c , c ′ ) = − H ( c ) − H ( c ′ ) + 2 H ( c ∧ c ′ ) • H ( · ) information entropy • metric on set partition lattice

  16. Centered Partition Processes Define sets of partitions with distance δ l from c 0 and configuration λ m s lm ( c 0 ) = { c ∈ Π N : d ( c , c 0 ) = δ l , Λ ( c ) = λ m } for l = 0 , . . . , L and m = 1 , . . . , M . Centered Partition Processes - analytic form g ( λ m ) e − ψδ l p ( c | c 0 , ψ ) = for c ∈ s lm ( c 0 ) v =1 | s uv ( c 0 ) | g ( λ v ) e − ψδ u , � L � M u =0 • g ( · ) function of the configuration Λ ( c ) � e.g. Uniform g ( Λ ( c )) = 1 , DP g ( Λ ( c )) = α K � K j =1 Γ( λ j ) • | · | cardinality of the set s lm ( c 0 ) , not analytically tractable � but can nonetheless be used in Bayesian models relying on Monte Carlo methods

  17. CP Process - Uniform EPPF c 0 = { 1 , 2 , 3 , 4 , 5 } c 0 = { 1 , 2 }{ 3 , 4 }{ 5 }

  18. CP Process - DP EPPF ( α = 1 ) c 0 = { 1 , 2 , 3 , 4 , 5 } c 0 = { 1 , 2 }{ 3 , 4 }{ 5 }

Recommend


More recommend