Nonparametric Graph Estimation Han Liu Department of Opera-ons - - PowerPoint PPT Presentation

nonparametric graph estimation
SMART_READER_LITE
LIVE PREVIEW

Nonparametric Graph Estimation Han Liu Department of Opera-ons - - PowerPoint PPT Presentation

Nonparametric Graph Estimation Han Liu Department of Opera-ons Research and Financial Engineering Princeton University Acknowledgement Fang Han John Lafferty Larry Wasserman Tuo Zhao JHU Biostats Chicago


  • Nonparametric Graph Estimation Han Liu Department ¡of ¡Opera-ons ¡Research ¡and ¡Financial ¡Engineering Princeton ¡University

  • Acknowledgement Fang Han John Lafferty Larry Wasserman Tuo Zhao JHU Biostats Chicago CS/Stats CMU Stats/ML JHU CS http:// www.princeton.edu/~hanliu 2

  • High Dimensional Data Analysis The dimensionality d increases with the sample size n Approximation Error + Estimation Error + Computing Error This talk Well studied under linear and Gaussian models A little nonparametricity goes a long way 3

  • Graph Estimation Problem Infer conditional independence based on observational data d variables X 1 , … , X d G = ( V , E ) ⎛ ⎞ n samples x 1 , … , x n ⎟ … ⎜ x 1 x 1 ⎟ ⎜ ⎟ 1 d X j ⎜ X i ⇒ ⎟ ⎜ ⎟    ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎟  x n x n ⎟ ⎜ ⎝ ⎠ 1 d ( X i , X j ) ∉ E ⇔ X i ⊥ X j | the rest Applications: density estimation, computing, visualization... 4

  • Desired Statistical Properties Characterize the performance using different criteria Persistency : Risk( ˆ f ) - Risk( f o ) = o P (1) F Model ¡class Consistency : Distance( ˆ f , f * ) = o P (1) oracle es-mator f o ( ) = o (1) Sparsistency : P graph( ˆ f ) ≠ graph( f * ) ˆ f f * Minimax optimality true ¡func-on 5

  • Outline Nonparanormal Forest Density Estimation Summary 6

  • Gaussian Graphical Models ( ) Ω = Σ − 1 X ~ N d µ , Σ Ω jk = 0 ⇔ X j ⊥ X k | the rest (Lauritzen 96) glasso--Graphical Lasso ( Yuan and Lin 06, Banerjee 08, Friedman et al. 08 ) Sample covariance ∑ Ω  0 { tr ( ˆ } S Ω ) − log | Ω | + λ Ω jk min              j , k Negative Gaussian log-likelihood L 1 -regularization Neighborhood selection ( Meinshausen and Buhlmann 06 ) 7

  • Gaussian Graphical Models CLIME -- Constrained L 1 -Minimization Method ( Cai et al. 2011 ) subject to ∑ Ω jk S Ω − I ‖ ‖ ˆ max ≤ λ min Ω j , k gDantzig -- Graphical Dantzig Selector ( Yuan 2010 ) 8

  • Computation and Theory Computing: scalable up to thousands of dimensions glasso ( Hastie et al. ) huge ( Zhao and Liu ) language: C language: Fortran scalability: d<6000 scalability: d<3000 Speed: 3 x faster Speed: very fast Theory: persistency, consistency, sparsistency, optimal rate,... ⎛ ⎞ log d ⎟ ⎜ ‖ ˆ S − Σ ‖ ⎟ max = O P ⎜ key result for analysis ⎟ ⎜ ⎜ ⎟ ⎝ ⎠ n population covariance sample covariance 9

  • Many Real Data are non-Gaussian Normal Q-Q plot of one typical gene Sample Quantile Arabidopsis Data ( Wille et al. 04 ) Theoretical Quantile ( n = 118, d =39) Relax the Gaussian assumption without losing statistical and computational efficiency? 10

  • The Nonparanormal Gaussian ⇒ Gaussian Copula Nonparanormal Definition ( Liu, Lafferty, Wasserman 09 ) A random vector X = ( X 1 , … , X d ) is nonparanormal ( ) X ~ NPN d Σ ,{ f j } j = 1 d ( ) is normal in case f ( X ) = f 1 ( X 1 ), … , f d ( X d ) ( ) . f ( X ) ~ N d 0, Σ Here f j ' s are strictly monotone and diag ( Σ ) = 1 . f j ( t ) = t − µ j ⇒ recover arbitrary Gaussian distributions σ j 11

  • Visualization Bivariate nonparanormal densities with different transformations 12

  • Basic Properties The graph is encoded in the inverse correlation matrix ( ) and Ω = Σ − 1 , then Let X ~ NPN d Σ ,{ f j } j = 1 d ⎧ ⎫ ⎪ 2 f ( x ) T Ω f ( x ) ⎪ d (2 π ) d /2 | Ω | − 1/2 exp − 1 1 ⎪ ⎪ ∏ ′ p X ( x ) = ⎨ ⎬ f j ( x j ) ⎪ ⎪ ⎩ ⎪ ⎪ ⎭ j = 1 ⇒ Ω ij = 0 ⇔ X i ⊥ X j | the rest Not jointly convex, how to estimate the parameters? 13

  • Estimating Transformation Functions without worrying about Ω d Directly estimate { f j } j = 1 f j strictly monotone CDF of X j f j ( X j ) ~ N (0,1) ( ) = P f j ( X j ) ≤ f j ( t ) ( ) = Φ f j ( t ) ( ) F j ( t ) = P X j ≤ t ⇒ ( ) f j ( t ) = Φ − 1 F j ( t ) Normal-score transformation n 1 i ≤ t ) ∑ ( x j ˆ F j ( t ) = I n + 1 i = 1 14

  • Estimating Inverse Correlation Matrix Nonparanormal Algorithm ( Liu, Han, Lafferty, Wasserman 12 ) Step 1 : calculate the Spearman's rank correlation coefficient matrix ˆ R ρ Step 2 : transform ˆ R ρ into ˆ Σ ρ according to ⎛ ⎞ jk = 2 ⋅ sin π ⎟ ˆ ⎜ ( ∗ ) ˆ ˆ Σ ρ provides good estimate of Σ . Σ ρ R ρ ⎟ ⎜ ⎜ ⎟ ⎝ ⎠ jk 6 Σ ρ into glasso / CLIME / gDantzig to get ˆ Step 3 : plug ˆ Ω ρ and the graph The same procedure is independently proposed by (Xue and Zou 12) 15

  • Nonparanormal Theory Theorem ( Liu, Han, Lafferty, Wasserman 12 ) Let X ~ NPN d ( Σ , f ) and Ω = Σ − 1 . Given whatever conditions on Σ and Ω that secure the consistency and sparsistency of glasso / CLIME / gDantzig under the Gaussian models, the nonparanormal is also consistent and sparsistent with exactly the same parametric rates of convergence. ⇒ The nonparanormal is a safe replacement of the Gaussian model 16

  • Proof of the Theorem ⎛ ⎞ Σ ρ − Σ ‖ log d ⎟ ⎜ ‖ ˆ ⎟ ⎟ . max = O P ⎜ Proof: The key is to show that ⎟ ⎜ ⎜ ⎝ ⎠ n For Gaussian distribution, Kruskal (1948) shows monotone transformation invariant ⎛ ⎞ Σ jk = 2 ⋅ sin π ⎟ Pearson’s ⎜ Population Spearman’s 6 R ρ ⎟ ⎜ ⎜ ⎟ ⎝ ⎠ correlation coefficient jk rank coefficient Also true for the nonparanormal distribution ⎛ ⎞ Σ ρ − Σ ‖ R ρ − R ρ ‖ log d ⎟ ⎜ ‖ ˆ ‖ ˆ max  ⎟ . ⎟ max = O P ⎜ ⎟ ⎜ ⎜ ⎝ ⎠ n the theory of U - statistics. 17

  • Empirical Results For nonGaussian data, the nonparanormal >> glasso Sample x i ~ NPN d Σ , f ( ) with n = 200, d = 40 and transformation f j FN FP glasso true graph nonparanormal Oracle graph: pick the best tuning parameter along the path 18

  • Nonparanormal: Efficiency Loss For Gaussian data, the nonparanormal almost loses no efficiency Computationally -- no extra cost Statistically -- sample x 1 , … , x n ~ N d (0, Σ ) with n = 80 and d = 100 1-FN almost no efficiency loss ROC curve for graph recovery 1-FP 19

  • Arabidopsis Data The nonparanormal behaves differently from glasso on the Arabidopsis data λ 1 The paths are different λ 2 highly nonlinear ˆ f j λ 3 MECPS Nonlinear transformation causes graph difference glasso nonparanormal difference 20

  • Scientific Implications Cross-pathway interactions? nonparanormal MVA Pathway MEP Pathway HMGR1 MECPS HMGR2 glasso Still open in the current biological literature ( Hou et al. 2010 ) 21

  • Tradeoff Nonparanormal: unrestricted graphs, more flexible distributions What if the true distribution is not nonparanormal? Tradeoff structural flexibility for greater nonparametricity 22

  • Forest Densities Gaussian Copula ⇒ Fully nonparametric distribution A forest F = ( V , E F ) is an acylic graph . A distribution is supported on a forest F=(V, E F ) if p ( x i , x j ) ∏ ∏ ( x k ) p F ( x ) = ⋅ p p ( x i ) p ( x j ) ( i , j ) ∈ E F k ∈ V p ( x i , x j ), ˆ ˆ F = ( V , E ˆ ˆ p ( x k ) F ) Forest density estimator Advantages: visualization, computing, distributional flexibility, inference 23

  • Some Previous Work Most existing work on forests are for discrete distributions Chow and Liu (1968) Bach and Jordan (2003) Tan et al. (2010) Chechetka and Guestrin (2007) Our focus: statistical properties in high dimensions 24

  • ‖ Estimation Find a forest F ( k ) = argmin ( ) subject to E F ≤ k KL p ( x ) p F ( x ) F projection of p ( x ) onto F true density Maximum weight forest problem ( Kruskal 56 ) F ( k ) = argmax ∑ subject to E F ≤ k I ( p ij ) F ( i , j ) ∈ E F mutual information p ( x i , x j ) ∫ I ( p ij ) = p ( x i , x j )log p ( x i ) p ( x j ) dx i dx j p ( x i , x j ), ˆ Clipped KDE ˆ p ( x k ) 25

  • Forest Density Estimation Algorithm Forest Density Estimation Algorithm 1. Sort edges according to empirical mutual information I ( ˆ p ij ) 2. Greedily pick a set of edges such that no cycles are formed 3. Output the obtained forest after k edges have been added 26

  • Assumptions for Forest Graph Estimation (A1) Bivariate marginals p ( x j , x k ) ∈ 2nd - order H  older class (A2) p ( x ) has bounded support (e.g. [0,1] d ) and κ 1 ≤ min j , k p ( x j , x k ) ≤ max j , k p ( x j , x k ) ≤ κ 2 (A3) p ( x j , x k ) has vanishing partial derivatives on boundaries (A4) For a "crucial" set of edges, their mutual info. distinct enough from each other To secure enough signal-to-noise-ratio for correct structure recovery ( Tan, Anandkumar, Willsky 11 ) 27

  • ‖ Forest Density Estimation Theory F ( k ) = argmin ( ) F : E F ≤ k KL p ( x ) p F ( x ) Theorem-Oracle Sparsistency ( Liu et al. 12 ) P ( k ) : densities supported by For graph estimation, let forests with at most k edges log d → 0, parametric scaling n and 1d and 2d KDEs use the same bandwidth Oracle density estimator p F ( k ) h  n − 1/4 , undersmooth ˆ p ˆ F ( k ) F ( k ) ≠ F ( k ) ( ) = o (1). p k P ˆ we have sup Forest Estimator true density 28