High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John Lafferty (CMU), Pradeep Ravikumar (UC Berkeley), and Prasad Santhanam (University of Hawaii) Supported by grants from National Science Foundation, and a Sloan Foundation Fellowship 1
� classical asymptotic theory of statistical inference: Introduction � not suitable for many modern applications: – number of observations n → + ∞ – model dimension p stays fixed – { images, signals, systems, networks } frequently large ( p ≈ 10 3 − 10 8 )... � curse of dimensionality: frequently impossible to obtain consistent – function/surface estimation: enforces limit p → + ∞ – interesting consequences: might have p = Θ( n ) or even p ≫ n � can be saved by a lower effective dimensionality , due to some form procedures unless p/n → 0 of complexity constraint: – sparse vectors – { sparse, structured, low-rank } -matrices – structured regression functions – graphical models (Markov random fields) 2
� Markov random field: random vector ( X 1 , . . . , X p ) with What are graphical models? distribution factoring according to a graph G = ( V, E ): D � Hammersley-Clifford Theorem: ( X 1 , . . . , X p ) being Markov w.r.t G A B C � studied/used in various fields: spatial statistics, language modeling, implies factorization: � � P ( x 1 , . . . , x p ) ∝ exp θ A ( x A ) + θ B ( x B ) + θ C ( x C ) + θ D ( x D ) . computational biology, computer vision, statistical physics .... 3
� let G = ( V, E ) be an undirected graph on p = | V | vertices Graphical model selection � pairwise Markov random field: family of prob. distributions � given n independent and identically distributed (i.i.d.) samples of � � � 1 P ( x 1 , . . . , x p ; θ ) = Z ( θ ) exp � θ st , φ st ( x s , x t ) � . ( s,t ) ∈ E � complexity constraint: restrict to subset G d,p of graphs with X = ( X 1 , . . . , X p ), identify the underlying graph structure maximum degree d 4
Illustration: Voting behavior of US senators Graphical model fit to voting records of US senators (Bannerjee, El Ghaoui, & d’Aspremont, 2008) 5
Some issues in high-dimensional inference Consider some fixed loss function, and a fixed level δ of error. � for what sample sizes n do they succeed/fail to achieve error δ ? Limitations of tractable algorithms: � given a collection of methods, when does more computation reduce Given particular (polynomial-time) algorithms minimum # samples needed? � what are fundamental limitations of problem (Shannon capacity)? Information-theoretic limitations: � when are known (polynomial-time) methods optimal? Data collection as communication from nature − → statistician: � when are there gaps between poly.-time methods and optimal methods? 6
� exact solution for trees Previous/on-going work on graph selection � local testing-based approaches (Chow & Liu, 1967) � methods for Gaussian MRFs (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) – ℓ 1 -regularized neighborhood regression for Gaussian MRFs � methods for discrete MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao, 2006) – ℓ 1 -regularized log-determinant (e.g., Yuan & Lin, 2006; d’Aspr´ emont et al., 2007; Friedman, 2008; Ravikumar et al., 2008) � information-theoretic approaches: – neighborhood-based search method (Bresler, Mossel & Sly, 2008) – ℓ 1 -regularized logistic regression (Ravikumar et al., 2006, 2008) – pseudolikelihood and BIC criterion (Csiszar & Talata, 2006) – information-theoretic limitations (Santhanam & Wainwright, 2008) 7
� Markov properties encode neighborhood structure: Markov property and neighborhood structure d ( X r | X V \ r ) = ( X r | X N ( r ) ) � �� � � �� � Condition on full graph Condition on Markov blanket N ( r ) = { s, t, u, v, w } X s X t X w X r X u � basis of pseudolikelihood method X v (Besag, 1974) 8
Practical method via neighborhood regression Observation: Recovering graph G equivalent to recovering neighborhood set N ( r ) for all r ∈ V . Method: Given n i.i.d. samples { X (1) , . . . , X ( n ) } , perform logistic regression of each node X r on X \ r := { X r , t � = r } to estimate neighborhood structure b N ( r ). 1. For each node r ∈ V , perform ℓ 1 regularized logistic regression of X r on the remaining variables X \ r : ( ) n X 1 b f ( θ ; X ( i ) θ [ r ] := arg min \ r ) + ρ n � θ � 1 n |{z} θ ∈ R p − 1 | {z } i =1 logistic likelihood regularization 2. Estimate the local neighborhood b N ( r ) as the support (non-negative entries) of the regression vector b θ [ r ]. 3. Combine the neighborhood estimates in a consistent manner (AND, or OR rule). 9
� classical analysis: dimension p fixed, sample size n → + ∞ � high-dimensional analysis: allow both dimension p , sample size n , and High-dimensional analysis maximum degree d to increase at arbitrary rates � take n i.i.d. samples from MRF defined by G p,d � study probability of success as a function of three parameters: � theory is non-asymptotic: explicit probabilities for finite ( n, p, d ) Success( n, p, d ) = P [Method recovers graph G p,d from n samples] 10
Empirical behavior: Unrescaled plots Star graph; Linear fraction neighbors 1 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 p = 225 0 0 100 200 300 400 500 600 Number of samples Plots of success probability versus raw sample size n . 11
Empirical behavior: Appropriately rescaled Star graph; Linear fraction neighbors 1 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 p = 225 0 0 0.5 1 1.5 2 Control parameter Plots of success probability versus control parameter T LR ( n, p, d ). 12
� graph sequences G p,d = ( V, E ) with p vertices, and maximum degree d . � drawn n i.i.d, samples, and analyze prob. success indexed by ( n, p, d ) Sufficient conditions for consistent model selection Theorem: For a rescaled sample size (RavWaiLaf06, RavWaiLaf08) n d 3 log p > T ∗ T LR ( n, p, d ) := crit � log p and regularization parameter ρ n ≥ c 1 τ n , then with probability � � greater than 1 − 2 exp − c 2 ( τ − 2) log p → 1: (a) For each node r ∈ V , the ℓ 1 -regularized logistic convex program has a unique solution. (Non-trivial since p ≫ n = ⇒ not strictly convex). (b) The estimated sign neighborhood � N ± ( r ) correctly excludes all edges not in the true neighborhood. � d 2 log p (c) For θ min ≥ c 3 τ , the method selects the correct signed n neighborhood. 13
Some challenges in distinguishing graphs A C B D Guilt by association Hidden interactions Conditions on Fisher information matrix Q ∗ = E [ ∇ 2 f ( θ ∗ ; X )] A1. Bounded eigenspectra: λ ( Q ∗ SS ) ∈ [ C min , C max ]. A2. Mutual incoherence There exists an ν ∈ (0 , 1] such that | Q ∗ S c S ( Q ∗ SS ) − 1 | | | | | ∞ , ∞ ≤ 1 − ν. P where | | | A | | | ∞ , ∞ := max i j | A ij | . 14
� construct candidate primal-dual pair ( b � proof technique—-not a practical algorithm! Proof sketch: Primal-dual certificate z ) ∈ R p − 1 × R p − 1 . θ, b (A) For a fixed node r with S = N ( r ), we solve the restricted program n � � 1 �� f ( θ ; X ( i ) � θ = arg min \ r ) + ρ n � θ � 1 , n θ ∈ R p − 1 ,θ Sc =0 i =1 thereby obtaining candidate solution � θ = ( � θ S ,� 0 S c ). z S ∈ R | S | as an element of the subdifferential ∂ � � (B) We choose � θ S � 1 . (C) Using optimality conditions from original convex program, solve z S c and check whether or not strict dual feasibility for � for all j ∈ S c holds. | � z j | < 1 Lemma: Full convex program recovers neighborhood ⇐ ⇒ primal-dual witness succeeds. 15
� thus far: have exhibited a a particular polynomial-time method can Information-theoretic limits on graph selection recover structure if � but....is this a “good” result? n > Ω( d 3 log( p − d )) � are there polynomial-time methods that can do better? � information theory can answer the question: is there an exponential-time method that can do better? (Santhanam & Wainwright, 2008) 16
� graphical model selection is an unorthodox channel coding problem: Graph selection as channel coding � nature sends G ∈ G d,p := { graphs on p vertices, max. degree d } P ( X | G ) X (1) , . . . , X ( n ) G � decoding problem: use observations { X (1) , . . . , X ( n ) } to correctly � channel capacity for graph decoding: balance between distinguish the “codeword” ` ´ pd log p – log number of models: log | M ( p, d ) | = Θ . d – relative distinguishability of different models 17
� take Ising models P θ ( G ) from G d,p ( λ, ω ): Necessary conditions for graph recovery – graphs with p nodes and max. degree d � take n i.i.d. observations, and study probability of success in terms – parameters | θ st | ≥ λ for all edges ( s, t ) � – maximum neighborhood weight ω = max | θ st | . s ∈ V t ∈ N ( s ) of ( n, p, d ) Theorem: Necessary conditions: For sample size n � � log p exp( ω/ 2) λ d 8 log p n ≤ max 2 λ tanh( λ ) , d log( pd ) , 8 d, , 16 sinh( λ ) then the probability of error of any algorithm over G d,p ( λ, ω ) is at least 1 / 2. (Santhanam & W., 2008) 18
Recommend
More recommend