 
              The Search For Structure or The Relationship Between Structure and Prediction June 2012 Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University 1
The Search For Structure Searching For Structure ⇓ choose tuning parameters for structure finding ⇓ converting structure finding into prediction ⇓ conformal inference (distribution free prediction) 2
The Three Lectures 1. The Search For Structure. (Today). 2. Manifolds and Filaments. 3. Undirected Graphs. 3
Collaborators • Xi Chen • Chris Genovese • Haijie Gu • Anupam Gupta • John Lafferty • Jing Lei • Han Liu • Pradeep Ravikumar • Marco Perone-Pacifico • Isabella Verdinelli • Min Xu • Aarti Singh • Martin Azizyan • Sivaraman Balikrishnan • Don Sheehy • Mladen Kolar • Alessandro Rinaldo • And ... 4
Outline 1. Prediction is easy, finding structure is hard. 2. Examples. 3. Using prediction to find structure: (minimax) conformal pre- diction. (4. Using structure to help with prediction: (minimax) semisu- pervised inference.) 5
The Three Eras of Statistics and Machine Learning 1. PALEOZOIC: parameter estimation (a) mle (b) confidence intervals, etc. 2. MESOZOIC: prediction (a) classification (b) regression (c) SVM etc 3. CENOZOIC: the search for structure (a) graphical models (b) manifolds (c) matrix factorization 6
Prediction is “Easy.” Example 1: Nonparametric Regression Let ( X 1 , Y 1 ) , . . . , ( X 2 n , Y 2 n ) ∼ P . Split the data into training and test. Let { � m h : h ∈ H} be estimates of m ( x ) = E ( Y | X = x ) from the training data. Choose � h to minimize � 1 m h ( X i )) 2 . ( Y i − � n i ∈ test Then ∗ m ∗ ) + c 2 log |H| Risk ( � m � h ) ≤ c 1 Risk ( � . n ∗ See Gyorfi et al, for example. 7
Prediction is “Easy.” Example 2: The Lasso. Let ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P where X i ∈ R d . β minimize � n i β ) 2 s.t. || β || 1 ≤ L (the lasso). Let � i =1 ( Y i − X T Then, w.h.p. ∗  �  L 4 log d R ( �   β ) ≤ R ( β ∗ ) + O n where β ∗ minimizes Risk ( β ) subject to || β || 1 ≤ L . Choose L by cross-validation. ∗ See Greenshtein and Ritov 2004 8
Prediction is “Easy.” Example 3: SpAM. Sparse Additive Models ∗ d � Y = m ( X ) = s j ( X j ) + ǫ. j =1 Choose � s 1 , . . . , � s d to minimize n � � s j ( X j )) 2 ( Y i − i =1 j subject to s j being smooth and � j || s j || ≤ L . ∗ Ravikumar, Lafferty, Liu and Wasserman 2009 9
Prediction is “Easy.” Example 3: SpAM. Choose L by minimizing generalized cross-validation 1 RSS f ( L ) /n ) 2 . n (1 − d If d ≤ e n ξ for ξ < 1 then � 1 − ξ � 1 2 . Risk ( � m ) − Risk ( m ∗ ) = O P n 10
Prediction Prediction is easy because: 1. Goal is clear. 2. Tuning parameters can be selected by cross-validation, data- splitting etc. Important: The results on data-splitting give distribution-free guarantees. This is a goal we want to emulate. 11
Structure Finding Examples: -clustering -curve clustering -manifolds -graphs -graph-valued regression (Details about graphs and manifolds in lectures 2 and 3.) In this talk, we will show how prediction helps finding structure. 12
Clustering Despite many, many years of research and many, many papers, there does not seem to be a consensus on how to choose tuning parameters. • k -means: choose k . • Density-based clustering: choose bandwidth h . • Hierarchical clustering: choice of merging rule. • Spectral clustering: many parameters. 13
Clustering Various suggestions include: • stability • hypothesis testing • information-theoretic • others I’ll (tentatively) propose an alternative. 14
Example of Our Results: Dustribution Free Curve Clustering 1500 1000 500 0 −500 −1000 0 5 10 15 20 25 30 15
Relating Structure to Prediction Our approach (Lei, Rinaldo, Robins, Wasserman) is to convert a structure-finding problem into a prediction problem. Example: Density estimation = ⇒ conformal prediction. Conformal prediction is due to Vovk et al. Rest of talk: -explain conformal prediction -minimax theory for conformal prediction (briefly) -using conformal prediction to guide structure finding 16
Conformal Inference A theory of distribution free prediction. See: Vovk, Gammerman and Shafer (2005) + many papers by Vovk and co-workers. (See also Phil Dawid’s work on prequential inference.) Our contribution: marrying conformal inference with traditional statistical theory (minimax theory) and extending some of the techniques: Lei, Robins and Wasserman (arXiv:1111.1418) Lei, Wasserman (arXiv:1203.5422) Lei, Rinaldo, Wasserman (submitted to NIPS) Lei, Robins and Wasserman (arXiv:1111.1418) Lei, Robins and Wasserman (in progress) 17
(Batch) Conformal Prediction Observe Y 1 , . . . , Y n ∼ P . Construct C n ≡ C n ( Y 1 , . . . , Y n ) such that P ( Y n +1 ∈ C n ) ≥ 1 − α for all P and all n . Here, P ≡ P n +1 . See Vovk et al for the general (sequential) theory. We are only concerned with the bacth version. We will also be concerned with minimax optimality (efficiency). 18
(Batch) Conformal Prediction 1. Observe Y 1 , . . . , Y n ∼ P where Y i ∈ R d . 2. Choose any fixed y ∈ R d . 3. Let aug ( y ) = ( Y 1 , . . . , Y n , y ). 4. Compute conformity scores σ 1 ( y ) , . . . , σ n +1 ( y ). 5. Under H 0 : Y n +1 = y , the ranks are uniform. The p-value is � n +1 i =1 I ( σ i ( y ) ≤ σ n +1 ( y )) π ( y ) = . n + 1 6. Invert the test: C n = { y : π ( y ) ≥ α } . 19
Conformity Scores Use aug ( y ) = ( Y 1 , . . . , Y n , y ) to construct a function g . Compute � g ( Y i ) i = 1 , . . . , n σ i ( y ) = g ( y ) i = n + 1 . Example: σ i = −| Y i − Y ( y ) | where Y ( y ) = y + � n i =1 Y i . n + 1 * In certain cases, we need to use σ i = g i ( Y i ) where g i is built from aug ( y ) − { Y i } . More on this later. 20
(Batch) Conformal Prediction When H 0 : Y n +1 = y is true, the ranks of the σ i ’s are uniform. It follows that, for any P and any n , P ( Y n +1 ∈ C n ) ≡ P n +1 ( Y n +1 ∈ C n ) ≥ 1 − α. This is true, finite sample, distribution-free prediction. But what is the best conformity score? 21
Oracle Best (smallest) prediction set or Oracle: C ∗ = { y : p ( y ) > λ } where λ is such that P ( C ∗ ) = 1 − α , The form of C ∗ suggests using an estimate � p of p to define a conformity score. And this leads to a method for level set density clustering. 22
Loss Function Loss function: L ( C ) = µ ( C ∆ C ∗ ) where A ∆ B = ( A ∩ B c ) ∪ ( A c ∩ B ) and µ is Lebesgue measure. Minimax risk: inf sup E P [ µ ( C ∆ C ∗ )] C ∈ Γ n P ∈P where Γ n denotes all 1 − α prediction regions. 23
Kernel Conformity Define the augmented kernel density estimator � � � � � � � � n � 1 1 || u − Y i || 1 1 || y − Y i || p y � h ( u ) = h d K + h d K . n + 1 h n + 1 h i =1 Let p y p y σ i ( y ) = � h ( Y i ) , σ n +1 ( y ) = � h ( y ) � n +1 i =1 I ( σ i ( y ) ≤ σ n +1 ) π ( y ) = n + 1 C n = { y : π ( y ) ≥ α } . Then P ( Y n +1 ∈ C n ) ≥ 1 − α for all P and n . 24
Helpful Approximation C n is not a density level set. Also, it is expensive to compute. However, C n ⊂ C + n where C + n = { y : p h ( y ) > c n } � where p h ( Y ( nα ) ) − K (0) c n = � nh d and Y (1) , Y (2) , · · · are ordered so that � p h ( Y (1) ) ≥ � p h ( Y (2) ) ≥ · · · . The set C + n involves no augmentation set but still satisfies P ( Y n +1 ∈ C + n ) ≥ 1 − α . Its connected components are the density clusters. 25
Optimality 1 2 β + d then (with Assuming Holder- β smoothness, if h n ≍ (log n/n ) high probability) β � log n � 2 β + d . µ ( C n ∆ C ∗ ) � n The same holds for C + n . This rate is minimax optimal: w.h.p. β � log n � 2 β + d inf C sup L ( C ) � n P ∈P where the infimum is over all level 1 − α prediction sets. Note: the minimax result requires smoothness assumptions; the finite sample distribution free guarantee does not. Note: the rate for the alternative loss L ( C ) = µ ( C ) − µ ( C ∗ ) is faster. 26
Data-Driven Bandwidth Each bandwidth h yields a conformal prediction region C n,h . Choose h to minimize µ ( C n,h ). (With some adjustments, this still has fi- nite sample validity.) 80 µ ( ˆ C ) 70 µ ( ˜ C − ) 60 µ ( ˜ C + ) 50 C ) µ ( ˆ 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 3.5 4 log 2 ( h/h n ) 27
Lebesgue Measure 8 10 12 14 16 0.00 0.06 0.12 0 1 2 3 4 5 0 −5 −5 1 Bandwidth 2 0 0 3 5 5 4 0.00 0.10 0.20 0.000 0.015 0.030 0.0 0.2 0.4 −5 −5 −5 0 0 0 5 5 5 28
Level Set Clustering To summarize so far: -choose tuning parameters by minimizing size of conformal pre- diction region -leads to optimized density clusters -and the resulting set has a finite-sample prediction property 29
2d Example 10 Optimal Set Data points not in region 10 Outer Bound Data points in region Inner Bound convex hull of data poins in region 8 Conformal Set 8 Data Point 6 6 y (2) y (2) 4 4 2 2 0 0 −2 −2 −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 12 y (1) y (1) Left: conformal. Right: Data-depth method (Li and Liu 2008). The conformal method is 1,000 times faster. 30
Recommend
More recommend