Seeki eking ng Interp erpretable retable Models ls for High - PowerPoint PPT Presentation

Seeki eking ng Interp erpretable retable Models ls for High Dimensi nsiona onal l Data Bin Yu Statistics Department, EECS Department University of California, Berkeley http://www.stat.berkeley.edu/~binyu

Characteristics of Modern Data Sets  Goal: efficient use of data for: – Prediction – Interpretation (proxy: sparsity)  Larger number of variables: – Number of variables (p) in data sets is large – Sample sizes (n) have not increased at same pace  Scientific opportunities: – New findings in different scientific fields December 30, 2008 page 2

Today’s Talk  Understanding early visual cortex area V1 through fMRI  Occam’s Razor  Lasso: linear regression and Gaussian graphical models  Discovering compressive property of V1 through shared non-linear sparsity  Future work December 30, 2008 page 3

Understanding visual pathway Gallant Lab at UCB is a leading vision lab. Da Vinci (1452-1519) Mapping of different visual cortex areas PPo Polyak (1957) Small left middle grey area: V1 December 30, 2008 page 4

Understanding visual pathway through fMRI One goal at Gallant Lab: understand how natural images relate to fMRI signals December 30, 2008 page 5

Gallant Lab in Nature News  This article is part of Nature's premium content.  Published online 5 March 2008 | Nature | doi:10.1038/news.2008.650 Mind-reading with a brain scan Brain activity can be decoded using magnetic resonance imaging.  Kerri Smith Scientists have developed a way of „decoding‟ someone‟s brain activity to determine what they are looking at. “The problem is analogous to the classic „pick a card, any card‟ magic trick,” says Jack Gallant, a neuroscientist at the University of California in Berkeley, who led the study. December 30, 2008 page 6

Stimuli  Natural image stimuli December 30, 2008 page 7

Stimulus to fMRI response  Natural image stimuli drawn randomly from a database of 11,499 images  Experiment designed so that response from different presentations are nearly independent  fMRI response is pre-processed and roughly Gaussian December 30, 2008 page 8

Gabor Wavelet Pyramid December 30, 2008 page 9

Features December 30, 2008 page 10

“Neural” (fMRI) encoding for visual cortex V1 Predictor : p=10,921 features of an image Response: (preprocessed) fMRI signal at a voxel n=1750 samples Goal: understanding human visual system interpretable (sparse) model desired good prediction is necessary Minimization of an empirical loss (e.g. L2) leads to ill-posed computational problem, and • bad prediction • December 30, 2008 page 11

Linear Encoding Model by Gallant Lab  Data – X: p=10921 dimensions (features) – Y: fMRI signal – n = 1750 training samples  Separate linear model for each voxel via e- L2boosting (or Lasso)  Fitted model tested on 120 validation samples – Performance measured by correlation December 30, 2008 page 12

Modeling “history” at Gallant Lab  Prediction on validation set is the benchmark  Methods tried: neural nets, SVMs, e-L2boosting (Lasso)  Among models with similar predictions, simpler (sparser) models by e-L2boosting are preferred for interpretation This practice reflects a general trend in statistical machine learning -- moving from prediction to simpler/sparser models for interpretation, faster computation or data transmission. December 30, 2008 page 13

Occam’s Razor 14th-century English logician and Franciscan friar, William of Ockham Principle of Parsimony: Entities must not be multiplied beyond necessity. Wikipedia December 30, 2008 page 14

Occam’s Razor via Model Selection in Linear Regression • Maximum likelihood (ML) is LS when Gaussian assumption • There are 2^p submodels • ML goes for the largest submodel with all predictors • Largest model often gives bad prediction for p large December 30, 2008 page 15

Model Selection Criteria Akaike (73,74) and Mallows’ Cp used estimated prediction error to choose a model: Schwartz (1980): Both are penalized LS by . Rissanen’s Minimum Description Length (MDL) principle gives rise to many different different criteria. The two-part code leads to BIC. December 30, 2008 page 16

Model Selection for image-fMRI problem For the linear encoding model, the number of submodels Combinatorial search: too expensive and often not necessary A recent alternative: continuous embedding into a convex optimization problem through L1 penalized LS (Lasso) -- a third generation computational method in statistics or machine learning. December 30, 2008 page 17

Lasso: L 1 -norm as a penalty  The L 1 penalty is defined for coefficients   Used initially with L 2 loss: – Signal processing: Basis Pursuit (Chen & Donoho,1994) – Statistics: Non-Negative Garrote (Breiman, 1995) – Statistics: LASSO (Tibshirani, 1996)  Properties of Lasso – Sparsity (variable selection) and regularization – Convexity (convex relaxation of L 0 -penalty) December 30, 2008 page 18

Lasso: computation and evaluation The “ right ” tuning parameter unknown so “ path ” is needed (discretized or continuous) Initially: quadratic program for each a grid on  . QP is called for each  . Later: path following algorithms such as homotopy by Osborne et al (2000) LARS by Efron et al (2004) Theoretical studies: much work recently on Lasso in terms of L2 prediction error L2 error of parameter model selection consistency December 30, 2008 page 19

Model Selection Consistency of Lasso Set-up: Linear regression model n observations and p predictors Assume (A): Knight and Fu (2000) showed L2 estimation consistency under (A). December 30, 2008 page 20

Model Selection Consistency of Lasso  p small , n large (Zhao and Y, 2006), assume (A) and Then roughly* Irrepresentable condition (1 by (p-s) matrix) model selection consistency * Some ambiguity when equality holds.  Related work: Tropp(06), Meinshausen and Buhlmann (06), Zou (06), Wainwright (06) Population version December 30, 2008 page 21

Irrepresentable condition (s=2, p=3): geomery r=0.4  r=0.6  December 30, 2008 page 22

Model Selection Consistency of Lasso Consistency holds also for s and p growing with n, assume  irrepresentable condition bounds on max and min eigenvalues of design matrix smallest nonzero coefficient bounded away from zero. Gaussian noise (Wainwright, 06): Finite 2k-th moment noise (Zhao&Y,06): December 30, 2008 page 23

Consistency of Lasso for Model Selection  Interpretation of Condition – Regressing the irrelevant predictors on the relevant predictors. If the L 1 norm of regression coefficients (*) – Larger than 1, Lasso can not distinguish the irrelevant predictor from the relevant predictors for some parameter values. – Smaller than 1, Lasso can distinguish the irrelevant predictor from the relevant predictors.  Sufficient Conditions (Verifiable) – Constant correlation – Power decay correlation – Bounded correlation* December 30, 2008 page 24

Sparse cov. estimation via L1 penalty Banerjee, El Ghaoui, d’Aspremont (08) December 30, 2008 page 25

L1 penalized log Gaussian Likelihood Given n iid observations of X with Banerjee, El Ghaoui, d’Aspremont (08): by a block descent algorithm. December 30, 2008 page 26

Model selection consistency Ravikumar, Wainwright, Raskutti, Yu (08) gives sufficient conditions for model selection consistency. Hessian: Define “model complexity”: December 30, 2008 page 27

Model selection consistency (Ravikumar et al, 08) Assume the irrepresentable condition below holds 1. X sub-Gaussian with parameter and effective sample size Or 2. X has 4m-th moment, Then with high probability as n tends to infinity, the correct model is chosen. December 30, 2008 page 28

Success prob’s dependence on n and p (Gaussian) *   ij  Edge covariances as Each point is an average over 100 trials. 0 . 1 .  Curves stack up in second plot, so that (n/log p) controls model selection. December 30, 2008 page 29

Success prob’s dependence on “model complexity” K and n Chain graph with p = 120 nodes.  Curves from left to right have increasing values of K.  Models with larger K thus require more samples n for same probability of success. December 30, 2008 page 30

Back to image-fMRI problem: Linear sparse encoding model on complex “cells” Gallant Lab’s approach:  Separate linear model for each voxel  Y = Xb + e  Model fitting via e-L2boosting and stopping by CV – X: p=10921 dimensions ( features or complex “cells” ) – n = 1750 training samples  Fitted model tested on 120 validation samples (not used in fitting) Performance measured by correlation (cc) December 30, 2008 page 31

Adding nonlinearity via Sparse Additive Models  Additive Models (Hastie and Tibshirani, 1990): p     Y f ( X ) ,   i 1 , , n i j ij i  j 1   Sparse: for most j f 0 j  High dimensional: p >>> n SpAM (Sparse Additve Models) By Ravikumar, Lafferty, Liu, Wasserman (2007) Related work: COSSO, Lin and Zhang (2006) December 30, 2008 page 32

Sparse Additive Models (SpAM) (Ravikumar, Lafferty, Liu and Wasserman, 07) December 30, 2008 page 33

Sparse Backfitting December 30, 2008 page 34

Seeki eking ng Interp erpretable retable Models ls for High - PowerPoint PPT Presentation

Seeki eking ng Interp erpretable retable Models ls for High Dimensi nsiona onal l Data Bin Yu Statistics Department, EECS Department University of California, Berkeley http://www.stat.berkeley.edu/~binyu Characteristics of Modern Data

Semantics interp IR 0 2 compile / desugar = interp IR 1 2 Assignment 1 interp Scheme IR 2

Th The C Clie ient nts s You ar are Se Seeki king ng are Al ar Also o Se Seeki king

Th The C Clie ient nts s You ar are Se Seeki king ng are Al ar Also o Se Seeki king

Th The C Clie ient nts s You ar are Se Seeki king ng are Al ar Also o Se Seeki king

St Nicholas Church, Stevenage Seeki king g to know Jesus s and make Him known A growing parish

Col ollaborative laborative In Info formation mation Seeki king: ng: On tra raca cabi

textures Fabrice NEYRET 24 March 2016 Blend / interp: Which space is linear ? RGB

for for Libr brary y Me Media dia Pr Professional ssionals Libr brarian rian to

DSGE Models: A User Guide for Policymakers Lawrence J. Christiano Outline Why models? Why

Seminar LIGHTING MODELS What is a light? Types of light Illumination models

From Conceptual Models From Conceptual Models to Simulation Models to Simulation Models Model

Factor Models: A Review James J. Heckman The University of Chicago Econ 312, Winter 2019

Weak memory models INF4140 - Models of concurrency Weak memory models Fall 2016 30. 10. 2016

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

1 Th The Ch Chal allenge lenge to to Re Realizing alizing Inte Interpersonal rpersonal

Superinstructions and Replication in the Cacao JVM interpreter M. Anton Ertl Christian Thalinger

Marta Favali Thesis director: Alessandro Sarti Thesis co-director: Giovanna Citti Title of the

Midbrain Processing of Salient Events William James (1842 - 1910)

Integrating Vision and Haptics for Object Recognition Sibel Toprak Seminar Talk in Intelligent

Self-Organizing Feature Maps Christian Jacob CPSC 565 Winter 2003 Department of Computer

Deep learning for visual recognition Thurs April 27 Kristen Grauman UT Austin Last time

talking about and seeing blue (b) 2.5B 7.5BG 2.5BG (a) ! (a) vs. ! (b) (b) 2.5B 7.5BG

MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing Sami

TTotal variation flow in the Subelliptic Heisenberg group Giovanna Citti October 11, 2014