Stay on path: PCA along graph paths Megasthenis Asteris Electrical and Computer Engineering Anastasios Kyrillidis Alexandros Dimakis Han - Gyol Yi Communication Sciences and Disorders Bharath Chandrasekaran
Sparse PCA Direction of x 2 maximum variance n observations / datapoints p variables Find new variable (feature) that p captures most of the variance. y n x 1 . . . y 1
Sparse PCA Direction of x 2 maximum variance n observations / datapoints p variables Find new variable (feature) that p captures most of the variance. y n x 1 . . . y 1 Empirical y > cov. matrix i n X b y i y > Σ = 1 y i k x k 2 = 1 n · i i =1
Sparse PCA Sparse direction of maximum variance x 2 n observations / datapoints p variables Find new variable (feature) that p captures most of the variance. y n x 1 . . . y 1 NP-Hard Empirical y > cov. matrix i n X b y i y > Σ = 1 y i k x k 2 = 1 n · i i =1 k x k 0 = k
Sparse PCA Why sparsity? [ Engineer ] Extracted feature is more interpretable ; it depends on only a few original variables. [ Statistician ] Recovery of “true” PC in high dimensions; # observations << # variables.
Sparse PCA Why sparsity? More structure…? [ Engineer ] Extracted feature is more interpretable ; More interpretable . it depends on only a few original variables. [ Statistician ] Better sample complexity. Recovery of “true” PC in high dimensions; # observations << # variables. E.g. wavelets of natural images, block structures, periodical neuronal spikes, … [Baraniuk et al., 2008; Kyrillidis et al., 2014, Friedman et al., 2010, …]
Sparse PCA Why sparsity? More structure…? [ Engineer ] Extracted feature is more interpretable ; More interpretable . it depends on only a few original variables. [ Statistician ] Better sample complexity. Recovery of “true” PC in high dimensions; # observations << # variables. E.g. wavelets of natural images, block structures, periodical neuronal spikes, … [Baraniuk et al., 2008; Kyrillidis et al., 2014, Friedman et al., 2010, …] • Structured sparse PCA [Jenatton et al., 2010] - Sparsity-inducing norm - 2D grid, rectangular nonzero patterns
[ PCA On Graph Paths ]
Problem Definition • Structure captured by an underlying graph. Directed, x 1 x 1 Acyclic x 2 x 3 . p . x i T S . x 2 x i . . . x p Active variables on s ⤳ t path
Problem Definition • Structure captured by an underlying graph. Directed, x 1 x 1 Acyclic x 2 x 3 . p . x i T S . x 2 x i . . . x p Active variables on s ⤳ t path Graph Path PCA
Motivation 1: Neuroscience - Variables: “ voxels” (points in the brain) - Measurements: blood-oxygen levels
Motivation 1: Neuroscience - Variables: “ voxels” (points in the brain) - Measurements: blood-oxygen levels T S
Motivation 2: Finance - Variables: stocks - Measurements: prices over time - Goal : Find subset that explains variance
Motivation 2: Finance - Variables: stocks divided in sectors - Measurements: prices over time - Goal : Find subset that explains variance 1 stock/ sector
Motivation 2: Finance - Variables: stocks divided in sectors - Measurements: prices over time - Goal : Find subset that explains variance 1 stock/ sector Chase BofA UBS
Motivation 2: Finance - Variables: stocks divided in sectors - Measurements: prices over time - Goal : Find subset that explains variance 1 stock/ sector BANKS Chase BofA UBS
Motivation 2: Finance - Variables: stocks divided in sectors - Measurements: prices over time - Goal : Find subset that explains variance 1 stock/ sector BANKS Chase Chevron Shell BofA UBS
Motivation 2: Finance - Variables: stocks divided in sectors - Measurements: prices over time - Goal : Find subset that explains variance 1 stock/ sector BANKS ENERGY Chase Chevron Shell BofA UBS
Motivation 2: Finance - Variables: stocks divided in sectors - Measurements: prices over time - Goal : Find subset that explains variance 1 stock/ sector BANKS ENERGY Chase Chevron Shell BofA UBS
Motivation 2: Finance - Variables: stocks divided in sectors - Measurements: prices over time - Goal : Find subset that explains variance 1 stock/ sector BANKS ENERGY Chase Chevron Shell BofA UBS
Motivation 2: Finance - Variables: stocks divided in sectors - Measurements: prices over time - Goal : Find subset that explains variance 1 stock/ sector BANKS ENERGY Chase Chevron Shell BofA UBS T S
Motivation 2: Finance - Variables: stocks divided in sectors - Measurements: prices over time - Goal : Find subset that explains variance 1 stock/ sector BANKS ENERGY Chase Chevron Shell BofA UBS T S
[ Statistical Analysis ]
Data model (p,k,d)-layer graph 2 · · · 3 = d T S p = d 1 . . . . . . . . . . . . . . . p − 2 · · · p − 2 k k layers
Data model (p,k,d)-layer graph 2 · · · Target Source vertex vertex 3 = d T S p = d 1 . . . . . . . . . . . . . . . p − 2 · · · p − 2 k k layers
Data model layer (p,k,d)-layer graph 2 · · · Target Source vertex vertex 3 = d T S p = d 1 . . . . . . . . . . . . . . . p − 2 · · · p − 2 k k layers
Data model layer (p,k,d)-layer graph in & out degree 2 · · · Target Source vertex vertex 3 = d T S p = d 1 . . . . . . . . . . . . . . . p − 2 · · · p − 2 k k layers
Data model layer (p,k,d)-layer graph in & out degree 2 · · · Target Source vertex vertex 3 = d T S p = d 1 . . . . . . . . . . . . . . . p − 2 · · · p − 2 k k layers Spike along a path Gaussian p noise (i.i.d) � · u i · x ? + z i , y i = Samples Signal, supported on path of G.
Bounds [ Theorem 1 ] G (unknown) : -layer graph (known). : signal support on st-path of . ( p, k, d ) G x ? N ( 0 , β · x ? x > Observe sequence y 1 , . . . , y n of i.i.d. samples from . ? + I ) b b Σ x log p ⇣ ⌘ n = O k + k log d Then, samples suffice for recovery.
Bounds [ Theorem 1 ] G (unknown) : -layer graph (known). : signal support on st-path of . ( p, k, d ) G x ? N ( 0 , β · x ? x > Observe sequence y 1 , . . . , y n of i.i.d. samples from . ? + I ) b b Σ x k log p log p ⇣ ⌘ ⇣ ⌘ vs Ω n = O k + k log d Then, samples suffice for recovery. k for sparse PCA.
Bounds [ Theorem 1 ] G (unknown) : -layer graph (known). : signal support on st-path of . ( p, k, d ) G x ? N ( 0 , β · x ? x > Observe sequence y 1 , . . . , y n of i.i.d. samples from . ? + I ) b b Σ x k log p log p ⇣ ⌘ ⇣ ⌘ vs Ω n = O k + k log d Then, samples suffice for recovery. k for sparse PCA. [ Theorem 2 ] That many samples are also necessary .
Bounds [ Theorem 1 ] G (unknown) : -layer graph (known). : signal support on st-path of . ( p, k, d ) G x ? N ( 0 , β · x ? x > Observe sequence y 1 , . . . , y n of i.i.d. samples from . ? + I ) NP-HARD b b Σ x k log p log p ⇣ ⌘ ⇣ ⌘ vs Ω n = O k + k log d Then, samples suffice for recovery. k for sparse PCA. [ Theorem 2 ] That many samples are also necessary .
Algorithms
Algorithm 1 A Power Method-based approach. Input: init x 0 , i ← 0 w i ← b Σ x i Power Iteration with projection step. End? b x ← x i +1
[ Projection Step ] Project a p-dimensional on w x ∈ X ( G ) k x � w k 2 arg min T S
[ Projection Step ] Project a p-dimensional on w x ∈ X ( G ) k x � w k 2 arg min Due to the constraints. T S
[ Projection Step ] Project a p-dimensional on w x ∈ X ( G ) k x � w k 2 arg min Due to the constraints. T S Due to Cauchy -Schwarz
[ Projection Step ] Project a p-dimensional on w x ∈ X ( G ) k x � w k 2 arg min Due to the constraints. T S Due to Cauchy -Schwarz Longest (weighted) path problem on G, with G acyclic; special weights!
[ Experiments ]
Synthetic Data generated according to the (p,k,d)-layer graph model. (p=1000, k=50, d=10 , 100 MC iterations) 1.4 Trunc. Power M. Span. k -sparse 1.2 Graph Power M. Low-D Sampling 1 x > ! xx > k F 0.8 0.6 x b k b 0.4 0.2 0 1000 2000 3000 4000 5000 Samples n
Neuroscience • Resting state fMRI dataset.* • 111 regions of interest (ROIs) (variables), extracted based on Harvard-Oxford Atlas [Desikan et al., 2006]. • Graph extracted based on Euclidean distances between center of mass of ROIs. Identified core neural components of the brain’s memory network. *[Human Connectome Project, WU-Minn Consortium]
Recommend
More recommend