approximation of matrices Luis Rademacher The Ohio State University - PowerPoint PPT Presentation

Randomized algorithms for the approximation of matrices Luis Rademacher The Ohio State University Computer Science and Engineering (joint work with Amit Deshpande, Santosh Vempala, Grant Wang)

Two topics • Low-rank matrix approximation (PCA). • Subset selection: Approximate a matrix using another matrix whose columns lie in the span of a few columns of the original matrix.

Motivating example: DNA microarray • [Drineas, Mahoney] Unsupervised feature selection for classification – Data: table of gene expressions (features) v/s patients – Categories: cancer types – Feature selection criterion: Leverage scores (importance of a given feature in determining top principal components) • Empirically: Leverage scores are correlated with “information gain”, a supervised measure of influence. Somewhat unexpected. • Leads to clear separation (clusters) from selected features.

In matrix form: • 𝐵 is 𝑛 × 𝑜 matrix, 𝑛 patients, 𝑜 genes (features), find 𝐵 ≈ 𝐷𝑌, where the columns of 𝐷 are a few columns of 𝐵 (so 𝑌 = 𝐷 + 𝐵) . • They prove error bounds when columns of 𝐷 are selected at random according to leverage scores (importance sampling).

(P1) Matrix approximation • Given 𝑛 -by- 𝑜 matrix, find low rank approximation … • … for some norm: 2 = 𝐵 𝑗𝑘 2 – 𝐵 𝐺 (Frobenius) 𝑗𝑘 – 𝐵 2 = 𝜏max 𝐵 = max 𝐵𝑦 / 𝑦 (spectral) 𝑦

Geometric view • Given points in 𝑆 𝑜 , find subspace close to them. • Error: Frobenius norm corresponds to sum of squared distances.

Classical solution 2 and • Best rank-k approximation 𝐵 𝑙 in ∙ 𝐺 ∙ 2 : – Top 𝑙 terms of singular value decomposition 𝑈 𝑙 𝑈 (SVD): if 𝐵 = 𝜏 𝑗 𝑣 𝑗 𝑤 𝑗 then 𝐵 𝑙 = 𝜏 𝑗 𝑣 𝑗 𝑤 𝑗 𝑗 𝑗=1 • Best k-dim. subspace: rowspan(𝐵 𝑙 ) , i.e. – Span of top 𝑙 eigenvectors of 𝐵 𝑈 𝐵 . • Leads to iterative algorithm. Essentially, in time 𝑛𝑜 2 .

Want algorithm • With better error/time trade-off. • Efficient for very large data: – Nearly linear time – Pass efficient: if data does not fit in main memory, algorithm should not need random access, but only a few sequential passes. • Subspace equal to or contained in the span of a few rows (actual rows are more informative than arbitrary linear combinations).

Idea [Frieze Kannan Vempala] • Sampling rows. Uniform does not work (e.g. a single non-zero entry) • By “importance”: sample s rows, each independently with probability proportional to squared length.

[FKV] Theorem 1. Let S be a sample of k=² rows where P ( row i is picked ) / k A i k 2 : Then the span of S contains the rows of a matrix ~ A of rank k for which E ( k A ¡ ~ A k 2 F ) · k A ¡ A k k 2 F + ² k A k 2 F : This can be turned into an e±cient algorithm: 2 passes, complexity O ( kmn=² ) : , SVD in span of 𝑇 , which is fast (to compute 𝐵 because 𝑜 becomes 𝑙/𝜗 ).

One drawback of [FKV] • Additive error can be large (say if matrix is nearly low rank). Prefer relative error, something like k A ¡ ~ A k 2 F · (1 + ² ) k A ¡ A k k 2 F :

Several ways: • [Har-Peled ‘06] (first linear time relative approximation) • [Sarlos ‘06]: Random projection of rows onto a 𝑃(𝑙/𝜗) -dim. subspace. Then SVD. • [Deshpande R Vempala Wang ‘06] [Deshpande Vempala ‘06] Volume sampling (rough approximation) + adaptive sampling.

Some more relevant work • [Papadimitriou Raghavan Tamaki Vempala ‘98]: Introduced random projection for matrix approximation. • [Achlioptas McSherry ‘01][Clarkson Woodruff ’09] One-pass algorithm. • [Woolfe Liberty Rokhlin Tygert ’08] [ Rokhlin Szlam Tygert ‘09] Random projection + power iteration to get very fast practical algorithms. Read survey [Halko Martinsson Tropp ‘09 ]. • D’Aspremont , Drineas, Ipsen , Mahoney, Muthukrishnan , …

(P2) Algorithmic Problems: Volume sampling and subset selection • Given 𝑛 -by- 𝑜 matrix, pick set of k rows at random with probability proportional to squared volume of 𝑙 -simplex spanned by them and origin. [DRVW] (equivalently, squared volume of parallelepiped determined by them)

Volume sampling • Let S be k-subset of rows of A – [k! vol(conv(0, A S ))] 2 = vol(  (A S )) 2 = (*) det( A S A T S ) – volume sampling for A is equivalent to: pick k by k principal minor “S  S” of A A T with prob. det( A S A T proportional to S ) – For(*): complete A S to a square matrix B by adding orthonormal rows, orthogonal to span(A S ). µ A S A T ¶ 0 vol ¤ ( A S ) 2 = (det B ) 2 = det( BB T ) = det = det( A S A T S S ) 0 I

Original motivation: • Relative error low rank matrix approximation [DRVW]: – S: k-subset of rows according to volume sampling – A k : best rank-k approximation, given by principal components (SVD) –  S : projection of rows onto rowspan(A S ) ) E S ( k A ¡ ¼ S ( A ) k 2 F ) · ( k + 1) k A ¡ A k k 2 = F • Factor “k+1” is best possible [DRVW] • Interesting existential result (there exist k rows…). Alg.? • Lead to linear time, pass-efficient algorithm for relative approximation of A k [DV]. (1+  ) in span of O * (k/  ) rows

Where does volume sampling come from? • No self-respecting architect leaves the scaffolding in place after completing the building. Gauss?

Where does volume sampling come from? • Idea: – For picking 𝑙 out of 𝑙 + 1 points, 𝑙 with maximum volume is optimal. – For picking 1 out of 𝑛 , random according to squared length is better than max. length. – For 𝑙 out of 𝑛 , this suggest volume sampling.

Where does volume sampling come from? • Why does the algebra work? Idea: – When picking 1 out of 𝑛 random according to squared length, expected error is sum of squares of areas of triangles: 2 𝐵 𝑡 2 E error = 𝑡 2 𝑗 𝑒 𝐵 𝑗 , 𝑡𝑞𝑏𝑜 𝐵 𝑡 𝑢 𝐵 𝑢 – This sum corresponds to certain coefficient of the characteristic polynomial of 𝐵𝐵 𝑈

Later motivation [BDM,…] • (row/column) Subset selection. A refinement of principal component analysis: Given a matrix A, – PCA: find k-dim subspace V that minimizes k A ¡ ¼ V ( A ) k 2 F – Subset selection: find V spanned by k rows of A . • Seemingly harder, combinatorial flavor. (  projects rows)

Why subset selection? • PCA unsatisfactory: – top components are linear combinations of rows (all rows, generically). Many applications prefer individual , most relevant rows, e.g.: • feature selection in machine learning • linear regression using only most relevant independent variables • out of thousands of genes, find a few that explain a disease

Known results • [Deshpande-Vempala] Polytime k!-approximation to volume sampling, by adaptive sampling: – pick a row with probability proportional to squared length – project all rows orthogonal to it – repeat • Implies for random k-subset S with that distribution: E S ( k A ¡ ¼ S ( A ) k 2 F ) · ( k + 1)! k A ¡ A k k 2 F

Known results • [Boutsidis, Drineas, Mahoney] Polytime randomized algorithm to find k-subset S: F · O ( k 2 log k ) k A ¡ A k k 2 k A ¡ ¼ S ( A ) k 2 F • [Gu-Eisenstat] Deterministic algorithm, k A ¡ ¼ S ( A ) k 2 2 · (1 + f 2 k ( n ¡ k )) k A ¡ A k k 2 2 in time O((m + n log f n)n 2 ) Spectral norm: k A k 2 = sup x 2 R n k Ax k = k x k

Known results • Remember, volume sampling equivalent to sampling k by k minor “S  S” of AA T with probability proportional to det( A S A T (*) S ) • [Goreinov, Tyrtishnikov, Zamarashkin] Maximizing (*) over S is good for subset selection. • [Çivril , Magdon-Ismail] [see also Koutis ‘06] But maximizing is NP-hard, even approximately to within an exponential factor.

Results • Volume sampling: Polytime exact alg. O(mn  log n) (arithmetic ops.) (some ideas earlier in [Houges Krishnapur Peres Virág]) • Implies alg. with optimal approximation for subset selection under Frobenius norm. Can be derandomized by conditional expectation. O(mn  log n) • 1+  approximations to the previous 2 algorithms in nearly linear time, using volume-preserving random projection [M Z].

Results • Observation: Bound in Frobenius norm easily implies bound in spectral norm: k A ¡ ¼ S ( A ) k 2 2 · k A ¡ ¼ S ( A ) k 2 F · ( k + 1) k A ¡ A k k 2 F · ( k + 1)( n ¡ k ) k A ¡ A k k 2 2 using X k A k 2 k A k 2 ¾ 2 2 = ¾ 2 F = i max i ¾ max = ¾ 1 ¸ ¾ 2 ¸ ¢ ¢ ¢ ¸ ¾ n ¸ 0 are the singular values of A

Comparison for subset selection k A ¡ ¼ S ( A ) k 2 ? · ? k A ¡ A k k 2 Find S s.t. ? Frobenius Spectral norm Time (assuming m>n) : exponent of matrix mult. norm sq sq [D R V W] k+1 Existential [Despande (k+1)! kmn R Vempala] [Gu Eisenstat] 1+k(n-k) Existential 1+f 2 k(n-k) ((m + n log f n)n 2 [Gu Eisenstat] D k 2 log k k 2 (n-k) log k mn 2 [Boutsidis R Drineas (F implies Mahoney] spectral) kmn  log n [Desphande R] k+1 (optimal) (k+1)(n-k) D [Desphande R] (1+  )(k+1) (1+  ) (k+1)(n-k) O * (mnk 2 /  2 + m k 2  + 1 /  2  ) R

Proofs: volume sampling • Want (w.l.o.g.) k-tuple S of rows of m by n matrix A with probability det( A S A T S ) P S 0 2 [ m ] k det( A S 0 A T S 0 )

approximation of matrices Luis Rademacher The Ohio State University - PowerPoint PPT Presentation

Randomized algorithms for the approximation of matrices Luis Rademacher The Ohio State University Computer Science and Engineering (joint work with Amit Deshpande, Santosh Vempala, Grant Wang) Two topics Low-rank matrix approximation

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

6. Approximation and fitting norm approximation least-norm problems regularized

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Transformations and Matrices Transformations I Transformations are functions Matrices

Structural Matrices in MDOF Systems Evaluation of Structural Matrices Choice of Property

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal & spectral matrices) by

Structural Matrices in MDOF Systems Structural Matrices Evaluation of Structural Giacomo Boffi

Orbit matrices of symmetric designs and related self-dual codes Orbit matrices of symmetric

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

Why Are We Matrices? Studying plenty Matrices have uses in Computer Science. E.g.: of

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Block and Triangular Matrices Block Matrices Defn. A partitioned matrix has the rows and columns

Matrices and Vectors Marco Chiarandini Department of Mathematics & Computer Science

Completely positive and copositive matrices and optimization Bob s birthday conference The

Revised Implicit Equal-Weights Particle Filter Jacob Skauvold 1 Joint work with Jo Eidsvik 1 ,

Sequential Importance Sampling for Rare Event Estimation with Computer Experiments Brian

SYSTEM IDENTIFICATION AND MODEL UPDATING STUDIES IWSHM Derek Skolnik 2007 Ertugrul Taciroglu,

1 f d 2016 10 2017 2 2 2 High Performance Comp. 3 JO,

Recognizers A study in learning how to model temporally extended behaviors Jordan Frank

Safe Water Environmental Health Beach Sampling Program Chris Spooney East Algoma District

Presentation Outline The role of the FCU Sampling Methodology to be adopted by FCU

Apollo Global Management, LLC Reports Third Quarter 2015 Results New York, October 28, 2015--

approximation of matrices Luis Rademacher The Ohio State University - PowerPoint PPT Presentation

Randomized algorithms for the approximation of matrices Luis Rademacher The Ohio State University Computer Science and Engineering (joint work with Amit Deshpande, Santosh Vempala, Grant Wang) Two topics Low-rank matrix approximation

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices &amp; quadratic forms)

6. Approximation and fitting norm approximation least-norm problems regularized

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Transformations and Matrices Transformations I Transformations are functions Matrices

Structural Matrices in MDOF Systems Evaluation of Structural Matrices Choice of Property

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal &amp; spectral matrices) by

Structural Matrices in MDOF Systems Structural Matrices Evaluation of Structural Giacomo Boffi

Orbit matrices of symmetric designs and related self-dual codes Orbit matrices of symmetric

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

Why Are We Matrices? Studying plenty Matrices have uses in Computer Science. E.g.: of

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Block and Triangular Matrices Block Matrices Defn. A partitioned matrix has the rows and columns

Matrices and Vectors Marco Chiarandini Department of Mathematics &amp; Computer Science

Completely positive and copositive matrices and optimization Bob s birthday conference The

Revised Implicit Equal-Weights Particle Filter Jacob Skauvold 1 Joint work with Jo Eidsvik 1 ,

Sequential Importance Sampling for Rare Event Estimation with Computer Experiments Brian

SYSTEM IDENTIFICATION AND MODEL UPDATING STUDIES IWSHM Derek Skolnik 2007 Ertugrul Taciroglu,

1 f d 2016 10 2017 2 2 2 High Performance Comp. 3 JO,

Recognizers A study in learning how to model temporally extended behaviors Jordan Frank

Safe Water Environmental Health Beach Sampling Program Chris Spooney East Algoma District

Presentation Outline The role of the FCU Sampling Methodology to be adopted by FCU

Apollo Global Management, LLC Reports Third Quarter 2015 Results New York, October 28, 2015--

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal & spectral matrices) by

Matrices and Vectors Marco Chiarandini Department of Mathematics & Computer Science