di dimen ension sion re redu ducti ction on Yury Makarychev, - PowerPoint PPT Presentation

𝒍 -mea d 𝒍 -med eans ns an and edian ians s un unde der r di dimen ension sion re redu ducti ction on Yury Makarychev, TTIC Konstantin Makarychev, Northwestern Ilya Razenshteyn, Microsoft Research Simons Institute, November 2, 2018

Euclidean 𝑙 -means and 𝑙 -medians Given a set of points 𝑌 in ℝ 𝑛 Partition 𝑌 into 𝑙 clusters 𝐷 1 , … , 𝐷 𝑙 and find a “center” 𝑑 𝑗 for each 𝐷 𝑗 so as to minimize the cost 𝑙 ( 𝑙 -median) ෍ ෍ 𝑒(𝑣, 𝑑 𝑗 ) 𝑗=1 𝑣∈𝐷 𝑗 𝑙 𝑒 𝑣, 𝑑 𝑗 2 ෍ ෍ ( 𝑙 -means) 𝑗=1 𝑣∈𝐷 𝑗

Dimension Reduction Dimension reduction 𝜒: ℝ 𝑛 → ℝ 𝑒 is a random map that preserves distances within a factor of 1 + 𝜁 with probability at least 1 − 𝜀 : 1 𝑣 − 𝑤 ≤ 𝜒 𝑣 − 𝜒 𝑤 ≤ (1 + 𝜁) 𝑣 − 𝑤 1 + 𝜁 [Johnson-Lindenstrauss ‘84] There exists a random log 1/𝜀 linear dimension reduction with 𝑒 = 𝑃 . 𝜁 2 [Larsen, Nelson ‘17] The dependence of 𝑒 on 𝜁 and 𝜀 is optimal.

Dimension Reduction JL preserves all distances between points in 𝑌 whp when 𝑒 = Ω(log |𝑌|/𝜁 2 ) . Numerous applications in computer science. Dimension Reduction Constructions: • [JL ‘84] Project on a random 𝑒 -dimensional subspace • [Indyk, Motwani ‘98] Apply a random Gaussian matrix • [Achlioptas ‘03] Apply a random matrix with ±1 entries • [Ailon , Chazelle ‘06] Fast JL-transform

𝑙 -means under dimension reduction [Boutsidis, Zouzias, Drineas ’10] Apply a dimension reduction 𝜒 to our dataset 𝑌 dimension reduction Cluster 𝜒(𝑌) in dimension 𝑒 .

𝑙 -means under dimension reduction want Optimal clusterings of 𝑌 and 𝜒(𝑌) have approximately the same cost. even better The cost of every clustering is approximately preserved. For what dimension 𝑒 can we get this?

𝑙 -means under dimension reduction distortion 𝒆 ~ log 𝑜 /𝜁 2 Folklore 1 + 𝜁 Boutsidis, Zouzias, ~𝑙/𝜁 2 2 + 𝜁 Drineas ‘10 ~𝑙/𝜁 2 Cohen, Elder, 1 + 𝜁 Musco, Musco, ~ log 𝑙 /𝜁 2 9 + 𝜁 Persu ’15 ~ log(𝑙/𝜁) /𝜁 2 MM R ’18 1 + 𝜁 ~ log 𝑙 /𝜁 2 Lower bound 1 + 𝜁

𝑙 -medians under dimension reduction distortion 𝒆 Prior work — — Kirszsbraun Thm ⇒ ~ log 𝑜 /𝜁 2 1 + 𝜁 ~ log(𝑙/𝜁) /𝜁 2 MM R ’18 1 + 𝜁 ~ log 𝑙 /𝜁 2 Lower bound 1 + 𝜁

Plan 𝑙 -means • Challenges • Warm up: 𝑒~log 𝑜 /𝜁 2 • Special case: “distortions” are everywhere sparse • Remove outliers: the general case → the special case • Outliers 𝑙 -medians • Overview of our approach

Out result for 𝑙 -means Let 𝑌 ⊂ ℝ 𝑛 𝜒: ℝ 𝑛 → ℝ 𝑒 be a random dimension reduction. 𝑒 ≥ 𝑑 log 𝑙 𝜁𝜀 /𝜁 2 With probability at least 1 − 𝜀 : 1 − 𝜁 cost 𝒟 ≤ cost 𝜒 𝒟 ≤ 1 + 𝜁 cost 𝒟 for every clustering 𝒟 = 𝐷 1 , … , 𝐷 𝑙 of 𝑌

Challenges Let 𝒟 ∗ be the optimal 𝑙 -means clustering. Easy: cost 𝒟 ∗ ≈ cost 𝜒(𝒟 ∗ ) with probability 1 − 𝜀 Hard: Prove that there is no other clustering 𝒟′ s.t. cost 𝜒 𝒟 ′ < 1 − 𝜁 cost 𝒟 ∗ since there are exponentially many clusterings 𝒟′ (can’t use the union bound)

Warm-up Consider a clustering 𝒟 = (𝐷 1 , … , 𝐷 𝑙 ) . Write the cost in terms of pair-wise distances: 𝑙 1 𝑣 − 𝑤 2 cost 𝒟 = ෍ 2|𝐷 𝑗 | ෍ 𝑗=1 𝑣,𝑤∈𝐷 𝑗 all distances 𝑣 − 𝑤 are preserved within 1 + 𝜁 ⇓ cost 𝒟 is preserved within 1 + 𝜁 Sufficient to have 𝑒~ log 𝑜 /𝜁 2

Problem & Notation Assume that 𝒟 = (𝐷 1 , … , 𝐷 𝑙 ) is a random clustering that depends on 𝜒 . Want to prove: cost 𝒟 ≈ cost 𝜒 𝒟 whp. The distance between 𝑣 and 𝑤 is (1 + 𝜁) -preserved or distorted depending on whether 𝜒(𝑣) − 𝜒(𝑤) ≈ 1+𝜁 𝑣 − 𝑤 Think 𝜀 = poly(1/𝑙, 𝜁) is sufficiently small.

Distortion graph Connect 𝑣 and 𝑤 with an edge if the distance between them is distorted. + Every edge is present with probability at most 𝜀 . − Edges are not independent. − 𝒟 depends on the set of edges. − May have high-degree vertices. − All distances in a cluster may be distorted.

Cost of a cluster The cost of 𝐷 𝑗 is 1 𝑣 − 𝑤 2 2|𝐷 𝑗 | ෍ 𝑣,𝑤∈𝐷 𝑗 + Terms for non-edges (𝑣, 𝑤) are (1 + 𝜁) preserved. 𝑣 − 𝑤 ≈ 𝜒 𝑣 − 𝜒(𝑤) − Need to prove that 𝑣 − 𝑤 2 = 𝜒 𝑣 − 𝜒(𝑤) 2 ± 𝜁′cost 𝒟 ෍ ෍ 𝑣,𝑤∈𝐷 𝑗 𝑣,𝑤∈𝐷 𝑗 𝑣,𝑤 ∈𝐹 𝑣,𝑤 ∈𝐹

Everywhere-sparse edges Assume every 𝑣 ∈ 𝐷 𝑗 is connected to at most a 𝜄 fraction of all 𝑤 in 𝐷 𝑗 (where 𝜄 ≪ 𝜁 ).

Everywhere-sparse edges + Terms for non-edges (𝑣, 𝑤) are (1 + 𝜁) preserved. + The contribution of terms for edges is small: for an edge 𝑣, 𝑤 and any 𝑥 ∈ 𝐷 𝑗 𝑣 − 𝑤 ≤ 𝑣 − 𝑥 + 𝑥 − 𝑤 𝑣 − 𝑤 2 ≤ 2 𝑣 − 𝑥 2 + 𝑥 − 𝑤 2

Everywhere-sparse edges 𝑣 − 𝑤 2 ≤ 2 𝑣 − 𝑥 2 + 𝑥 − 𝑤 2 • Replace the term for every edge with two terms 𝑣 − 𝑥 2 , 𝑥 − 𝑤 2 for random 𝑥 ∈ 𝐷 𝑗 . • Each term is used at most 2𝜄 times, in expectation. 𝑣 − 𝑤 2 ≤ 4𝜄 ෍ 𝑣 − 𝑤 2 ෍ (𝑣,𝑤)∈𝐹 𝑣,𝑤∈𝐷 𝑗 𝑣,𝑤∈𝐷 𝑗

Everywhere-sparse edges 𝑣 − 𝑤 2 ≈ 𝑣 − 𝑤 2 ෍ ෍ 𝑣,𝑤∈𝐷 𝑗 𝑣,𝑤 ∉𝐹 ≈ 𝜒(𝑣) − 𝜒(𝑤) 2 ≈ ෍ 𝜒(𝑣) − 𝜒(𝑤) 2 ෍ (𝑣,𝑤)∉𝐹 𝑣,𝑤∈𝐷 𝑗

Everywhere-sparse edges 𝑣 − 𝑤 2 ≈ 𝑣 − 𝑤 2 ෍ ෍ 𝑣,𝑤∈𝐷 𝑗 𝑣,𝑤 ∉𝐹 ≈ 𝜒(𝑣) − 𝜒(𝑤) 2 ≈ ෍ 𝜒(𝑣) − 𝜒(𝑤) 2 ෍ (𝑣,𝑤)∉𝐹 𝑣,𝑤∈𝐷 𝑗 Edges are not necessarily everywhere sparse!

Outliers Want: remove “outliers” so that in the remaining set 𝑌′ edges are everywhere sparse in every cluster.

(1 − 𝜄) non-distorted core Want: remove “outliers” so that in the remaining set 𝑌′ edges are everywhere sparse in every cluster.

(1 − 𝜄) non-distorted core Want: remove “outliers” so that in the remaining set 𝑌′ edges are everywhere sparse in every cluster. Find a subset 𝑌 ′ ⊂ 𝑌 (which depends on 𝒟 ) s.t. • Edges are sparse in the obtained clusters: Every 𝑣 ∈ 𝐷 𝑗 ∩ 𝑌′ is connected to at most a 𝜄 fraction of all 𝑤 in 𝐷 𝑗 ∩ 𝑌′ . • Outliers are rare: For every 𝑣 , Pr 𝑣 ∉ 𝑌 ′ ≤ 𝜄

All clusters are large Assume all clusters are of size ~𝑜/𝑙 . Let 𝜄 = 𝜀 1/4 . outliers = all vertices of degree at least ~𝜄𝑜/𝑙 Every vertex has degree at most 𝜀𝑜 in expectation. By Markov, Pr( 𝑣 is an outlier) ≤ 𝜀𝑙 𝜄 ≤ 𝜄 Remove 𝜄𝑜 ≪ 𝑜/𝑙 vertices in total, so all clusters still have size ~𝑜/𝑙 . Crucially use that all clusters are large!

Main Combinatorial Lemma Idea: assign “weights” to vertices so that all clusters have a large weight. • There is a measure 𝜈 on 𝑌 and random set 𝑆 s.t. 1 𝐷 𝑗 ∖𝑆 for 𝑦 ∈ 𝐷 𝑗 ∖ 𝑆 (always) 𝜈 𝑦 ≥ • 𝜈 𝑌 ≤ 4𝑙 3 /𝜄 2 • Pr(𝑦 ∈ 𝑆) ≤ 𝜄 All clusters 𝐷 𝑗 ∖ 𝑆 are “large” w.r.t. measure 𝜈 . Can apply a variant of the previous argument.

Edges Incident on Outliers Need to take care of edges incident on outliers. 𝑣 𝑤 𝑑 ∗ Say, 𝑣 is an outlier and 𝑤 is not. ∗ for 𝑌 . ∗ , … , 𝐷 𝑙 Consider a fixed optimal clustering 𝐷 1 Let 𝑑 ∗ be the optimal center for 𝑣 .

Edges Incident on Outliers 𝑣 𝑤 𝑑 ∗ 𝑤 − 𝑑 ∗ ± 𝑑 ∗ − 𝑣 𝑣 − 𝑤 = ≈ 𝜒(𝑤) − 𝜒(𝑑 ∗ ) ± 𝜒(𝑑 ∗ ) − 𝜒(𝑣) 𝜒(𝑣) − 𝜒(𝑤) = May assume that the distances between non-outliers and the optimal centers are 1 + 𝜁 -preserved.

Edges Incident on Outliers 𝑣 𝑤 𝑑 ∗ 𝑤 − 𝑑 ∗ ± 𝑑 ∗ − 𝑣 𝑣 − 𝑤 = ≈ 𝜒(𝑤) − 𝜒(𝑑 ∗ ) ± 𝜒(𝑑 ∗ ) − 𝜒(𝑣) 𝜒(𝑣) − 𝜒(𝑤) = ∗ − 𝑣 2 ] ≤ 𝜄 σ 𝑣∈𝑌 𝑑 𝑣 ∗ − 𝑣 2 = 𝜄 OPT 𝔽 [ σ 𝑣∉𝑌 ′ 𝑑 𝑣

Edges Incident on Outliers 𝑣 𝑤 𝑑 ∗ 𝑤 − 𝑑 ∗ ± 𝑑 ∗ − 𝑣 𝑣 − 𝑤 = ≈ 𝜒(𝑤) − 𝜒(𝑑 ∗ ) ± 𝜒(𝑑 ∗ ) − 𝜒(𝑣) 𝜒(𝑣) − 𝜒(𝑤) = Taking care of 𝜒(𝑑 ∗ ) − 𝜒(𝑣) is a bit more difficult. QED

𝑙 -medians under dimension reduction

𝑙 -medians − No formula for the cost of the clustering in terms of pairwise distances. − Not obvious when 𝑒 ~ log 𝑜 (then all pairwise distances are approximately preserved). [was asked by Ravi Kannan in a tutorial @ Simons] + Kirzsbraun Theorem ⇒ the 𝑒~ log 𝑜 case + Prove a Robust Kirzsbraun Theorem Our methods for 𝑙 -means + Robust Kirzsbraun ⇒ 𝑒~ log 𝑙 for 𝑙 -medians

di dimen ension sion re redu ducti ction on Yury Makarychev, - PowerPoint PPT Presentation

-mea d -med eans ns an and edian ians s un unde der r di dimen ension sion re redu ducti ction on Yury Makarychev, TTIC Konstantin Makarychev, Northwestern Ilya Razenshteyn, Microsoft Research Simons Institute,

Te Texa xas s Emis Emission sion Re Redu duct ction ion Plan Plan (TER (TERP) P) Dray

Redu eduction ction In In For orce ce Revi view w of of th the e Basics sics Lynn

Linear Algebra Vectors A column vector is a list of numbers stored vertically. The dimen-

Pro rodu ducti ction on Pat ath Digger & Dealers, Kalgoorlie ACN130 955 725 6 th August,

Pro rodu ducti ction on Pat ath July, 2014 ACN130 955 725 PRODUCTORA COPPER PROJECT CHILE

P ENSION SECURITY : REPLACING A MYOPIC TRADITION WITH FAR SIGHTED REFORM A N ADDRESS BY J IM L EECH

LSTM M Based sed Ada dapt ptive ive Fil ilterin ering g for r Redu duced ced Pre redi

Re Repr presentational Di Dimen ensions Com omputer Science c cpsc sc322, Lecture 2 2 (Te

Dim imensionality ty Redu eduction: Th Theoretic ical Ana nalysis of Pr Practi tical Mea

In Intro roduc ducti tion n to Q Q-Free ree September 2016 2 Introduction Hkon Volldal

S ETTING PBGC P REMIUMS O UR G OAL : P RESERVE P ENSION P LANS 40 30 Participants (in millions)

Le Legis gisla lative e Fin inan ance C e Com ommit ittee e Pen ension S Solven ency

Workshop for BC Utilities Commission Pacific Northern Gas (PNG) February 4, 2013 Agenda

Turbo urbo D Data ta Systems ems, I , Inc. nc. Cont ontract ract E Exten ension ion:

NORTH HI H HIGH H STREET EX EET EXTEN ENSION FACI CILI LITY PLANNI NNING S STUDY

Sec ectio ion I I: Health I th Insuranc ance 2020 Sec ectio ion I II: Pen ension &

Recent Advances in Hilbert Space Representation of Probability Distributions Krikamol Muandet

Forst: Question Answering System for Term and Essay Questions at NTCIR-13 QA Lab-3 Task Kotaro

Water Vapor Seasonal Variability: AIRS L3 results vs. climate models Thomas Hearty, George

FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures Tayo Oguntebi,

Partial Default Cristina Arellano, Xavier Mateos-Planas and Jose-Victor Rios-Rull Mpls Fed, Univ

CDC Coronavirus Disease 2019 Response Disparities in COVID-19 Incidence, Severity, and Outcomes

Give birth to the end of Hep B Hepatitis B: What Hospitals Need to Do to Protect Newborns A

PUBLIC HEALTH POLICY CHANGE POLICY OPTIONS FOR COMBATING TOBACCO INDUSTRY PRICE DISCOUNTING The

di dimen ension sion re redu ducti ction on Yury Makarychev, - PowerPoint PPT Presentation

-mea d -med eans ns an and edian ians s un unde der r di dimen ension sion re redu ducti ction on Yury Makarychev, TTIC Konstantin Makarychev, Northwestern Ilya Razenshteyn, Microsoft Research Simons Institute,

Te Texa xas s Emis Emission sion Re Redu duct ction ion Plan Plan (TER (TERP) P) Dray

Redu eduction ction In In For orce ce Revi view w of of th the e Basics sics Lynn

Linear Algebra Vectors A column vector is a list of numbers stored vertically. The dimen-

Pro rodu ducti ction on Pat ath Digger &amp; Dealers, Kalgoorlie ACN130 955 725 6 th August,

Pro rodu ducti ction on Pat ath July, 2014 ACN130 955 725 PRODUCTORA COPPER PROJECT CHILE

P ENSION SECURITY : REPLACING A MYOPIC TRADITION WITH FAR SIGHTED REFORM A N ADDRESS BY J IM L EECH

LSTM M Based sed Ada dapt ptive ive Fil ilterin ering g for r Redu duced ced Pre redi

Re Repr presentational Di Dimen ensions Com omputer Science c cpsc sc322, Lecture 2 2 (Te

Dim imensionality ty Redu eduction: Th Theoretic ical Ana nalysis of Pr Practi tical Mea

In Intro roduc ducti tion n to Q Q-Free ree September 2016 2 Introduction Hkon Volldal

S ETTING PBGC P REMIUMS O UR G OAL : P RESERVE P ENSION P LANS 40 30 Participants (in millions)

Le Legis gisla lative e Fin inan ance C e Com ommit ittee e Pen ension S Solven ency

Workshop for BC Utilities Commission Pacific Northern Gas (PNG) February 4, 2013 Agenda

Turbo urbo D Data ta Systems ems, I , Inc. nc. Cont ontract ract E Exten ension ion:

NORTH HI H HIGH H STREET EX EET EXTEN ENSION FACI CILI LITY PLANNI NNING S STUDY

Sec ectio ion I I: Health I th Insuranc ance 2020 Sec ectio ion I II: Pen ension &amp;

Recent Advances in Hilbert Space Representation of Probability Distributions Krikamol Muandet

Forst: Question Answering System for Term and Essay Questions at NTCIR-13 QA Lab-3 Task Kotaro

Water Vapor Seasonal Variability: AIRS L3 results vs. climate models Thomas Hearty, George

FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures Tayo Oguntebi,

Partial Default Cristina Arellano, Xavier Mateos-Planas and Jose-Victor Rios-Rull Mpls Fed, Univ

CDC Coronavirus Disease 2019 Response Disparities in COVID-19 Incidence, Severity, and Outcomes

Give birth to the end of Hep B Hepatitis B: What Hospitals Need to Do to Protect Newborns A

PUBLIC HEALTH POLICY CHANGE POLICY OPTIONS FOR COMBATING TOBACCO INDUSTRY PRICE DISCOUNTING The

Pro rodu ducti ction on Pat ath Digger & Dealers, Kalgoorlie ACN130 955 725 6 th August,

Sec ectio ion I I: Health I th Insuranc ance 2020 Sec ectio ion I II: Pen ension &