Learning Arbitrary Statistical Mixtures of Discrete Distributions - PowerPoint PPT Presentation

STOC15 Learning Arbitrary Statistical Mixtures of Discrete Distributions Jian Li (Tsinghua), Yuval Rabani(HUJI), Leonard J. Schulman(Caltech), Chaitanya Swamy (Waterloo) lijian83@mail.tsinghua.edu.cn

 Problem Definition  Related Work  Our Results  The Coin Problem  Higher Dimension  Conclusion

Problem Definition 𝑜 ||𝑦|| 1 = 1 }  Δ 𝑜 = 𝑦 ∈ 𝑆 +  So each point in Δ 𝑜 is a prob. distr. over [n]  𝜘 is a prob. distr. over Δ 𝑜 (unknown to us) Mixture of discrete distributions  Goal: learn 𝜘 (i.e., transportation distance in 𝑀 1 at most 𝜗 . Tran 1 𝜘, 𝜘 ≤ 𝜗 )

Problem Definition 𝑜 ||𝑦|| 1 = 1 }  Δ 𝑜 = 𝑦 ∈ 𝑆 +  So each point in Δ 𝑜 is a prob. distr. over [n]  𝜘 is a prob. distr. over Δ 𝑜 (unknown to us)  Goal: learn 𝜘 (i.e., transportation distance in 𝑀 1 at most 𝜗 . Tran 1 𝜘, 𝜘 ≤ 𝜗 )  A 𝒍 -snapshot sample: ( k : snapshot#)  Take a sample point 𝑦 ∼ 𝜘 (𝑦 ∈ Δ 𝑜 ) (we don’t get to observe 𝑦 directly)  Take 𝑙 i.i.d. samples 𝑡 1 𝑡 2 … 𝑡 𝑙 from 𝑦 (we observe 𝑡 1 𝑡 2 … 𝑡 𝑙 , called a k -snapshot sample )  Question: How large the snapshot# 𝒍 needs to be in order to learn 𝝒 ?? How many 𝒍 -snapshot samples do we need to learn 𝝒 ??

Related Work  Previous work  Mixture of Gaussians: a large body of work  Only need 1-snapshot samples  k-snapshot (k>1) is necessary for mixtures of discrete distributions  Learn the parameters  Topic Models  𝜘 is a mixture of topics (each topic is a distribution of words) How a document is generated ：  Sample a topic from 𝑦 ∼ 𝜘 (𝑦 ∈ Δ 𝑜 )  Use 𝑦 to generate a document of size k (a document is a k - snapshot sample)

Related Work  Previous work  Mixture of Gaussians: a large body of work  Only need 1-snapshot samples  k-snapshot (k>1) is necessary for mixtures of discrete distribution  Topic Models  Various assumptions:  LSI, Separability [Papadimitriou,Raghavan,Tamaki,Vempala’00]  LDA [Blei , Ng, Jordan’03]  Anchor words [Arora,Ge,Moitra’12] (snapshot#=2)  Topic linear independent [Anandkumar, Foster, Hsu, Kakade , Liu’12 ] (snapshot#=O(1))  Several others  Collaborative Filtering  L1 condition number [Kleinberg, Sandler ‘08]

Transportation Distance  Also known as earth mover distance, Rubinstein distance, Wasserstein distance  Tran(𝑄, 𝑅): Distance between two probability distributions 𝑄, 𝑅 If we want to turn P to Q, the metric is the cost of the optimal transportation T (i.e., ∫ ||𝑦 − 𝑈(𝑦)||𝑒𝑄) E.g., in discrete case, it is the solution of the following LP:

Transportation Distance  Also known as earth mover distance, Rubinstein distance, Wasserstein distance  Tran 1 (𝑄, 𝑅): Distance between two probability distributions 𝑄, 𝑅 If we want to turn P to Q, the metric is the cost of the optimal transportation T (i.e., ∫ 𝑦 − 𝑈 𝑦 1 𝑒𝑄) E.g., in discrete case, it is the solution of the following LP:

Our Results  The Coin problem: 1-dimension  A mixture 𝜘 defined over [0,1]  If mixture 𝜘 is a k -spike distribution ( k different coins)  Require k -snapshot ( k >1) samples • (H 0,T 1) w.p. 0.5 (H 1,T 0) w.p. 0.5 0 1 • (H 0.1, T 0.9) w.p. 0.5 (H 0.9, T 0.1) w.p. 0.5 0 1 • …… • (H 0.5, T 0.5) w.p. 1 0 1

Our Results The Coin problem: 1-dimension  A mixture 𝜘 defined over [0,1]  If mixture 𝜘 is a k-spike distribution, a lower bound is known  Require k-snapshot (k>1) samples  Lower bound : To guarantee Tran 1 𝜘, 𝜘 ≤ 𝑃(1/𝑙) [Rabani,Schulman,Swamy’14] (1) (2k-1)-snapshot is necessary (2) We need exp(Ω(𝑙)) (2k-1)-snapshot samples Our Result:  A nearly matching upper bound: 𝑙/𝜗 𝑃(𝑙) log 1/𝜀 (2k-1)-snapshot samples suffice (w.p. 1 − 𝜀 )

Our Results The Coin problem: 1-dimension  A mixture 𝜘 over [0,1]  𝜘 is arbitrary (may even be continuous)  Lower bound [Rabani,Schulman,Swamy’14 ] : Still applies. (rewrite a bit) o We can use K-snapshot samples. o We need exp(Ω(𝐿)) K-snapshot samples to make Tran 1 𝜘, 𝜘 ≤ 𝑃(1/𝐿)  Our Result  A nearly matching upper bound  Using exp(O(𝐿)) K-snapshot samples, we can recover 𝜘 s.t. Tran 1 𝜘, 𝜘 ≤ 𝑃(1/𝐿) A tight tradeoff between K and transportation distance

Our Results Higher Dimension  A mixture 𝜘 over Δ 𝑜  Assumption: 𝜘 is a k-spike distribution (think k very small, k<<n) Our result:  Using poly(n) 1- and 2-snapshot samples and 𝑙/𝜗 𝑃(𝑙 2 ) (2k-1)-snapshot samples, we can obtain a mixture 𝜘 s.t. Tran 1 𝜘, 𝜘 ≤ 𝜗 L1 distance. Harder than L2

Our Results  Higher Dimension  A mixture 𝜘 over Δ 𝑜  Assumption: 𝜘 is a k-spike distribution (think k very small, k<<n)  Why L1 distance?  𝑄, 𝑅 ∈ Δ 𝑜 𝑒 𝑈𝑊 𝑄, 𝑅 = ||𝑄 − 𝑅|| 1 1 1 1 1 2 2  E.g., 𝑜 and 0, … , 0, 𝑜 are two very 𝑜 , … , 𝑜 , 𝑜 , … , 𝑜 , … , different distributions. But their L2 distance is small ( 1/ 𝑜 )

Our Results (0,1,0,0)  Higher Dimension  A mixture 𝜘 over Δ 𝑜 (1,0,0,0)  Assumption: 𝜘 is an arbitrary distribution (0,0,0,1) supported on a k-dim slice of Δ 𝑜 (0,0,1,0) (again think k<<n) A 2-dim slice in Simplex Δ 4 Our result:  Using poly(n) 1- and 2-snapshot samples, and 𝑙/𝜗 𝑃(𝑙) K -snapshot samples ( 𝐿 = poly(𝑙, 𝜗) ), we can obtain a mixture 𝜘 s.t. Tran 1 𝜘, 𝜘 ≤ 𝜗

The Coin Problem  A (even continuous) mixture 𝜘 of coins  Consider a K-snapshot sample Bernstein Polynomial samples, we can obtain Using

The Coin Problem  A simple but useful lemma: Pf based on the Dual formulation (Kantorovich&Rubinstein) 𝑔 𝑦 − 𝑔 𝑧 ≤ ||𝑦 − 𝑧||

The Coin Problem  If we want to make need Require poly(𝐷/𝜗) samples

The Coin Problem  If we want to make need Require poly(𝐷/𝜗) samples What C and 𝝁 can we achieve?? WELL KNOWN in approximation theory (e.g., Rivlin03): Bernstein polynomial approximation So, with poly(𝐿) K -snapshot samples, Tran = 𝑃(1/ 𝐿)

The Coin Problem  If we want to make need Require poly(𝐷/𝜗) samples What C and 𝝁 can we achieve?? Jackson’s theorem: Chebyshev polynomials By a change of basis with 𝐟𝐲𝐪(𝑳) K-snapshot samples, Tran = 𝑃(1/𝐿)

High Dimensional Case (0,1,0,0)  A mixture 𝜘 over Δ 𝑜  𝜘 is a k-spike distribution over a k-dim slice A of Δ 𝑜 ( k << n ) (1,0,0,0) (0,0,0,1) (0,0,1,0) A 2-dim slice in Simplex Δ 4 Outline:  Step 1: Reduce the learning problem from n -dim to k -dim (we don’t want the snapshot# depends on n)  Step 2: Learn the projected mixture in the k -dim subspace (require Tran 2 ≤ 𝜗 , snapshot# depends only on k, 𝜗 )  Step 3: Project back to Δ 𝑜

High Dimensional Case Step 1: From n -dim to k -dim  Existing approach: apply SVD/PCA/Eigen decomposition to the 2-moment matrix, then take the subspace spanned by the first few eigenvectors  Does NOT work!

High Dimensional Case Step 1: From n -dim to k -dim  Existing approach: apply SVD/PCA/Eigen decomposition to the 2-moment matrix, then take the subspace spanned by the first few eigenvectors  Does NOT work! Reason: we want Tran 1 𝜘, 𝜘 ≤ 𝜗 (L1 metric)  L1 is not rotationally invariant. So it may happen (in the subspace) that in some directions but in some other directions Implication: in the reduced k-dim learning problem, we have to be very accurate in some directions (only by making snapshot# depend on n)

High Dimensional Case  Step 1: From n -dim to k -dim  What we do: Find a k’ -dim ( k’<k ) subspace B where the L1-ball is almost spherical , and the supporting slice A is close to B in L1 metric

High Dimensional Case Step 1: From n -dim to k -dim (sketch) 1. Put 𝜘 in an isotropic position: (by deleting and splitting letters) 2. Compute the John Ellipsoid for a polytope and take the first few (normalized) principle axes, where

High Dimensional Case Step 2: Learn the projected mixture in the k -dim subspace (sketch) (1) project to a net of 1-dim directions (2) Learn the 1-d projections (3) Assemble the 1-d projections using LP Similar to a Geometric Tomography question. Analysis uses Fourier decomposition and a multidimension version of Jackson theorem

Learning Arbitrary Statistical Mixtures of Discrete Distributions - PowerPoint PPT Presentation

STOC15 Learning Arbitrary Statistical Mixtures of Discrete Distributions Jian Li (Tsinghua), Yuval Rabani(HUJI), Leonard J. Schulman(Caltech), Chaitanya Swamy (Waterloo) lijian83@mail.tsinghua.edu.cn Problem Definition Related Work

Analysis of a model of elastic plastic mixtures (Prandtl-Reuss-mixtures) Project of Josef

Release granular mushrooms Release granular mushrooms and dried mixtures and dried mixtures

The science of mixtures and separation techniques Rahul Bhambure PhD Scientist, Chemical

Mixtures of models Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

Discrete Sampling using Semigradient-based Product Mixtures Alkis Gotovos Hamed Hassani Andreas

Discrete Mathematics Jeremy Siek Spring 2010 Jeremy Siek Discrete Mathematics 1 / 118 Jeremy

Cyber-Physical Systems Discrete Dynamics IECE 553/453 Fall 2019 Prof. Dola Saha 1 Discrete

CMSC 222: Discrete Mathematics Prof S Fall 2018 What is Discrete Mathematics? Discrete

Cyber-Physical Systems Discrete Dynamics ICEN 553/453 Fall 2018 Prof. Dola Saha 1 Discrete

Plan Discrete paths as Heyting algebras Discrete paths as categories Discrete paths as quantales

Discrete-time Systems in the Time Domain Chaiwoot Boonyasiriwat August 21, 2020 Discrete-time

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Self-testing quantum systems of arbitrary local Self-testing quantum systems of arbitrary local

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

On learning statistical mixtures maximizing the complete likelihood The k -MLE methodology using

Differentially Private Testing of Identity and Closeness of Discrete Distributions NeurIPS 2018,

12/1/2019 Department of Veterinary and Animal Sciences Department of Veterinary and Animal

Discrete Mathematics & Mathematical Reasoning Chapter 7: Discrete Probability Kousha

Some Discrete Distribution Families Many families of discrete distributions have been studied; we

Coalgebraic Tools for Randomness-Conserving Protocols Matvey Soloviev (Cornell University) RAMiCS

Discrete Random Variables A random variable is a numerical value associated with the outcome of an

Probability and Statistics for Computer Science Its straigh+orward to link a number to the

Randome Variables and Expectation Example: Finding the k -Smallest Element in an ordered set.

Learning Arbitrary Statistical Mixtures of Discrete Distributions - PowerPoint PPT Presentation

STOC15 Learning Arbitrary Statistical Mixtures of Discrete Distributions Jian Li (Tsinghua), Yuval Rabani(HUJI), Leonard J. Schulman(Caltech), Chaitanya Swamy (Waterloo) lijian83@mail.tsinghua.edu.cn Problem Definition Related Work

Analysis of a model of elastic plastic mixtures (Prandtl-Reuss-mixtures) Project of Josef

Release granular mushrooms Release granular mushrooms and dried mixtures and dried mixtures

The science of mixtures and separation techniques Rahul Bhambure PhD Scientist, Chemical

Mixtures of models Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

Discrete Sampling using Semigradient-based Product Mixtures Alkis Gotovos Hamed Hassani Andreas

Discrete Mathematics Jeremy Siek Spring 2010 Jeremy Siek Discrete Mathematics 1 / 118 Jeremy

Cyber-Physical Systems Discrete Dynamics IECE 553/453 Fall 2019 Prof. Dola Saha 1 Discrete

CMSC 222: Discrete Mathematics Prof S Fall 2018 What is Discrete Mathematics? Discrete

Cyber-Physical Systems Discrete Dynamics ICEN 553/453 Fall 2018 Prof. Dola Saha 1 Discrete

Plan Discrete paths as Heyting algebras Discrete paths as categories Discrete paths as quantales

Discrete-time Systems in the Time Domain Chaiwoot Boonyasiriwat August 21, 2020 Discrete-time

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Self-testing quantum systems of arbitrary local Self-testing quantum systems of arbitrary local

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

On learning statistical mixtures maximizing the complete likelihood The k -MLE methodology using

Differentially Private Testing of Identity and Closeness of Discrete Distributions NeurIPS 2018,

12/1/2019 Department of Veterinary and Animal Sciences Department of Veterinary and Animal

Discrete Mathematics &amp; Mathematical Reasoning Chapter 7: Discrete Probability Kousha

Some Discrete Distribution Families Many families of discrete distributions have been studied; we

Coalgebraic Tools for Randomness-Conserving Protocols Matvey Soloviev (Cornell University) RAMiCS

Discrete Random Variables A random variable is a numerical value associated with the outcome of an

Probability and Statistics for Computer Science Its straigh+orward to link a number to the

Randome Variables and Expectation Example: Finding the k -Smallest Element in an ordered set.

Discrete Mathematics & Mathematical Reasoning Chapter 7: Discrete Probability Kousha