Communication Lower Bounds for Statistical Estimation Problems via a - PowerPoint PPT Presentation
Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality Mark Braverman Ankit Garg Tengyu Ma Huy Nguyen David Woodruff DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms
Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality Mark Braverman Ankit Garg Tengyu Ma Huy Nguyen David Woodruff DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms Aug 28, 2015 1
Distributed mean estimation Big Data! Statistical estimation: Distributed Storage β Unknown parameter π . and Processing small small β Inputs to machines: i.i.d. data data data points βΌ πΈ π . . β Output estimator π Blackboard Objectives: β Low communication π· = Ξ . β Small loss 2 . β π π β π½ π π 2
Goal: Distributed sparse Gaussian estimate mean estimation (π 1 , β¦ , π π ) β’ Ambient dimension π . β’ Sparsity parameter π : π 0 β€ π . β’ Number of machines π . β’ Each machine holds π samples. β’ Standard deviation π . β’ Thus each sample is a vector (π’) βΌ πͺ π 1 , π 2 , β¦ , πͺ π π , π 2 β β π π π 3
Goal: Higher value makes estimate estimation: (π 1 , β¦ , π π ) β’ Ambient dimension π . harder β’ Sparsity parameter π : π 0 β€ π . harder β’ Number of machines π . easier* β’ Each machine holds π samples. easier β’ Standard deviation π . harder β’ Thus each sample is a vector (π’) βΌ πͺ π 1 , π 2 , β¦ , πͺ π π , π 2 β β π π π 4
Distributed sparse Gaussian mean estimation Statistical limit β’ Main result: if Ξ = C , then π β₯ Ξ© max π 2 ππ ππ· , π 2 π β’ π β dim ππ β’ π β sparsity β’ π β machine β’ Tight up to a log π factor β’ π β samp. each β’ π β deviation [GMN14]. Up to a const. β’ π β sq. loss factor in the dense case. β’ For optimal performance, π· β³ ππ (not ππ ) is needed! 5
Prior work (partial list) β’ [Zhang-Duchi-Jordan- Wainwrightβ13]: the case when π = 1 and general communication; and the dense case for simultaneous-message protocols. β’ [Shamirβ14]: Implies the result for π = 1 in a restricted communication model. β’ [Duchi-Jordan-Wainwright- Zhangβ14, Garg -Ma- Nguyenβ14]: the dense case (up to logarithmic factors). β’ A lot of recent work on communication-efficient distributed learning. 6
Reduction from Gaussian mean detection π 2 ππ π 2 π β’ π β₯ Ξ© max ππ· , ππ β’ Gaussian mean detection β A one-dimensional problem. β Goal: distinguish between π 0 = πͺ 0, π 2 and π 1 = πͺ π, π 2 . β Each player gets π samples. 7
π 2 ππ ππ· , π 2 π β’ Assume π βͺ max ππ β’ Distinguish between π 0 = πͺ 0, π 2 and π 1 = πͺ π, π 2 . 1 16 ππ 2 in the β’ Theorem: If can attain π β€ estimation problem using π· communication, then we can solve the detection problem at βΌ π·/π min- information cost. β’ Using π 2 βͺ π 2 π/(π· π) , get detection using π½ βͺ π 2 π π 2 min-information cost. 8
The detection problem β’ Distinguish between π 0 = πͺ 0,1 and π 1 = πͺ π, 1 . β’ Each player gets π samples. 1 β’ Want this to be impossible using π½ βͺ π π 2 min-information cost. 9
The detection problem β’ Distinguish between π 0 = πͺ 0,1 and π 1 = πͺ π, 1 . β’ Distinguish between π 0 = πͺ 0, 1 π and π 1 = πͺ π, 1 π . β’ Each player gets π samples. one sample. 1 β’ Want this to be impossible using π½ βͺ π π 2 min-information cost. 10
The detection problem β’ By scaling everything by π (and replacing π with π π ). β’ Distinguish between π 0 = πͺ 0,1 and π 1 = πͺ π, 1 . β’ Each player gets one sample. β’ Want this to be impossible using π½ βͺ 1 π 2 min-information cost. Tight (for π large enough, otherwise task impossible) 11
Information cost π π€ = πͺ ππ, 1 π π 1 βΌ π π€ π 2 βΌ π π€ π π βΌ π π€ Blackboard Ξ π½π· π : = π½(Ξ ; π 1 π 2 β¦ π π ) 12
Min-Information cost π π = πͺ ππ, 1 π π 1 βΌ π π€ π 2 βΌ π π€ π π βΌ π π€ Blackboard Ξ ππππ½π· π β min π€β{0,1} π½(Ξ ; π 1 π 2 β¦ π π |π = π€) 13
Min-Information cost ππππ½π· π β min π€β{0,1} π½(Ξ ; π 1 π 2 β¦ π π |π = π€) 1 β’ We will want this quantity to be Ξ© π 2 . β’ Warning: it is not the same thing as π½(Ξ ; π 1 π 2 β¦ π π |π)= π½ π€βΌπ π½(Ξ ; π 1 π 2 β¦ π π |π = π€) because one case can be much smaller than the other. β’ In our case, the need to use ππππ½π· instead of π½π· happens because of the sparsity. 14
Strong data processing inequality π π€ = πͺ ππ, 1 π π 1 βΌ π π€ π 2 βΌ π π€ π π βΌ π π€ Blackboard Ξ Fact: Ξ β₯ π½ Ξ ; π 1 π 2 β¦ π π = π½(Ξ ; π π |π <π ) π 15
Strong data processing inequality β’ π π€ = πͺ ππ, 1 ; suppose π βΌ πΆ 1/2 . β’ For each π , π β π π β Ξ is a Markov chain. β’ Intuition: β π π contains little information about π ; no way to learn this information except by learning a lot about π π β. β’ Data processing: π½ π; Ξ β€ π½ π π ; Ξ . β’ Strong Data Processing: π½ π; Ξ β€ πΎ β π½ π π ; Ξ for some πΎ = πΎ(π 0 , π 1 ) < 1 . 16
Strong data processing inequality β’ π π€ = πͺ ππ, 1 ; suppose π βΌ πΆ 1/2 . β’ For each π , π β π π β Ξ is a Markov chain. β’ Strong Data Processing: π½ π; Ξ β€ πΎ β π½ π π ; Ξ for some πΎ = πΎ(π 0 , π 1 ) < 1 . β’ In this case ( π 0 = πͺ 0,1 , π 1 = πͺ π, 1 ): πΎ π 0 , π 1 βΌ π½ π; sign π π π½ π π ; sign(π π ) βΌ π 2 17
βProofβ β’ π π€ = πͺ ππ, 1 ; suppose π βΌ πΆ 1/2 . β’ Strong Data Processing: π½ π; Ξ β€ π 2 β π½ π π ; Ξ β’ We know π½ π; Ξ = Ξ©(1) . β₯ 1 Ξ β₯ π½ Ξ ; π 1 π 2 β¦ π π β³ π½ Ξ ; π π π 2 β¦ π "π½πππ Ξ ππππ€ππ§π‘ ππππ£π’ π π’βπ ππ£πβ ππππ§ππ π" β³ π 1 1 π 2 π½ π; Ξ = Ξ© Q.E.D! π 2 18
Issues with the proof β’ The right high level idea. β’ Two main issues: β Not clear how to deal with additivity over coordinates. β Dealing with ππππ½π· instead of π½π· . 19
If the picture were thisβ¦ π π€ = πͺ ππ, 1 π π 1 βΌ π π€ π 2 βΌ π 0 π π βΌ π 0 Blackboard Ξ Then indeed π½ Ξ ; π β€ π 2 β π½ Ξ ; π 1 . 20
Hellinger distance β’ Solution to additivity: using Hellinger 2 ππ¦ π π¦ β π π¦ distance Ξ© β’ Following from [Jayramβ09]. β 2 Ξ π=0 , Ξ π=1 βΌ π½ π; Ξ = Ξ© 1 β’ β 2 Ξ π=0 , Ξ π=1 decomposes into π scenarios as above using the fact that Ξ is a protocol. 21
ππππ½π· β’ Dealing with ππππ½π· is more technical. Recall: β’ ππππ½π· π β min π€β{0,1} π½(Ξ ; π 1 π 2 β¦ π π |π = π€) β’ Leads to our main technical statement: βDistributed Strong Data Processing Inequalityβ Theorem: Suppose Ξ© 1 β π 0 β€ π 1 β€ π 1 β π 0 , and let πΎ(π 0 , π 1 ) be the SDPI constant. Then β 2 Ξ π=0 , Ξ π=1 β€ π πΎ π 0 , π 1 β ππππ½π·(π) 22
Putting it together Theorem: Suppose Ξ© 1 β π 0 β€ π 1 β€ π 1 β π 0 , and let πΎ(π 0 , π 1 ) be the SDPI constant. Then β 2 Ξ π=0 , Ξ π=1 β€ π πΎ π 0 , π 1 β ππππ½π·(π) β’ With π 0 = πͺ 0,1 , π 1 = πͺ π, 1 , πΎ βΌ π 2 , we get Ξ© 1 = β 2 Ξ π=0 , Ξ π=1 β€ π 2 β ππππ½π·(π) 1 β’ Therefore, ππππ½π· π = Ξ© π 2 . 23
Essential! Putting it together Theorem: Suppose Ξ© 1 β π 0 β€ π 1 β€ π 1 β π 0 , and let πΎ(π 0 , π 1 ) be the SDPI constant. Then β 2 Ξ π=0 , Ξ π=1 β€ π πΎ π 0 , π 1 β ππππ½π·(π) β’ With π 0 = πͺ 0,1 , π 1 = πͺ π, 1 β’ Ξ© 1 β π 0 β€ π 1 β€ π 1 β π 0 fails!! β’ Need an additional truncation step. Fortunately, the failure happens far in the tails. 24
Summary βOnly get π 2 bits Hellinger Distributed distance toward detection sparse linear per bit of ππππ½π· β β regression Strong data 1 an π 2 lower bound processing Reduction [ZDJWβ13] Gaussian mean Sparse Gaussian A direct sum detection ( π β 1 ) mean estimation argument sample ( ππππ½π· ) 25
Distributed sparse linear regression β’ Each machine gets π data of the form (π΅ π , π§ π ) , where π§ π = π΅ π , π + π₯ π , π₯ π βΌ πͺ 0, π 2 β’ Promised that π is π -sparse: π 0 β€ π . β’ Ambient dimension π . 2 . β π β’ Loss π = π½ π β’ How much communication to achieve statistically optimal loss? 26
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.