Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality Mark Braverman Ankit Garg Tengyu Ma Huy Nguyen David Woodruff DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms Aug 28, 2015 1
Distributed mean estimation Big Data! Statistical estimation: Distributed Storage β Unknown parameter π . and Processing small small β Inputs to machines: i.i.d. data data data points βΌ πΈ π . . β Output estimator π Blackboard Objectives: β Low communication π· = Ξ . β Small loss 2 . β π π β π½ π π 2
Goal: Distributed sparse Gaussian estimate mean estimation (π 1 , β¦ , π π ) β’ Ambient dimension π . β’ Sparsity parameter π : π 0 β€ π . β’ Number of machines π . β’ Each machine holds π samples. β’ Standard deviation π . β’ Thus each sample is a vector (π’) βΌ πͺ π 1 , π 2 , β¦ , πͺ π π , π 2 β β π π π 3
Goal: Higher value makes estimate estimation: (π 1 , β¦ , π π ) β’ Ambient dimension π . harder β’ Sparsity parameter π : π 0 β€ π . harder β’ Number of machines π . easier* β’ Each machine holds π samples. easier β’ Standard deviation π . harder β’ Thus each sample is a vector (π’) βΌ πͺ π 1 , π 2 , β¦ , πͺ π π , π 2 β β π π π 4
Distributed sparse Gaussian mean estimation Statistical limit β’ Main result: if Ξ = C , then π β₯ Ξ© max π 2 ππ ππ· , π 2 π β’ π β dim ππ β’ π β sparsity β’ π β machine β’ Tight up to a log π factor β’ π β samp. each β’ π β deviation [GMN14]. Up to a const. β’ π β sq. loss factor in the dense case. β’ For optimal performance, π· β³ ππ (not ππ ) is needed! 5
Prior work (partial list) β’ [Zhang-Duchi-Jordan- Wainwrightβ13]: the case when π = 1 and general communication; and the dense case for simultaneous-message protocols. β’ [Shamirβ14]: Implies the result for π = 1 in a restricted communication model. β’ [Duchi-Jordan-Wainwright- Zhangβ14, Garg -Ma- Nguyenβ14]: the dense case (up to logarithmic factors). β’ A lot of recent work on communication-efficient distributed learning. 6
Reduction from Gaussian mean detection π 2 ππ π 2 π β’ π β₯ Ξ© max ππ· , ππ β’ Gaussian mean detection β A one-dimensional problem. β Goal: distinguish between π 0 = πͺ 0, π 2 and π 1 = πͺ π, π 2 . β Each player gets π samples. 7
π 2 ππ ππ· , π 2 π β’ Assume π βͺ max ππ β’ Distinguish between π 0 = πͺ 0, π 2 and π 1 = πͺ π, π 2 . 1 16 ππ 2 in the β’ Theorem: If can attain π β€ estimation problem using π· communication, then we can solve the detection problem at βΌ π·/π min- information cost. β’ Using π 2 βͺ π 2 π/(π· π) , get detection using π½ βͺ π 2 π π 2 min-information cost. 8
The detection problem β’ Distinguish between π 0 = πͺ 0,1 and π 1 = πͺ π, 1 . β’ Each player gets π samples. 1 β’ Want this to be impossible using π½ βͺ π π 2 min-information cost. 9
The detection problem β’ Distinguish between π 0 = πͺ 0,1 and π 1 = πͺ π, 1 . β’ Distinguish between π 0 = πͺ 0, 1 π and π 1 = πͺ π, 1 π . β’ Each player gets π samples. one sample. 1 β’ Want this to be impossible using π½ βͺ π π 2 min-information cost. 10
The detection problem β’ By scaling everything by π (and replacing π with π π ). β’ Distinguish between π 0 = πͺ 0,1 and π 1 = πͺ π, 1 . β’ Each player gets one sample. β’ Want this to be impossible using π½ βͺ 1 π 2 min-information cost. Tight (for π large enough, otherwise task impossible) 11
Information cost π π€ = πͺ ππ, 1 π π 1 βΌ π π€ π 2 βΌ π π€ π π βΌ π π€ Blackboard Ξ π½π· π : = π½(Ξ ; π 1 π 2 β¦ π π ) 12
Min-Information cost π π = πͺ ππ, 1 π π 1 βΌ π π€ π 2 βΌ π π€ π π βΌ π π€ Blackboard Ξ ππππ½π· π β min π€β{0,1} π½(Ξ ; π 1 π 2 β¦ π π |π = π€) 13
Min-Information cost ππππ½π· π β min π€β{0,1} π½(Ξ ; π 1 π 2 β¦ π π |π = π€) 1 β’ We will want this quantity to be Ξ© π 2 . β’ Warning: it is not the same thing as π½(Ξ ; π 1 π 2 β¦ π π |π)= π½ π€βΌπ π½(Ξ ; π 1 π 2 β¦ π π |π = π€) because one case can be much smaller than the other. β’ In our case, the need to use ππππ½π· instead of π½π· happens because of the sparsity. 14
Strong data processing inequality π π€ = πͺ ππ, 1 π π 1 βΌ π π€ π 2 βΌ π π€ π π βΌ π π€ Blackboard Ξ Fact: Ξ β₯ π½ Ξ ; π 1 π 2 β¦ π π = π½(Ξ ; π π |π <π ) π 15
Strong data processing inequality β’ π π€ = πͺ ππ, 1 ; suppose π βΌ πΆ 1/2 . β’ For each π , π β π π β Ξ is a Markov chain. β’ Intuition: β π π contains little information about π ; no way to learn this information except by learning a lot about π π β. β’ Data processing: π½ π; Ξ β€ π½ π π ; Ξ . β’ Strong Data Processing: π½ π; Ξ β€ πΎ β π½ π π ; Ξ for some πΎ = πΎ(π 0 , π 1 ) < 1 . 16
Strong data processing inequality β’ π π€ = πͺ ππ, 1 ; suppose π βΌ πΆ 1/2 . β’ For each π , π β π π β Ξ is a Markov chain. β’ Strong Data Processing: π½ π; Ξ β€ πΎ β π½ π π ; Ξ for some πΎ = πΎ(π 0 , π 1 ) < 1 . β’ In this case ( π 0 = πͺ 0,1 , π 1 = πͺ π, 1 ): πΎ π 0 , π 1 βΌ π½ π; sign π π π½ π π ; sign(π π ) βΌ π 2 17
βProofβ β’ π π€ = πͺ ππ, 1 ; suppose π βΌ πΆ 1/2 . β’ Strong Data Processing: π½ π; Ξ β€ π 2 β π½ π π ; Ξ β’ We know π½ π; Ξ = Ξ©(1) . β₯ 1 Ξ β₯ π½ Ξ ; π 1 π 2 β¦ π π β³ π½ Ξ ; π π π 2 β¦ π "π½πππ Ξ ππππ€ππ§π‘ ππππ£π’ π π’βπ ππ£πβ ππππ§ππ π" β³ π 1 1 π 2 π½ π; Ξ = Ξ© Q.E.D! π 2 18
Issues with the proof β’ The right high level idea. β’ Two main issues: β Not clear how to deal with additivity over coordinates. β Dealing with ππππ½π· instead of π½π· . 19
If the picture were thisβ¦ π π€ = πͺ ππ, 1 π π 1 βΌ π π€ π 2 βΌ π 0 π π βΌ π 0 Blackboard Ξ Then indeed π½ Ξ ; π β€ π 2 β π½ Ξ ; π 1 . 20
Hellinger distance β’ Solution to additivity: using Hellinger 2 ππ¦ π π¦ β π π¦ distance Ξ© β’ Following from [Jayramβ09]. β 2 Ξ π=0 , Ξ π=1 βΌ π½ π; Ξ = Ξ© 1 β’ β 2 Ξ π=0 , Ξ π=1 decomposes into π scenarios as above using the fact that Ξ is a protocol. 21
ππππ½π· β’ Dealing with ππππ½π· is more technical. Recall: β’ ππππ½π· π β min π€β{0,1} π½(Ξ ; π 1 π 2 β¦ π π |π = π€) β’ Leads to our main technical statement: βDistributed Strong Data Processing Inequalityβ Theorem: Suppose Ξ© 1 β π 0 β€ π 1 β€ π 1 β π 0 , and let πΎ(π 0 , π 1 ) be the SDPI constant. Then β 2 Ξ π=0 , Ξ π=1 β€ π πΎ π 0 , π 1 β ππππ½π·(π) 22
Putting it together Theorem: Suppose Ξ© 1 β π 0 β€ π 1 β€ π 1 β π 0 , and let πΎ(π 0 , π 1 ) be the SDPI constant. Then β 2 Ξ π=0 , Ξ π=1 β€ π πΎ π 0 , π 1 β ππππ½π·(π) β’ With π 0 = πͺ 0,1 , π 1 = πͺ π, 1 , πΎ βΌ π 2 , we get Ξ© 1 = β 2 Ξ π=0 , Ξ π=1 β€ π 2 β ππππ½π·(π) 1 β’ Therefore, ππππ½π· π = Ξ© π 2 . 23
Essential! Putting it together Theorem: Suppose Ξ© 1 β π 0 β€ π 1 β€ π 1 β π 0 , and let πΎ(π 0 , π 1 ) be the SDPI constant. Then β 2 Ξ π=0 , Ξ π=1 β€ π πΎ π 0 , π 1 β ππππ½π·(π) β’ With π 0 = πͺ 0,1 , π 1 = πͺ π, 1 β’ Ξ© 1 β π 0 β€ π 1 β€ π 1 β π 0 fails!! β’ Need an additional truncation step. Fortunately, the failure happens far in the tails. 24
Summary βOnly get π 2 bits Hellinger Distributed distance toward detection sparse linear per bit of ππππ½π· β β regression Strong data 1 an π 2 lower bound processing Reduction [ZDJWβ13] Gaussian mean Sparse Gaussian A direct sum detection ( π β 1 ) mean estimation argument sample ( ππππ½π· ) 25
Distributed sparse linear regression β’ Each machine gets π data of the form (π΅ π , π§ π ) , where π§ π = π΅ π , π + π₯ π , π₯ π βΌ πͺ 0, π 2 β’ Promised that π is π -sparse: π 0 β€ π . β’ Ambient dimension π . 2 . β π β’ Loss π = π½ π β’ How much communication to achieve statistically optimal loss? 26
Recommend
More recommend