communication lower bounds for statistical
play

Communication Lower Bounds for Statistical Estimation Problems via a - PowerPoint PPT Presentation

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality Mark Braverman Ankit Garg Tengyu Ma Huy Nguyen David Woodruff DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms


  1. Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality Mark Braverman Ankit Garg Tengyu Ma Huy Nguyen David Woodruff DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms Aug 28, 2015 1

  2. Distributed mean estimation Big Data! Statistical estimation: Distributed Storage – Unknown parameter πœ„ . and Processing small small – Inputs to machines: i.i.d. data data data points ∼ 𝐸 πœ„ . . – Output estimator πœ„ Blackboard Objectives: – Low communication 𝐷 = Ξ  . – Small loss 2 . βˆ’ πœ„ 𝑆 ≔ 𝔽 πœ„ πœ„ 2

  3. Goal: Distributed sparse Gaussian estimate mean estimation (πœ„ 1 , … , πœ„ 𝑒 ) β€’ Ambient dimension 𝑒 . β€’ Sparsity parameter 𝑙 : πœ„ 0 ≀ 𝑙 . β€’ Number of machines 𝑛 . β€’ Each machine holds π‘œ samples. β€’ Standard deviation 𝜏 . β€’ Thus each sample is a vector (𝑒) ∼ π’ͺ πœ„ 1 , 𝜏 2 , … , π’ͺ πœ„ 𝑒 , 𝜏 2 ∈ ℝ 𝑒 π‘Œ π‘˜ 3

  4. Goal: Higher value makes estimate estimation: (πœ„ 1 , … , πœ„ 𝑒 ) β€’ Ambient dimension 𝑒 . harder β€’ Sparsity parameter 𝑙 : πœ„ 0 ≀ 𝑙 . harder β€’ Number of machines 𝑛 . easier* β€’ Each machine holds π‘œ samples. easier β€’ Standard deviation 𝜏 . harder β€’ Thus each sample is a vector (𝑒) ∼ π’ͺ πœ„ 1 , 𝜏 2 , … , π’ͺ πœ„ 𝑒 , 𝜏 2 ∈ ℝ 𝑒 π‘Œ π‘˜ 4

  5. Distributed sparse Gaussian mean estimation Statistical limit β€’ Main result: if Ξ  = C , then 𝑆 β‰₯ Ξ© max 𝜏 2 𝑒𝑙 π‘œπ· , 𝜏 2 𝑙 β€’ 𝑒 – dim π‘œπ‘› β€’ 𝑙 – sparsity β€’ 𝑛 – machine β€’ Tight up to a log 𝑒 factor β€’ π‘œ – samp. each β€’ 𝜏 – deviation [GMN14]. Up to a const. β€’ 𝑆 – sq. loss factor in the dense case. β€’ For optimal performance, 𝐷 ≳ 𝑛𝑒 (not 𝑛𝑙 ) is needed! 5

  6. Prior work (partial list) β€’ [Zhang-Duchi-Jordan- Wainwright’13]: the case when 𝑒 = 1 and general communication; and the dense case for simultaneous-message protocols. β€’ [Shamir’14]: Implies the result for 𝑙 = 1 in a restricted communication model. β€’ [Duchi-Jordan-Wainwright- Zhang’14, Garg -Ma- Nguyen’14]: the dense case (up to logarithmic factors). β€’ A lot of recent work on communication-efficient distributed learning. 6

  7. Reduction from Gaussian mean detection 𝜏 2 𝑒𝑙 𝜏 2 𝑙 β€’ 𝑆 β‰₯ Ξ© max π‘œπ· , π‘œπ‘› β€’ Gaussian mean detection – A one-dimensional problem. – Goal: distinguish between 𝜈 0 = π’ͺ 0, 𝜏 2 and 𝜈 1 = π’ͺ πœ€, 𝜏 2 . – Each player gets π‘œ samples. 7

  8. 𝜏 2 𝑒𝑙 π‘œπ· , 𝜏 2 𝑙 β€’ Assume 𝑆 β‰ͺ max π‘œπ‘› β€’ Distinguish between 𝜈 0 = π’ͺ 0, 𝜏 2 and 𝜈 1 = π’ͺ πœ€, 𝜏 2 . 1 16 π‘™πœ€ 2 in the β€’ Theorem: If can attain 𝑆 ≀ estimation problem using 𝐷 communication, then we can solve the detection problem at ∼ 𝐷/𝑒 min- information cost. β€’ Using πœ€ 2 β‰ͺ 𝜏 2 𝑒/(𝐷 π‘œ) , get detection using 𝐽 β‰ͺ 𝜏 2 π‘œ πœ€ 2 min-information cost. 8

  9. The detection problem β€’ Distinguish between 𝜈 0 = π’ͺ 0,1 and 𝜈 1 = π’ͺ πœ€, 1 . β€’ Each player gets π‘œ samples. 1 β€’ Want this to be impossible using 𝐽 β‰ͺ π‘œ πœ€ 2 min-information cost. 9

  10. The detection problem β€’ Distinguish between 𝜈 0 = π’ͺ 0,1 and 𝜈 1 = π’ͺ πœ€, 1 . β€’ Distinguish between 𝜈 0 = π’ͺ 0, 1 π‘œ and 𝜈 1 = π’ͺ πœ€, 1 π‘œ . β€’ Each player gets π‘œ samples. one sample. 1 β€’ Want this to be impossible using 𝐽 β‰ͺ π‘œ πœ€ 2 min-information cost. 10

  11. The detection problem β€’ By scaling everything by π‘œ (and replacing πœ€ with πœ€ π‘œ ). β€’ Distinguish between 𝜈 0 = π’ͺ 0,1 and 𝜈 1 = π’ͺ πœ€, 1 . β€’ Each player gets one sample. β€’ Want this to be impossible using 𝐽 β‰ͺ 1 πœ€ 2 min-information cost. Tight (for 𝑛 large enough, otherwise task impossible) 11

  12. Information cost 𝜈 𝑀 = π’ͺ πœ€π‘Š, 1 π‘Š π‘Œ 1 ∼ 𝜈 𝑀 π‘Œ 2 ∼ 𝜈 𝑀 π‘Œ 𝑛 ∼ 𝜈 𝑀 Blackboard Ξ  𝐽𝐷 𝜌 : = 𝐽(Ξ ; π‘Œ 1 π‘Œ 2 … π‘Œ 𝑛 ) 12

  13. Min-Information cost 𝜈 π‘Š = π’ͺ πœ€π‘Š, 1 π‘Š π‘Œ 1 ∼ 𝜈 𝑀 π‘Œ 2 ∼ 𝜈 𝑀 π‘Œ 𝑛 ∼ 𝜈 𝑀 Blackboard Ξ  π‘›π‘—π‘œπ½π· 𝜌 ≔ min π‘€βˆˆ{0,1} 𝐽(Ξ ; π‘Œ 1 π‘Œ 2 … π‘Œ 𝑛 |π‘Š = 𝑀) 13

  14. Min-Information cost π‘›π‘—π‘œπ½π· 𝜌 ≔ min π‘€βˆˆ{0,1} 𝐽(Ξ ; π‘Œ 1 π‘Œ 2 … π‘Œ 𝑛 |π‘Š = 𝑀) 1 β€’ We will want this quantity to be Ξ© πœ€ 2 . β€’ Warning: it is not the same thing as 𝐽(Ξ ; π‘Œ 1 π‘Œ 2 … π‘Œ 𝑛 |π‘Š)= 𝔽 π‘€βˆΌπ‘Š 𝐽(Ξ ; π‘Œ 1 π‘Œ 2 … π‘Œ 𝑛 |π‘Š = 𝑀) because one case can be much smaller than the other. β€’ In our case, the need to use π‘›π‘—π‘œπ½π· instead of 𝐽𝐷 happens because of the sparsity. 14

  15. Strong data processing inequality 𝜈 𝑀 = π’ͺ πœ€π‘Š, 1 π‘Š π‘Œ 1 ∼ 𝜈 𝑀 π‘Œ 2 ∼ 𝜈 𝑀 π‘Œ 𝑛 ∼ 𝜈 𝑀 Blackboard Ξ  Fact: Ξ  β‰₯ 𝐽 Ξ ; π‘Œ 1 π‘Œ 2 … π‘Œ 𝑛 = 𝐽(Ξ ; π‘Œ 𝑗 |π‘Œ <𝑗 ) 𝑗 15

  16. Strong data processing inequality β€’ 𝜈 𝑀 = π’ͺ πœ€π‘Š, 1 ; suppose π‘Š ∼ 𝐢 1/2 . β€’ For each 𝑗 , π‘Š βˆ’ π‘Œ 𝑗 βˆ’ Ξ  is a Markov chain. β€’ Intuition: β€œ π‘Œ 𝑗 contains little information about π‘Š ; no way to learn this information except by learning a lot about π‘Œ 𝑗 ”. β€’ Data processing: 𝐽 π‘Š; Ξ  ≀ 𝐽 π‘Œ 𝑗 ; Ξ  . β€’ Strong Data Processing: 𝐽 π‘Š; Ξ  ≀ 𝛾 β‹… 𝐽 π‘Œ 𝑗 ; Ξ  for some 𝛾 = 𝛾(𝜈 0 , 𝜈 1 ) < 1 . 16

  17. Strong data processing inequality β€’ 𝜈 𝑀 = π’ͺ πœ€π‘Š, 1 ; suppose π‘Š ∼ 𝐢 1/2 . β€’ For each 𝑗 , π‘Š βˆ’ π‘Œ 𝑗 βˆ’ Ξ  is a Markov chain. β€’ Strong Data Processing: 𝐽 π‘Š; Ξ  ≀ 𝛾 β‹… 𝐽 π‘Œ 𝑗 ; Ξ  for some 𝛾 = 𝛾(𝜈 0 , 𝜈 1 ) < 1 . β€’ In this case ( 𝜈 0 = π’ͺ 0,1 , 𝜈 1 = π’ͺ πœ€, 1 ): 𝛾 𝜈 0 , 𝜈 1 ∼ 𝐽 π‘Š; sign π‘Œ 𝑗 𝐽 π‘Œ 𝑗 ; sign(π‘Œ 𝑗 ) ∼ πœ€ 2 17

  18. β€œProof” β€’ 𝜈 𝑀 = π’ͺ πœ€π‘Š, 1 ; suppose π‘Š ∼ 𝐢 1/2 . β€’ Strong Data Processing: 𝐽 π‘Š; Ξ  ≀ πœ€ 2 β‹… 𝐽 π‘Œ 𝑗 ; Ξ  β€’ We know 𝐽 π‘Š; Ξ  = Ξ©(1) . β‰₯ 1 Ξ  β‰₯ 𝐽 Ξ ; π‘Œ 1 π‘Œ 2 … π‘Œ 𝑛 ≳ 𝐽 Ξ ; π‘Œ 𝑗 πœ€ 2 … 𝑗 "π½π‘œπ‘”π‘ Ξ  π‘‘π‘π‘œπ‘€π‘“π‘§π‘‘ 𝑏𝑐𝑝𝑣𝑒 π‘Š π‘’β„Žπ‘ π‘π‘£π‘•β„Ž π‘žπ‘šπ‘π‘§π‘“π‘  𝑗" ≳ 𝑗 1 1 πœ€ 2 𝐽 π‘Š; Ξ  = Ξ© Q.E.D! πœ€ 2 18

  19. Issues with the proof β€’ The right high level idea. β€’ Two main issues: – Not clear how to deal with additivity over coordinates. – Dealing with π‘›π‘—π‘œπ½π· instead of 𝐽𝐷 . 19

  20. If the picture were this… 𝜈 𝑀 = π’ͺ πœ€π‘Š, 1 π‘Š π‘Œ 1 ∼ 𝜈 𝑀 π‘Œ 2 ∼ 𝜈 0 π‘Œ 𝑛 ∼ 𝜈 0 Blackboard Ξ  Then indeed 𝐽 Ξ ; π‘Š ≀ πœ€ 2 β‹… 𝐽 Ξ ; π‘Œ 1 . 20

  21. Hellinger distance β€’ Solution to additivity: using Hellinger 2 𝑒𝑦 𝑔 𝑦 βˆ’ 𝑕 𝑦 distance Ξ© β€’ Following from [Jayram’09]. β„Ž 2 Ξ  π‘Š=0 , Ξ  π‘Š=1 ∼ 𝐽 π‘Š; Ξ  = Ξ© 1 β€’ β„Ž 2 Ξ  π‘Š=0 , Ξ  π‘Š=1 decomposes into 𝑛 scenarios as above using the fact that Ξ  is a protocol. 21

  22. π‘›π‘—π‘œπ½π· β€’ Dealing with π‘›π‘—π‘œπ½π· is more technical. Recall: β€’ π‘›π‘—π‘œπ½π· 𝜌 ≔ min π‘€βˆˆ{0,1} 𝐽(Ξ ; π‘Œ 1 π‘Œ 2 … π‘Œ 𝑛 |π‘Š = 𝑀) β€’ Leads to our main technical statement: β€œDistributed Strong Data Processing Inequality” Theorem: Suppose Ξ© 1 β‹… 𝜈 0 ≀ 𝜈 1 ≀ 𝑃 1 β‹… 𝜈 0 , and let 𝛾(𝜈 0 , 𝜈 1 ) be the SDPI constant. Then β„Ž 2 Ξ  π‘Š=0 , Ξ  π‘Š=1 ≀ 𝑃 𝛾 𝜈 0 , 𝜈 1 β‹… π‘›π‘—π‘œπ½π·(𝜌) 22

  23. Putting it together Theorem: Suppose Ξ© 1 β‹… 𝜈 0 ≀ 𝜈 1 ≀ 𝑃 1 β‹… 𝜈 0 , and let 𝛾(𝜈 0 , 𝜈 1 ) be the SDPI constant. Then β„Ž 2 Ξ  π‘Š=0 , Ξ  π‘Š=1 ≀ 𝑃 𝛾 𝜈 0 , 𝜈 1 β‹… π‘›π‘—π‘œπ½π·(𝜌) β€’ With 𝜈 0 = π’ͺ 0,1 , 𝜈 1 = π’ͺ πœ€, 1 , 𝛾 ∼ πœ€ 2 , we get Ξ© 1 = β„Ž 2 Ξ  π‘Š=0 , Ξ  π‘Š=1 ≀ πœ€ 2 β‹… π‘›π‘—π‘œπ½π·(𝜌) 1 β€’ Therefore, π‘›π‘—π‘œπ½π· 𝜌 = Ξ© πœ€ 2 . 23

  24. Essential! Putting it together Theorem: Suppose Ξ© 1 β‹… 𝜈 0 ≀ 𝜈 1 ≀ 𝑃 1 β‹… 𝜈 0 , and let 𝛾(𝜈 0 , 𝜈 1 ) be the SDPI constant. Then β„Ž 2 Ξ  π‘Š=0 , Ξ  π‘Š=1 ≀ 𝑃 𝛾 𝜈 0 , 𝜈 1 β‹… π‘›π‘—π‘œπ½π·(𝜌) β€’ With 𝜈 0 = π’ͺ 0,1 , 𝜈 1 = π’ͺ πœ€, 1 β€’ Ξ© 1 β‹… 𝜈 0 ≀ 𝜈 1 ≀ 𝑃 1 β‹… 𝜈 0 fails!! β€’ Need an additional truncation step. Fortunately, the failure happens far in the tails. 24

  25. Summary β€œOnly get πœ€ 2 bits Hellinger Distributed distance toward detection sparse linear per bit of π‘›π‘—π‘œπ½π· ” β‡’ regression Strong data 1 an πœ€ 2 lower bound processing Reduction [ZDJW’13] Gaussian mean Sparse Gaussian A direct sum detection ( π‘œ β†’ 1 ) mean estimation argument sample ( π‘›π‘—π‘œπ½π· ) 25

  26. Distributed sparse linear regression β€’ Each machine gets π‘œ data of the form (𝐡 π‘˜ , 𝑧 π‘˜ ) , where 𝑧 π‘˜ = 𝐡 π‘˜ , πœ„ + π‘₯ π‘˜ , π‘₯ π‘˜ ∼ π’ͺ 0, 𝜏 2 β€’ Promised that πœ„ is 𝑙 -sparse: πœ„ 0 ≀ 𝑙 . β€’ Ambient dimension 𝑒 . 2 . βˆ’ πœ„ β€’ Loss 𝑆 = 𝔽 πœ„ β€’ How much communication to achieve statistically optimal loss? 26

Recommend


More recommend