time space tradeoffs for two pass learning
play

Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) - PowerPoint PPT Presentation

Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley) [Shamir 14], [Steinhardt-Valiant-Wager 15] Initiated a study of memory-samples lower bounds for learning Can one


  1. Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley)

  2. [Shamir 14], [Steinhardt-Valiant-Wager 15] Initiated a study of memory-samples lower bounds for learning Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints? (when samples are viewed one by one) (also known as online learning)

  3. When two-passes are allowed? Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints, and when learner is allowed to go over the stream of samples twice? (in the same order)

  4. Toy-Example: Parity Learning

  5. Parity Learning 𝑦 ∈ # 0,1 ' is unknown A learner tries to learn 𝑦 from (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , (𝑏 / , 𝑐 / ) , where βˆ€ 𝑒 , 𝑏 2 ∈ # 0,1 ' and 𝑐 2 =< 𝑏 2 , 𝑦 > (inner product mod 2) In other words, learner gets random linear equations in 𝑦 * , 𝑦 - , . . , 𝑦 ' , one by one, and need to solve them

  6. Parity Learners Solve independent linear equations (Gaussian Elimination) ● 𝑃(π‘œ) samples and 𝑃(π‘œ - ) memory β—‹ Try all possibilities of 𝑦 ● 𝑃(π‘œ) memory but exponential number of samples β—‹

  7. Parity Learning (Two-pass) 𝑦 ∈ # 0,1 ' is unknown A learner tries to learn 𝑦 from (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , 𝑏 / , 𝑐 / , (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , (𝑏 / , 𝑐 / ) , where βˆ€ 𝑒 , 𝑏 2 ∈ # 0,1 ' and 𝑐 2 =< 𝑏 2 , 𝑦 > (inner product mod 2)

  8. Raz’s Breakthough ’16 (One-pass) Any algorithm for parity learning of size π‘œ requires either Ξ© π‘œ - memory bits or an exponential number of samples Conjectured by: Steinhardt, Valiant and Wager [2015]

  9. Subsequent Results (One-pass) [Kol-Raz-Tal β€˜17]: Generalization to sparse parities [Raz’17, Moshkovitz-Moshkovitz’17, Moshkovitz- Tishby’17, Moshkovitz- Moshkovitz’18, Garg-Raz-Tal’18, Beame-Gharan-Yang’18]: Generalization to larger class of problems [Sharan-Sidford-Valiant’19]: Generalization to real-valued learning

  10. Related Results (Multiple-pass) [Dagan-Shamir’18, Assadi-Chen-Khanna’19,…]: Uses communication complexity (Quite different technique, at most polynomial bound on the number of samples)

  11. Motivation Learning Theory, Bounded Storage Cryptography, Complexity Theory With [Barrington’89], proving super-polynomial lower bounds on the time needed for computing a function, by a branching program of width 5, with polynomially many passes over the input, would imply super-polynomial lower bounds for formula size Technically Challenging: previous techniques are heavily based on the fact that in the one-pass case all the samples are independent

  12. Our Result

  13. Our Result for Parity Learning Any two-pass algorithm for parity learning of size π‘œ requires either Ξ©(π‘œ *.: ) memory bits or 2 <( ') number of samples (no matching upper bound)

  14. Learning Problem as a Matrix 𝐡, π‘Œ : finite sets, 𝑁: π΅Γ—π‘Œ β†’ {βˆ’1,1} : a matrix 𝑦 ∈ # π‘Œ is unknown. A learner tries to learn 𝑦 from a stream (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , 𝑏 / , 𝑐 / , (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , (𝑏 / , 𝑐 / ) , where βˆ€π‘’ : 𝑏 2 ∈ # 𝐡 and 𝑐 2 = 𝑁(𝑏 2 , 𝑦) π‘Œ : concept class = 0,1 ' 𝐡 : possible samples = 0,1 'F

  15. Generalized Result Assume that any submatrix of 𝑁 of at least 2 GH |𝐡| rows and at least 2 Gβ„“ |π‘Œ| columns, has a bias of at most 2 GK . Then: Any two-pass algorithm requires either Ξ©(𝑙 β‹… min{𝑙, π‘š}) memory bits or 2 <(RST{H, U,K}) samples In contrast, [GRT’18] proved Any one-pass algorithm requires either Ξ©(𝑙 β‹… π‘š) memory bits or 2 <(K) samples

  16. Branching Program (length 𝑛 , width 𝑒 , 2 -pass) (𝑏 / , 𝑐 / ) (𝑏 * , 𝑐 * ) (𝑏 / , 𝑐 / ) (𝑏 * , 𝑐 * ) 𝑒 (𝑏, 𝑐) 𝑛 𝑄𝑏𝑠𝑒 2 Each layer represents a time step. Each vertex represents a memory state of the learner ( 𝑒 = 2 /X/YKZ ). Each non-leaf vertex has 2 'F[* outgoing edges, one for each 𝑏, 𝑐 ∈ 0,1 ' \ Γ— βˆ’1,1

  17. Branching Program (length 𝑛 , width 𝑒 , 2 -pass) (𝑏 / , 𝑐 / ) (𝑏 * , 𝑐 * ) (𝑏 / , 𝑐 / ) (𝑏 * , 𝑐 * ) 𝑒 (𝑏, 𝑐) 𝑛 𝑄𝑏𝑠𝑒 2 The samples 𝑏 * , 𝑐 * , . . , 𝑏 / , 𝑐 / , 𝑏 * , 𝑐 * , . . , 𝑏 / , 𝑐 / define a computation-path. Each vertex 𝑀 in the last layer is labeled by ` 𝑦 a ∈ 0,1 ' . The output is the label 𝑦 a of the vertex reached by the path `

  18. Brief Overview of One-Pass Lower Bound [GRT’18] c|a = distribution of 𝑦 conditioned on the event that the computation- P path reaches 𝑀 c|a || - β‰₯ 2 U β‹… 2 G' Significant vertices: 𝑀 s.t. ||P 𝑄𝑠 𝑀 = probability that the path reaches 𝑀 GRT proves: If 𝑀 is significant, 𝑄𝑠 𝑀 ≀ 2 G<(Hβ‹…U) Hence, there are at least 2 <(Hβ‹…U) significant vertices to output correct answer with high probability

  19. Brief Overview of One-Pass Lower Bound c|a = distribution of 𝑦 conditioned on the event that the computation- P path reaches 𝑀 𝑄𝑠 𝑀 = probability that the path reaches 𝑀 under π‘ˆ = same as the computational path, but stops when β€œatypical” things happen (traversing a bad edge and …) Bad edges: 𝑏 s.t. |(𝑁 β‹… P c|a )(𝑏)| β‰₯ 2 GK 𝑄𝑠 π‘ˆ π‘‘π‘’π‘π‘žπ‘‘ is exp small (uses 𝑏 ∈ # 0,1 ' !)

  20. Difficulties for Two-Passes (1) j|a β‰  π‘‰π‘œπ‘—π‘”π‘π‘ π‘› o,* p for 𝑀 in Part-2 P For e.g., BP remembers 𝑏 * . Therefore, probability of traversing a β€œbad edge” may not be small c|a )(𝑏)| β‰₯ 2 GK (gives too much information about 𝑦 ) Bad edges: |(𝑁 β‹… P Save: can’t remember too many 𝑏 s. New stopping rules!

  21. Difficulties for Two-Passes (2) Proving -- if 𝑀 is significant, then 𝑄𝑠 𝑀 ≀ 2 G<(Hβ‹…U) -- uses 𝑏 ∈ # 0,1 ' along with extractor property Save: work on product of 2 parts which is read-once. New stopping rules!

  22. Product of 2 Parts (length 𝑛 , width 𝑒 - , 1 -pass) 𝐢′ 𝐢 𝑣′ - (𝑣 - , 𝑣′ - ) (𝑏, 𝑐) 𝑣 - ( 𝑏 , (𝑏, 𝑐) 𝑐 ) 𝑣′ o 𝑣′ * (𝑣 o , 𝑣′ o ) (𝑣 - , 𝑣′ * ) 𝑣 o 𝑣 * (𝑣 * , 𝑣′ - ) Let 𝑀 o be start vertex of 2-pass BP. 𝑀 * , 𝑀 - be the vertices (𝑣 * , 𝑣′ * ) reached in the end of Part-1 and 2 respectively. Then 𝑀 o β†’ 𝑀 * β†’ 𝑀 - ≑ 𝑀 o , 𝑀 * β†’ (𝑀 * , 𝑀 - )

  23. Proof Outline: Stopping Rules for Product Significant vertices: 𝑀, 𝑀 F s.t. ||P c| a u ,a v β†’(a,a \ ) || - β‰₯ 2 U β‹… 2 G' Bad edges: 𝑏 s.t. |(𝑁 β‹… P c| a u ,a v β†’(a,a \ ) )(𝑏)| β‰₯ 2 GK High-probability edges: 𝑏 s.t. Pr[𝑏|𝑀 o β†’ 𝑀 β†’ 𝑀 * β†’ 𝑀′] β‰₯ 2 H β‹… 2 G' .... Stop at bad edges unless high-probability edges unless they are very bad

  24. Proof Outline: Stopping Rules for Product Conditioned on 𝑀 o β†’ 𝑀 β†’ 𝑀 * β†’ 𝑀′ , 𝑄𝑠 π‘‘π‘’π‘π‘žπ‘‘ is small (1/100) 𝑀 o β†’ 𝑀 β†’ 𝑀 * β†’ 𝑀 F β‰  𝑀 o , 𝑀 * β†’ (𝑀, 𝑀′) Proved using single-pass result as a subroutine

  25. Open Problems Generalize to multiple-passes ● Better lower-bounds for two-pass ● Non-trivial upper bounds for constant, linear passes ●

  26. Thank You! Anyone wants a second pass?

Recommend


More recommend