Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley)
[Shamir 14], [Steinhardt-Valiant-Wager 15] Initiated a study of memory-samples lower bounds for learning Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints? (when samples are viewed one by one) (also known as online learning)
When two-passes are allowed? Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints, and when learner is allowed to go over the stream of samples twice? (in the same order)
Toy-Example: Parity Learning
Parity Learning π¦ β # 0,1 ' is unknown A learner tries to learn π¦ from (π * , π * ), π - , π - , β¦ , (π / , π / ) , where β π’ , π 2 β # 0,1 ' and π 2 =< π 2 , π¦ > (inner product mod 2) In other words, learner gets random linear equations in π¦ * , π¦ - , . . , π¦ ' , one by one, and need to solve them
Parity Learners Solve independent linear equations (Gaussian Elimination) β π(π) samples and π(π - ) memory β Try all possibilities of π¦ β π(π) memory but exponential number of samples β
Parity Learning (Two-pass) π¦ β # 0,1 ' is unknown A learner tries to learn π¦ from (π * , π * ), π - , π - , β¦ , π / , π / , (π * , π * ), π - , π - , β¦ , (π / , π / ) , where β π’ , π 2 β # 0,1 ' and π 2 =< π 2 , π¦ > (inner product mod 2)
Razβs Breakthough β16 (One-pass) Any algorithm for parity learning of size π requires either Ξ© π - memory bits or an exponential number of samples Conjectured by: Steinhardt, Valiant and Wager [2015]
Subsequent Results (One-pass) [Kol-Raz-Tal β17]: Generalization to sparse parities [Razβ17, Moshkovitz-Moshkovitzβ17, Moshkovitz- Tishbyβ17, Moshkovitz- Moshkovitzβ18, Garg-Raz-Talβ18, Beame-Gharan-Yangβ18]: Generalization to larger class of problems [Sharan-Sidford-Valiantβ19]: Generalization to real-valued learning
Related Results (Multiple-pass) [Dagan-Shamirβ18, Assadi-Chen-Khannaβ19,β¦]: Uses communication complexity (Quite different technique, at most polynomial bound on the number of samples)
Motivation Learning Theory, Bounded Storage Cryptography, Complexity Theory With [Barringtonβ89], proving super-polynomial lower bounds on the time needed for computing a function, by a branching program of width 5, with polynomially many passes over the input, would imply super-polynomial lower bounds for formula size Technically Challenging: previous techniques are heavily based on the fact that in the one-pass case all the samples are independent
Our Result
Our Result for Parity Learning Any two-pass algorithm for parity learning of size π requires either Ξ©(π *.: ) memory bits or 2 <( ') number of samples (no matching upper bound)
Learning Problem as a Matrix π΅, π : finite sets, π: π΅Γπ β {β1,1} : a matrix π¦ β # π is unknown. A learner tries to learn π¦ from a stream (π * , π * ), π - , π - , β¦ , π / , π / , (π * , π * ), π - , π - , β¦ , (π / , π / ) , where βπ’ : π 2 β # π΅ and π 2 = π(π 2 , π¦) π : concept class = 0,1 ' π΅ : possible samples = 0,1 'F
Generalized Result Assume that any submatrix of π of at least 2 GH |π΅| rows and at least 2 Gβ |π| columns, has a bias of at most 2 GK . Then: Any two-pass algorithm requires either Ξ©(π β min{π, π}) memory bits or 2 <(RST{H, U,K}) samples In contrast, [GRTβ18] proved Any one-pass algorithm requires either Ξ©(π β π) memory bits or 2 <(K) samples
Branching Program (length π , width π , 2 -pass) (π / , π / ) (π * , π * ) (π / , π / ) (π * , π * ) π (π, π) π πππ π’ 2 Each layer represents a time step. Each vertex represents a memory state of the learner ( π = 2 /X/YKZ ). Each non-leaf vertex has 2 'F[* outgoing edges, one for each π, π β 0,1 ' \ Γ β1,1
Branching Program (length π , width π , 2 -pass) (π / , π / ) (π * , π * ) (π / , π / ) (π * , π * ) π (π, π) π πππ π’ 2 The samples π * , π * , . . , π / , π / , π * , π * , . . , π / , π / define a computation-path. Each vertex π€ in the last layer is labeled by ` π¦ a β 0,1 ' . The output is the label π¦ a of the vertex reached by the path `
Brief Overview of One-Pass Lower Bound [GRTβ18] c|a = distribution of π¦ conditioned on the event that the computation- P path reaches π€ c|a || - β₯ 2 U β 2 G' Significant vertices: π€ s.t. ||P ππ π€ = probability that the path reaches π€ GRT proves: If π€ is significant, ππ π€ β€ 2 G<(Hβ U) Hence, there are at least 2 <(Hβ U) significant vertices to output correct answer with high probability
Brief Overview of One-Pass Lower Bound c|a = distribution of π¦ conditioned on the event that the computation- P path reaches π€ ππ π€ = probability that the path reaches π€ under π = same as the computational path, but stops when βatypicalβ things happen (traversing a bad edge and β¦) Bad edges: π s.t. |(π β P c|a )(π)| β₯ 2 GK ππ π π‘π’πππ‘ is exp small (uses π β # 0,1 ' !)
Difficulties for Two-Passes (1) j|a β ππππππ π o,* p for π€ in Part-2 P For e.g., BP remembers π * . Therefore, probability of traversing a βbad edgeβ may not be small c|a )(π)| β₯ 2 GK (gives too much information about π¦ ) Bad edges: |(π β P Save: canβt remember too many π s. New stopping rules!
Difficulties for Two-Passes (2) Proving -- if π€ is significant, then ππ π€ β€ 2 G<(Hβ U) -- uses π β # 0,1 ' along with extractor property Save: work on product of 2 parts which is read-once. New stopping rules!
Product of 2 Parts (length π , width π - , 1 -pass) πΆβ² πΆ π£β² - (π£ - , π£β² - ) (π, π) π£ - ( π , (π, π) π ) π£β² o π£β² * (π£ o , π£β² o ) (π£ - , π£β² * ) π£ o π£ * (π£ * , π£β² - ) Let π€ o be start vertex of 2-pass BP. π€ * , π€ - be the vertices (π£ * , π£β² * ) reached in the end of Part-1 and 2 respectively. Then π€ o β π€ * β π€ - β‘ π€ o , π€ * β (π€ * , π€ - )
Proof Outline: Stopping Rules for Product Significant vertices: π€, π€ F s.t. ||P c| a u ,a v β(a,a \ ) || - β₯ 2 U β 2 G' Bad edges: π s.t. |(π β P c| a u ,a v β(a,a \ ) )(π)| β₯ 2 GK High-probability edges: π s.t. Pr[π|π€ o β π€ β π€ * β π€β²] β₯ 2 H β 2 G' .... Stop at bad edges unless high-probability edges unless they are very bad
Proof Outline: Stopping Rules for Product Conditioned on π€ o β π€ β π€ * β π€β² , ππ π‘π’πππ‘ is small (1/100) π€ o β π€ β π€ * β π€ F β π€ o , π€ * β (π€, π€β²) Proved using single-pass result as a subroutine
Open Problems Generalize to multiple-passes β Better lower-bounds for two-pass β Non-trivial upper bounds for constant, linear passes β
Thank You! Anyone wants a second pass?
Recommend
More recommend