Learning Fast Requires Good Memory: Time-Space Tradeoff Lower Bounds for Learning Ran Raz Princeton University Based on joint works with: Sumegha Garg, Gillat Kol, Avishay Tal [R16, KRT17, R17, GRT18]
This Talk: A line of recent works studies time-space (memory-samples) lower bounds for learning [S14, SVW16, R16, VV16, KRT17, MM17, R17, MM18, BOGY18, GRT18, DS18, AS18, DKS19, SSV19, GRT19, GKR19] Main Message: For some learning problems, access to a relatively large memory is crucial. In other words, in some cases, learning is infeasible, due to memory constraints
Original Motivation: Online Learning Theory: Initiated by: [Shamir 2014], [Steinhardt-Valiant-Wager 2015]: Can one prove unconditional lower bounds on the number of samples needed for learning, under memory constraints? (when each sample is viewed only once - also known as online learning)
Example: Parity Learning: 𝒚 = (𝒚 𝟐 , … , 𝒚 𝒐 ) ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown A learner gets a stream of random linear equations (mod 2) in 𝒚 𝟐 , … , 𝒚 𝒐 , one by one, and tries to learn 𝒚 Formally: The learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (inner product mod 2) The learner needs to solve the equations and find 𝒚 (no noise)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟐 = 𝟐, 𝟐, 𝟏, 𝟐, 𝟐 , 𝒄 𝟐 = 𝟏 𝒚 𝟐 + 𝒚 𝟑 + 𝒚 𝟓 + 𝒚 𝟔 = 𝟏 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟑 = 𝟏, 𝟐, 𝟐, 𝟏, 𝟏 , 𝒄 𝟑 = 𝟏 𝒚 𝟑 + 𝒚 𝟒 = 𝟏 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟒 = 𝟏, 𝟏, 𝟐, 𝟐, 𝟐 , 𝒄 𝟒 = 𝟏 𝒚 𝟒 + 𝒚 𝟓 + 𝒚 𝟔 = 𝟏 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟓 = 𝟏, 𝟐, 𝟐, 𝟐, 𝟏 , 𝒄 𝟓 = 𝟏 𝒚 𝟑 + 𝒚 𝟒 + 𝒚 𝟓 = 𝟏 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟔 = 𝟐, 𝟐, 𝟏, 𝟏, 𝟐 , 𝒄 𝟔 = 𝟏 𝒚 𝟐 + 𝒚 𝟑 + 𝒚 𝟔 = 𝟏 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟕 = 𝟏, 𝟏, 𝟐, 𝟐, 𝟏 , 𝒄 𝟕 = 𝟐 𝒚 𝟒 + 𝒚 𝟓 = 𝟐 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟖 = 𝟏, 𝟐, 𝟏, 𝟐, 𝟐 , 𝒄 𝟖 = 𝟏 𝒚 𝟑 + 𝒚 𝟓 + 𝒚 𝟔 = 𝟏 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟗 = 𝟐, 𝟏, 𝟏, 𝟏, 𝟐 , 𝒄 𝟗 = 𝟐 𝒚 𝟐 + 𝒚 𝟔 = 𝟐 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟘 = 𝟐, 𝟐, 𝟐, 𝟐, 𝟏 , 𝒄 𝟘 = 𝟏 𝒚 𝟐 + 𝒚 𝟑 + 𝒚 𝟒 + 𝒚 𝟓 = 𝟏 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟐𝟏 = 𝟏, 𝟐, 𝟐, 𝟐, 𝟐 , 𝒄 𝟐𝟏 = 𝟐 𝒚 𝟑 + 𝒚 𝟒 + 𝒚 𝟓 + 𝒚 𝟔 = 𝟐 (mod 2)
Ready to Play? 𝒚 = (𝒚 𝟐 , 𝒚 𝟑 , 𝒚 𝟒 , 𝒚 𝟓 , 𝒚 𝟔 ) is unknown 𝒃 𝟐𝟐 = 𝟏, 𝟏, 𝟏, 𝟏, 𝟏 , 𝒄 𝟐𝟐 = 𝟏 𝟏 = 𝟏 (mod 2)
Parity Learning: 𝒚 ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown A learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (mod 2) and needs to solve the equations and find 𝒚
𝒚 ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown Parity Learning: A learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (mod 2) and needs to solve the equations and find 𝒚 By solving linear equations: 𝑷(𝒐) samples, 𝑷(𝒐 𝟑 ) memory bits
𝒚 ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown Parity Learning: A learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (mod 2) and needs to solve the equations and find 𝒚 By solving linear equations: 𝑷(𝒐) samples, 𝑷(𝒐 𝟑 ) memory bits By trying all possibilities: 𝑷(𝒐) memory bits, exponential number of samples
𝒚 ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown Parity Learning: A learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (mod 2) and needs to solve the equations and find 𝒚
𝒚 ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown Parity Learning: A learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (mod 2) and needs to solve the equations and find 𝒚 𝒐 𝟑 [R 2016]: Any algorithm for parity learning requires either 𝟐𝟏 memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015])
𝒚 ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown Parity Learning: A learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (mod 2) and needs to solve the equations and find 𝒚 𝒐 𝟑 [R 2016]: Any algorithm for parity learning requires either 𝟐𝟏 memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015]) Previously: no lower bound on the number of samples, even if the memory size is 𝒐 (for any learning problem) (for memory of size < 𝑜 , relatively easy to prove lower bounds, since inner product is a good two-source extractor)
𝒚 ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown Parity Learning: A learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (mod 2) and needs to solve the equations and find 𝒚 𝒐 𝟑 [R 2016]: Any algorithm for parity learning requires either 𝟐𝟏 memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015]) Previously: no lower bound on the number of samples, even if the memory size is 𝒐 (for any learning problem) I will focus on super-linear lower bounds on the memory size
𝒚 ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown Parity Learning: A learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (mod 2) and needs to solve the equations and find 𝒚 𝒐 𝟑 [R 2016]: Any algorithm for parity learning requires either 𝟐𝟏 memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015])
𝒚 ∈ 𝑺 𝟏, 𝟐 𝒐 is unknown Parity Learning: A learner gets a stream of samples: 𝒃 𝟐 , 𝒄 𝟐 , 𝒃 𝟑 , 𝒄 𝟑 … , where ∀𝒖 : 𝒃 𝐮 ∈ 𝑺 𝟏, 𝟐 𝒐 and 𝒄 𝒖 = 𝒃 𝒖 ⋅ 𝒚 (mod 2) and needs to solve the equations and find 𝒚 𝒐 𝟑 [R 2016]: Any algorithm for parity learning requires either 𝟐𝟏 memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015]) 𝒐 𝟑 Best upper bound on the memory size : ≈ 𝟓 (when the number of samples is sub-exponential)
Motivation: Machine Learning Theory: For some online learning problems, access to a relatively large memory is crucial In some cases, learning is infeasible, due to memory constraints (if each sample is viewed only once)
Motivation: Machine Learning Theory: For some online learning problems, access to a relatively large memory is crucial In some cases, learning is infeasible, due to memory constraints (if each sample is viewed only once) Very interesting to understand how much memory is needed for learning Our result gives a concept class that can be efficiently learnt if and only if the learner has a quadratic-size memory
Motivation: Machine Learning Theory: For some online learning problems, access to a relatively large memory is crucial In some cases, learning is infeasible, due to memory constraints (if each sample is viewed only once) Very interesting to understand how much memory is needed for learning Our result gives a concept class that can be efficiently learnt if and only if the learner has a quadratic-size memory “ Good ” memory may be crucial in learning processes
Example: Neural Networks Many learning algorithms try to learn a concept by modeling it as a neural network. The algorithm keeps in the memory some neural network and updates the weights when new samples arrive. The memory used is the size of the network
Recommend
More recommend