SLIDE 1
Ran Raz Princeton University Based on joint works with: Sumegha - - PowerPoint PPT Presentation
Ran Raz Princeton University Based on joint works with: Sumegha - - PowerPoint PPT Presentation
Learning Fast Requires Good Memory: Time-Space Tradeoff Lower Bounds for Learning Ran Raz Princeton University Based on joint works with: Sumegha Garg, Gillat Kol, Avishay Tal [R16, KRT17, R17, GRT18] This Talk: A line of recent works
SLIDE 2
SLIDE 3
Original Motivation: Online Learning Theory: Initiated by: [Shamir 2014], [Steinhardt-Valiant-Wager 2015]: Can one prove unconditional lower bounds on the number
- f samples needed for learning, under memory constraints?
(when each sample is viewed only once - also known as
- nline learning)
SLIDE 4
Example: Parity Learning: ๐ = (๐๐, โฆ , ๐๐) โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of random linear equations (mod 2) in ๐๐, โฆ , ๐๐, one by one, and tries to learn ๐ Formally: The learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (inner product mod 2) The learner needs to solve the equations and find ๐ (no noise)
SLIDE 5
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown
Ready to Play?
SLIDE 6
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐ = ๐ ๐๐ + ๐๐ + ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 7
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐ = ๐ ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 8
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐ = ๐ ๐๐ + ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 9
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐ = ๐ ๐๐ + ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 10
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐ = ๐ ๐๐ + ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 11
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐ = ๐ ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 12
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐ = ๐ ๐๐ + ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 13
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐ = ๐ ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 14
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐ = ๐ ๐๐ + ๐๐ + ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 15
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐๐ = ๐ ๐๐ + ๐๐ + ๐๐ + ๐๐ = ๐
Ready to Play?
(mod 2)
SLIDE 16
๐ = (๐๐, ๐๐, ๐๐, ๐๐, ๐๐) is unknown ๐๐๐ = ๐, ๐, ๐, ๐, ๐ , ๐๐๐ = ๐ ๐ = ๐
Ready to Play?
(mod 2)
SLIDE 17
Parity Learning: ๐ โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (mod 2) and needs to solve the equations and find ๐
SLIDE 18
Parity Learning: ๐ โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (mod 2) and needs to solve the equations and find ๐ By solving linear equations: ๐ท(๐) samples, ๐ท(๐๐) memory bits
SLIDE 19
Parity Learning: ๐ โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (mod 2) and needs to solve the equations and find ๐ By solving linear equations: ๐ท(๐) samples, ๐ท(๐๐) memory bits By trying all possibilities: ๐ท(๐) memory bits, exponential number of samples
SLIDE 20
Parity Learning: ๐ โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (mod 2) and needs to solve the equations and find ๐
SLIDE 21
Parity Learning: ๐ โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (mod 2) and needs to solve the equations and find ๐ [R 2016]: Any algorithm for parity learning requires either
๐๐ ๐๐
memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015])
SLIDE 22
Parity Learning: ๐ โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (mod 2) and needs to solve the equations and find ๐ [R 2016]: Any algorithm for parity learning requires either
๐๐ ๐๐
memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015]) Previously: no lower bound on the number of samples, even if the memory size is ๐ (for any learning problem)
(for memory of size < ๐, relatively easy to prove lower bounds, since inner product is a good two-source extractor)
SLIDE 23
Parity Learning: ๐ โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (mod 2) and needs to solve the equations and find ๐ [R 2016]: Any algorithm for parity learning requires either
๐๐ ๐๐
memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015]) Previously: no lower bound on the number of samples, even if the memory size is ๐ (for any learning problem)
I will focus on super-linear lower bounds on the memory size
SLIDE 24
Parity Learning: ๐ โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (mod 2) and needs to solve the equations and find ๐ [R 2016]: Any algorithm for parity learning requires either
๐๐ ๐๐
memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015])
SLIDE 25
Parity Learning: ๐ โ๐บ ๐, ๐ ๐ is unknown A learner gets a stream of samples: ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ฎ โ๐บ ๐, ๐ ๐ and ๐๐ = ๐๐ โ ๐ (mod 2) and needs to solve the equations and find ๐ [R 2016]: Any algorithm for parity learning requires either
๐๐ ๐๐
memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015]) Best upper bound on the memory size : โ
๐๐ ๐
(when the number of samples is sub-exponential)
SLIDE 26
Motivation: Machine Learning Theory: For some online learning problems, access to a relatively large memory is crucial In some cases, learning is infeasible, due to memory constraints (if each sample is viewed only once)
SLIDE 27
Motivation: Machine Learning Theory: For some online learning problems, access to a relatively large memory is crucial In some cases, learning is infeasible, due to memory constraints (if each sample is viewed only once) Very interesting to understand how much memory is needed for learning Our result gives a concept class that can be efficiently learnt if and only if the learner has a quadratic-size memory
SLIDE 28
Motivation: Machine Learning Theory: For some online learning problems, access to a relatively large memory is crucial In some cases, learning is infeasible, due to memory constraints (if each sample is viewed only once) Very interesting to understand how much memory is needed for learning Our result gives a concept class that can be efficiently learnt if and only if the learner has a quadratic-size memory โGoodโ memory may be crucial in learning processes
SLIDE 29
Example: Neural Networks Many learning algorithms try to learn a concept by modeling it as a neural network. The algorithm keeps in the memory some neural network and updates the weights when new samples arrive. The memory used is the size of the network
SLIDE 30
Example: Neural Networks Many learning algorithms try to learn a concept by modeling it as a neural network. The algorithm keeps in the memory some neural network and updates the weights when new samples arrive. The memory used is the size of the network Conclusion: such algorithms cannot learn certain concept classes, if each sample is viewed only once and the size of the neural network is not sufficiently large
For example, for learning parities, the memory size must be quadratic
SLIDE 31
Motivation: Bounded Storage Cryptography [Maurer 92]: [Maurer, CM, AR, ADR, Vadhan, DM,โฆ] In the bounded storage model. Alice and Bob want to interact securely, in the presence of an adversary with bounded memory size Alice Bob
SLIDE 32
Applications to Bounded Storage Cryptography: [R16]: Secret Key Encryption/Decryption protocol [Guan-Zhandry 2019]: Key Agreement, Bit Commitment, Oblivious Transfer (all these protocols have advantages over previous works)
SLIDE 33
Motivation: Complexity Theory: Time-Space Lower Bounds for computing a function ๐(๐๐, . . , ๐๐) have been studied for a long time, in various models [BJS 98, Ajt 99, BSSV 00, For 97, FLvMV 05, Wil 06,โฆ]
SLIDE 34
Motivation: Complexity Theory: Time-Space Lower Bounds for computing a function ๐(๐๐, . . , ๐๐) have been studied for a long time, in various models [BJS 98, Ajt 99, BSSV 00, For 97, FLvMV 05, Wil 06,โฆ] These results proved lower bounds of at most ๐๐+๐บ on the time needed, under space constraints How come we prove exponential bounds on the time, under space constraints?
SLIDE 35
Motivation: Complexity Theory: Time-Space Lower Bounds for computing a function ๐(๐๐, . . , ๐๐) have been studied for a long time, in various models [BJS 98, Ajt 99, BSSV 00, For 97, FLvMV 05, Wil 06,โฆ] These results proved lower bounds of at most ๐๐+๐บ on the time needed, under space constraints How come we prove exponential bounds on the time, under space constraints? The models are different
SLIDE 36
Motivation: Complexity Theory: Time-Space Lower Bounds for computing a function ๐(๐๐, . . , ๐๐) have been studied for a long time, in various models [BJS 98, Ajt 99, BSSV 00, For 97, FLvMV 05, Wil 06,โฆ] These results proved lower bounds of at most ๐๐+๐บ on the time needed, under space constraints How come we prove exponential bounds on the time, under space constraints? The models are different
Lower bounds for computing ๐(๐๐, . . , ๐๐) assume that ๐๐, . . , ๐๐ can always be accessed (The inputs are stored for free) In our case, after the learner saw (๐๐, ๐๐), the sample ๐๐, ๐๐ cannot be accessed again (unless stored in the memory)
SLIDE 37
Motivation: Pseudo Randomness: Alternative way to construct pseudorandom generators for log-space, using polylog(๐) truly-random bits (doesnโt match the best known generators that use only O(log2 ๐ ) bits [Nisan 90, INW 94, GR 14])
SLIDE 38
[Kol-R-Tal 2017]: Learning sparse parities requires either super-linear memory size or a super-polynomial number of samples Sparsity-๐ parity learning: Same as parity learning but it is known in advance that at most ๐ of the variables ๐๐, โฆ , ๐๐ are 1
SLIDE 39
[Kol-R-Tal 2017]: Learning sparse parities requires either super-linear memory size or a super-polynomial number of samples
SLIDE 40
[Kol-R-Tal 2017]: Learning sparse parities requires either super-linear memory size or a super-polynomial number of samples For sparsity ๐ < ๐๐.๐๐: Any algorithm requires 1) ๐(๐ โ ๐) memory bits or ๐๐(๐) samples 2) ๐(๐ โ ๐๐.๐๐) memory bits or ๐๐(๐) samples
SLIDE 41
[Kol-R-Tal 2017]: Learning sparse parities requires either super-linear memory size or a super-polynomial number of samples For sparsity ๐ < ๐๐.๐๐: Any algorithm requires 1) ๐(๐ โ ๐) memory bits or ๐๐(๐) samples 2) ๐(๐ โ ๐๐.๐๐) memory bits or ๐๐(๐) samples Conclusion: Learning ๐ฆ๐ฉ๐ก(๐)-sparse parities, linear-size DNFs, linear-size CNFs, linear-size decision trees, ๐ฆ๐ฉ๐ก(๐)-size Juntas requires super-linear memory size or a super- polynomial number of samples
SLIDE 42
[R 17, MM 18, BOGY 18, GRT 18]: For a large class of learning problems, any learning algorithm requires quadratic memory size or an exponential number of samples [BOGY 18, GRT 18] build on [R 17] [MM 18] builds on an earlier paper [MM 17] that obtained a similar result but with a linear lower-bound of ๐. ๐๐ โ ๐ on the memory size
SLIDE 43
[R 17, MM 18, BOGY 18, GRT 18]: For a large class of learning problems, any learning algorithm requires quadratic memory size or an exponential number of samples
SLIDE 44
[R 17, MM 18, BOGY 18, GRT 18]: For a large class of learning problems, any learning algorithm requires quadratic memory size or an exponential number of samples I will focus on [R 17, GRT 18] Additional follow-up works (building on [R 17]): Memory-Sample lower bounds for: linear regression with small error [SSV 19] two-pass learning [GRT 19]
SLIDE 45
A Learning Problem as a Matrix : ๐ฉ, ๐ : finite sets ๐ต: ๐ฉ ร ๐ โ {โ๐, ๐} : a matrix ๐ โ๐บ ๐ is unknown. A learner tries to learn ๐ from a stream ๐๐, ๐๐ , ๐๐, ๐๐ โฆ , where โ๐ : ๐๐ โ๐บ ๐ฉ and ๐๐ = ๐ต(๐๐, ๐) ๐ : concept class ( = ๐, ๐ ๐ in parity learning) ๐ฉ : possible samples ( = ๐, ๐ ๐ in parity learning)
SLIDE 46
Theorem [R 17], [Garg-R-Tal 18]: Assume that any submatrix of ๐ต of fraction ๐โ๐ ร ๐โโ has bias of at most ๐โ๐. Then, any learning algorithm requires either ๐(๐ โ โ) memory bits or ๐๐(๐) samples In particular, for large classes of learning problems, any learning algorithm requires either memory of size ๐((log ๐ฉ ) โ (log ๐ )) or an exponential number of samples
(A new general proof technique. Implies all previous results) (A related result by Beame, Oveis-Gharan, Yang (building on [R 17]))
SLIDE 47
Applications: (examples) Learning from low-degree equations: A learner tries to learn ๐ = (๐๐, โฆ , ๐๐) โ๐บ ๐, ๐ ๐, from random multilinear polynomial equations of degree at most ๐ (over ๐ฎ๐): Requires ๐(๐๐+๐) memory or ๐๐(๐) samples
SLIDE 48
Applications: (examples) Learning from low-degree equations: A learner tries to learn ๐ = (๐๐, โฆ , ๐๐) โ๐บ ๐, ๐ ๐, from random multilinear polynomial equations of degree at most ๐ (over ๐ฎ๐): Requires ๐(๐๐+๐) memory or ๐๐(๐) samples Low degree polynomials: A learner tries to learn an ๐-variate multilinear polynomial of degree ๐ over ๐ฎ๐, from random evaluations: Requires ๐(๐๐+๐) memory or ๐๐(๐) samples
SLIDE 49
Applications: (examples) Learning from low-degree equations: A learner tries to learn ๐ = (๐๐, โฆ , ๐๐) โ๐บ ๐, ๐ ๐, from random multilinear polynomial equations of degree at most ๐ (over ๐ฎ๐): Requires ๐(๐๐+๐) memory or ๐๐(๐) samples Low degree polynomials: A learner tries to learn an ๐-variate multilinear polynomial of degree ๐ over ๐ฎ๐, from random evaluations: Requires ๐(๐๐+๐) memory or ๐๐(๐) samples Error correcting codesโฆ Random matricesโฆ
SLIDE 50
Branching Program (length ๐, width ๐): (for parity learning) Each layer represents a time step. Each vertex represents a memory state of the learner. Each non-leaf vertex has ๐๐+๐
- utgoing edges, one for each ๐, ๐ โ ๐, ๐ ๐ ร {โ๐, ๐}
(๐๐, ๐๐) (๐, ๐)
๐ ๐
SLIDE 51
Branching Program (length ๐, width ๐): (for parity learning) The samples ๐๐, ๐๐ , . . , ๐๐, ๐๐ define a computation-
- path. Each vertex ๐ in the last layer is labeled by
๐๐ โ ๐, ๐ ๐. The output is the label ๐๐ of the vertex reached by the path
(๐๐, ๐๐) (๐, ๐)
๐ ๐
SLIDE 52
Branching Program (length ๐, width ๐): (for parity learning) Example: BP for Parity learning: Any BP with width ๐ โค ๐๐๐/๐๐ and length ๐ โค ๐๐/๐๐๐, outputs the correct ๐ with exponentially small probability
(๐๐, ๐๐) (๐, ๐)
๐ ๐
SLIDE 53
Proof Outline (for parity): [R17] same proof techniques was used in several follow-up works: [BOGY18, GRT18, SSV19, GRT19]
SLIDE 54
Interesting Idea in the Proof: (very high level)
Significant vertices: ๐ s.t. conditioned on the event that the computation-path reaches ๐, ๐ can be guessed with non-negligible probability ๐ธ๐ ๐ = probability that the computation-path reaches ๐ We want to prove: If ๐ is significant, ๐ธ๐ ๐ โค ๐โ๐(๐๐) Hence, there are at least ๐๐(๐๐) significant vertices ๐ช = the event that some โatypicalโ things happen We show that ๐ธ๐ ๐ช โค ๐โ๐(๐) (but much larger than ๐โ๐(๐๐)) We show that if ๐ is significant, ๐ธ๐ ๐| ๐ช โค ๐โ๐(๐๐)
SLIDE 55
Proof Outline:
๐ผ = same as the computation path, but stops when โatypicalโ things happen (stopping rules). All definitions are with respect to ๐ผ ๐ธ๐ ๐ผ stops is exp small (but much larger than ๐โ๐(๐๐)) ๐๐|๐ = distribution of ๐ conditioned on the event that the path ๐ผ reaches ๐ Significant vertices: ๐ s.t. ||๐๐|๐||๐ โฅ ๐๐บ๐ โ ๐โ๐ ๐ธ๐ ๐ = probability that the path ๐ผ reaches ๐ We prove: If ๐ is significant, ๐ธ๐ ๐ โค ๐โ๐(๐๐) Hence, there are at least ๐๐(๐๐) significant vertices
SLIDE 56
Proof Outline:
If ๐ is significant, ๐ธ๐ ๐ โค ๐โ๐(๐๐) Progress Function: For layer ๐ด๐, ๐๐ =
๐โ๐ด๐
๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐ 1) ๐๐ = ๐โ๐๐บ๐๐ 2) ๐๐ is very slowly growing: ๐๐ โ ๐๐ 3) If ๐ โ ๐ด๐, then ๐๐ โฅ ๐ธ๐ ๐ โ ๐(๐บ๐)๐ โ ๐โ๐๐บ๐๐ Hence: If ๐ is significant, ๐ธ๐ ๐ โค ๐โ(๐บ๐)๐ = ๐โ๐(๐๐) (the hard step is step 2)
SLIDE 57
How we prove that ๐๐ is very slowly growing:
๐๐ =
๐โ๐ด๐
๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐ ๐๐+๐ =
๐โ๐ด๐+๐
๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐ ๐โฒ =
๐: ๐ด๐โ๐ด๐+๐
๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐ By a simple convexity argument, ๐๐+๐ โค ๐โฒ The hard part is to show that ๐โฒ is only negligibly larger than ๐๐
SLIDE 58
How we prove that ๐โฒ is only negligibly larger than ๐๐:
๐๐ =
๐โ๐ด๐
๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐ ๐โฒ =
๐: ๐ด๐โ๐ด๐+๐
๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐ We show that, on average,
๐: ๐โ๐ด๐+๐
๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐ is only negligibly larger than ๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐
SLIDE 59
How we prove that, on average,
๐: ๐โ๐ด๐+๐
๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐ is only negligibly larger than ๐ธ๐ ๐ โ ๐๐|๐ , ๐๐|๐ ๐บ๐ Roughly speaking: For parity: ๐๐|๐ , ๐๐|๐ are close, up to a normalization, to the Fourier coefficients of ๐๐|๐ โ ๐๐|๐. By introducing stopping rules for the path ๐ผ, we are able to bound the ๐ด๐-norm of ๐๐|๐ and the ๐ดโ-norm
- f ๐๐|๐ and hence the ๐ด๐-norm of ๐๐|๐ โ ๐๐|๐ , so that the Fourier
coefficients of ๐๐|๐ โ ๐๐|๐ are small on average. Another stopping rule ensures that the normalization doesnโt distort by much
SLIDE 60
Stopping Rules:
Stop on a vertex ๐ if: 1) ||๐๐|๐||๐ is large 2) ๐๐|๐(๐) is large 3) The next edge corresponds to a large Fourier coefficient of ๐๐|๐ The stopping rules do not depend on ๐ They do depend on ๐
SLIDE 61
Summary: For a large class of learning problems, any learning algorithm requires either super-linear memory size or a super- polynomial number of samples Main Message: For some learning problems, access to a relatively large memory is crucial. In other words, in some cases, learning is infeasible, due to memory constraints [S14, SVW16, R16, VV16, KRT17, MM17, R17, MM18, BOGY18, GRT18, DS18, AS18, DKS19, SSV19, GRT19, GKR19]
SLIDE 62