Ran Raz Princeton University Based on joint works with: Sumegha - - PowerPoint PPT Presentation

โ–ถ
ran raz
SMART_READER_LITE
LIVE PREVIEW

Ran Raz Princeton University Based on joint works with: Sumegha - - PowerPoint PPT Presentation

Learning Fast Requires Good Memory: Time-Space Tradeoff Lower Bounds for Learning Ran Raz Princeton University Based on joint works with: Sumegha Garg, Gillat Kol, Avishay Tal [R16, KRT17, R17, GRT18] This Talk: A line of recent works


slide-1
SLIDE 1

Ran Raz Princeton University

Based on joint works with: Sumegha Garg, Gillat Kol, Avishay Tal [R16, KRT17, R17, GRT18]

Learning Fast Requires Good Memory: Time-Space Tradeoff Lower Bounds for Learning

slide-2
SLIDE 2

This Talk: A line of recent works studies time-space (memory-samples) lower bounds for learning [S14, SVW16, R16, VV16, KRT17, MM17, R17, MM18, BOGY18, GRT18, DS18, AS18, DKS19, SSV19, GRT19, GKR19] Main Message: For some learning problems, access to a relatively large memory is crucial. In other words, in some cases, learning is infeasible, due to memory constraints

slide-3
SLIDE 3

Original Motivation: Online Learning Theory: Initiated by: [Shamir 2014], [Steinhardt-Valiant-Wager 2015]: Can one prove unconditional lower bounds on the number

  • f samples needed for learning, under memory constraints?

(when each sample is viewed only once - also known as

  • nline learning)
slide-4
SLIDE 4

Example: Parity Learning: ๐’š = (๐’š๐Ÿ, โ€ฆ , ๐’š๐’) โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of random linear equations (mod 2) in ๐’š๐Ÿ, โ€ฆ , ๐’š๐’, one by one, and tries to learn ๐’š Formally: The learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (inner product mod 2) The learner needs to solve the equations and find ๐’š (no noise)

slide-5
SLIDE 5

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown

Ready to Play?

slide-6
SLIDE 6

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ = ๐Ÿ ๐’š๐Ÿ + ๐’š๐Ÿ‘ + ๐’š๐Ÿ“ + ๐’š๐Ÿ” = ๐Ÿ

Ready to Play?

(mod 2)

slide-7
SLIDE 7

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ‘ = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ‘ = ๐Ÿ ๐’š๐Ÿ‘ + ๐’š๐Ÿ’ = ๐Ÿ

Ready to Play?

(mod 2)

slide-8
SLIDE 8

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ’ = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ’ = ๐Ÿ ๐’š๐Ÿ’ + ๐’š๐Ÿ“ + ๐’š๐Ÿ” = ๐Ÿ

Ready to Play?

(mod 2)

slide-9
SLIDE 9

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ“ = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ“ = ๐Ÿ ๐’š๐Ÿ‘ + ๐’š๐Ÿ’ + ๐’š๐Ÿ“ = ๐Ÿ

Ready to Play?

(mod 2)

slide-10
SLIDE 10

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ” = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ” = ๐Ÿ ๐’š๐Ÿ + ๐’š๐Ÿ‘ + ๐’š๐Ÿ” = ๐Ÿ

Ready to Play?

(mod 2)

slide-11
SLIDE 11

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ• = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ• = ๐Ÿ ๐’š๐Ÿ’ + ๐’š๐Ÿ“ = ๐Ÿ

Ready to Play?

(mod 2)

slide-12
SLIDE 12

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ– = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ– = ๐Ÿ ๐’š๐Ÿ‘ + ๐’š๐Ÿ“ + ๐’š๐Ÿ” = ๐Ÿ

Ready to Play?

(mod 2)

slide-13
SLIDE 13

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ— = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ— = ๐Ÿ ๐’š๐Ÿ + ๐’š๐Ÿ” = ๐Ÿ

Ready to Play?

(mod 2)

slide-14
SLIDE 14

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ˜ = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ˜ = ๐Ÿ ๐’š๐Ÿ + ๐’š๐Ÿ‘ + ๐’š๐Ÿ’ + ๐’š๐Ÿ“ = ๐Ÿ

Ready to Play?

(mod 2)

slide-15
SLIDE 15

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ๐Ÿ = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ๐Ÿ = ๐Ÿ ๐’š๐Ÿ‘ + ๐’š๐Ÿ’ + ๐’š๐Ÿ“ + ๐’š๐Ÿ” = ๐Ÿ

Ready to Play?

(mod 2)

slide-16
SLIDE 16

๐’š = (๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐’š๐Ÿ’, ๐’š๐Ÿ“, ๐’š๐Ÿ”) is unknown ๐’ƒ๐Ÿ๐Ÿ = ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ, ๐Ÿ , ๐’„๐Ÿ๐Ÿ = ๐Ÿ ๐Ÿ = ๐Ÿ

Ready to Play?

(mod 2)

slide-17
SLIDE 17

Parity Learning: ๐’š โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (mod 2) and needs to solve the equations and find ๐’š

slide-18
SLIDE 18

Parity Learning: ๐’š โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (mod 2) and needs to solve the equations and find ๐’š By solving linear equations: ๐‘ท(๐’) samples, ๐‘ท(๐’๐Ÿ‘) memory bits

slide-19
SLIDE 19

Parity Learning: ๐’š โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (mod 2) and needs to solve the equations and find ๐’š By solving linear equations: ๐‘ท(๐’) samples, ๐‘ท(๐’๐Ÿ‘) memory bits By trying all possibilities: ๐‘ท(๐’) memory bits, exponential number of samples

slide-20
SLIDE 20

Parity Learning: ๐’š โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (mod 2) and needs to solve the equations and find ๐’š

slide-21
SLIDE 21

Parity Learning: ๐’š โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (mod 2) and needs to solve the equations and find ๐’š [R 2016]: Any algorithm for parity learning requires either

๐’๐Ÿ‘ ๐Ÿ๐Ÿ

memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015])

slide-22
SLIDE 22

Parity Learning: ๐’š โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (mod 2) and needs to solve the equations and find ๐’š [R 2016]: Any algorithm for parity learning requires either

๐’๐Ÿ‘ ๐Ÿ๐Ÿ

memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015]) Previously: no lower bound on the number of samples, even if the memory size is ๐’ (for any learning problem)

(for memory of size < ๐‘œ, relatively easy to prove lower bounds, since inner product is a good two-source extractor)

slide-23
SLIDE 23

Parity Learning: ๐’š โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (mod 2) and needs to solve the equations and find ๐’š [R 2016]: Any algorithm for parity learning requires either

๐’๐Ÿ‘ ๐Ÿ๐Ÿ

memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015]) Previously: no lower bound on the number of samples, even if the memory size is ๐’ (for any learning problem)

I will focus on super-linear lower bounds on the memory size

slide-24
SLIDE 24

Parity Learning: ๐’š โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (mod 2) and needs to solve the equations and find ๐’š [R 2016]: Any algorithm for parity learning requires either

๐’๐Ÿ‘ ๐Ÿ๐Ÿ

memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015])

slide-25
SLIDE 25

Parity Learning: ๐’š โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ is unknown A learner gets a stream of samples: ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐ฎ โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’ and ๐’„๐’– = ๐’ƒ๐’– โ‹… ๐’š (mod 2) and needs to solve the equations and find ๐’š [R 2016]: Any algorithm for parity learning requires either

๐’๐Ÿ‘ ๐Ÿ๐Ÿ

memory bits or an exponential number of samples (Conjectured by Steinhardt, Valiant and Wager [2015]) Best upper bound on the memory size : โ‰ˆ

๐’๐Ÿ‘ ๐Ÿ“

(when the number of samples is sub-exponential)

slide-26
SLIDE 26

Motivation: Machine Learning Theory: For some online learning problems, access to a relatively large memory is crucial In some cases, learning is infeasible, due to memory constraints (if each sample is viewed only once)

slide-27
SLIDE 27

Motivation: Machine Learning Theory: For some online learning problems, access to a relatively large memory is crucial In some cases, learning is infeasible, due to memory constraints (if each sample is viewed only once) Very interesting to understand how much memory is needed for learning Our result gives a concept class that can be efficiently learnt if and only if the learner has a quadratic-size memory

slide-28
SLIDE 28

Motivation: Machine Learning Theory: For some online learning problems, access to a relatively large memory is crucial In some cases, learning is infeasible, due to memory constraints (if each sample is viewed only once) Very interesting to understand how much memory is needed for learning Our result gives a concept class that can be efficiently learnt if and only if the learner has a quadratic-size memory โ€œGoodโ€ memory may be crucial in learning processes

slide-29
SLIDE 29

Example: Neural Networks Many learning algorithms try to learn a concept by modeling it as a neural network. The algorithm keeps in the memory some neural network and updates the weights when new samples arrive. The memory used is the size of the network

slide-30
SLIDE 30

Example: Neural Networks Many learning algorithms try to learn a concept by modeling it as a neural network. The algorithm keeps in the memory some neural network and updates the weights when new samples arrive. The memory used is the size of the network Conclusion: such algorithms cannot learn certain concept classes, if each sample is viewed only once and the size of the neural network is not sufficiently large

For example, for learning parities, the memory size must be quadratic

slide-31
SLIDE 31

Motivation: Bounded Storage Cryptography [Maurer 92]: [Maurer, CM, AR, ADR, Vadhan, DM,โ€ฆ] In the bounded storage model. Alice and Bob want to interact securely, in the presence of an adversary with bounded memory size Alice Bob

slide-32
SLIDE 32

Applications to Bounded Storage Cryptography: [R16]: Secret Key Encryption/Decryption protocol [Guan-Zhandry 2019]: Key Agreement, Bit Commitment, Oblivious Transfer (all these protocols have advantages over previous works)

slide-33
SLIDE 33

Motivation: Complexity Theory: Time-Space Lower Bounds for computing a function ๐’ˆ(๐’š๐Ÿ, . . , ๐’š๐’) have been studied for a long time, in various models [BJS 98, Ajt 99, BSSV 00, For 97, FLvMV 05, Wil 06,โ€ฆ]

slide-34
SLIDE 34

Motivation: Complexity Theory: Time-Space Lower Bounds for computing a function ๐’ˆ(๐’š๐Ÿ, . . , ๐’š๐’) have been studied for a long time, in various models [BJS 98, Ajt 99, BSSV 00, For 97, FLvMV 05, Wil 06,โ€ฆ] These results proved lower bounds of at most ๐’๐Ÿ+๐œบ on the time needed, under space constraints How come we prove exponential bounds on the time, under space constraints?

slide-35
SLIDE 35

Motivation: Complexity Theory: Time-Space Lower Bounds for computing a function ๐’ˆ(๐’š๐Ÿ, . . , ๐’š๐’) have been studied for a long time, in various models [BJS 98, Ajt 99, BSSV 00, For 97, FLvMV 05, Wil 06,โ€ฆ] These results proved lower bounds of at most ๐’๐Ÿ+๐œบ on the time needed, under space constraints How come we prove exponential bounds on the time, under space constraints? The models are different

slide-36
SLIDE 36

Motivation: Complexity Theory: Time-Space Lower Bounds for computing a function ๐’ˆ(๐’š๐Ÿ, . . , ๐’š๐’) have been studied for a long time, in various models [BJS 98, Ajt 99, BSSV 00, For 97, FLvMV 05, Wil 06,โ€ฆ] These results proved lower bounds of at most ๐’๐Ÿ+๐œบ on the time needed, under space constraints How come we prove exponential bounds on the time, under space constraints? The models are different

Lower bounds for computing ๐’ˆ(๐’š๐Ÿ, . . , ๐’š๐’) assume that ๐’š๐Ÿ, . . , ๐’š๐’ can always be accessed (The inputs are stored for free) In our case, after the learner saw (๐’ƒ๐’–, ๐’„๐’–), the sample ๐’ƒ๐’–, ๐’„๐’– cannot be accessed again (unless stored in the memory)

slide-37
SLIDE 37

Motivation: Pseudo Randomness: Alternative way to construct pseudorandom generators for log-space, using polylog(๐’) truly-random bits (doesnโ€™t match the best known generators that use only O(log2 ๐’ ) bits [Nisan 90, INW 94, GR 14])

slide-38
SLIDE 38

[Kol-R-Tal 2017]: Learning sparse parities requires either super-linear memory size or a super-polynomial number of samples Sparsity-๐’ parity learning: Same as parity learning but it is known in advance that at most ๐’ of the variables ๐’š๐Ÿ, โ€ฆ , ๐’š๐’ are 1

slide-39
SLIDE 39

[Kol-R-Tal 2017]: Learning sparse parities requires either super-linear memory size or a super-polynomial number of samples

slide-40
SLIDE 40

[Kol-R-Tal 2017]: Learning sparse parities requires either super-linear memory size or a super-polynomial number of samples For sparsity ๐’ < ๐’๐Ÿ.๐Ÿ˜๐Ÿ˜: Any algorithm requires 1) ๐›(๐’ โ‹… ๐’) memory bits or ๐Ÿ‘๐›(๐’) samples 2) ๐›(๐’ โ‹… ๐’๐Ÿ.๐Ÿ˜๐Ÿ˜) memory bits or ๐’๐›(๐’) samples

slide-41
SLIDE 41

[Kol-R-Tal 2017]: Learning sparse parities requires either super-linear memory size or a super-polynomial number of samples For sparsity ๐’ < ๐’๐Ÿ.๐Ÿ˜๐Ÿ˜: Any algorithm requires 1) ๐›(๐’ โ‹… ๐’) memory bits or ๐Ÿ‘๐›(๐’) samples 2) ๐›(๐’ โ‹… ๐’๐Ÿ.๐Ÿ˜๐Ÿ˜) memory bits or ๐’๐›(๐’) samples Conclusion: Learning ๐ฆ๐ฉ๐ก(๐’)-sparse parities, linear-size DNFs, linear-size CNFs, linear-size decision trees, ๐ฆ๐ฉ๐ก(๐’)-size Juntas requires super-linear memory size or a super- polynomial number of samples

slide-42
SLIDE 42

[R 17, MM 18, BOGY 18, GRT 18]: For a large class of learning problems, any learning algorithm requires quadratic memory size or an exponential number of samples [BOGY 18, GRT 18] build on [R 17] [MM 18] builds on an earlier paper [MM 17] that obtained a similar result but with a linear lower-bound of ๐Ÿ. ๐Ÿ‘๐Ÿ” โ‹… ๐’ on the memory size

slide-43
SLIDE 43

[R 17, MM 18, BOGY 18, GRT 18]: For a large class of learning problems, any learning algorithm requires quadratic memory size or an exponential number of samples

slide-44
SLIDE 44

[R 17, MM 18, BOGY 18, GRT 18]: For a large class of learning problems, any learning algorithm requires quadratic memory size or an exponential number of samples I will focus on [R 17, GRT 18] Additional follow-up works (building on [R 17]): Memory-Sample lower bounds for: linear regression with small error [SSV 19] two-pass learning [GRT 19]

slide-45
SLIDE 45

A Learning Problem as a Matrix : ๐‘ฉ, ๐’€ : finite sets ๐‘ต: ๐‘ฉ ร— ๐’€ โ†’ {โˆ’๐Ÿ, ๐Ÿ} : a matrix ๐’š โˆˆ๐‘บ ๐’€ is unknown. A learner tries to learn ๐’š from a stream ๐’ƒ๐Ÿ, ๐’„๐Ÿ , ๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘ โ€ฆ , where โˆ€๐’– : ๐’ƒ๐’– โˆˆ๐‘บ ๐‘ฉ and ๐’„๐’– = ๐‘ต(๐’ƒ๐’–, ๐’š) ๐’€ : concept class ( = ๐Ÿ, ๐Ÿ ๐’ in parity learning) ๐‘ฉ : possible samples ( = ๐Ÿ, ๐Ÿ ๐’ in parity learning)

slide-46
SLIDE 46

Theorem [R 17], [Garg-R-Tal 18]: Assume that any submatrix of ๐‘ต of fraction ๐Ÿ‘โˆ’๐’ ร— ๐Ÿ‘โˆ’โ„“ has bias of at most ๐Ÿ‘โˆ’๐’”. Then, any learning algorithm requires either ๐›(๐’ โ‹… โ„“) memory bits or ๐Ÿ‘๐›(๐’”) samples In particular, for large classes of learning problems, any learning algorithm requires either memory of size ๐›((log ๐‘ฉ ) โ‹… (log ๐’€ )) or an exponential number of samples

(A new general proof technique. Implies all previous results) (A related result by Beame, Oveis-Gharan, Yang (building on [R 17]))

slide-47
SLIDE 47

Applications: (examples) Learning from low-degree equations: A learner tries to learn ๐’š = (๐’š๐Ÿ, โ€ฆ , ๐’š๐’) โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’, from random multilinear polynomial equations of degree at most ๐’† (over ๐‘ฎ๐Ÿ‘): Requires ๐›(๐’๐’†+๐Ÿ) memory or ๐Ÿ‘๐›(๐’) samples

slide-48
SLIDE 48

Applications: (examples) Learning from low-degree equations: A learner tries to learn ๐’š = (๐’š๐Ÿ, โ€ฆ , ๐’š๐’) โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’, from random multilinear polynomial equations of degree at most ๐’† (over ๐‘ฎ๐Ÿ‘): Requires ๐›(๐’๐’†+๐Ÿ) memory or ๐Ÿ‘๐›(๐’) samples Low degree polynomials: A learner tries to learn an ๐’-variate multilinear polynomial of degree ๐’† over ๐‘ฎ๐Ÿ‘, from random evaluations: Requires ๐›(๐’๐’†+๐Ÿ) memory or ๐Ÿ‘๐›(๐’) samples

slide-49
SLIDE 49

Applications: (examples) Learning from low-degree equations: A learner tries to learn ๐’š = (๐’š๐Ÿ, โ€ฆ , ๐’š๐’) โˆˆ๐‘บ ๐Ÿ, ๐Ÿ ๐’, from random multilinear polynomial equations of degree at most ๐’† (over ๐‘ฎ๐Ÿ‘): Requires ๐›(๐’๐’†+๐Ÿ) memory or ๐Ÿ‘๐›(๐’) samples Low degree polynomials: A learner tries to learn an ๐’-variate multilinear polynomial of degree ๐’† over ๐‘ฎ๐Ÿ‘, from random evaluations: Requires ๐›(๐’๐’†+๐Ÿ) memory or ๐Ÿ‘๐›(๐’) samples Error correcting codesโ€ฆ Random matricesโ€ฆ

slide-50
SLIDE 50

Branching Program (length ๐’, width ๐’†): (for parity learning) Each layer represents a time step. Each vertex represents a memory state of the learner. Each non-leaf vertex has ๐Ÿ‘๐’+๐Ÿ

  • utgoing edges, one for each ๐’ƒ, ๐’„ โˆˆ ๐Ÿ, ๐Ÿ ๐’ ร— {โˆ’๐Ÿ, ๐Ÿ}

(๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘) (๐’ƒ, ๐’„)

๐’ ๐’†

slide-51
SLIDE 51

Branching Program (length ๐’, width ๐’†): (for parity learning) The samples ๐’ƒ๐Ÿ, ๐’„๐Ÿ , . . , ๐’ƒ๐’, ๐’„๐’ define a computation-

  • path. Each vertex ๐’˜ in the last layer is labeled by

๐’š๐’˜ โˆˆ ๐Ÿ, ๐Ÿ ๐’. The output is the label ๐’š๐’˜ of the vertex reached by the path

(๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘) (๐’ƒ, ๐’„)

๐’ ๐’†

slide-52
SLIDE 52

Branching Program (length ๐’, width ๐’†): (for parity learning) Example: BP for Parity learning: Any BP with width ๐’† โ‰ค ๐Ÿ‘๐’๐Ÿ‘/๐Ÿ‘๐Ÿ and length ๐’ โ‰ค ๐Ÿ‘๐’/๐Ÿ๐Ÿ๐Ÿ, outputs the correct ๐’š with exponentially small probability

(๐’ƒ๐Ÿ‘, ๐’„๐Ÿ‘) (๐’ƒ, ๐’„)

๐’ ๐’†

slide-53
SLIDE 53

Proof Outline (for parity): [R17] same proof techniques was used in several follow-up works: [BOGY18, GRT18, SSV19, GRT19]

slide-54
SLIDE 54

Interesting Idea in the Proof: (very high level)

Significant vertices: ๐’˜ s.t. conditioned on the event that the computation-path reaches ๐’˜, ๐’š can be guessed with non-negligible probability ๐‘ธ๐’” ๐’˜ = probability that the computation-path reaches ๐’˜ We want to prove: If ๐’˜ is significant, ๐‘ธ๐’” ๐’˜ โ‰ค ๐Ÿ‘โˆ’๐›(๐’๐Ÿ‘) Hence, there are at least ๐Ÿ‘๐›(๐’๐Ÿ‘) significant vertices ๐‘ช = the event that some โ€œatypicalโ€ things happen We show that ๐‘ธ๐’” ๐‘ช โ‰ค ๐Ÿ‘โˆ’๐›(๐’) (but much larger than ๐Ÿ‘โˆ’๐›(๐’๐Ÿ‘)) We show that if ๐’˜ is significant, ๐‘ธ๐’” ๐’˜| ๐‘ช โ‰ค ๐Ÿ‘โˆ’๐›(๐’๐Ÿ‘)

slide-55
SLIDE 55

Proof Outline:

๐‘ผ = same as the computation path, but stops when โ€œatypicalโ€ things happen (stopping rules). All definitions are with respect to ๐‘ผ ๐‘ธ๐’” ๐‘ผ stops is exp small (but much larger than ๐Ÿ‘โˆ’๐›(๐’๐Ÿ‘)) ๐๐’š|๐’˜ = distribution of ๐’š conditioned on the event that the path ๐‘ผ reaches ๐’˜ Significant vertices: ๐’˜ s.t. ||๐๐’š|๐’˜||๐Ÿ‘ โ‰ฅ ๐Ÿ‘๐œบ๐’ โ‹… ๐Ÿ‘โˆ’๐’ ๐‘ธ๐’” ๐’˜ = probability that the path ๐‘ผ reaches ๐’˜ We prove: If ๐’˜ is significant, ๐‘ธ๐’” ๐’˜ โ‰ค ๐Ÿ‘โˆ’๐›(๐’๐Ÿ‘) Hence, there are at least ๐Ÿ‘๐›(๐’๐Ÿ‘) significant vertices

slide-56
SLIDE 56

Proof Outline:

If ๐’• is significant, ๐‘ธ๐’” ๐’• โ‰ค ๐Ÿ‘โˆ’๐›(๐’๐Ÿ‘) Progress Function: For layer ๐‘ด๐’‹, ๐’‚๐’‹ =

๐’˜โˆˆ๐‘ด๐’‹

๐‘ธ๐’” ๐’˜ โ‹… ๐๐’š|๐’˜ , ๐๐’š|๐’• ๐œบ๐’ 1) ๐’‚๐Ÿ = ๐Ÿ‘โˆ’๐Ÿ‘๐œบ๐’๐Ÿ‘ 2) ๐’‚๐’‹ is very slowly growing: ๐’‚๐Ÿ โ‰ˆ ๐’‚๐’ 3) If ๐’• โˆˆ ๐‘ด๐’, then ๐’‚๐’ โ‰ฅ ๐‘ธ๐’” ๐’• โ‹… ๐Ÿ‘(๐œบ๐’)๐Ÿ‘ โ‹… ๐Ÿ‘โˆ’๐Ÿ‘๐œบ๐’๐Ÿ‘ Hence: If ๐’• is significant, ๐‘ธ๐’” ๐’• โ‰ค ๐Ÿ‘โˆ’(๐œบ๐’)๐Ÿ‘ = ๐Ÿ‘โˆ’๐›(๐’๐Ÿ‘) (the hard step is step 2)

slide-57
SLIDE 57

How we prove that ๐’‚๐’‹ is very slowly growing:

๐’‚๐’‹ =

๐’˜โˆˆ๐‘ด๐’‹

๐‘ธ๐’” ๐’˜ โ‹… ๐๐’š|๐’˜ , ๐๐’š|๐’• ๐œบ๐’ ๐’‚๐’‹+๐Ÿ =

๐’˜โˆˆ๐‘ด๐’‹+๐Ÿ

๐‘ธ๐’” ๐’˜ โ‹… ๐๐’š|๐’˜ , ๐๐’š|๐’• ๐œบ๐’ ๐’‚โ€ฒ =

๐’‡: ๐‘ด๐’‹โ†’๐‘ด๐’‹+๐Ÿ

๐‘ธ๐’” ๐’‡ โ‹… ๐๐’š|๐’‡ , ๐๐’š|๐’• ๐œบ๐’ By a simple convexity argument, ๐’‚๐’‹+๐Ÿ โ‰ค ๐’‚โ€ฒ The hard part is to show that ๐’‚โ€ฒ is only negligibly larger than ๐’‚๐’‹

slide-58
SLIDE 58

How we prove that ๐’‚โ€ฒ is only negligibly larger than ๐’‚๐’‹:

๐’‚๐’‹ =

๐’˜โˆˆ๐‘ด๐’‹

๐‘ธ๐’” ๐’˜ โ‹… ๐๐’š|๐’˜ , ๐๐’š|๐’• ๐œบ๐’ ๐’‚โ€ฒ =

๐’‡: ๐‘ด๐’‹โ†’๐‘ด๐’‹+๐Ÿ

๐‘ธ๐’” ๐’‡ โ‹… ๐๐’š|๐’‡ , ๐๐’š|๐’• ๐œบ๐’ We show that, on average,

๐’‡: ๐’˜โ†’๐‘ด๐’‹+๐Ÿ

๐‘ธ๐’” ๐’‡ โ‹… ๐๐’š|๐’‡ , ๐๐’š|๐’• ๐œบ๐’ is only negligibly larger than ๐‘ธ๐’” ๐’˜ โ‹… ๐๐’š|๐’˜ , ๐๐’š|๐’• ๐œบ๐’

slide-59
SLIDE 59

How we prove that, on average,

๐’‡: ๐’˜โ†’๐‘ด๐’‹+๐Ÿ

๐‘ธ๐’” ๐’‡ โ‹… ๐๐’š|๐’‡ , ๐๐’š|๐’• ๐œบ๐’ is only negligibly larger than ๐‘ธ๐’” ๐’˜ โ‹… ๐๐’š|๐’˜ , ๐๐’š|๐’• ๐œบ๐’ Roughly speaking: For parity: ๐๐’š|๐’‡ , ๐๐’š|๐’• are close, up to a normalization, to the Fourier coefficients of ๐๐’š|๐’˜ โ‹… ๐๐’š|๐’•. By introducing stopping rules for the path ๐‘ผ, we are able to bound the ๐‘ด๐Ÿ‘-norm of ๐๐’š|๐’• and the ๐‘ดโˆž-norm

  • f ๐๐’š|๐’˜ and hence the ๐‘ด๐Ÿ‘-norm of ๐๐’š|๐’˜ โ‹… ๐๐’š|๐’• , so that the Fourier

coefficients of ๐๐’š|๐’˜ โ‹… ๐๐’š|๐’• are small on average. Another stopping rule ensures that the normalization doesnโ€™t distort by much

slide-60
SLIDE 60

Stopping Rules:

Stop on a vertex ๐’˜ if: 1) ||๐๐’š|๐’˜||๐Ÿ‘ is large 2) ๐๐’š|๐’˜(๐’š) is large 3) The next edge corresponds to a large Fourier coefficient of ๐๐’š|๐’˜ The stopping rules do not depend on ๐’• They do depend on ๐’š

slide-61
SLIDE 61

Summary: For a large class of learning problems, any learning algorithm requires either super-linear memory size or a super- polynomial number of samples Main Message: For some learning problems, access to a relatively large memory is crucial. In other words, in some cases, learning is infeasible, due to memory constraints [S14, SVW16, R16, VV16, KRT17, MM17, R17, MM18, BOGY18, GRT18, DS18, AS18, DKS19, SSV19, GRT19, GKR19]

slide-62
SLIDE 62

Thank You!