computational and statistical
play

Computational and Statistical Learning Theory TTIC 31120 Prof. - PowerPoint PPT Presentation

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Hardness of Improper Learning (continued) Agnostic Learning Hardness of Learning via Crypto Easy to generate


  1. Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning β€” Hardness of Improper Learning (continued) Agnostic Learning

  2. Hardness of Learning via Crypto Easy to generate random (𝐿, 𝐸 𝐿 ) βˆ’1 (𝑐) very hard: 𝐿, 𝑐 ↦ 𝑔 𝐿 𝐿, 𝑏 ↦ 𝑔 𝐿 (𝑏) easy No poly-time alg for non-negligible 𝐿, 𝑐 βˆ’1 (𝑐) i Hard to learn β„‹ = β„Ž 𝐿 𝑐, 𝑗 : 𝑐, 𝑗 ↦ 𝑔 𝐿 βˆ’1 𝑐 easy 𝐸 𝐿 , 𝑐 ↦ 𝑔 𝐿 Hard to learn polytime functions (e.g. polytime)

  3. Hardness of Learning via Crypto Assumption: No poly-time algorithm for 3 𝑐 𝑛𝑝𝑒 𝐿 that works for non- negligible 𝑐 , 𝐿 = π‘žπ‘Ÿ ( π‘ž, π‘Ÿ primes with 3 ∀ π‘ž βˆ’ 1 π‘Ÿ βˆ’ 1 ) 𝐿, 𝑏 ↦ 𝑏 3 𝑛𝑝𝑒 𝐿 easy βˆ’1 (𝑐) very hard: 𝐿, 𝑐 ↦ 𝑔 𝐿 𝐿, 𝑏 ↦ 𝑔 𝐿 (𝑏) easy No poly-time alg for non-negligible 𝐿, 𝑐 βˆ’1 (𝑐) i Hard to learn β„‹ = β„Ž 𝐿 𝑐, 𝑗 : 𝑐, 𝑗 ↦ 𝑔 𝐿 βˆ’1 𝑐 easy 𝐸 𝐿 , 𝑐 ↦ 𝑔 𝐿 Hard to learn polytime functions (e.g. polytime) βˆ€ 𝐿 β„Ž 𝐿 ∈ Hard to learn β„‹ β„‹ 𝑏 ↦ 𝑏 𝐸 𝐿 = 3 𝑏 𝑛𝑝𝑒 𝐿 Hard to learn log-depth circuit Computable using log-depth logic circuit Hard to learn log-depth NN Computable using log-depth neural nets

  4. Hardness of Learning via Crypto β€’ Public-key crypto is possible  hard to learn poly-time functions β€’ Hardness of Discrete Cube Root  hard to learn log(n)-depth logic circuits  hard to learn log(n)-depth poly-size neural networks Michael β€’ Hardness of breaking RSA Kearns  hard to learn poly-length logical formulas  hard to learn poly-size automata  hard to learn push-down automata, ie regexps  for some depth d, hard to learn poly-size depth-d threshold circuits (output of unit is one iff number of input units that are one is greater than threshold) β€’ Hardness of lattice-shortest-vector based cryptography  hard to learn intersection of π‘œ 𝑠 halfspaces (for any 𝑠 > 0 )

  5. Intersections of Halfspaces 𝑙(π‘œ) = 𝑙 π‘œ | π‘₯ 1 , … , π‘₯ 𝑙 π‘œ ∈ ℝ π‘œ β„‹ π‘œ 𝑦 β†¦βˆ§ 𝑗=1 π‘₯ 𝑗 , 𝑦 > 0 𝑃 π‘œ 1.5 βˆ’ π‘£π‘‡π‘Šπ‘„ βˆ‰ 𝑆𝑄 ⇓ Lattice-based cryptosystem is secure ⇓ Sasha 𝑙 π‘œ =π‘œ 𝑠 Sherstov For any 𝑠 > 0 , hard to learn 𝐼 π‘œ ⇓ Hard to learn 2-layer NN with π‘œ 𝑠 hidden units Adam Klivans The unique shortest lattice vector problem: β€’ SVP 𝑀 1 , 𝑀 2 , … , 𝑀 π‘œ ∈ ℝ π‘œ = arg min 𝑏 1 ,𝑏 2 ,…,𝑏 π‘œ βˆˆβ„€ 𝑏 1 𝑀 1 + 𝑏 2 𝑀 2 + β‹― + 𝑏 π‘œ 𝑀 π‘œ 𝑃 π‘œ 1.5 βˆ’ π‘£π‘‡π‘Šπ‘„ : only required to return SVP if next-shortest is 𝑃 π‘œ 1.5 times longer β€’

  6. Hardness of Learning via Crypto βˆ’1 (𝑐) very hard: 𝐿, 𝑐 ↦ 𝑔 Easy to generate 𝐿 No poly-time alg for non-negligible 𝐿, 𝑐 random (𝐿, 𝐸 𝐿 ) 𝐿, 𝑏 ↦ 𝑔 𝐿 (𝑏) easy βˆ’1 (𝑐) i Hard to learn β„‹ = β„Ž 𝐿 𝑐, 𝑗 : 𝑐, 𝑗 ↦ 𝑔 𝐿 βˆ’1 𝑐 easy 𝐸 𝐿 , 𝑐 ↦ 𝑔 𝐿 Hard to learn polytime functions (e.g. polytime)

  7. Hardness of Learning via Crypto βˆ’1 (𝑐) very hard: 𝐿, 𝑐 ↦ 𝑔 Easy to generate 𝐿 No poly-time alg for non-negligible 𝐿, 𝑐 random (𝐿, 𝐸 𝐿 ) No poly-time alg for all 𝐿 and almost all 𝑐 𝐿, 𝑏 ↦ 𝑔 𝐿 (𝑏) easy βˆ’1 (𝑐) i Hard to learn β„‹ = β„Ž 𝐿 𝑐, 𝑗 : 𝑐, 𝑗 ↦ 𝑔 𝐿 βˆ’1 𝑐 easy 𝐸 𝐿 , 𝑐 ↦ 𝑔 𝐿 Hard to learn polytime functions (e.g. polytime)

  8. Hardness of Learning: Take II β€’ Recall how we proved hardness of proper learning: β€’ Reduction from deciding consistency with β„‹ β€’ If we had efficient proper learner, could train it and find consistent hypothesis in β„‹ if it exists β€’ Problem: if learning is not proper, might return good hypothesis not in β„‹ , even though 𝒠 not consistent with β„‹ β€’ Instead: reduction from deciding between two possibilities: β€’ Sample is consistent with β„‹ β€’ For every consistent sample, return 1 w.p. β‰₯ 3/4 (over randomization in algorithm) β€’ Sample comes from random β€œ unpredictable ” distribution β€’ E.g. sampled such that labels 𝑧 independent of 𝑦 β€’ For all but negligible samples 𝑇 ∼ 𝒠 𝑛 , return 0 w.p. β‰₯ 3/4 Amit Daniely

  9. Hardness Relative to RSAT β€’ RSAT assumption: For some 𝑔 𝐿 = πœ• 1 , there is no poly-time randomized algorithm that gets as input a K-SAT formula with π‘œ 𝑔(𝐿) constraints, and: β€’ If the input is satisfiable, then w.p. β‰₯ 3/4 (over the randomization in the algorithm), it outputs 1 β€’ If each constraint is generated independently and uniformly at random, then with probability approaching 1 (as π‘œ β†’ ∞ ) over the formula, w.p. β‰₯ 3/4 (over the randomization in the algorithm), it outputs 0 β€’ Theorem: Under the RSAT assumption, β€’ Poly-length DNFs are not efficiently PAC learnable e.g. β„Ž 𝑦 = 𝑦 1 ∧ 𝑦 7 ∧ 𝑦 15 ∧ 𝑦 17 ∨ 𝑦 2 ∧ 𝑦 24 ∨ β‹― β€’ Intersection of πœ• log π‘œ halfspaces are not efficiently PAC learnable  2-layer Neural Networks with 𝑃 log 1.1 π‘œ hidden layers are not efficiently PAC learnable Amit Daniely

  10. Hardness of Learning β€’ Axis-aligned rectangles in π‘œ dimensions Efficiently Properly β€’ Halfspaces in π‘œ dimensions Learnable β€’ Conjunctions on π‘œ variables Efficiently Learnable, β€’ 3-term DNF ’ s but not Properly β€’ DNF formulas of size poly(n) β€’ Generic logical formulas of size poly(n) Not Efficiently β€’ Neural nets with at most poly(n) units Learnable β€’ Functions computable in poly(n) time

  11. Realizable vs Agnostic β€’ Definition : A family β„‹ π‘œ of hypothesis classes is efficiently properly PAC-Learnable if there exists a learning rule 𝐡 such that βˆ€ π‘œ βˆ€πœ—, πœ€ > 0 , πœ€ βˆƒπ‘› π‘œ, πœ—, πœ€ , βˆ€π’  s.t. 𝑀 𝒠 β„Ž = 0 for some β„Ž ∈ β„‹ , βˆ€ π‘‡βˆΌπ’  𝑛 π‘œ,πœ—,πœ€ , 𝑀 𝒠 𝐡 𝑇 ≀ πœ— 1 πœ— , π‘šπ‘π‘• 1 and 𝐡(𝑇)(𝑦) can be computed in time π‘žπ‘π‘šπ‘§ π‘œ, πœ€ and 𝐡 always outputs a predictor in β„‹ π‘œ β€’ Definition : A family β„‹ π‘œ of hypothesis classes is efficiently properly agnostically PAC-Learnable if there exists a learning rule 𝐡 such that πœ€ βˆ€ π‘œ βˆ€πœ—, πœ€ > 0 , βˆƒπ‘› π‘œ, πœ—, πœ€ , βˆ€π’  βˆ€ π‘‡βˆΌπ’  𝑛 π‘œ,πœ—,πœ€ , 𝑀 𝒠 𝐡 𝑇 ≀ inf β„Žβˆˆβ„‹ π‘œ 𝑀 𝒠 β„Ž + πœ— and 𝐡(𝑇)(𝑦) can be computed in time π‘žπ‘π‘šπ‘§ π‘œ, 1 πœ— , π‘šπ‘π‘• 1 πœ€ and 𝐡 always outputs a predictor in β„‹ π‘œ

  12. Conditions for Efficient Agnostic Learning 𝐹𝑆𝑁 β„‹ 𝑇 = arg min β„Žβˆˆβ„‹ 𝑀 𝑇 (β„Ž) β€’ Claim: If β€’ VCdim β„‹ π‘œ ≀ π‘žπ‘π‘šπ‘§(π‘œ) , and β€’ Each β„Ž ∈ β„‹ π‘œ is computable in time poly(n) β€’ There is a poly-time (in size of input) algorithm for 𝐹𝑆𝑁 β„‹ (i.e. that returns any ERM) then β„‹ π‘œ is efficiently agnostically properly PAC learnable. π΅π»π‘†πΉπΉπ‘πΉπ‘‚π‘ˆ β„‹ 𝑇, 𝑙 = 1 𝑗𝑔𝑔 βˆƒ β„Žβˆˆβ„‹ 𝑀 𝑇 β„Ž ≀ (1 βˆ’ 𝑙 𝑇 ) β€’ Claim: If β„‹ π‘œ is efficiently properly agnostically PAC learnable, then π΅π»π‘†πΉπΉπ‘πΉπ‘‚π‘ˆ β„‹ ∈ 𝑆𝑄

  13. What is Properly Agnostically Learnable? β€’ Poly-time functions? No! (not even in realizable case) β€’ Poly-length logical formulas? No! (not even in realizable case) β€’ Poly-size depth-2 neural networks? No! (not even in realizable case) β€’ Halfspaces (linear predictors)? No! β€’ 𝒴 π‘œ = 0,1 π‘œ , β„‹ π‘œ = | π‘₯ ∈ ℝ π‘œ 𝑦 ↦ π‘₯, 𝑦 > 0 β€’ Claim: π΅π»π‘†πΉπΉπ‘πΉπ‘‚π‘ˆ β„‹ is NP-Hard (optional HW problem) β€’ Conclusion: If 𝑂𝑄 β‰  𝑆𝑄 , halfspaces are not efficiently properly agnostically learnable β€’ Conjunctions? No! β€’ Also NP-hard! β€’ Unions of segments on the line Yes! π‘œ β€’ 𝒴 π‘œ = 0,1 , β„‹ π‘œ = 𝑦 β†¦βˆ¨ 𝑗=1 𝑏 𝑗 ≀ 𝑦 ≀ 𝑐 𝑗 | 𝑏 𝑗 , 𝑐 𝑗 ∈ 0,1 β€’ Efficiently Properly Agnostically PAC Learnable!

  14. Source of the Hardness min β„Žβˆˆβ„‹ β„“(β„Ž π‘₯ 𝑦 𝑗 ; 𝑧 𝑗 ) 𝑗 β„Ž π‘₯ 𝑦 = 〈π‘₯, 𝑦βŒͺ β„“ 01 β„Ž 𝑦 ; 𝑧 = π‘§β„Ž(𝑦) ≀ 0 β„“ 01 (β„Ž 𝑦 ; 𝑧 = βˆ’1) β„“ π‘‘π‘Ÿπ‘  (β„Ž 𝑦 ; 𝑧 = βˆ’1) 1 1 -1 β„Ž 𝑦 ∈ ℝ β„Ž 𝑦 ∈ ℝ

  15. Convexity β€’ Definition (convex set): A set 𝐷 in a vector space is convex if βˆ€π‘£, 𝑀 ∈ 𝐷 and for all 𝛽 ∈ 0,1 : 𝛽𝑣 + 1 βˆ’ 𝛽 𝑀 ∈ 𝐷

  16. Convexity β€’ Definition (convex function): A function 𝑔: 𝐷 ↦ ℝ is convex if βˆ€π‘£, 𝑀 ∈ 𝐷: 𝑔 𝛽𝑣 + 1 βˆ’ 𝛽 𝑀 ≀ 𝛽𝑔 𝑣 + 1 βˆ’ 𝛽 𝑔(𝑀) 𝑔(𝑀) 𝛽𝑔(𝑣) + 1 βˆ’ 𝛽 𝑔(𝑀) 𝑔(𝑣) 𝑔(𝛽𝑣 + 1 βˆ’ 𝛽 𝑀) 𝑣 𝑀 𝛽𝑣 + 1 βˆ’ 𝛽 𝑀

  17. Using a surrogate loss min β„Žβˆˆβ„‹ β„“(β„Ž π‘₯ 𝑦 𝑗 ; 𝑧 𝑗 ) 𝑗 β€’ Instead of β„“ 01 (𝑨; 𝑧) , use a surrogate β„“(𝑨; 𝑧) s.t.: β€’ βˆ€ 𝑧 β„“(𝑨; 𝑧) is convex in 𝑨 β€’ βˆ€ 𝑨,𝑧 β„“ 01 𝑨; 𝑧 ≀ β„“(𝑨; 𝑧) β€’ E.g. β€’ β„“ π‘‘π‘Ÿπ‘  𝑨; 𝑧 = 𝑧 βˆ’ 𝑨 2 β€’ β„“ π‘šπ‘π‘•π‘—π‘‘π‘’π‘—π‘‘ 𝑨; 𝑧 = log(1 + exp βˆ’π‘§π‘¨ ) β€’ β„“ β„Žπ‘—π‘œπ‘•π‘“ z; 𝑧 = 1 βˆ’ 𝑧𝑨 + = max{0,1 βˆ’ 𝑧𝑨 }

Recommend


More recommend