Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning β Hardness of Improper Learning (continued) Agnostic Learning
Hardness of Learning via Crypto Easy to generate random (πΏ, πΈ πΏ ) β1 (π) very hard: πΏ, π β¦ π πΏ πΏ, π β¦ π πΏ (π) easy No poly-time alg for non-negligible πΏ, π β1 (π) i Hard to learn β = β πΏ π, π : π, π β¦ π πΏ β1 π easy πΈ πΏ , π β¦ π πΏ Hard to learn polytime functions (e.g. polytime)
Hardness of Learning via Crypto Assumption: No poly-time algorithm for 3 π πππ πΏ that works for non- negligible π , πΏ = ππ ( π, π primes with 3 β€ π β 1 π β 1 ) πΏ, π β¦ π 3 πππ πΏ easy β1 (π) very hard: πΏ, π β¦ π πΏ πΏ, π β¦ π πΏ (π) easy No poly-time alg for non-negligible πΏ, π β1 (π) i Hard to learn β = β πΏ π, π : π, π β¦ π πΏ β1 π easy πΈ πΏ , π β¦ π πΏ Hard to learn polytime functions (e.g. polytime) β πΏ β πΏ β Hard to learn β β π β¦ π πΈ πΏ = 3 π πππ πΏ Hard to learn log-depth circuit Computable using log-depth logic circuit Hard to learn log-depth NN Computable using log-depth neural nets
Hardness of Learning via Crypto β’ Public-key crypto is possible ο¨ hard to learn poly-time functions β’ Hardness of Discrete Cube Root ο¨ hard to learn log(n)-depth logic circuits ο¨ hard to learn log(n)-depth poly-size neural networks Michael β’ Hardness of breaking RSA Kearns ο¨ hard to learn poly-length logical formulas ο¨ hard to learn poly-size automata ο¨ hard to learn push-down automata, ie regexps ο¨ for some depth d, hard to learn poly-size depth-d threshold circuits (output of unit is one iff number of input units that are one is greater than threshold) β’ Hardness of lattice-shortest-vector based cryptography ο¨ hard to learn intersection of π π halfspaces (for any π > 0 )
Intersections of Halfspaces π(π) = π π | π₯ 1 , β¦ , π₯ π π β β π β π π¦ β¦β§ π=1 π₯ π , π¦ > 0 π π 1.5 β π£πππ β ππ β Lattice-based cryptosystem is secure β Sasha π π =π π Sherstov For any π > 0 , hard to learn πΌ π β Hard to learn 2-layer NN with π π hidden units Adam Klivans The unique shortest lattice vector problem: β’ SVP π€ 1 , π€ 2 , β¦ , π€ π β β π = arg min π 1 ,π 2 ,β¦,π π ββ€ π 1 π€ 1 + π 2 π€ 2 + β― + π π π€ π π π 1.5 β π£πππ : only required to return SVP if next-shortest is π π 1.5 times longer β’
Hardness of Learning via Crypto β1 (π) very hard: πΏ, π β¦ π Easy to generate πΏ No poly-time alg for non-negligible πΏ, π random (πΏ, πΈ πΏ ) πΏ, π β¦ π πΏ (π) easy β1 (π) i Hard to learn β = β πΏ π, π : π, π β¦ π πΏ β1 π easy πΈ πΏ , π β¦ π πΏ Hard to learn polytime functions (e.g. polytime)
Hardness of Learning via Crypto β1 (π) very hard: πΏ, π β¦ π Easy to generate πΏ No poly-time alg for non-negligible πΏ, π random (πΏ, πΈ πΏ ) No poly-time alg for all πΏ and almost all π πΏ, π β¦ π πΏ (π) easy β1 (π) i Hard to learn β = β πΏ π, π : π, π β¦ π πΏ β1 π easy πΈ πΏ , π β¦ π πΏ Hard to learn polytime functions (e.g. polytime)
Hardness of Learning: Take II β’ Recall how we proved hardness of proper learning: β’ Reduction from deciding consistency with β β’ If we had efficient proper learner, could train it and find consistent hypothesis in β if it exists β’ Problem: if learning is not proper, might return good hypothesis not in β , even though π not consistent with β β’ Instead: reduction from deciding between two possibilities: β’ Sample is consistent with β β’ For every consistent sample, return 1 w.p. β₯ 3/4 (over randomization in algorithm) β’ Sample comes from random β unpredictable β distribution β’ E.g. sampled such that labels π§ independent of π¦ β’ For all but negligible samples π βΌ π π , return 0 w.p. β₯ 3/4 Amit Daniely
Hardness Relative to RSAT β’ RSAT assumption: For some π πΏ = π 1 , there is no poly-time randomized algorithm that gets as input a K-SAT formula with π π(πΏ) constraints, and: β’ If the input is satisfiable, then w.p. β₯ 3/4 (over the randomization in the algorithm), it outputs 1 β’ If each constraint is generated independently and uniformly at random, then with probability approaching 1 (as π β β ) over the formula, w.p. β₯ 3/4 (over the randomization in the algorithm), it outputs 0 β’ Theorem: Under the RSAT assumption, β’ Poly-length DNFs are not efficiently PAC learnable e.g. β π¦ = π¦ 1 β§ π¦ 7 β§ π¦ 15 β§ π¦ 17 β¨ π¦ 2 β§ π¦ 24 β¨ β― β’ Intersection of π log π halfspaces are not efficiently PAC learnable ο¨ 2-layer Neural Networks with π log 1.1 π hidden layers are not efficiently PAC learnable Amit Daniely
Hardness of Learning β’ Axis-aligned rectangles in π dimensions Efficiently Properly β’ Halfspaces in π dimensions Learnable β’ Conjunctions on π variables Efficiently Learnable, β’ 3-term DNF β s but not Properly β’ DNF formulas of size poly(n) β’ Generic logical formulas of size poly(n) Not Efficiently β’ Neural nets with at most poly(n) units Learnable β’ Functions computable in poly(n) time
Realizable vs Agnostic β’ Definition : A family β π of hypothesis classes is efficiently properly PAC-Learnable if there exists a learning rule π΅ such that β π βπ, π > 0 , π βπ π, π, π , βπ s.t. π π β = 0 for some β β β , β πβΌπ π π,π,π , π π π΅ π β€ π 1 π , πππ 1 and π΅(π)(π¦) can be computed in time ππππ§ π, π and π΅ always outputs a predictor in β π β’ Definition : A family β π of hypothesis classes is efficiently properly agnostically PAC-Learnable if there exists a learning rule π΅ such that π β π βπ, π > 0 , βπ π, π, π , βπ β πβΌπ π π,π,π , π π π΅ π β€ inf βββ π π π β + π and π΅(π)(π¦) can be computed in time ππππ§ π, 1 π , πππ 1 π and π΅ always outputs a predictor in β π
Conditions for Efficient Agnostic Learning πΉππ β π = arg min βββ π π (β) β’ Claim: If β’ VCdim β π β€ ππππ§(π) , and β’ Each β β β π is computable in time poly(n) β’ There is a poly-time (in size of input) algorithm for πΉππ β (i.e. that returns any ERM) then β π is efficiently agnostically properly PAC learnable. π΅π»ππΉπΉππΉππ β π, π = 1 πππ β βββ π π β β€ (1 β π π ) β’ Claim: If β π is efficiently properly agnostically PAC learnable, then π΅π»ππΉπΉππΉππ β β ππ
What is Properly Agnostically Learnable? β’ Poly-time functions? No! (not even in realizable case) β’ Poly-length logical formulas? No! (not even in realizable case) β’ Poly-size depth-2 neural networks? No! (not even in realizable case) β’ Halfspaces (linear predictors)? No! β’ π΄ π = 0,1 π , β π = | π₯ β β π π¦ β¦ π₯, π¦ > 0 β’ Claim: π΅π»ππΉπΉππΉππ β is NP-Hard (optional HW problem) β’ Conclusion: If ππ β ππ , halfspaces are not efficiently properly agnostically learnable β’ Conjunctions? No! β’ Also NP-hard! β’ Unions of segments on the line Yes! π β’ π΄ π = 0,1 , β π = π¦ β¦β¨ π=1 π π β€ π¦ β€ π π | π π , π π β 0,1 β’ Efficiently Properly Agnostically PAC Learnable!
Source of the Hardness min βββ β(β π₯ π¦ π ; π§ π ) π β π₯ π¦ = β©π₯, π¦βͺ β 01 β π¦ ; π§ = π§β(π¦) β€ 0 β 01 (β π¦ ; π§ = β1) β π‘ππ (β π¦ ; π§ = β1) 1 1 -1 β π¦ β β β π¦ β β
Convexity β’ Definition (convex set): A set π· in a vector space is convex if βπ£, π€ β π· and for all π½ β 0,1 : π½π£ + 1 β π½ π€ β π·
Convexity β’ Definition (convex function): A function π: π· β¦ β is convex if βπ£, π€ β π·: π π½π£ + 1 β π½ π€ β€ π½π π£ + 1 β π½ π(π€) π(π€) π½π(π£) + 1 β π½ π(π€) π(π£) π(π½π£ + 1 β π½ π€) π£ π€ π½π£ + 1 β π½ π€
Using a surrogate loss min βββ β(β π₯ π¦ π ; π§ π ) π β’ Instead of β 01 (π¨; π§) , use a surrogate β(π¨; π§) s.t.: β’ β π§ β(π¨; π§) is convex in π¨ β’ β π¨,π§ β 01 π¨; π§ β€ β(π¨; π§) β’ E.g. β’ β π‘ππ π¨; π§ = π§ β π¨ 2 β’ β πππππ‘π’ππ π¨; π§ = log(1 + exp βπ§π¨ ) β’ β βππππ z; π§ = 1 β π§π¨ + = max{0,1 β π§π¨ }
Recommend
More recommend