Online Learning Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others
Big picture 2
Big picture Last lecture: Linear models 3
Big picture Linear models How good is a learning algorithm? 4
Big picture Linear models Online learning How good is a learning algorithm? 5
Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm? 6
Big picture Linear models …. Perceptron, Support Vector Winnow Machines PAC, Online …. Empirical Risk learning Minimization How good is a learning algorithm? 7
Mistake bound learning • The mistake bound model • A proof of concept mistake bound algorithm: The Halving algorithm • Examples • Representations and ease of learning 8
Coming up… • Mistake-driven learning • Learning algorithms for learning a linear function over the feature space – Perceptron (with many variants) – General Gradient Descent view Issues to watch out for – Importance of Representation – Complexity of Learning – More about features 9
Mistake bound learning • The mistake bound model • A proof of concept mistake bound algorithm: The Halving algorithm • Examples • Representations and ease of learning 10
Motivation Consider a learning problem in a very high dimensional space {𝑦 ! , 𝑦 " , ⋯ , 𝑦 !###### } And assume that the function space is very sparse (the function of interest depends on a small number of attributes.) 𝑔 = 𝑦 " ∧ 𝑦 $ ∧ 𝑦 % ∧ 𝑦 & ∧ 𝑦 !## Middle Eastern deserts are known for their sweetness Can we develop an algorithm that depends only weakly on the • dimensionality and mostly on the number of relevant attributes? How should we represent the hypothesis? • 11
Motivation Consider a learning problem in a very high dimensional space {𝑦 ! , 𝑦 " , ⋯ , 𝑦 !###### } And assume that the function space is very sparse (the function of interest depends on a small number of attributes.) 𝑔 = 𝑦 " ∧ 𝑦 $ ∧ 𝑦 % ∧ 𝑦 & ∧ 𝑦 !## Middle Eastern deserts are known for their sweetness Can we develop an algorithm that depends only weakly on the • dimensionality and mostly on the number of relevant attributes? How should we represent the hypothesis? • 12
Motivation Consider a learning problem in a very high dimensional space {𝑦 ! , 𝑦 " , ⋯ , 𝑦 !###### } And assume that the function space is very sparse (the function of interest depends on a small number of attributes.) 𝑔 = 𝑦 " ∧ 𝑦 $ ∧ 𝑦 % ∧ 𝑦 & ∧ 𝑦 !## Middle Eastern deserts are known for their sweetness Can we develop an algorithm that depends only weakly on the • dimensionality and mostly on the number of relevant attributes? How should we represent the hypothesis? • 13
Motivation Consider a learning problem in a very high dimensional space {𝑦 ! , 𝑦 " , ⋯ , 𝑦 !###### } And assume that the function space is very sparse (the function of interest depends on a small number of attributes.) 𝑔 = 𝑦 " ∧ 𝑦 $ ∧ 𝑦 % ∧ 𝑦 & ∧ 𝑦 !## Middle Eastern deserts are known for their sweetness Can we develop an algorithm that depends only weakly on the • dimensionality and mostly on the number of relevant attributes? How should we represent the hypothesis? • 14
An illustration of mistake driven learning Learner One example: x Prediction ℎ ! (x) Current hypothesis ℎ ! Loop forever: 1. Receive example x 2. Make a prediction using the current hypothesis ℎ ! (x) 3. Receive the true label for x . 4. If ℎ ! (x) is not correct, then: • Update ℎ ! to ℎ !"# 15
An illustration of mistake driven learning Learner One example: x Prediction ℎ ! (x) Current hypothesis ℎ ! Loop forever: 1. Receive example x 2. Make a prediction using the current hypothesis ℎ ! (x) 3. Receive the true label for x . 4. If ℎ ! (x) is not correct, then: • Update ℎ ! to ℎ !"# Only need to define how prediction and update behave Can such a simple scheme work? How do we quantify what “work” means? 16
Mistake bound algorithms Setting: • – Instance space: 𝒴 (dimensionality 𝑜 ) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜 ) Learning Protocol: • – Learner is given 𝐲 ∈ 𝒴 , randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback Performance: learner makes a mistake when ℎ 𝐲 ≠ 𝑔(𝑦) • – 𝑁 ! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of examples for the target function 𝑔 – 𝑁 ! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made by 𝐵 for any target function in 𝐷 and any sequence S of examples Algorithm 𝐵 is a mistake bound algorithm for the concept class 𝐷 if • 𝑁 𝐵 (𝐷) is a polynomial in the dimensionality 𝑜 17
Mistake bound algorithms Setting: • – Instance space: 𝒴 (dimensionality 𝑜 ) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜 ) Learning Protocol: • – Learner is given 𝐲 ∈ 𝒴 , randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback Performance: learner makes a mistake when ℎ 𝐲 ≠ 𝑔(𝑦) • – 𝑁 ! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of examples for the target function 𝑔 – 𝑁 ! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made by 𝐵 for any target function in 𝐷 and any sequence S of examples Algorithm 𝐵 is a mistake bound algorithm for the concept class 𝐷 if • 𝑁 𝐵 (𝐷) is a polynomial in the dimensionality 𝑜 18
Mistake bound algorithms Setting: • – Instance space: 𝒴 (dimensionality 𝑜 ) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜 ) Learning Protocol: • – Learner is given 𝐲 ∈ 𝒴 , randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback Performance: learner makes a mistake when ℎ 𝐲 ≠ 𝑔(𝑦) • – 𝑁 ! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of examples for the target function 𝑔 – 𝑁 ! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made by 𝐵 for any target function in 𝐷 and any sequence S of examples Algorithm 𝐵 is a mistake bound algorithm for the concept class 𝐷 if • 𝑁 𝐵 (𝐷) is a polynomial in the dimensionality 𝑜 19
Mistake bound algorithms Setting: • – Instance space: 𝒴 (dimensionality 𝑜 ) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜 ) Learning Protocol: • – Learner is given 𝐲 ∈ 𝒴 , randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback Performance: learner makes a mistake when ℎ 𝐲 ≠ 𝑔(𝑦) • – 𝑁 ! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of examples for the target function 𝑔 – 𝑁 ! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made by 𝐵 for any target function in 𝐷 and any sequence S of examples Algorithm 𝐵 is a mistake bound algorithm for the concept class 𝐷 if • 𝑁 𝐵 (𝐷) is a polynomial in the dimensionality 𝑜 20
Mistake bound algorithms Setting: • – Instance space: 𝒴 (dimensionality 𝑜 ) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜 ) Learning Protocol: • – Learner is given 𝐲 ∈ 𝒴 , randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback Performance: learner makes a mistake when ℎ 𝐲 ≠ 𝑔(𝑦) • – 𝑁 ! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of examples for the target function 𝑔 – 𝑁 ! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made by 𝐵 for any target function in 𝐷 and any sequence S of examples Algorithm 𝐵 is a mistake bound algorithm for the concept class 𝐷 if • 𝑁 𝐵 (𝐷) is a polynomial in the dimensionality 𝑜 21
Mistake bound algorithms Setting: • – Instance space: 𝒴 (dimensionality 𝑜 ) – Target 𝑔: 𝒴 → 0,1 , 𝑔 ∈ 𝐷 the concept class (parameterized by 𝑜 ) Learning Protocol: • – Learner is given 𝐲 ∈ 𝒴 , randomly chosen – Learner predicts ℎ(𝐲) and is then given 𝑔 𝐲 ⟵ the feedback Performance: learner makes a mistake when ℎ 𝐲 ≠ 𝑔(𝑦) • – 𝑁 ! 𝑔, 𝑇 : Number of mistakes algorithm 𝐵 makes on sequence 𝑇 of examples for the target function 𝑔 – 𝑁 ! 𝐷 = max "∈$ 𝑁 ! 𝑔, 𝑇 : The maximum possible number of mistakes made by 𝐵 for any target function in 𝐷 and any sequence S of examples Algorithm 𝐵 is a mistake bound algorithm for the concept class 𝐷 if • 𝑁 𝐵 (𝐷) is a polynomial in the dimensionality 𝑜 22
Learnability in the mistake bound model • Algorithm 𝐵 is a mistake bound algorithm for the concept class 𝐷 if 𝑁 𝐵 (𝐷) is a polynomial in the dimensionality 𝑜 – That is, the maximum number of mistakes it makes for any sequence of inputs (perhaps even an adversarially chosen one) is polynomial in the dimensionality 23
Recommend
More recommend