Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824
Administrative • Please start HW 1 early! • Questions are welcome!
Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes probability of observed data 𝜾 MLE = argmax 𝑄(𝐸𝑏𝑢𝑏|𝜄) 𝜄 • Maximum a posteriori estimation (MAP) Choose 𝜄 that is most probable given prior probability and data 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝑄 𝜄 𝜾 MAP = argmax 𝑄 𝜄 𝐸 = argmax 𝑄(𝐸𝑏𝑢𝑏) 𝜄 𝜄 Slide credit: Tom Mitchell
Naïve Bayes classifier • Want to learn 𝑄 𝑍 𝑌 1 , ⋯ , 𝑌 𝑜 ) • But require 𝟑 𝒐 parameters... • How about applying Bayes rule? 𝑄(𝑌 1 ,⋯,𝑌 𝑜 𝑍 𝑄 𝑍 • 𝑄 𝑍 𝑌 1 , ⋯ , 𝑌 𝑜 ) = ∝ 𝑄(𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 𝑄 𝑍 𝑄(𝑌 1 ,⋯,𝑌 𝑜 ) • 𝑄(𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 : Need (𝟑 𝒐 −𝟐) × 𝟑 parameters • 𝑄(𝑍) : Need 1 parameter • Apply conditional independence assumption 𝑜 • 𝑄 𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 = ς 𝑘=1 𝑄(𝑌 𝑘 |𝑍) : Need 𝐨 × 𝟑 parameters
Naïve Bayes classifier • Bayes rule: 𝑄(𝑍 = 𝑧 𝑙 )𝑄(𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 = 𝑧 𝑙 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) = σ 𝑘 𝑄 𝑍 = 𝑧 𝑘 𝑄 𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 = 𝑧 𝑘 • Assume conditional independence among 𝑌 𝑗 ’s: 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) = 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) σ 𝑘 𝑄 𝑍 = 𝑧 𝑘 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑘 ) • Pick the most probable Y 𝑍 ← argmax 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) 𝑧 𝑙 Slide credit: Tom Mitchell
Example • 𝑄 𝑍 𝑌 1 , 𝑌 2 ∝ 𝑄 𝑍 𝑄 𝑌 1 , 𝑌 2 𝑍 = 𝑄 𝑍 𝑄 𝑌 1 𝑍 𝑄(𝑌 2 𝑍 Bayes rule Conditional indep. • Estimating parameters 𝑄 𝑍 = 1 = 0.4 𝑄 𝑍 = 0 = 0.6 𝑄 𝑌 1 = 1|𝑍 = 1 = 0.2 𝑄 𝑌 1 = 0|𝑍 = 1 = 0.8 𝑄 𝑌 1 = 1|𝑍 = 0 = 0.7 𝑄 𝑌 1 = 0|𝑍 = 0 = 0.3 𝑄 𝑌 2 = 1|𝑍 = 1 = 0.3 𝑄 𝑌 2 = 0|𝑍 = 1 = 0.7 𝑄 𝑌 2 = 1|𝑍 = 0 = 0.9 𝑄 𝑌 2 = 0|𝑍 = 0 = 0.1 • Test example: 𝑌 1 = 1, 𝑌 2 = 0 • 𝑍 = 1 : 𝑄 𝑍 = 1 𝑄 𝑌 1 = 1|𝑍 = 1 𝑄 𝑌 2 = 0|𝑍 = 1 = 0.4 × 0.2 × 0.7 = 0.056 • 𝑍 = 0 : 𝑄 𝑍 = 0 𝑄 𝑌 1 = 1|𝑍 = 0 𝑄 𝑌 2 = 0|𝑍 = 0 = 0.6 × 0.7 × 0.1 = 0.042
Naïve Bayes algorithm – discrete X i • For each value y k Estimate 𝜌 𝑙 = 𝑄(𝑍 = 𝑧 𝑙 ) For each value x ij of each attribute X i Estimate 𝜄 𝑗𝑘𝑙 = 𝑄(𝑌 𝑗 = 𝑦 𝑗𝑘𝑙 |𝑍 = 𝑧 𝑙 ) • Classify X test test 𝑍 = 𝑧 𝑙 ) 𝑍 ← argmax 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑧 𝑙 𝑍 ← argmax 𝜌 𝑙 Π 𝑗 𝜄 𝑗𝑘𝑙 𝑧 𝑙 Slide credit: Tom Mitchell
Estimating parameters: discrete 𝑍, 𝑌 𝑗 • Maximum likelihood estimates (MLE) 𝑄 𝑍 = 𝑧 𝑙 = #𝐸 𝑍 = 𝑧 𝑙 𝜌 𝑙 = ො 𝐸 𝑄 𝑌 𝑗 = 𝑦 𝑗𝑘 𝑍 = 𝑧 𝑙 = #𝐸 𝑌 𝑗 = 𝑦 𝑗𝑘 ^ 𝑍 = 𝑧 𝑙 መ 𝜄 𝑗𝑘𝑙 = #𝐸{𝑍 = 𝑧 𝑙 } Slide credit: Tom Mitchell
• F = 1 iff you live in Fox Ridge • S = 1 iff you watched the superbowl last night • D = 1 iff you drive to VT • G = 1 iff you went to gym in the last month 𝑄 𝐺 = 1 = 𝑄 𝐺 = 0 = 𝑄 𝑇 = 1|𝐺 = 1 = 𝑄 𝑇 = 0|𝐺 = 1 = 𝑄 𝑇 = 1|𝐺 = 0 = 𝑄 𝑇 = 0|𝐺 = 0 = 𝑄 𝐸 = 1|𝐺 = 1 = 𝑄 𝐸 = 0|𝐺 = 1 = 𝑄 𝐸 = 1|𝐺 = 0 = 𝑄 𝐸 = 0|𝐺 = 0 = 𝑄 𝐻 = 1|𝐺 = 1 = 𝑄 𝐻 = 0|𝐺 = 1 = 𝑄 𝐻 = 1|𝐺 = 0 = 𝑄 𝐻 = 0|𝐺 = 0 = 𝑄 𝐺|𝑇, 𝐸, 𝐻 = 𝑄 𝐺 P S F P D F P(G|F)
Naïve Bayes: Subtlety #1 • Often the 𝑌 𝑗 are not really conditionally independent • Naïve Bayes often works pretty well anyway • Often the right classification, even when not the right probability [Domingos & Pazzani, 1996]) • What is the effect on estimated P(Y|X) ? • What if we have two copies: 𝑌 𝑗 = 𝑌 𝑙 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) ∝ 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) Slide credit: Tom Mitchell
Naïve Bayes: Subtlety #2 MLE estimate for 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) might be zero. (for example, 𝑌 𝑗 = birthdate. 𝑌 𝑗 = Feb_4_1995) • Why worry about just one parameter out of many? 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) ∝ 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) • What can we do to address this? • M AP estimates (adding “imaginary” examples) Slide credit: Tom Mitchell
Estimating parameters: discrete 𝑍, 𝑌 𝑗 • Maximum likelihood estimates (MLE) 𝑄 𝑍 = 𝑧 𝑙 = #𝐸 𝑍 = 𝑧 𝑙 𝜌 𝑙 = ො 𝐸 𝑄 𝑌 𝑗 = 𝑦 𝑗𝑘 𝑍 = 𝑧 𝑙 = #𝐸 𝑌 𝑗 = 𝑦 𝑗𝑘 , 𝑍 = 𝑧 𝑙 𝜄 𝑗𝑘𝑙 = መ #𝐸{𝑍 = 𝑧 𝑙 } • MAP estimates (Dirichlet priors): 𝑄 𝑍 = 𝑧 𝑙 = #𝐸 𝑍 = 𝑧 𝑙 + (𝛾 𝑙 −1) 𝜌 𝑙 = ො 𝐸 + σ 𝑛 (𝛾 𝑛 −1) 𝑄 𝑌 𝑗 = 𝑦 𝑗𝑘 𝑍 = 𝑧 𝑙 = #𝐸 𝑌 𝑗 = 𝑦 𝑗𝑘 , 𝑍 = 𝑧 𝑙 + (𝛾 𝑙 −1) 𝜄 𝑗𝑘𝑙 = መ #𝐸{𝑍 = 𝑧 𝑙 } + σ 𝑛 (𝛾 𝑛 −1) Slide credit: Tom Mitchell
What if we have continuous X i • Gaussian Naïve Bayes (GNB): assume 2 1 exp(− 𝑦 − 𝜈 𝑗𝑙 𝑄 𝑌 𝑗 = 𝑦 𝑍 = 𝑧 𝑙 = ) 2 2𝜏 𝑗𝑙 2𝜌𝜏 𝑗𝑙 • Additional assumption on 𝜏 𝑗𝑙 : • Is independent of 𝑍 ( 𝜏 𝑗 ) • Is independent of 𝑌 𝑗 ( 𝜏 𝑙 ) • Is independent of 𝑌 i and 𝑍 ( 𝜏 ) Slide credit: Tom Mitchell
Naïve Bayes algorithm – continuous X i • For each value y k Estimate 𝜌 𝑙 = 𝑄(𝑍 = 𝑧 𝑙 ) For each attribute X i estimate Class conditional mean 𝜈 𝑗𝑙 , variance 𝜏 𝑗𝑙 • Classify X test test 𝑍 = 𝑧 𝑙 ) 𝑍 ← argmax 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑧 𝑙 test , 𝜈 𝑗𝑙 , 𝜏 𝑗𝑙 ) 𝑍 ← argmax 𝜌 𝑙 Π 𝑗 𝑂𝑝𝑠𝑛𝑏𝑚(𝑌 𝑗 𝑧 𝑙 Slide credit: Tom Mitchell
Things to remember • Probability basics • Conditional probability, joint probability, Bayes rule • Estimating parameters from data • Maximum likelihood (ML) maximize 𝑄(Data|𝜄) • Maximum a posteriori estimation (MAP) maximize 𝑄(𝜄|Data) • Naive Bayes 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) ∝ 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 )
Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
1 (Yes) Malignant? 0 (No) Tumor Size ℎ 𝜄 𝑦 = 𝜄 ⊤ 𝑦 • Threshold classifier output ℎ 𝜄 𝑦 at 0.5 • If ℎ 𝜄 𝑦 ≥ 0.5, predict “ 𝑧 = 1 ” • If ℎ 𝜄 𝑦 < 0.5 , predict “ 𝑧 = 0 ” Slide credit: Andrew Ng
Classification: 𝑧 = 1 or 𝑧 = 0 ℎ 𝜄 𝑦 = 𝜄 ⊤ 𝑦 (from linear regression) can be > 1 or < 0 Logistic regression: 0 ≤ ℎ 𝜄 𝑦 ≤ 1 Logistic regression is actually for classification Slide credit: Andrew Ng
Hypothesis representation • Want 0 ≤ ℎ 𝜄 𝑦 ≤ 1 1 ℎ 𝜄 𝑦 = 1 + 𝑓 −𝜄 ⊤ 𝑦 • ℎ 𝜄 𝑦 = 𝜄 ⊤ 𝑦 , 1 where 𝑨 = 1+𝑓 −𝑨 (𝑨) • Sigmoid function • Logistic function 𝑨 Slide credit: Andrew Ng
Interpretation of hypothesis output • ℎ 𝜄 𝑦 = estimated probability that 𝑧 = 1 on input 𝑦 • Example: If 𝑦 = 𝑦 0 1 x 1 = tumorSize • ℎ 𝜄 𝑦 = 0.7 • Tell patient that 70% chance of tumor being malignant Slide credit: Andrew Ng
Logistic regression (𝑨) ℎ 𝜄 𝑦 = 𝜄 ⊤ 𝑦 1 𝑨 = 1 + 𝑓 −𝑨 𝑨 = 𝜄 ⊤ 𝑦 Suppose predict “y = 1” if ℎ 𝜄 𝑦 ≥ 0.5 𝑨 = 𝜄 ⊤ 𝑦 ≥ 0 predict “y = 0” if ℎ 𝜄 𝑦 < 0.5 𝑨 = 𝜄 ⊤ 𝑦 < 0 Slide credit: Andrew Ng
Decision boundary • ℎ 𝜄 𝑦 = (𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 ) Age E.g., 𝜄 0 = −3, 𝜄 1 = 1, 𝜄 2 = 1 Tumor Size • Predict “ 𝑧 = 1 ” if −3 + 𝑦 1 + 𝑦 2 ≥ 0 Slide credit: Andrew Ng
• ℎ 𝜄 𝑦 = (𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 2 + 𝜄 4 𝑦 2 2 ) + 𝜄 3 𝑦 1 E.g., 𝜄 0 = −1, 𝜄 1 = 0, 𝜄 2 = 0, 𝜄 3 = 1, 𝜄 4 = 1 2 ≥ 0 2 + 𝑦 2 • Predict “ 𝑧 = 1 ” if −1 + 𝑦 1 2 + • ℎ 𝜄 𝑦 = (𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + 𝜄 3 𝑦 1 2 + 𝜄 6 𝑦 1 2 𝑦 2 + 𝜄 5 𝑦 1 2 𝑦 2 3 𝑦 2 + ⋯ ) 𝜄 4 𝑦 1 Slide credit: Andrew Ng
Recommend
More recommend