LEARNING Outline Math Behind Logistic Regression Visualizing - PowerPoint PPT Presentation

Logistic Regression CSCI 447/547 MACHINE LEARNING

Outline  Math Behind Logistic Regression  Visualizing Logistic Regression  Loss Function  Minimizing Log Likelihood Function  Batch/Full Logistic Regression  Gradient Descent for OLS  Gradient Descent for Logistic Regression  Comparing OLS and Logistic Regression  Multi-Class Logistic Regression Using Softmax

OLS Recap  Linear regression  Predicts continuous and potentially unbounded labels based on given features ˆ  Y = XB  B are the coefficients, X is the data matrix  The Issue:  Unbounded output means we cannot use it with discrete classification

Logistic Regression  Classification  Predicts discrete labels based on given features  The Setup  y = 0 or 1 with probabilities 1-p and p  Predict a probability instead of a value  Estimate P(y=1|X)

Definitions  Link Function  Relates mean of distribution to output of linear model  Convert unbounded to bounded predictions  Convert continuous output to discrete interpretation  Typically the link function is exponentially distributed  Logistic Function  Input is ∞ to - ∞ and output is 0 to 1  f(- ∞, ∞) -> (0, 1)  Typically sigmoid function  f(0) = 0.5, f(- ∞) = 0 , f(∞) = 1  This value is a probability

Note on Choice of Link Function  Properties:  Bounded [0,1]  Domain (- ∞, ∞)  Differentiable everywhere  Used in optimization  Increasing function  We do not lose the property of coefficient effect (positive and negative) in moving from linear to logistic

Visualizing Logistic Regression

Logistic Regression Loss  Our Goal  The Full Log-Likelihood  A Note on Minimization

Our Goal  Find P(y=1 | X)  Probability to be estimated 1  Logistic Function: Φ 𝑦 = 1+𝑓 −𝑦  OLS Function Y = XB becomes Y = Φ (XB) 𝜚 1 …  Φ = 𝜚 𝑂  y i = 𝜚 (x i . B)  MLE – Maximum Likelihood Estimation 𝑂  L = 𝑄(𝑍 = 𝑧 𝑗 |𝑦 𝑗 ) 𝑗=1  Independent, so joint probability is the product of each observation

Our Goal 𝑂  L = 𝑄(𝑍 = 𝑧𝑗|𝑦𝑗) 𝑗=1 𝑄 𝑧𝑗 1 − 𝑄 1 − 𝑧𝑗 ) 𝑂 =  𝑗=1  Log-Likelihood, log(L) 𝑂 log(L) = [𝑧 𝑗 log 𝑞 + 1 − 𝑧 𝑗 log(1 − 𝑞)  𝑗=1 𝑂  = [𝑧 𝑗 log 𝜚(𝑦 𝑗 . 𝐶) + 1 − 𝑧 𝑗 log(1 − 𝜚 𝑦 𝑗 . 𝐶 ) 𝑗=1  Since we want to minimize, take the negative log likelihood  -log(L) – will minimize this

The Full Log-Likelihood  Minimize this in logistic regression  So far, similar to OLS  But this does not have an explicit solution so we need to minimize this numerically

Logistic Regression Loss

Gradient Descent

Batch/Full Gradient Descent  The Gradient  Algorithm  1. Chose x randomly  2. Compute gradient of f at x: 𝛼 f(x)  3. Step in the direction of the negative of the gradient  x <- x – η ( 𝛼 f(x))  η = step size (too large, can overshoot, too small, takes too long)  Repeat steps 2 and 3 until convergence  Difference in iterations is not decreasing or have reached a certain number of iterations

Gradient Descent for OLS  Numerical Minimization of OLS  Mean Log-Likelihood  L = -1/N ||y – XB|| 2 2  The Algorithm for OLS

Gradient Descent for Logistic Regression  Numerical Minimization of Logistic Loss  L(B) = - ∑ i [y i log( Φ (x i . B)) + (1-y)log(1- Φ (x i . B))]  Two Components 1  Recall ∅ 𝑦 = 1+𝑓 −𝑦  Because this is symmetric about 1/2:  ∅ 𝑦 + ∅ −𝑦 = 1

Gradient Descent for Logistic Regression  First Term 𝜖 1 𝜖 ))] ) 𝜖𝐶 𝑘0 [log (Φ(𝑦 ∙ 𝐶 𝜖𝐶 𝑘0 𝜚(𝑦 𝑗 ∙ 𝐶  = ∙𝐶) 𝜚(𝑦 𝑗 1+𝑓 − 𝑦 𝑗 .𝐶 𝜖 1  = 1+𝑓 − 𝑦𝑗.𝐶 1 𝜖𝐶 𝑘0 1 = 1 + 𝑓 − 𝑦𝑗.𝐶 2 𝑓 −𝑦𝑗.𝐶 (−𝑦 𝑗𝑘 )  1+𝑓 − 𝑦𝑗.𝐶 1 1+𝑓 𝑦𝑗.𝐶 𝑦 𝑗𝑘0  = = 𝜚(−𝑦 𝑗 ∙ 𝐶)𝑦 𝑗𝑘0 = (1 − 𝜚(𝑦 𝑗 ∙ 𝐶))𝑦 𝑗𝑘0  𝜖 ))] = 𝑦 𝑗 𝜖𝐶 𝑘0 [log (Φ(𝑦 ∙ 𝐶 𝑈 (1 − 𝜚(𝑦 𝑗 ∙ 𝐶))𝑦 𝑗𝑘0 

Gradient Descent for Logistic Regression  Second Term 𝜖 ))] = 𝜖 ) 𝜖𝐶 𝑘0 [log (1 − Φ(𝑦 ∙ 𝐶 𝜖𝐶 log (𝜚 −𝑦 𝑗 ∙ 𝐶  ) = −𝑦 𝑗 𝑈 (1 − 𝜚 −𝑦 𝑗 ∙ 𝐶  ) = −𝑦 𝑗 𝑈 (𝜚 𝑦 𝑗 ∙ 𝐶 

Gradient Descent for Logistic Regression  All Terms 𝜖𝑀(𝐶) )) + )] = − [𝒛 𝒋 𝒚 𝒋 𝑼 (1 − 𝜚(𝑦 ∙ 𝐶 (1 − 𝑧 𝑗 )(−𝑦 𝑗 𝑈 )𝜚(𝑦 𝑗 ∙ 𝐶  𝒋 𝜖𝑪 ) + 𝑦 𝑗 𝑈 𝑧 𝑗 𝜚(𝑦 𝑗 ) - 𝑦 𝑗 𝑈 𝜚(𝑦 𝑗 ) ] = − [𝒛 𝒋 𝒚 𝒋 𝑼 −𝑧 𝑗 𝑦 𝑗 𝑈 𝜚(𝑦 𝑗 ∙ 𝐶 ∙ 𝐶 ∙ 𝐶  𝒋 )) = − 𝒚 𝒋 𝑼 (𝑧 𝑗 − 𝜚(𝑦 𝑗 ∙ 𝐶  𝒋 = −X T 𝒛 − 𝚾 𝒀𝑪  𝜚 1 … where 𝚾 =  𝜚 𝑂

Gradient Descent for Logistic Regression  𝜖𝑀 ) = −𝑌 𝑈 (𝑧 − Φ 𝑌𝐶 𝜖𝐶  where Φ is the individual logistic functions  Normalize this by the number of observations  Divide by N to get the mean loss

Gradient Descent for Logistic Regression  Algorithm:  Pick learning rate η  Initialize B randomly  Iterate: 𝜖𝑀 𝜃 ⟵ 𝐶 − 𝜃 + ] 𝑂 𝑌 𝑈 [𝑧  𝐶 = 𝐶 − Φ 𝑌𝐶 𝜖𝐶

Comparing OLS and Logistic Regression  The Two Update Steps  OLS: 𝑂 𝑌 𝑈 𝑧 2𝜃 ← 𝐶 +  𝐶 − 𝑌𝐶  constant residual error  Logistic: 𝑂 𝑌 𝑈 𝑧 𝜃 ← 𝐶 +  𝐶 − Φ(𝑌𝐶)  constant error

Comparing OLS and Logistic Regression  Regularization  OLS Loss Function under Regularization: 2 + 𝜀 2 𝐶  𝑀 = 𝑧 − 𝑌𝐶 2 2 2  Logistic: 2 + 𝜀 2 𝐶  𝑀 = 𝑧 − Φ(𝑌𝐶) 2 2 2  L2 Norm – Ridge Regression: Uniform Regularization  L1 Norm – LASSO Regression: Dimensionality Reduction

Multi-Class Logistic Regression Using Softmax  Softmax 𝑓 𝐶𝑘∙𝑦𝑗  𝑄 𝑧 = 𝑑 𝑘 𝑦 𝑗 = , 𝑧 ∈ {1, … , 𝑛} 𝑓 𝐶𝑘∙𝑦𝑗 𝑛 𝑘=1 𝑛  𝑄 𝑧 = 𝑑 𝑘 𝑦 𝑗 = 1 , so it is a probability 𝑘=1

Multi-Class Logistic Regression Using Softmax  Comparison with Logistic Regression in the Case of Two Classes 1  2 Classes: 𝑄 𝑧 = 1 𝑦 𝑗 = 1+𝑓 −𝑦𝑗𝐶  Softmax with 2 classes: 𝑓 𝑦𝑗𝐶1 𝑓 −𝑦𝑗𝐶1 𝑄 𝑧 = 1 𝑦 𝑗 = 𝑓 𝑦𝑗𝐶1 +𝑓 𝑦𝑗𝐶2 multiply this by 𝑓 −𝑦𝑗𝐶1 1 =  1+𝑓 −𝑦𝑗(𝐶1−𝐶2)

Softmax Optimization  Classification Probabilities: 𝜌 1,𝑘 𝑓 𝑦𝑗𝐶𝑘 …  Π 𝑗,𝑘 = , 𝜌 𝑘 = 𝑓 𝑦𝑗𝐶𝑙 𝜌 𝑂,𝑘 𝑙  Probabilities of observation i, class j  The Gradients: 𝜖𝐶 𝑘 = −𝑌 𝑈 𝑧 𝑘 𝜖𝑀 − 𝜌 𝑘  )  Logistic: −𝑌 𝑈 (𝑧 − 𝜚 𝑌𝐶  Probability vector replaces sigmoid term  y j us a column vector with all entries 0 except the j th

Softmax Gradient Descent  Algorithm largely unchanged  1. Initialize learning rate  2. Randomly choose B 1 … B m + 𝜃 ← 𝐶 𝑂 𝑌 𝑈 (𝑧 𝑘  3. 𝐶 − 𝜌 𝑘 ) 𝑘 𝑘

Last Notes  Learning Rate η :  Small values take a long time to converge  Large values, and convergence may not happen  Important to monitor loss function on each itertation  Also need to make sure you normalize so values don’t get too large  Gradient descent algorithms across all are very similar

Summary  Math Behind Logistic Regression  Visualizing Logistic Regression  Loss Function  Minimizing Log Likelihood Function  Batch/Full Logistic Regression  Gradient Descent for OLS  Gradient Descent for Logistic Regression  Comparing OLS and Logistic Regression  Multi-Class Logistic Regression Using Softmax

LEARNING Outline Math Behind Logistic Regression Visualizing - PowerPoint PPT Presentation

Logistic Regression CSCI 447/547 MACHINE LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss Function Minimizing Log Likelihood Function Batch/Full Logistic Regression Gradient Descent for

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Learning from Observations Chapter 18, Sections 13 Chapter 18, Sections 13 1 Outline

Logistic regression for probabilit y of defa u lt C R E D IT R ISK MOD E L IN G IN P YTH ON

Universal Scaling in Fast Quenches Near Lifshitz-Like Fixed Points Ali Mollabashi YITP Workshop

The degree distribution Ramon Ferrer-i-Cancho & Argimiro Arratia Universitat Polit` ecnica de

P orting a G AMESS C omputational C hemistry K ernel to F PGAs Uma Klaassen University of Texas

Overfitting Many hypotheses consistent with/close to the data About this class With enough

Machine Learning - MT 2016 16. Course Summary Varun Kanade University of Oxford November 30,

Chapter 3: Modeling with First-Order Differential Equations Department of Electrical Engineering

Discriminative vs. Generative Learning CS 760@UW-Madison Goals for the lecture you should

LEARNING Outline Math Behind Logistic Regression Visualizing - PowerPoint PPT Presentation

Logistic Regression CSCI 447/547 MACHINE LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss Function Minimizing Log Likelihood Function Batch/Full Logistic Regression Gradient Descent for

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Learning from Observations Chapter 18, Sections 13 Chapter 18, Sections 13 1 Outline

Logistic regression for probabilit y of defa u lt C R E D IT R ISK MOD E L IN G IN P YTH ON

Universal Scaling in Fast Quenches Near Lifshitz-Like Fixed Points Ali Mollabashi YITP Workshop

The degree distribution Ramon Ferrer-i-Cancho &amp; Argimiro Arratia Universitat Polit` ecnica de

P orting a G AMESS C omputational C hemistry K ernel to F PGAs Uma Klaassen University of Texas

Overfitting Many hypotheses consistent with/close to the data About this class With enough

Machine Learning - MT 2016 16. Course Summary Varun Kanade University of Oxford November 30,

Chapter 3: Modeling with First-Order Differential Equations Department of Electrical Engineering

Discriminative vs. Generative Learning CS 760@UW-Madison Goals for the lecture you should

The degree distribution Ramon Ferrer-i-Cancho & Argimiro Arratia Universitat Polit` ecnica de