supervised learning
play

Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao - PowerPoint PPT Presentation

EE226 Big Data Mining Lecture 4 Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University Reference and Acknowledgement Most of the course materials are credited to Andrew Ngs CS229 lecture notes. Outline


  1. EE226 Big Data Mining Lecture 4 Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University

  2. Reference and Acknowledgement • Most of the course materials are credited to Andrew Ng’s CS229 lecture notes.

  3. Outline • Linear Regression ( 线性回归 ) • Classification and Logistic Regression ( 逻辑回归 ) • Generalized Linear Models

  4. Outline • Linear Regression ( 线性回归 ) • Classification and Logistic Regression ( 逻辑回归 ) • Generalized Linear Models

  5. Supervised Learning Example Revisited (x (i) , y (i) ): a training example {(x (i) , y (i) ); i = 1,…, m}: training set y (i) ∈ Y : output variables h: X ⟼ Y hypothesis ( 假设函数 ) 4 predicted Price 3 value in million 2 RMB 1 x (i) ∈ X : input variables 0 0 50 100 150 200 250 75 Size in m 2 testing example

  6. Supervised Learning Example Revisited Let’s consider a richer dataset in which we also know the number of bedrooms in each apartment • x: two-dimensional vectors in R 2 Price Size #bedrooms (million ¥ ) • x 1 (i) : the size of the i-th apartment 40 0 1.2 in the training set • x 2 (i) : the number of bedrooms of 65 1 1.9 the i-th apartment in the training 80 2 2.2 set • We decide hypothesis h as a 89 2 3.3 linear function: h θ (x) = θ 0 + θ 1 x 1 + 120 3 5.3 θ 2 x 2 • θ i : parameters/weights of h … … … • By letting x 0 = 1, we rewrite h as n θ i x i = θ T x X Why a linear function? h ( x ) = i =0

  7. Supervised Learning Example Revisited Let’s consider a richer dataset in which we also know the number of bedrooms in each apartment Price Size #bedrooms (million ¥ ) • By letting x 0 = 1, we rewrite h as 40 0 1.2 n θ i x i = θ T x X 65 1 1.9 h ( x ) = i =0 80 2 2.2 • How can we learn θ ? Making h(x) close to y for the training 89 2 3.3 examples! • cost function ( 损失函数 ) : 120 3 5.3 m … … … J ( θ ) = 1 X ( h θ ( x ( i ) ) − y ( i ) ) 2 2 Why a least-squares cost? i =1

  8. Least-Mean Square Alg • How to choose θ to minimize J( θ )? Let’s start with some “initial guess” for θ , and use gradient descent ( 梯度下降 ) alg. repeatedly to make J( θ ) smaller: θ j := θ j − α∂ J ( θ ) direction of steepest decrease of J θ j 𝛽 : learning rate • What is the partial derivative ( 偏导数 ) term? least mean square update rule: error term

  9. Least-Mean Square Alg • Two ways to modify the method: • batch gradient descent: scan through the entire training set before taking a single step • stochastic gradient descent: update parameters according to the gradient of the error w.r.t. a single training example

  10. Convergence • In most cases, gradient descent converges to local minima. Linear regression only has one global minima, which the gradient descent always converges to. This is because the cost function J is a convex quadratic function ( ⼆亍次凸函数 ). θ global minima is reached! contour ( 等⾼髙线 ) shows the cost x

  11. Normal Equations ( 标准⽅斺程 ) • Gradient descent gives one way of minimizing J. How about others? • We minimize J by explicitly taking derivatives w.r.t. θ and setting them to 0s. And solve the equations! f: R mxn ⟼ R A: m x n matrix

  12. Normal Equations ( 标准⽅斺程 ) 1. trace ( 迹 ): , the trace of a real number is itself 2. trace of a matrix = trace of its transpose ( 转置矩阵 ) 3. , 4. 5.

  13. Normal Equations ( 标准⽅斺程 ) Property 1 Property 2, 3 Property 4, 5 = 0

  14. Probabilistic View • The target variables and the inputs are related by y ( i ) = ✓ T x ( i ) + ✏ ( i ) error term • Assume are distributed IID (independently and identically ✏ ( i ) ✏ ( i ) ∼ N (0 , � 2 ) distributed 独⽴竌同分布 ) and • Implies − ( y ( i ) − θ T x ( i ) ) 2 1 ⇣ ⌘ p ( y ( i ) | x ( i ) ; θ ) = 2 πσ exp √ 2 σ 2 • Given X and θ , what is the distribution of y (i) ’s? Likelihood function:

  15. Probabilistic View • Maximum likelihood: we should choose θ to make the data as high probability as possible • Equivalently, we maximize the log likelihood: minimizing this term instead! original least-squares cost

  16. Underfitting & Overfitting • Fitting to di ff erent hypotheses: 5 X θ j x j y = y = θ 0 + θ 1 x + θ 2 x 2 y = θ 0 + θ 1 x j =0 underfitting overfitting The more features we add, the better. However, there is also a risk in adding too many features.

  17. Locally Weighted Linear Regression • The choice of features is important to learning performance! • Locally weighted linear regression X w ( i ) ( y ( i ) − θ T x ( i ) ) 2 1. Fit θ to minimize i 2. Output θ T x • larger w (i) -> try harder to make (y (i) - θ T x (i) ) 2 small; otherwise, ignore the corresponding error term Non-parametric Alg: • Standard choice for the weight: keep the entire training − ( x ( i ) − x ) 2 dataset when making w ( i ) = exp ⇣ ⌘ 2 τ 2 predictions θ is giving a higher weight to the training examples close to the testing data x

  18. Summary • Linear regression n θ i x i = θ T x X • Linear hypothesis class h ( x ) = m i =0 J ( θ ) = 1 • Cost function X ( h θ ( x ( i ) ) − y ( i ) ) 2 2 i =1 • Least mean square algorithm: • Batch/stochastic gradient descent • Probabilistic view: • Errors ∼ I.I.D. Gaussian distribution • Maximum likelihood • Overfitting & Underfitting • Locally weighted linear regression

  19. Outline • Linear Regression ( 线性回归 ) • Classification and Logistic Regression ( 逻辑回归 ) • Generalized Linear Models

  20. Binary Classification • The target y can only take two values: y ∈ {-1, +1}. y = 1 if the example belongs to the positive class, otherwise, it is a member of the negative class • Hypothesis: h(x) = θ T x. Given x, we classify it as positive or negative depending on the sign of θ T x, i.e., sign( θ T x) = y ⟺ y θ T x > 0 • Margin for the example (x, y): y θ T x — the more θ T x is negative (or positive), the stronger the belief that y is negative (or positive) • loss function: should penalize the θ for which y(i) θ T x(i) < 0 frequently in the training data. Loss value should be small if y(i) θ T x(i) > 0 and large if y(i) θ T x(i) < 0 • We expect the loss function to be continuous and convex (easy to converge to the global minima!)

  21. Binary Classification • Expect the loss to satisfy: Loss_func ( y(i) θ T x(i) ) → 0 as y(i) θ T x(i) → ∞ and Loss_func ( y(i) θ T x(i) ) → ∞ as y(i) θ T x(i) → - ∞ Loss logistic ( z ) = log(1 + e − z ) logistic regression support vector machines Loss hinge = max { 1 − z, 0 } boosting Loss exp = e − z

  22. Logistic Regression • Choose θ to minimize m m J ( θ ) = 1 Loss logistic ( y ( i ) θ T x ( i ) ) = 1 log(1 + exp( − y ( i ) θ T x ( i ) )) X X m m i =1 i =1 which hopefully yields θ that y(i) θ T x(i) > 0 for most training examples • Alternative view : Logistic (Sigmoid) 1 function g ( z ) = 1 + e − z → 1 as z → ∞ and g(z) → 0 as z → - ∞ • g(z) + g(-z) = 1 we could use it to define the probability model for binary classification.

  23. Probabilistic View • For y ∈ {-1, +1}, we define the logistic model as 1 p ( Y = y | x ; θ ) = g ( yx T θ ) = , & refine hypothesis class as 1 + e − yx T θ 1 h θ ( x ) = 1 + e − x T θ • The likelihood of the training data is • The log-likelihood is • maximizing likelihood in the logistic model = minimizing the average logistic loss

  24. Gradient Descent • For the , the derivative is Loss logistic ( z ) = log(1 + e − z ) Sigmoid function e − z d 1 + e − z · d 1 d z e − z = − d z Loss logistic ( z ) = 1 + e − z = − g ( − z ) • For a single training example (x, y): ∂ Loss logistic ( yx T θ ) = − g ( − yx T θ ) ∂ ( yx T θ ) = − g ( − yx T θ ) yx k ∂θ k ∂θ k • Update rule for stochastic gradient descent: θ t +1 = θ t � α t · r θ Loss logistic ( � y ( i ) x ( i ) T θ t ) incorrect label

  25. Update Rule when y ∈ {0,1} 1 P ( y = 1 | x ; θ ) = h θ ( x ) = 1 + e − θ T x p ( y | x ; θ ) = ( h θ ( x )) y (1 − h θ ( x )) 1 − y P ( y = 0 | x ; θ ) = 1 − h θ ( x ) similar to least mean square update rule, but h is non-linear! gradient ascent:

  26. Another Update Rule to Maximize l( θ ) • Newton’s method for finding a zero of a function: f( θ ) = 0 • Update rule: θ := θ - f( θ )/f’( θ )

  27. Another Update Rule to Maximize l( θ ) • Newton’s method for finding a zero of a function: f( θ ) = 0 • What if we want to maximize some loss function? The maxima of the loss corresponds to points where its first derivative is 0 θ := θ − l 0 ( θ ) • Update rule: l 00 ( θ ) • Multidimensional setting: θ := θ � H − 1 r θ l ( θ ) Hessian matrix • Advantage: Newton’s method typically enjoys faster convergence than gradient descent, and requires many fewer iterations to get very close to the minimum. • Disadvantage: more expensive in one iteration

  28. Summary • Logistic regression • Hypothesis h(x) = θ T x • Cost function Loss logistic ( z ) = log(1 + e − z ) • Update rule: θ t +1 = θ t � α t · r θ Loss logistic ( � y ( i ) x ( i ) T θ t ) θ := θ − l 0 ( θ ) • Newton’s method l 00 ( θ ) • Probabilistic view: • maximizing likelihood in the logistic model = minimizing the average logistic loss

Recommend


More recommend