classification
play

Classification Basic concepts Decision tree Nave Bayesian - PowerPoint PPT Presentation

Classification Basic concepts Decision tree Nave Bayesian classifier Model evaluation Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors) Bayesian Belief


  1. Stochastic Gradient Descent (SGD) Update parameters based on gradient of random data samples Often preferred over gradient descent because it gets close to the minimum much faster 45

  2. Non-Linear basis function  So far we only used the observed values x 1 ,x 2 , …  However, linear regression can be applied in the same way to functions of these values 2 and x 1 x 2 so each example becomes:  Eg: to add a new variable x 1 x 1 , x 2 , … . x 1 2 , x 1 x 2  As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem       2 2  y w w x w k x 0 1 1 k

  3. Non-linear basis functions What type of functions can we use?  A few common examples:  - Polynomial:  j (x) = x j for j=0 … n f j ( x ) = ( x - m j ) - Gaussian: 2 s j 2 1 f j ( x ) = - Sigmoid: 1 + exp( - s j x ) f j ( x ) = log( x + 1) - Logs:

  4. General linear regression problem  Using our new notations for the basis function linear regression can be written as n å y = w j f j ( x ) j = 0  Where  j ( x ) can be either x j for multivariate regression or one of the non-linear basis functions we defined  … and  0 ( x ) =1 for the intercept term

  5. 0 th Order Polynomial n=10

  6. 1 st Order Polynomial

  7. 3 rd Order Polynomial

  8. 9 th Order Polynomial

  9. Over-fitting Root-Mean-Square (RMS) Error:

  10. Polynomial Coefficients

  11. Regularization Penalize large coefficient values æ ö 2 - l y i - J X,y ( w ) = 1 å å ç ÷ w j f j ( x i ) 2 2 w ç ÷ 2 è ø i j

  12. Regularization: +

  13. Over Regularization:

  14. Polynomial Coefficients none exp(18) huge

  15. LASSO • Adds an L1 regularizer to Linear Regression 60

  16. Intepretability  Coefficients suggest importance/correlation with the output  A large positive coefficient implies that the output will increase when this input is increased (positively correlated)  A large negative coefficient implies that the output will decrease when this input is increased (negatively correlated)  A small or 0 coefficient suggests that the input is uncorrelated with the output (at least at the 1 st order)  Linear regression can be used to find best "indicators" CS 478 - Regression 61

  17. Regression for Classification  Given input/output samples (X, y), we learn a function f such that y = f(X), which can be used on new data.  Classification: y is discrete (class labels).  Regression: y is continuous, e.g. linear regression. 1 y dependent variable (output) 0 x – independent variable (input) x – independent variable (input) CS 478 - Regression 62

  18. From real value to discrete value February 27, 2018 Data Mining: Concepts and Techniques 63

  19. From real value to discrete value February 27, 2018 Data Mining: Concepts and Techniques 64

  20. From real value to discrete value Non-differentiable February 27, 2018 Data Mining: Concepts and Techniques 65

  21. Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete valued. Prediction: Output is a logistic function of the linear function of the inputs . 66

  22. Classification: Discriminant Function Decision Nonlinear Linear Tree Functions Functions

  23. Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete valued. Prediction: Output is a logistic function of the linear function of the inputs . Learning: finds the parameters that minimize some objective function . 68

  24. Recall: Least Squares for Linear Regression Learning: finds the parameters that minimize some objective function . We minimize the sum of the squares: Why? Reduces distance between true measurements and predicted values 69

  25. Maximum Likelihood for Logistic Regression Learning: finds the parameters that maximizes the log likelihood of observing the data: 70

  26. Review: Derivative Rules February 27, 2018 Data Mining: Concepts and Techniques 71

  27. Stochastic Gradient Descent (Ascent) Gradient Descent Update: Partial derivative with one training example (x,y): Stochastic Gradient Descent Update: February 27, 2018 Data Mining: Concepts and Techniques 72

  28. Regression Summary  Regression methods  Linear regression  Logistic regression  Optimization  Gradient descent  Stochastic gradient descent  Regularization February 27, 2018 Data Mining: Concepts and Techniques 73

  29. Classification  Basic concepts  Decision tree  Naïve Bayesian classifier  Model evaluation  Support Vector Machines  Regression  Neural Networks and Deep Learning  Lazy Learners (k Nearest Neighbors)  Bayesian Belief Networks 74

  30. Deep Learning: MIT Technology Review - 10 Breakthrough Technologies 2013 February 27, 2018 Data Mining: Concepts and Techniques 75

  31. February 27, 2018 Data Mining: Concepts and Techniques 76

  32. Applications  Image recognition  Speech recognition  Natural language processing February 27, 2018 Data Mining: Concepts and Techniques 77

  33. Classification: Discriminant Function Decision Nonlinear Linear Tree Functions Functions

  34. Neural Network and Deep Learning  A neural network is a Output vector multi-layer structure of connected input/output Output layer units (artificial neurons)  Learning by adjusting the weights so as to Hidden layer predict the correct class w ij label of the input tuples Input layer  Deep learning use more layers than shallow Input vector: X learning 80

  35. Artificial Neuron – Perceptron Perceptron February 27, 2018 Data Mining: Concepts and Techniques 81

  36. Neuron: A Hidden/Output Layer Unit bias m k x 0 w 0 x 1  w 1 f output y x n w n For Example n    m y sign( ) w i x i k Input weight weighted Activation  i 0 vector x vector w sum function An n -dimensional input vector x is mapped into variable y by means of the  scalar product and a nonlinear function mapping The inputs to unit are outputs from the previous layer. They are multiplied by  their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it. 82

  37. From Neuron to Neural Network The input layer correspond to the attributes measured for each training tuple  They are then weighted and fed simultaneously to hidden layers  The weighted outputs of the last hidden layer are input to output layer , which emits  the network's prediction From a statistical point of view, networks perform nonlinear regression : Given  enough hidden units and enough training samples, they can closely approximate any function

  38. Neural Networks  A family of parametric, non-linear, and hierarchical representation learning functions February 27, 2018 Data Mining: Concepts and Techniques 84

  39. Learning Weights (and Bias) in the Network  If a small change in a weight (or bias) cause only a small change in output, we could modify weights and biases gradually to train the network  Does the perceptron work?

  40. Artificial Neuron – Perceptron to sigmoid neuron Perceptron sigmoid neuron 87

  41. Popular Activation Functions Tanh(x) ReLU (Rectified Linear Unit) tanh x = e x − e −x 𝑔 𝑦 = max(0, 𝑦) e x + e −x

  42. February 27, 2018 89

  43. Learning by Backpropagation Modifications are made in the “ backwards ” direction: from the output  layer, through each hidden layer down to the first hidden layer, hence “ backpropagation ” Steps   Initialize weights to small random numbers, associated with biases  Propagate the inputs forward (by applying activation function)  Backpropagate the error (by updating weights and biases)  Terminating condition (when error is very small, etc.) 91

  44. Stochastic Gradient Descent  Gradient Descent (batch GD)  The cost gradient is based on the complete training set, can be costly and longer to converge to minimum  Stochastic Gradient Descent (SGD, iterative or online-GD)  Update the weight after each training sample  The gradient based on a single training sample is a stochastic approximation of the true cost gradient  Converges faster but the path towards minimum may zig-zag  Mini-Batch Gradient Descent (MB-GD)  Update the weights based on small group of training samples

  45. Training the neural network Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

  46. Training data Initialise with random weights Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

  47. Training data Feed it through to get output Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 1.4 4.1 0.1 0.2 0 etc … 2.7 0.8 1.9

  48. Training data Compare with target output Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 1.4 4.1 0.1 0.2 0 etc … 2.7 0.8 0 1.9 error 0.8

  49. Training data Adjust weights based on error Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 1.4 4.1 0.1 0.2 0 etc … 2.7 0.8 0 1.9 error 0.8

  50. Training data And so on … . Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 6.4 4.1 0.1 0.2 0 etc … 2.8 0.9 1 1.7 error -0.1 Repeat this thousands, maybe millions of times – each time taking a random training instance, and making slight weight adjustments (stochastic gradient descent)

  51. Learning for neural networks  Shallow networks  Deep networks with multiple layers - deep learning

  52. Feature detectors

  53. Hidden layer units become self-organised feature detectors 1 5 10 15 20 25 … … 1 Strong weight low/zero weight 63

  54. What does this unit detect? 1 5 10 15 20 25 … … 1 strong weight low/zero weight it will send strong signal for a horizontal line in the top row, ignoring everywhere else 63

  55. What features might you expect a good NN to learn, when trained with data like this?

  56. vertical lines 1 63

  57. Horizontal lines 1 63

  58. Small circles 1 63

  59. Small circles 1 63

  60. Hierarchical Feature Learning etc … detect lines in Specific positions Higher level detetors ( horizontal line, etc … v “ RHS vertical lune ” “ upper loop ” , etc …

  61. Hierarchical Feature Learning  Deep learning (a.k.a. representation learning) seeks to learn rich hierarchical representations (i.e. features) automatically through multiple stage of feature learning process. Mid-level High-level Trainable Low-level output features features classifier features Feature visualization of convolutional net trained on ImageNet (Zeiler and Fergus, 2013)

  62. Deep Learning Architectures  Commonly used architectures  convolutional neural networks  recurrent neural networks

Recommend


More recommend