Decision support systems and machine learning Lecture 11 Lecture 11 - PowerPoint PPT Presentation

Decision support systems and machine learning Lecture 11 Lecture 11 – p. 1/24

Neural networks: Biological and artificial Consider humans: Properties of artificial neural nets (ANN’s): • Neuron switching time ≈ 0 . 001 second • Many neuron-like threshold switching units • Number of neurons ≈ 10 10 • Many weighted interconnections among units • Connections per neuron ≈ 10 4 - 10 5 • Highly parallel, distributed process • Scene recognition time ≈ 0 . 1 sec • Emphasis on tuning weights automatically • 100 inference steps doesn’t seem like enough ⇒ much parallel computation Lecture 11 – p. 2/24

Model of biological neurons x 1 w 1 P → g → a w n a = g ( P n x n i =1 w i · x i + w 0 ) Perceptron: • The inputs are combined linearly: w 1 · x 1 + · · · + w n · x n = ¯ w · ¯ x (vector notation). • The output is non-linear. We have different activation functions g : 1 1 1 1 Step function: step ( x ) Sign function: sign ( x ) Sigmoid function: σ ( x ) = 1+exp( − β · x ) Lecture 11 – p. 3/24

When to consider neural networks Application areas are usually characterized by: • Input is high-dimensional discrete or real-valued (e.g. raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data (e.g., recognition of hand-written digits) • Form of target function is unknown • Human readability of result is unimportant • Long training times are acceptable Examples: • Speech phoneme recognition • Image classification • Financial prediction Lecture 11 – p. 4/24

NET-talk Pronounciation of letters is very context dependent, e.g.: • bite and bit • prefer and preference Speak is decomposed into 60 phonemes (“sound alphabet”), which can be encoded into 26 independent units. i n p u t t e x t The network was trained on 1024 words: • 95% correct on the training set. • 78% correct on a test set (a “story”). 80 hidden neurons 1 2 26 Lecture 11 – p. 5/24

ALVINN drives 70 mph on highways Weight values for a hidden unit encouraging a turn to the left. Lecture 11 – p. 6/24

Perceptron examples 1 w 0 x 1 w 1 Sign w 2 x 2 The decision surface of a two-input perceptron a ( x 1 , x 2 ) = sign ( x 1 w 1 + x 2 w 2 + w 0 ) is given by a straight line, separating positive and negative examples. x 2 + + - + - x 1 - - Lecture 11 – p. 7/24

Perceptron examples 1 X 1 w 0 − 1 − 1 1 x 1 w 1 Sign 1 − 1 − 1 − 1 w 2 X 2 1 1 − 1 1 x 2 X 1 ∧ X 2 x 1 w 1 + x 2 w 2 + w 0 > 0 ? This perceptron specifies the decision surface: − 1 + 1 · x 1 + 1 · x 2 = 0 x 2 - + x 1 - - Lecture 11 – p. 7/24

Perceptron examples 1 X 1 w 0 1 − 1 1 x 1 w 1 Sign 1 − 1 − 1 1 w 2 X 2 1 1 1 1 x 2 X 1 ∨ X 2 x 1 w 1 + x 2 w 2 + w 0 > 0 ? This perceptron specifies the decision surface: 1 + 1 · x 1 + 1 · x 2 = 0 x 2 + + x 1 - + Lecture 11 – p. 7/24

Perceptron examples − 1 X 1 w 0 ?? − 1 1 x 1 w 1 Sign ?? − 1 − 1 1 w 2 X 2 ?? 1 1 − 1 x 2 X 1 xor X 2 x 1 w 1 + x 2 w 2 − w 0 > 0 ? We cannot specify any values for the weights so that the perceptron can represent the “xor” function: The examples are not linear separable. Lecture 11 – p. 7/24

Linear separable X 2 1 x 1 · w 1 + x 2 · w 2 = 0 X 1 − 1 1 − 1 x 1 · w 1 + x 2 · w 2 > 0 Lecture 11 – p. 8/24

Linear separable X 2 1 x 1 · w 1 + x 2 · w 2 = 0 X 1 − 1 1 − 1 x 1 · w 1 + x 2 · w 2 > 0 Only linearly separable functions can be computed by a single perceptron. Lecture 11 – p. 8/24

Two-layer (feed-forward) X 1 X 2 X n H 1 H k O 1 O m Connections to only the layer beneath Lecture 11 – p. 9/24

XOR again X 1 X 2 1 1 X 1 H 1 H 2 − 1 1 − 1 − 1 1 X 2 1 1 − 1 1 X 1 xor X 2 O Lecture 11 – p. 10/24

XOR again X 1 X 2 1 1 X 1 1 1 − 1 H 1 − 1 − 1 H 2 − 1 − 1 1 − 1 − 1 1 X 2 1 1 − 1 1 1 1 1 X 1 xor X 2 O Lecture 11 – p. 10/24

XOR again X 1 X 2 1 1 X 1 1 1 X 1 ∧ ¬ X 2 − 1 H 1 − 1 − 1 H 2 − 1 ¬ X 1 ∧ X 2 − 1 1 − 1 − 1 1 X 2 1 1 − 1 1 1 1 1 X 1 xor X 2 O ( X 1 ∧ ¬ X 2 ) ∨ ( ¬ X 1 ∧ X 2 ) Lecture 11 – p. 10/24

Classification tasks Assume that we have: Attributes: P 1 , . . . , P n (which we can assume are binary) Classes: C 1 , . . . , C m Examples: D = { ”case = ( P ′ 1 , . . . , P ′ n ) ” is a C j } P 1 P 2 P n Classification using H k H 1 artificial neural networks O 1 O l Class Lecture 11 – p. 11/24

Example: Boolean functions X 1 X 2 X 3 X 4 T − 1 − 1 − 1 − 1 1 X 1 X 2 X 3 X 4 − 1 − 1 − 1 1 1 − 1 − 1 1 − 1 − 1 − 1 − 1 1 1 1 − 1 1 − 1 − 1 − 1 − 1 1 − 1 1 1 − 1 1 1 − 1 1 − 1 1 1 1 − 1 1 − 1 − 1 − 1 − 1 1 − 1 − 1 1 − 1 1 − 1 1 − 1 1 1 − 1 1 1 1 1 1 − 1 − 1 − 1 1 1 − 1 1 − 1 1 1 1 − 1 1 1 1 1 1 1 This can always be done Lecture 11 – p. 12/24

Example: Disjunctive normal form Consider the Boolean expression: ( X 1 ∧ ¬ X 2 ) ∨ ( X 2 ∧ X 3 ∧ ¬ X 4 ) . X 1 X 2 X 3 X 4 1 1 H 1 H 2 1 O Lecture 11 – p. 13/24

Example: Disjunctive normal form Consider the Boolean expression: ( X 1 ∧ ¬ X 2 ) ∨ ( X 2 ∧ X 3 ∧ ¬ X 4 ) . X 1 X 2 X 3 X 4 1 1 1 − 1 1 1 − 1 − 1 H 1 H 2 − 2 1 1 1 1 O 2 Lecture 11 – p. 13/24

Learning weights and threshold I 1 X 1 w 0 X 1 ∧ ¬ X 2 − 1 1 w 1 X 1 − 1 − 1 1 w 2 X 2 1 − 1 − 1 X 2 X 1 ∧ ¬ X 2 Lecture 11 – p. 14/24

Learning weights and threshold I 1 X 1 w 0 X 1 ∧ ¬ X 2 − 1 1 w 1 X 1 − 1 − 1 1 w 2 X 2 1 − 1 − 1 X 2 X 1 ∧ ¬ X 2 We have: D = ( ¯ d 1 , ¯ d 2 , ¯ d 3 , ¯ d 4 ) input vectors (cases). ¯ t = ( − 1 , 1 , − 1 , − 1) vector of target outputs. w = ( w 0 , w 1 , w 2 ) vector of current parameters. ¯ o = ( o 1 , o 2 , o 3 , o 4 ) vector of current outputs. ¯ We request: w ∗ parameters yielding ¯ o = ¯ ¯ t . Lecture 11 – p. 14/24

The perceptron training rule 1 w 0 ( if ¯ o 1 X · ¯ w > 0 w 1 o = X 1 [ t ] − 1 otherwise w 2 X 2 Error: E = t − o The weights are updated as follows: • E > 0 ⇒ o shall be increased ⇒ ¯ w + αE ¯ X · ¯ w up ⇒ ¯ w := ¯ X • E < 0 ⇒ o shall be decreased ⇒ ¯ w + αE ¯ X · ¯ w down ⇒ ¯ w := ¯ X α is called the learning rate. With ¯ t linearly separable and α not too large this procedure will converge to a correct set of parameters. Lecture 11 – p. 15/24

Example: X 1 ∧ X 2 1 w 0 : 0 ( 1 if w 1 X 1 + w 2 X 2 + w 0 > 0 w 1 : 0 o = X 1 X 1 − 1 otherwise w 2 : 0 − 1 1 X 2 α = 1 − 1 − 1 − 1 4 X 2 1 − 1 1 Cases: (1 , − 1 , − 1) (1 , 1 , − 1) (1 , − 1 , 1) (1 , 1 , 1) ¯ t : − 1 − 1 − 1 1 o : E : Lecture 11 – p. 16/24

Example: X 1 ∧ X 2 1 w 0 : 0 ( 1 if w 1 X 1 + w 2 X 2 − w 0 > 0 w 1 : 0 o = X 1 X 1 − 1 otherwise w 2 : 0 − 1 1 X 2 α = 1 − 1 − 1 − 1 4 X 2 1 − 1 1 Cases: (1 , − 1 , − 1) (1 , 1 , − 1) (1 , − 1 , 1) (1 , 1 , 1) ¯ t : − 1 − 1 − 1 1 o : − 1 − 1 − 1 − 1 E : 0 0 0 2 0 1 0 1 0 1 1 0 1 2 A + 1 B C B C B 1 C w := 0 4 · 2 · 1 A = @ @ @ A 2 1 0 1 2 Lecture 11 – p. 16/24

Example: X 1 ∧ X 2 1 w 0 : 0 / 1 ( 2 1 if w 1 X 1 + w 2 X 2 − w 0 > 0 w 1 : 0 / 1 o = X 1 2 X 1 − 1 otherwise w 2 : 0 / 1 2 − 1 1 X 2 α = 1 − 1 − 1 − 1 4 X 2 1 − 1 1 Cases: (1 , − 1 , − 1) (1 , 1 , − 1) (1 , − 1 , 1) (1 , 1 , 1) ¯ t : − 1 − 1 − 1 1 o : − 1 1 1 1 E : 0 − 2 − 2 0 0 1 0 1 0 1 1 1 0 2 A + 1 B 1 C B C B C w := 4 · − 2 · 1 A = 0 @ @ @ A 2 1 − 1 1 2 Lecture 11 – p. 16/24

Example: X 1 ∧ X 2 1 w 0 : 0 / 1 2 / 0 ( 1 if w 1 X 1 + w 2 X 2 − w 0 > 0 w 1 : 0 / 1 2 / 0 o = X 1 X 1 − 1 otherwise w 2 : 0 / 1 2 / 1 − 1 1 X 2 α = 1 − 1 − 1 − 1 4 X 2 1 − 1 1 Cases: (1 , − 1 , − 1) (1 , 1 , − 1) (1 , − 1 , 1) (1 , 1 , 1) ¯ t : − 1 − 1 − 1 1 o : − 1 − 1 1 1 E : 0 0 − 2 0 ETC... for a finite number of steps Lecture 11 – p. 16/24

Gradient descent I The perceptron training rule finds a good weight vector when the training examples are linearly separable. If this is not the case we can apply gradient descent to ensure convergence. To understand, consider the simpler linear units, where o = w 0 + w 1 · X 1 + · · · + w n · X n Define the error-function as: w ) = 1 2 ( t d − o d ) 2 E d ( ¯ So we seek to learn the weights that minimize the squared error. Lecture 11 – p. 17/24

Gradient descent II We use the gradient to find the direction in which to change the weights: 25 „ ∂E d 20 « , . . . , ∂E d ∇ E d ( ¯ w ) = 15 ∂w 0 ∂w n E[w] 10 5 This specifies the direction of steepest increase in E . 0 2 1 -2 -1 0 0 Hence, our new training rule becomes: 1 2 -1 3 w1 w0 w i := w i + ∆ w i , where ∆ w i = − η ∂E d ∂w i ∂E d and = ( t d − o d )( − X id ) ∂w i Lecture 11 – p. 18/24

Learning of two layer feed-forward network X 1 X 2 1 1 H 1 H 2 1 O What type of unit/activation-function should we use? • It should be non-linear. • It should be differentiable. 1 One solution is the sigmoid function: 1 Sigmoid function: σ ( x ) = 1+exp( − β · x ) Lecture 11 – p. 19/24

Decision support systems and machine learning Lecture 11 Lecture 11 - PowerPoint PPT Presentation

Decision support systems and machine learning Lecture 11 Lecture 11 p. 1/24 Neural networks: Biological and artificial Consider humans: Properties of artificial neural nets (ANNs): Neuron switching time 0 . 001 second Many

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Tree Decision Trees A decision tree is a decision support tool that uses a tree-like

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

6 Decision- -Making Making MVC (revisited) 6 Decision MVC (revisited) decision

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders

A Kernel Theory of Modern Data Augmentation Tr Tri Dao ao , Albert Gu, Alex Ratner, Virginia

Outline Last time: window-based generic object detection Discriminative classifiers

Perceptrons 2-29-16 What is a neural network? activation connection functions A NN is a

Clean Slate Program: Second Chance Remedies and Preparing Client Declarations June 2020 TAPs

Decision support systems and machine learning Lecture 11 Lecture 11 - PowerPoint PPT Presentation

Decision support systems and machine learning Lecture 11 Lecture 11 p. 1/24 Neural networks: Biological and artificial Consider humans: Properties of artificial neural nets (ANNs): Neuron switching time 0 . 001 second Many

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Tree Decision Trees A decision tree is a decision support tool that uses a tree-like

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

6 Decision- -Making Making MVC (revisited) 6 Decision MVC (revisited) decision

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun &amp; Rich Zemels

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos &amp; Aarti Singh 1

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders

A Kernel Theory of Modern Data Augmentation Tr Tri Dao ao , Albert Gu, Alex Ratner, Virginia

Outline Last time: window-based generic object detection Discriminative classifiers

Perceptrons 2-29-16 What is a neural network? activation connection functions A NN is a

Clean Slate Program: Second Chance Remedies and Preparing Client Declarations June 2020 TAPs

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1