Decision support systems and machine learning Lecture 11 Lecture 11 - - PowerPoint PPT Presentation

decision support systems and machine learning
SMART_READER_LITE
LIVE PREVIEW

Decision support systems and machine learning Lecture 11 Lecture 11 - - PowerPoint PPT Presentation

Decision support systems and machine learning Lecture 11 Lecture 11 p. 1/24 Neural networks: Biological and artificial Consider humans: Properties of artificial neural nets (ANNs): Neuron switching time 0 . 001 second Many


slide-1
SLIDE 1

Decision support systems and machine learning

Lecture 11

Lecture 11 – p. 1/24

slide-2
SLIDE 2

Neural networks: Biological and artificial

Consider humans:

  • Neuron switching time ≈ 0.001 second
  • Number of neurons ≈ 1010
  • Connections per neuron ≈ 104-105
  • Scene recognition time ≈ 0.1 sec
  • 100 inference steps doesn’t seem like

enough ⇒ much parallel computation Properties of artificial neural nets (ANN’s):

  • Many neuron-like threshold switching units
  • Many weighted interconnections among

units

  • Highly parallel, distributed process
  • Emphasis on tuning weights automatically

Lecture 11 – p. 2/24

slide-3
SLIDE 3

Model of biological neurons

w1 wn P → g → a x1 xn a = g(Pn

i=1 wi · xi + w0)

Perceptron:

  • The inputs are combined linearly: w1 · x1 + · · · + wn · xn = ¯

w · ¯ x (vector notation).

  • The output is non-linear.

We have different activation functions g:

Step function: step(x) Sign function: sign(x) Sigmoid function: σ(x) = 1 1+exp(−β·x)

1 1 1

Lecture 11 – p. 3/24

slide-4
SLIDE 4

When to consider neural networks

Application areas are usually characterized by:

  • Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
  • Output is discrete or real valued
  • Output is a vector of values
  • Possibly noisy data (e.g., recognition of hand-written digits)
  • Form of target function is unknown
  • Human readability of result is unimportant
  • Long training times are acceptable

Examples:

  • Speech phoneme recognition
  • Image classification
  • Financial prediction

Lecture 11 – p. 4/24

slide-5
SLIDE 5

NET-talk

Pronounciation of letters is very context dependent, e.g.:

  • bite and bit
  • prefer and preference

Speak is decomposed into 60 phonemes (“sound alphabet”), which can be encoded into 26 independent units. The network was trained on 1024 words:

  • 95% correct on the training set.
  • 78% correct on a test set (a “story”).

i n p u t t e x t 80 hidden neurons 1 26 2

Lecture 11 – p. 5/24

slide-6
SLIDE 6

ALVINN drives 70 mph on highways

Weight values for a hidden unit encouraging a turn to the left.

Lecture 11 – p. 6/24

slide-7
SLIDE 7

Perceptron examples

w0 w1 w2 Sign 1 x1 x2 The decision surface of a two-input perceptron a(x1, x2) = sign(x1w1 + x2w2 + w0) is given by a straight line, separating positive and negative examples.

+

  • +

+

x1 x2

Lecture 11 – p. 7/24

slide-8
SLIDE 8

Perceptron examples

w0 w1 w2 Sign 1 x1 x2 x1w1 + x2w2 + w0 > 0? X1 −1 1 X2 −1 −1 −1 1 −1 1 X1 ∧ X2 −1 1 1 This perceptron specifies the decision surface:

x1 x2

  • +

−1 + 1 · x1 + 1 · x2 = 0

Lecture 11 – p. 7/24

slide-9
SLIDE 9

Perceptron examples

w0 w1 w2 Sign 1 x1 x2 x1w1 + x2w2 + w0 > 0? X1 −1 1 X2 −1 −1 1 1 1 1 X1 ∨ X2 1 1 1 This perceptron specifies the decision surface:

x1 x2

+

  • +

+

1 + 1 · x1 + 1 · x2 = 0

Lecture 11 – p. 7/24

slide-10
SLIDE 10

Perceptron examples

w0 w1 w2 Sign −1 x1 x2 x1w1 + x2w2 − w0 > 0? X1 −1 1 X2 −1 −1 1 1 1 −1 X1 xor X2 ?? ?? ?? We cannot specify any values for the weights so that the perceptron can represent the “xor” function: The examples are not linear separable.

Lecture 11 – p. 7/24

slide-11
SLIDE 11

Linear separable

X2 X1 1 −1 1 −1 x1 · w1 + x2 · w2 = 0 x1 · w1 + x2 · w2 > 0

Lecture 11 – p. 8/24

slide-12
SLIDE 12

Linear separable

X2 X1 1 −1 1 −1 x1 · w1 + x2 · w2 = 0 x1 · w1 + x2 · w2 > 0 Only linearly separable functions can be computed by a single perceptron.

Lecture 11 – p. 8/24

slide-13
SLIDE 13

Two-layer (feed-forward)

X1 X2 Xn H1 Hk O1 Om Connections to only the layer beneath

Lecture 11 – p. 9/24

slide-14
SLIDE 14

XOR again

X1 X2 1 1 H1 H2 O 1 X1 −1 1 X2 −1 −1 1 1 1 −1 X1 xor X2

Lecture 11 – p. 10/24

slide-15
SLIDE 15

XOR again

X1 X2 1 1 H1 H2 O 1 −1 1 −1 −1 1 −1 1 1 1 X1 −1 1 X2 −1 −1 1 1 1 −1 X1 xor X2

Lecture 11 – p. 10/24

slide-16
SLIDE 16

XOR again

X1 X2 1 1 H1 H2 O 1 −1 1 −1 −1 1 −1 1 1 1 ¬X1 ∧ X2 X1 ∧ ¬X2 (X1 ∧ ¬X2) ∨ (¬X1 ∧ X2) X1 −1 1 X2 −1 −1 1 1 1 −1 X1 xor X2

Lecture 11 – p. 10/24

slide-17
SLIDE 17

Classification tasks

Assume that we have: Attributes: P1, . . . , Pn (which we can assume are binary) Classes: C1, . . . , Cm Examples: D = {”case = (P ′

1, . . . , P ′ n)” is a Cj}

P1 P2 Pn H1 Hk O1 Ol Class Classification using artificial neural networks

Lecture 11 – p. 11/24

slide-18
SLIDE 18

Example: Boolean functions

X1 X2 X3 X4 T −1 −1 −1 −1 1 −1 −1 −1 1 1 −1 −1 1 −1 −1 −1 −1 1 1 1 −1 1 −1 −1 −1 −1 1 −1 1 1 −1 1 1 −1 1 −1 1 1 1 −1 1 −1 −1 −1 −1 1 −1 −1 1 −1 1 −1 1 −1 1 1 −1 1 1 1 1 1 −1 −1 −1 1 1 −1 1 −1 1 1 1 −1 1 1 1 1 1 1 X1 X2 X3 X4 This can always be done

Lecture 11 – p. 12/24

slide-19
SLIDE 19

Example: Disjunctive normal form

Consider the Boolean expression: (X1 ∧ ¬X2) ∨ (X2 ∧ X3 ∧ ¬X4). X1 X2 X3 X4 1 1 H1 H2 O 1

Lecture 11 – p. 13/24

slide-20
SLIDE 20

Example: Disjunctive normal form

Consider the Boolean expression: (X1 ∧ ¬X2) ∨ (X2 ∧ X3 ∧ ¬X4). X1 X2 X3 X4 1 1 H1 H2 O 1 −1 1−1 1 1 −1 −2 1 1

1 2

Lecture 11 – p. 13/24

slide-21
SLIDE 21

Learning weights and threshold I

w0 w1 w2 1 X1 X2 X1 ∧ ¬X2 X1 −1 1 X2 −1 −1 1 1 −1 −1 X1 ∧ ¬X2

Lecture 11 – p. 14/24

slide-22
SLIDE 22

Learning weights and threshold I

w0 w1 w2 1 X1 X2 X1 ∧ ¬X2 X1 −1 1 X2 −1 −1 1 1 −1 −1 X1 ∧ ¬X2 We have: D = ( ¯ d1, ¯ d2, ¯ d3, ¯ d4) input vectors (cases). ¯ t = (−1, 1, −1, −1) vector of target outputs. ¯ w = (w0, w1, w2) vector of current parameters. ¯

  • = (o1, o2, o3, o4) vector of current outputs.

We request: ¯ w∗ parameters yielding ¯

  • = ¯

t.

Lecture 11 – p. 14/24

slide-23
SLIDE 23

The perceptron training rule

w0 w1 w2 1 X1 X2

  • [t]
  • =

( 1 if ¯ X · ¯ w > 0 −1

  • therwise

Error: E= t − o The weights are updated as follows:

  • E > 0 ⇒ o shall be increased ⇒ ¯

X · ¯ w up ⇒ ¯ w := ¯ w + αE ¯ X

  • E < 0 ⇒ o shall be decreased ⇒ ¯

X · ¯ w down ⇒ ¯ w := ¯ w + αE ¯ X α is called the learning rate. With ¯ t linearly separable and α not too large this procedure will converge to a correct set of parameters.

Lecture 11 – p. 15/24

slide-24
SLIDE 24

Example: X1 ∧ X2

w0 : 0 w1 : 0 w2 : 0 1 X1 X2 X1 −1 1 X2 −1 −1 −1 1 −1 1

  • =

( 1 if w1X1 + w2X2 + w0 > 0 −1

  • therwise

α = 1

4

Cases: (1, −1, −1) (1, 1, −1) (1, −1, 1) (1, 1, 1) ¯ t: −1 −1 −1 1

  • :

E:

Lecture 11 – p. 16/24

slide-25
SLIDE 25

Example: X1 ∧ X2

w0 : 0 w1 : 0 w2 : 0 1 X1 X2 X1 −1 1 X2 −1 −1 −1 1 −1 1

  • =

( 1 if w1X1 + w2X2 − w0 > 0 −1

  • therwise

α = 1

4

Cases: (1, −1, −1) (1, 1, −1) (1, −1, 1) (1, 1, 1) ¯ t: −1 −1 −1 1

  • :

−1 −1 −1 −1 E: 2 w := B @ 1 C A + 1 4 · 2 · B @ 1 1 1 1 C A = B @

1 2 1 2 1 2

1 C A

Lecture 11 – p. 16/24

slide-26
SLIDE 26

Example: X1 ∧ X2

w0 : 0/ 1

2

w1 : 0/ 1

2

w2 : 0/ 1

2

1 X1 X2 X1 −1 1 X2 −1 −1 −1 1 −1 1

  • =

( 1 if w1X1 + w2X2 − w0 > 0 −1

  • therwise

α = 1

4

Cases: (1, −1, −1) (1, 1, −1) (1, −1, 1) (1, 1, 1) ¯ t: −1 −1 −1 1

  • :

−1 1 1 1 E: −2 −2 w := B @

1 2 1 2 1 2

1 C A + 1 4 · −2 · B @ 1 1 −1 1 C A = B @ 1 1 C A

Lecture 11 – p. 16/24

slide-27
SLIDE 27

Example: X1 ∧ X2

w0 : 0/ 1

2/0

w1 : 0/ 1

2/0

w2 : 0/ 1

2/1

1 X1 X2 X1 −1 1 X2 −1 −1 −1 1 −1 1

  • =

( 1 if w1X1 + w2X2 − w0 > 0 −1

  • therwise

α = 1

4

Cases: (1, −1, −1) (1, 1, −1) (1, −1, 1) (1, 1, 1) ¯ t: −1 −1 −1 1

  • :

−1 −1 1 1 E: −2 ETC... for a finite number of steps

Lecture 11 – p. 16/24

slide-28
SLIDE 28

Gradient descent I

The perceptron training rule finds a good weight vector when the training examples are linearly separable. If this is not the case we can apply gradient descent to ensure convergence. To understand, consider the simpler linear units, where

  • = w0 + w1 · X1 + · · · + wn · Xn

Define the error-function as: Ed( ¯ w) = 1 2 (td − od)2 So we seek to learn the weights that minimize the squared error.

Lecture 11 – p. 17/24

slide-29
SLIDE 29

Gradient descent II

We use the gradient to find the direction in which to change the weights: ∇Ed( ¯ w) = „ ∂Ed ∂w0 , . . . , ∂Ed ∂wn « This specifies the direction of steepest increase in E. Hence, our new training rule becomes: wi := wi + ∆wi, where ∆wi = −η ∂Ed ∂wi and ∂Ed ∂wi = (td − od)(−Xid)

  • 1

1 2

  • 2
  • 1

1 2 3 5 10 15 20 25 w0 w1 E[w]

Lecture 11 – p. 18/24

slide-30
SLIDE 30

Learning of two layer feed-forward network

X1 X2 1 1 H1 H2 O 1 What type of unit/activation-function should we use?

  • It should be non-linear.
  • It should be differentiable.

One solution is the sigmoid function:

1

Sigmoid function: σ(x) =

1 1+exp(−β·x)

Lecture 11 – p. 19/24

slide-31
SLIDE 31

Back propagation I

Training examples provide target values for only network outputs, so no target values are directly available for indicating the error of the hidden units’ values. Idea: Instead, calculate an error term δh for a hidden unit by taking the weighted sum of the error terms, δk, for each output unit it influences.

Lecture 11 – p. 20/24

slide-32
SLIDE 32

Back propagation I

Training examples provide target values for only network outputs, so no target values are directly available for indicating the error of the hidden units’ values. Idea: Instead, calculate an error term δh for a hidden unit by taking the weighted sum of the error terms, δk, for each output unit it influences. X1 X2 H1 H2 O E = 1

2 (o − t)2

w1 w2 w11 w12 w21 w22 Error term: δO Error term: δH2(δO) Error term: δH1(δO) That is, we compute an error term for the output unit and propagate it backwards in the

  • network. The error term plays the role of (t − o) in the delta rule.

Lecture 11 – p. 20/24

slide-33
SLIDE 33

Back propagation II

For an output unit: ∂E ∂w1 = 2 · 1 2(o − t) · ∂o ∂w1 = (o − t) · ∂o ∂w1 Since o =

1 1+exp(− ¯ w·¯ x) is the sigmoid function, ∂o ∂w1 is easily available:

∂o ∂w1 = o(1 − o)(−x1) Hence, the rule for modification is: w1 := w1 + η · X1O · (t − o)o(1 − o) | {z }

The error-term δO of the output unit

= w1 + η · X1O · δO The error-term δO is now used for modifying the weights of the hidden units.

Lecture 11 – p. 21/24

slide-34
SLIDE 34

Back propagation III

For a hidden unit: Since we don’t have direct access to a target output we use the weighted error term for the

  • utput unit when modifying the weights.

The error-term for the hidden unit H1 is then: δH1 := oH1(1 − oH1)w1δO Hence, the rule for modification is: w11 := w11 + ηX1δH1

Lecture 11 – p. 22/24

slide-35
SLIDE 35

Two ways of training

The situation:

  • We have a set of training cases {¯

t1, . . . , ¯ tn}.

  • The network is trained by running the training set through the modification algorithm

several times. There are two ways of doing this:

  • Incremental training – Modify the weights for each ¯

ti.

  • Cumulated training – Cumulate the error over the full set and then modify.

In practice, method one is usually preferred.

Lecture 11 – p. 23/24

slide-36
SLIDE 36

Summary: Characteristics of neural networks

An artificial neural network (ANN) can be characterized as follows:

  • It is a network of perceptrons, and each network has a predefined input-layer and an
  • utput-layer.
  • An ANN computes a function: I → O.
  • The weight and thresholds are the parameters of the ANN.
  • Through prescribed input-output behavior the parameters are learned.
  • The learning algorithm uses gradient descent.
  • ANNs are particular good at classification in domains with only little known structure.

Lecture 11 – p. 24/24