Decision support systems and machine learning Lecture 11 Lecture 11 – p. 1/24
Neural networks: Biological and artificial Consider humans: Properties of artificial neural nets (ANN’s): • Neuron switching time ≈ 0 . 001 second • Many neuron-like threshold switching units • Number of neurons ≈ 10 10 • Many weighted interconnections among units • Connections per neuron ≈ 10 4 - 10 5 • Highly parallel, distributed process • Scene recognition time ≈ 0 . 1 sec • Emphasis on tuning weights automatically • 100 inference steps doesn’t seem like enough ⇒ much parallel computation Lecture 11 – p. 2/24
Model of biological neurons x 1 w 1 P → g → a w n a = g ( P n x n i =1 w i · x i + w 0 ) Perceptron: • The inputs are combined linearly: w 1 · x 1 + · · · + w n · x n = ¯ w · ¯ x (vector notation). • The output is non-linear. We have different activation functions g : 1 1 1 1 Step function: step ( x ) Sign function: sign ( x ) Sigmoid function: σ ( x ) = 1+exp( − β · x ) Lecture 11 – p. 3/24
When to consider neural networks Application areas are usually characterized by: • Input is high-dimensional discrete or real-valued (e.g. raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data (e.g., recognition of hand-written digits) • Form of target function is unknown • Human readability of result is unimportant • Long training times are acceptable Examples: • Speech phoneme recognition • Image classification • Financial prediction Lecture 11 – p. 4/24
NET-talk Pronounciation of letters is very context dependent, e.g.: • bite and bit • prefer and preference Speak is decomposed into 60 phonemes (“sound alphabet”), which can be encoded into 26 independent units. i n p u t t e x t The network was trained on 1024 words: • 95% correct on the training set. • 78% correct on a test set (a “story”). 80 hidden neurons 1 2 26 Lecture 11 – p. 5/24
ALVINN drives 70 mph on highways Weight values for a hidden unit encouraging a turn to the left. Lecture 11 – p. 6/24
Perceptron examples 1 w 0 x 1 w 1 Sign w 2 x 2 The decision surface of a two-input perceptron a ( x 1 , x 2 ) = sign ( x 1 w 1 + x 2 w 2 + w 0 ) is given by a straight line, separating positive and negative examples. x 2 + + - + - x 1 - - Lecture 11 – p. 7/24
Perceptron examples 1 X 1 w 0 − 1 − 1 1 x 1 w 1 Sign 1 − 1 − 1 − 1 w 2 X 2 1 1 − 1 1 x 2 X 1 ∧ X 2 x 1 w 1 + x 2 w 2 + w 0 > 0 ? This perceptron specifies the decision surface: − 1 + 1 · x 1 + 1 · x 2 = 0 x 2 - + x 1 - - Lecture 11 – p. 7/24
Perceptron examples 1 X 1 w 0 1 − 1 1 x 1 w 1 Sign 1 − 1 − 1 1 w 2 X 2 1 1 1 1 x 2 X 1 ∨ X 2 x 1 w 1 + x 2 w 2 + w 0 > 0 ? This perceptron specifies the decision surface: 1 + 1 · x 1 + 1 · x 2 = 0 x 2 + + x 1 - + Lecture 11 – p. 7/24
Perceptron examples − 1 X 1 w 0 ?? − 1 1 x 1 w 1 Sign ?? − 1 − 1 1 w 2 X 2 ?? 1 1 − 1 x 2 X 1 xor X 2 x 1 w 1 + x 2 w 2 − w 0 > 0 ? We cannot specify any values for the weights so that the perceptron can represent the “xor” function: The examples are not linear separable. Lecture 11 – p. 7/24
Linear separable X 2 1 x 1 · w 1 + x 2 · w 2 = 0 X 1 − 1 1 − 1 x 1 · w 1 + x 2 · w 2 > 0 Lecture 11 – p. 8/24
Linear separable X 2 1 x 1 · w 1 + x 2 · w 2 = 0 X 1 − 1 1 − 1 x 1 · w 1 + x 2 · w 2 > 0 Only linearly separable functions can be computed by a single perceptron. Lecture 11 – p. 8/24
Two-layer (feed-forward) X 1 X 2 X n H 1 H k O 1 O m Connections to only the layer beneath Lecture 11 – p. 9/24
XOR again X 1 X 2 1 1 X 1 H 1 H 2 − 1 1 − 1 − 1 1 X 2 1 1 − 1 1 X 1 xor X 2 O Lecture 11 – p. 10/24
XOR again X 1 X 2 1 1 X 1 1 1 − 1 H 1 − 1 − 1 H 2 − 1 − 1 1 − 1 − 1 1 X 2 1 1 − 1 1 1 1 1 X 1 xor X 2 O Lecture 11 – p. 10/24
XOR again X 1 X 2 1 1 X 1 1 1 X 1 ∧ ¬ X 2 − 1 H 1 − 1 − 1 H 2 − 1 ¬ X 1 ∧ X 2 − 1 1 − 1 − 1 1 X 2 1 1 − 1 1 1 1 1 X 1 xor X 2 O ( X 1 ∧ ¬ X 2 ) ∨ ( ¬ X 1 ∧ X 2 ) Lecture 11 – p. 10/24
Classification tasks Assume that we have: Attributes: P 1 , . . . , P n (which we can assume are binary) Classes: C 1 , . . . , C m Examples: D = { ”case = ( P ′ 1 , . . . , P ′ n ) ” is a C j } P 1 P 2 P n Classification using H k H 1 artificial neural networks O 1 O l Class Lecture 11 – p. 11/24
Example: Boolean functions X 1 X 2 X 3 X 4 T − 1 − 1 − 1 − 1 1 X 1 X 2 X 3 X 4 − 1 − 1 − 1 1 1 − 1 − 1 1 − 1 − 1 − 1 − 1 1 1 1 − 1 1 − 1 − 1 − 1 − 1 1 − 1 1 1 − 1 1 1 − 1 1 − 1 1 1 1 − 1 1 − 1 − 1 − 1 − 1 1 − 1 − 1 1 − 1 1 − 1 1 − 1 1 1 − 1 1 1 1 1 1 − 1 − 1 − 1 1 1 − 1 1 − 1 1 1 1 − 1 1 1 1 1 1 1 This can always be done Lecture 11 – p. 12/24
Example: Disjunctive normal form Consider the Boolean expression: ( X 1 ∧ ¬ X 2 ) ∨ ( X 2 ∧ X 3 ∧ ¬ X 4 ) . X 1 X 2 X 3 X 4 1 1 H 1 H 2 1 O Lecture 11 – p. 13/24
Example: Disjunctive normal form Consider the Boolean expression: ( X 1 ∧ ¬ X 2 ) ∨ ( X 2 ∧ X 3 ∧ ¬ X 4 ) . X 1 X 2 X 3 X 4 1 1 1 − 1 1 1 − 1 − 1 H 1 H 2 − 2 1 1 1 1 O 2 Lecture 11 – p. 13/24
Learning weights and threshold I 1 X 1 w 0 X 1 ∧ ¬ X 2 − 1 1 w 1 X 1 − 1 − 1 1 w 2 X 2 1 − 1 − 1 X 2 X 1 ∧ ¬ X 2 Lecture 11 – p. 14/24
Learning weights and threshold I 1 X 1 w 0 X 1 ∧ ¬ X 2 − 1 1 w 1 X 1 − 1 − 1 1 w 2 X 2 1 − 1 − 1 X 2 X 1 ∧ ¬ X 2 We have: D = ( ¯ d 1 , ¯ d 2 , ¯ d 3 , ¯ d 4 ) input vectors (cases). ¯ t = ( − 1 , 1 , − 1 , − 1) vector of target outputs. w = ( w 0 , w 1 , w 2 ) vector of current parameters. ¯ o = ( o 1 , o 2 , o 3 , o 4 ) vector of current outputs. ¯ We request: w ∗ parameters yielding ¯ o = ¯ ¯ t . Lecture 11 – p. 14/24
The perceptron training rule 1 w 0 ( if ¯ o 1 X · ¯ w > 0 w 1 o = X 1 [ t ] − 1 otherwise w 2 X 2 Error: E = t − o The weights are updated as follows: • E > 0 ⇒ o shall be increased ⇒ ¯ w + αE ¯ X · ¯ w up ⇒ ¯ w := ¯ X • E < 0 ⇒ o shall be decreased ⇒ ¯ w + αE ¯ X · ¯ w down ⇒ ¯ w := ¯ X α is called the learning rate. With ¯ t linearly separable and α not too large this procedure will converge to a correct set of parameters. Lecture 11 – p. 15/24
Example: X 1 ∧ X 2 1 w 0 : 0 ( 1 if w 1 X 1 + w 2 X 2 + w 0 > 0 w 1 : 0 o = X 1 X 1 − 1 otherwise w 2 : 0 − 1 1 X 2 α = 1 − 1 − 1 − 1 4 X 2 1 − 1 1 Cases: (1 , − 1 , − 1) (1 , 1 , − 1) (1 , − 1 , 1) (1 , 1 , 1) ¯ t : − 1 − 1 − 1 1 o : E : Lecture 11 – p. 16/24
Example: X 1 ∧ X 2 1 w 0 : 0 ( 1 if w 1 X 1 + w 2 X 2 − w 0 > 0 w 1 : 0 o = X 1 X 1 − 1 otherwise w 2 : 0 − 1 1 X 2 α = 1 − 1 − 1 − 1 4 X 2 1 − 1 1 Cases: (1 , − 1 , − 1) (1 , 1 , − 1) (1 , − 1 , 1) (1 , 1 , 1) ¯ t : − 1 − 1 − 1 1 o : − 1 − 1 − 1 − 1 E : 0 0 0 2 0 1 0 1 0 1 1 0 1 2 A + 1 B C B C B 1 C w := 0 4 · 2 · 1 A = @ @ @ A 2 1 0 1 2 Lecture 11 – p. 16/24
Example: X 1 ∧ X 2 1 w 0 : 0 / 1 ( 2 1 if w 1 X 1 + w 2 X 2 − w 0 > 0 w 1 : 0 / 1 o = X 1 2 X 1 − 1 otherwise w 2 : 0 / 1 2 − 1 1 X 2 α = 1 − 1 − 1 − 1 4 X 2 1 − 1 1 Cases: (1 , − 1 , − 1) (1 , 1 , − 1) (1 , − 1 , 1) (1 , 1 , 1) ¯ t : − 1 − 1 − 1 1 o : − 1 1 1 1 E : 0 − 2 − 2 0 0 1 0 1 0 1 1 1 0 2 A + 1 B 1 C B C B C w := 4 · − 2 · 1 A = 0 @ @ @ A 2 1 − 1 1 2 Lecture 11 – p. 16/24
Example: X 1 ∧ X 2 1 w 0 : 0 / 1 2 / 0 ( 1 if w 1 X 1 + w 2 X 2 − w 0 > 0 w 1 : 0 / 1 2 / 0 o = X 1 X 1 − 1 otherwise w 2 : 0 / 1 2 / 1 − 1 1 X 2 α = 1 − 1 − 1 − 1 4 X 2 1 − 1 1 Cases: (1 , − 1 , − 1) (1 , 1 , − 1) (1 , − 1 , 1) (1 , 1 , 1) ¯ t : − 1 − 1 − 1 1 o : − 1 − 1 1 1 E : 0 0 − 2 0 ETC... for a finite number of steps Lecture 11 – p. 16/24
Gradient descent I The perceptron training rule finds a good weight vector when the training examples are linearly separable. If this is not the case we can apply gradient descent to ensure convergence. To understand, consider the simpler linear units, where o = w 0 + w 1 · X 1 + · · · + w n · X n Define the error-function as: w ) = 1 2 ( t d − o d ) 2 E d ( ¯ So we seek to learn the weights that minimize the squared error. Lecture 11 – p. 17/24
Gradient descent II We use the gradient to find the direction in which to change the weights: 25 „ ∂E d 20 « , . . . , ∂E d ∇ E d ( ¯ w ) = 15 ∂w 0 ∂w n E[w] 10 5 This specifies the direction of steepest increase in E . 0 2 1 -2 -1 0 0 Hence, our new training rule becomes: 1 2 -1 3 w1 w0 w i := w i + ∆ w i , where ∆ w i = − η ∂E d ∂w i ∂E d and = ( t d − o d )( − X id ) ∂w i Lecture 11 – p. 18/24
Learning of two layer feed-forward network X 1 X 2 1 1 H 1 H 2 1 O What type of unit/activation-function should we use? • It should be non-linear. • It should be differentiable. 1 One solution is the sigmoid function: 1 Sigmoid function: σ ( x ) = 1+exp( − β · x ) Lecture 11 – p. 19/24
Recommend
More recommend