Arti�cial Neural Net w orks [Read Ch. 4] [Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11] � Threshold units � Gradien t descen t � Multila y er net w orks � Bac kpropagation � Hidden la y er represen tations � Example: F ace Recognition � Adv anced topics 74 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Connectionist Mo dels Consider h umans: � Neuron switc hing time ~ : 001 second 10 � Num b er of neurons ~ 10 4 � 5 � Connections p er neuron ~ 10 � Scene recognition time ~ : 1 second � 100 inference steps do esn't seem lik e enough ! m uc h parallel computation Prop erties of arti�cial neural nets (ANN's): � Man y neuron-lik e threshold switc hing units � Man y w eigh ted in terconnections among units � Highly parallel, distributed pro cess � Emphasis on tuning w eigh ts automatically 75 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
When to Consider Neural Net w orks � Input is high-dimensional discrete or real-v alued (e.g. ra w sensor input) � Output is discrete or real v alued � Output is a v ector of v alues � P ossibly noisy data � F orm of target function is unkno wn � Human readabilit y of result is unimp ortan t Examples: � Sp eec h phoneme recognition [W aib el] � Image classi�cation [Kanade, Baluja, Ro wley] � Financial prediction 76 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
AL VINN driv es 70 mph on high w a ys Sharp Straight Sharp Left Ahead Right 30 Output Units 4 Hidden Units 77 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997 30x32 Sensor Input Retina
P erceptron 8 > > < 1 if w + w x + � � � + w x > 0 0 1 1 n n o ( x ; : : : ; x ) = 1 n > > : � 1 otherwise. x 1 w 1 x 0 =1 Sometimes w e'll use simpler v ector notation: w 0 x 2 w 2 8 > > < Σ 1 if w ~ � ~ x > 0 . o ( ~ x ) = AA > > : n . � 1 otherwise. Σ wi xi n Σ wi xi { . 1 if > 0 i =0 o = w n i =0 -1 otherwise x n 78 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Decision Surface of a P erceptron Represen ts some useful functions x2 x2 � What w eigh ts represen t + + g ( x ; x ) = AN D ( x ; x )? 1 2 1 2 - - + + But some functions not represen table x1 x1 - - + � e.g., not - linearly separable � Therefore, ( a ) w e'll w an t net w orks of these... ( b ) 79 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
P erceptron training rule w w + � w i i i where � w = � ( t � o ) x i i Where: � t = c ( ~ x ) is target v alue � o is p erceptron output � � is small constan t (e.g., .1) called le arning r ate 80 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
P erceptron training rule Can pro v e it will con v erge � If training data is linearly separable � and � su�cien tly small 81 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Gradien t Descen t T o understand, consider simpler line ar unit , where o = w + w x + � � � + w x 0 1 1 n n Let's learn w 's that minimize the squared error i 1 X 2 E [ w ~ ] � ( t � o ) d d 2 d 2 D Where D is set of training examples 82 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Gradien t Descen t Gradien t 25 2 3 @ E @ E @ E 6 7 20 4 5 r E [ w ~ ] � ; ; � � � @ w @ w @ w 0 1 n 15 E[w] T raining rule: 10 5 � w ~ = � � r E [ w ~ ] 0 i.e., 2 @ E 1 -2 � w = � � i -1 0 0 @ w i 1 2 -1 3 w1 w0 83 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Gradien t Descen t @ E @ 1 X 2 = ( t � o ) d d @ w @ w 2 d i i 1 @ X 2 = ( t � o ) d d 2 @ w d i 1 @ X = 2( t � o ) ( t � o ) d d d d 2 d @ w i @ X = ( t � o ) ( t � w ~ � x ~ ) d d d d @ w d i @ E X = ( t � o )( � x ) d d i;d @ w d i 84 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Gradien t Descen t Gradient-Descent ( tr aining exampl es; � ) Each tr aining example is a p air of the form h ~ x; t i , wher e ~ x is the ve ctor of input values, and t is the tar get output value. � is the le arning r ate (e.g., .05). � Initiali ze eac h w to some small random v alue i � Un til the termination condition is met, Do { Initiali ze eac h � w to zero. i { F or eac h h ~ x ; t i in tr aining exampl es , Do � Input the instance ~ x to the unit and compute the output o � F or eac h linear unit w eigh t w , Do i � w � w + � ( t � o ) x i i i { F or eac h linear unit w eigh t w , Do i w w + � w i i i 85 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Summary P erceptron training rule guaran teed to succeed if � T raining examples are linearly separable � Su�cien tly small learning rate � Linear unit training rule uses gradien t descen t � Guaran teed to con v erge to h yp othesis with minim um squared error � Giv en su�cien tly small learning rate � � Ev en when training data con tains noise � Ev en when training data not separable b y H 86 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Incremen tal (Sto c hastic) Gradien t Descen t Batc h mo de Gradien t Descen t: Do un til satis�ed 1. Compute the gradien t r E [ w ~ ] D 2. w ~ w ~ � � r E [ w ~ ] D Incremen tal mo de Gradien t Descen t: Do un til satis�ed � F or eac h training example d in D 1. Compute the gradien t r E [ w ~ ] d 2. w ~ w ~ � � r E [ w ~ ] d 1 X 2 E [ w ~ ] � ( t � o ) D d d 2 d 2 D 1 2 E [ w ~ ] � ( t � o ) d d d 2 Incr emental Gr adient Desc ent can appro ximate Batch Gr adient Desc ent arbitrarily closely if � made small enough 87 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Multila y er Net w orks of Sigmoid Units head hid who’d hood ... ... F1 F2 88 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Sigmoid Unit � ( x ) is the sigmoid function 1 � x 1 + e x1 w1 x0 = 1 A d� ( x ) Nice prop ert y: = � ( x )(1 � � ( x )) x2 w2 w0 dx A Σ . A W e can deriv e gradien t decen t rules to train . n net = Σ wi xi A 1 o = σ (net) = . i =0 � One sigmoid unit -net wn A 1 + e xn � Multilayer networks of sigmoid units ! Bac kpropagation 89 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Error Gradien t for a Sigmoid Unit @ E @ 1 X 2 = ( t � o ) d d @ w @ w 2 d 2 D i i 1 @ X 2 = ( t � o ) d d 2 d @ w i 1 @ X = 2( t � o ) ( t � o ) d d d d 2 d @ w i 0 1 @ o X d B C @ A = ( t � o ) � d d d @ w i @ o @ net X d d = � ( t � o ) d d d @ net @ w d i But w e kno w: @ o @ � ( net ) d d = = o (1 � o ) d d @ net @ net d d @ net @ ( w ~ � x ~ ) d d = = x i;d @ w @ w i i So: @ E X = � ( t � o ) o (1 � o ) x d d d d i;d @ w d 2 D i 90 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Bac kpropagation Algorithm Initiali ze all w eigh ts to small random n um b ers. Un til satis�ed, Do � F or eac h training example, Do 1. Input the training example to the net w ork and compute the net w ork outputs 2. F or eac h output unit k � o (1 � o )( t � o ) k k k k k 3. F or eac h hidden unit h X � o (1 � o ) w � h h h h;k k k 2 outputs 4. Up date eac h net w ork w eigh t w i;j w w + � w i;j i;j i;j where � w = � � x i;j j i;j 91 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Recommend
More recommend