Articial Neural Net w orks [Read Ch. 4] [Recommended - - PDF document

arti cial neural net w orks read ch 4 recommended
SMART_READER_LITE
LIVE PREVIEW

Articial Neural Net w orks [Read Ch. 4] [Recommended - - PDF document

Articial Neural Net w orks [Read Ch. 4] [Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11] Threshold units Gradien t descen t Multila y er net w orks Bac kpropagation Hidden la y er represen


slide-1
SLIDE 1 Articial Neural Net w
  • rks
[Read Ch. 4] [Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11]
  • Threshold
units
  • Gradien
t descen t
  • Multila
y er net w
  • rks
  • Bac
kpropagation
  • Hidden
la y er represen tations
  • Example:
F ace Recognition
  • Adv
anced topics 74 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-2
SLIDE 2 Connectionist Mo dels Consider h umans:
  • Neuron
switc hing time ~ :001 second
  • Num
b er
  • f
neurons ~ 10 10
  • Connections
p er neuron ~ 10 45
  • Scene
recognition time ~ :1 second
  • 100
inference steps do esn't seem lik e enough ! m uc h parallel computation Prop erties
  • f
articial neural nets (ANN's):
  • Man
y neuron-lik e threshold switc hing units
  • Man
y w eigh ted in terconnections among units
  • Highly
parallel, distributed pro cess
  • Emphasis
  • n
tuning w eigh ts automatically 75 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-3
SLIDE 3 When to Consider Neural Net w
  • rks
  • Input
is high-dimensional discrete
  • r
real-v alued (e.g. ra w sensor input)
  • Output
is discrete
  • r
real v alued
  • Output
is a v ector
  • f
v alues
  • P
  • ssibly
noisy data
  • F
  • rm
  • f
target function is unkno wn
  • Human
readabilit y
  • f
result is unimp
  • rtan
t Examples:
  • Sp
eec h phoneme recognition [W aib el]
  • Image
classication [Kanade, Baluja, Ro wley]
  • Financial
prediction 76 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-4
SLIDE 4 AL VINN driv es 70 mph
  • n
high w a ys

Sharp Left Sharp Right

4 Hidden Units 30 Output Units 30x32 Sensor Input Retina

Straight Ahead

77 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-5
SLIDE 5 P erceptron

w1 w2 wn w0 x1 x2 xn x0=1

. . .

Σ

AA

Σ wi xi

n i=0 1 if > 0

  • 1 otherwise

{

  • =

Σ wi xi

n i=0

  • (x
1 ; : : : ; x n ) = 8 > > < > > : 1 if w + w 1 x 1 +
  • +
w n x n > 1
  • therwise.
Sometimes w e'll use simpler v ector notation:
  • (
~ x) = 8 > > < > > : 1 if ~ w
  • ~
x > 1
  • therwise.
78 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-6
SLIDE 6 Decision Surface
  • f
a P erceptron

x1 x2 + +

  • +
  • x1

x2

(a) (b)

  • +
  • +
Represen ts some useful functions
  • What
w eigh ts represen t g (x 1 ; x 2 ) = AN D (x 1 ; x 2 )? But some functions not represen table
  • e.g.,
not linearly separable
  • Therefore,
w e'll w an t net w
  • rks
  • f
these... 79 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-7
SLIDE 7 P erceptron training rule w i w i + w i where w i =
  • (t
  • )x
i Where:
  • t
= c( ~ x ) is target v alue
  • is
p erceptron
  • utput
  • is
small constan t (e.g., .1) called le arning r ate 80 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-8
SLIDE 8 P erceptron training rule Can pro v e it will con v erge
  • If
training data is linearly separable
  • and
  • sucien
tly small 81 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-9
SLIDE 9 Gradien t Descen t T
  • understand,
consider simpler line ar unit, where
  • =
w + w 1 x 1 +
  • +
w n x n Let's learn w i 's that minimize the squared error E [ ~ w ]
  • 1
2 X d2D (t d
  • d
) 2 Where D is set
  • f
training examples 82 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-10
SLIDE 10 Gradien t Descen t
  • 1

1 2

  • 2
  • 1

1 2 3 5 10 15 20 25 w0 w1 E[w]

Gradien t rE [ ~ w ]
  • 2
6 4 @ E @ w ; @ E @ w 1 ;
  • @
E @ w n 3 7 5 T raining rule:
  • ~
w =
  • rE
[ ~ w ] i.e., w i =
  • @
E @ w i 83 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-11
SLIDE 11 Gradien t Descen t @ E @ w i = @ @ w i 1 2 X d (t d
  • d
) 2 = 1 2 X d @ @ w i (t d
  • d
) 2 = 1 2 X d 2(t d
  • d
) @ @ w i (t d
  • d
) = X d (t d
  • d
) @ @ w i (t d
  • ~
w
  • ~
x d ) @ E @ w i = X d (t d
  • d
)(x i;d ) 84 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-12
SLIDE 12 Gradien t Descen t Gradient-Descent(tr aining exampl es;
  • )
Each tr aining example is a p air
  • f
the form h ~ x; ti, wher e ~ x is the ve ctor
  • f
input values, and t is the tar get
  • utput
value.
  • is
the le arning r ate (e.g., .05).
  • Initiali
ze eac h w i to some small random v alue
  • Un
til the termination condition is met, Do { Initiali ze eac h w i to zero. { F
  • r
eac h h ~ x ; ti in tr aining exampl es, Do
  • Input
the instance ~ x to the unit and compute the
  • utput
  • F
  • r
eac h linear unit w eigh t w i , Do w i w i +
  • (t
  • )x
i { F
  • r
eac h linear unit w eigh t w i , Do w i w i + w i 85 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-13
SLIDE 13 Summary P erceptron training rule guaran teed to succeed if
  • T
raining examples are linearly separable
  • Sucien
tly small learning rate
  • Linear
unit training rule uses gradien t descen t
  • Guaran
teed to con v erge to h yp
  • thesis
with minim um squared error
  • Giv
en sucien tly small learning rate
  • Ev
en when training data con tains noise
  • Ev
en when training data not separable b y H 86 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-14
SLIDE 14 Incremen tal (Sto c hastic) Gradien t Descen t Batc h mo de Gradien t Descen t: Do un til satised 1. Compute the gradien t rE D [ ~ w ] 2. ~ w ~ w
  • rE
D [ ~ w ] Incremen tal mo de Gradien t Descen t: Do un til satised
  • F
  • r
eac h training example d in D 1. Compute the gradien t rE d [ ~ w ] 2. ~ w ~ w
  • rE
d [ ~ w ] E D [ ~ w ]
  • 1
2 X d2D (t d
  • d
) 2 E d [ ~ w ]
  • 1
2 (t d
  • d
) 2 Incr emental Gr adient Desc ent can appro ximate Batch Gr adient Desc ent arbitrarily closely if
  • made
small enough 87 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-15
SLIDE 15 Multila y er Net w
  • rks
  • f
Sigmoid Units

F1 F2 head hid who’d hood ... ...

88 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-16
SLIDE 16 Sigmoid Unit

w1 w2 wn w0 x1 x2 xn x0 = 1

A A A A A

. . .

Σ

net = Σ wi xi

i=0 n

1 1 + e

  • net
  • = σ(net) =
  • (x)
is the sigmoid function 1 1 + e x Nice prop ert y: d (x) dx =
  • (x)(1
  • (x))
W e can deriv e gradien t decen t rules to train
  • One
sigmoid unit
  • Multilayer
networks
  • f
sigmoid units ! Bac kpropagation 89 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-17
SLIDE 17 Error Gradien t for a Sigmoid Unit @ E @ w i = @ @ w i 1 2 X d2D (t d
  • d
) 2 = 1 2 X d @ @ w i (t d
  • d
) 2 = 1 2 X d 2(t d
  • d
) @ @ w i (t d
  • d
) = X d (t d
  • d
) B @
  • @
  • d
@ w i 1 C A =
  • X
d (t d
  • d
) @
  • d
@ net d @ net d @ w i But w e kno w: @
  • d
@ net d = @
  • (net
d ) @ net d =
  • d
(1
  • d
) @ net d @ w i = @ ( ~ w
  • ~
x d ) @ w i = x i;d So: @ E @ w i =
  • X
d2D (t d
  • d
)o d (1
  • d
)x i;d 90 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-18
SLIDE 18 Bac kpropagation Algorithm Initiali ze all w eigh ts to small random n um b ers. Un til satised, Do
  • F
  • r
eac h training example, Do 1. Input the training example to the net w
  • rk
and compute the net w
  • rk
  • utputs
2. F
  • r
eac h
  • utput
unit k
  • k
  • k
(1
  • k
)(t k
  • k
) 3. F
  • r
eac h hidden unit h
  • h
  • h
(1
  • h
) X k 2outputs w h;k
  • k
4. Up date eac h net w
  • rk
w eigh t w i;j w i;j w i;j + w i;j where w i;j =
  • j
x i;j 91 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-19
SLIDE 19 More
  • n
Bac kpropagation
  • Gradien
t descen t
  • v
er en tire network w eigh t v ector
  • Easily
generalized to arbitrary directed graphs
  • Will
nd a lo cal, not necessarily global error minim um { In practice,
  • ften
w
  • rks
w ell (can run m ultiple times)
  • Often
include w eigh t momentum
  • w
i;j (n) =
  • j
x i;j + w i;j (n
  • 1)
  • Minimizes
error
  • v
er tr aining examples { Will it generalize w ell to subsequen t examples?
  • T
raining can tak e thousands
  • f
iterations ! slo w!
  • Using
net w
  • rk
after training is v ery fast 92 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-20
SLIDE 20 Learning Hidden La y er Represen tations

Inputs Outputs

A target function: Input Output 10000000 ! 10000000 01000000 ! 01000000 00100000 ! 00100000 00010000 ! 00010000 00001000 ! 00001000 00000100 ! 00000100 00000010 ! 00000010 00000001 ! 00000001 Can this b e learned?? 93 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-21
SLIDE 21 Learning Hidden La y er Represen tations A net w
  • rk:

Inputs Outputs

Learned hidden la y er represen tation: Input Hidden Output V alues 10000000 ! .89 .04 .08 ! 10000000 01000000 ! .01 .11 .88 ! 01000000 00100000 ! .01 .97 .27 ! 00100000 00010000 ! .99 .97 .71 ! 00010000 00001000 ! .03 .05 .02 ! 00001000 00000100 ! .22 .99 .99 ! 00000100 00000010 ! .80 .01 .98 ! 00000010 00000001 ! .60 .94 .01 ! 00000001 94 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-22
SLIDE 22 T raining

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 500 1000 1500 2000 2500 Sum of squared errors for each output unit

95 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-23
SLIDE 23 T raining

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 500 1000 1500 2000 2500 Hidden unit encoding for input 01000000

96 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-24
SLIDE 24 T raining
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 500 1000 1500 2000 2500 Weights from inputs to one hidden unit

97 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-25
SLIDE 25 Con v ergence
  • f
Bac kpropagation Gradien t descen t to some lo cal minim um
  • P
erhaps not global minim um...
  • Add
momen tum
  • Sto
c hastic gradien t descen t
  • T
rain m ultiple nets with dieren t inital w eigh ts Nature
  • f
con v ergence
  • Initiali
ze w eigh ts near zero
  • Therefore,
initial net w
  • rks
near-linear
  • Increasingly
non-linear functions p
  • ssible
as training progresses 98 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-26
SLIDE 26 Expressiv e Capabilities
  • f
ANNs Bo
  • lean
functions:
  • Ev
ery b
  • lean
function can b e represen ted b y net w
  • rk
with single hidden la y er
  • but
migh t require exp
  • nen
tial (in n um b er
  • f
inputs) hidden units Con tin uous functions:
  • Ev
ery b
  • unded
con tin uous function can b e appro ximated with arbitrarily small error, b y net w
  • rk
with
  • ne
hidden la y er [Cyb enk
  • 1989;
Hornik et al. 1989]
  • An
y function can b e appro ximated to arbitrary accuracy b y a net w
  • rk
with t w
  • hidden
la y ers [Cyb enk
  • 1988].
99 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-27
SLIDE 27 Ov ertting in ANNs

0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 5000 10000 15000 20000 Error Number of weight updates Error versus weight updates (example 1) Training set error Validation set error 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 1000 2000 3000 4000 5000 6000 Error Number of weight updates Error versus weight updates (example 2) Training set error Validation set error

100 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-28
SLIDE 28 Neural Nets for F ace Recognitio n

... ...

left strt rgh t up 30x32 inputs T ypical input images 90% accurate learning head p
  • se,
and recognizing 1-of-20 faces 101 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-29
SLIDE 29 Learned Hidden Unit W eigh ts

... ...

left strt rgh t up 30x32 inputs Learned W eigh ts T ypical input images h ttp://www.cs.cm u.edu/tom/faces.h tml 102 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-30
SLIDE 30 Alternativ e Error F unctions P enalize large w eigh ts: E ( ~ w )
  • 1
2 X d2D X k 2outputs (t k d
  • k
d ) 2 +
  • X
i;j w 2 j i T rain
  • n
target slop es as w ell as v alues: E ( ~ w )
  • 1
2 X d2D X k 2outputs 2 6 6 6 4 (t k d
  • k
d ) 2 +
  • X
j 2inputs B B @ @ t k d @ x j d
  • @
  • k
d @ x j d 1 C C A 2 3 7 7 7 5 Tie together w eigh ts:
  • e.g.,
in phoneme recognition net w
  • rk
103 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-31
SLIDE 31 Recurren t Net w
  • rks

x(t) x(t) c(t) x(t) c(t) y(t)

b

y(t + 1)

Feedforward network

  • Recurrent network
  • Recurrent network

unfolded in time

y(t + 1) y(t + 1) y(t – 1) x(t – 1) c(t – 1) x(t – 2) c(t – 2) (a) (b) (c)

104 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997