CptS 570 – Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1
Also called multilayer perceptrons Inspired from human brain ◦ Brain consists of interconnected neurons ◦ Brain still outperforms machines on several tasks ◦ E.g., vision, speech recognition, learning Nonparametric estimator Classification and regression Trained using error backpropagation CptS 570 - Machine Learning 2
CptS 570 - Machine Learning 3
Processors ◦ Computer: Typically 1-2 (~10 9 Hz) ◦ Brain: 10 11 neurons (~ 10 3 Hz) Parallelism ◦ Computer: Typically little ◦ Brain: Massive parallelism On average, each neuron is connected via synapses to 10 4 other neurons CptS 570 - Machine Learning 4
“The Singularity is Near” Ray Kurzweil. CptS 570 - Machine Learning 5
d = ∑ + = T y w x w w x j j 0 = j 1 [ ] = T w , w ,..., w w 0 1 d [ ] = T 1 , x ,..., x x 1 d CptS 570 - Machine Learning 6
y = wx + w 0 y y w 0 w x x x 0 =+1 CptS 570 - Machine Learning 7
If (wx + w 0 > 0) Then y=1 Else y=0 y w 0 w x x ( ) 1 = = [ ] T sigmoid y w x + − T 1 exp w x CptS 570 - Machine Learning 8
Cla lassif ific ication: = T o w x i i exp( ) o = i y ∑ i exp( ) o k k choose C i = if max y y i k k d = ∑ + = T Regressi ssion: y w x w w x i ij j i 0 i = j 1 = y W x CptS 570 - Machine Learning 9
Batch learning (gradient descent) ◦ Requires entire training set ◦ Each weight update based on pass through entire training set Online learning (stochastic gradient descent) ◦ Allows incremental arrival of training examples ◦ Weights updated for each training example ◦ Adaptive to problems changing over time ◦ Tends to converge faster ( ) t ∆ = η − t t t w r y x ij i i j CptS 570 - Machine Learning 10
Regression Single linear output [ ] ( ) ( ) ( ) 1 1 2 2 = − = − t t t t t t T t E | , r r y r w x w x 2 2 ( ) t ∆ = η − t t t w r y x j j CptS 570 - Machine Learning 11
Classification Single sigmoid output ( ) = t T t y sigmoid w x ( ) ( ) ( ) = − − − − t t t t t t t E | , r log y 1 r log 1 y w x r ( ) t ∆ = η − t t t w r y x j j Cross Entropy K>2 softmax outputs ( ) T t exp( ) w x { } ∑ = = − t t t t t t i | , log y E w x r r y ∑ i i i i i T t exp( ) w x i k k ( ) t ∆ = η − t t t w r y x ij i i j CptS 570 - Machine Learning 12
For i = 1,…,K Stochastic For j = 0,…,d online gradient w ij ← rand(-0.01,0.01) Repeat descent for For all (x t ,r t ) in X in random order K>2 classes For i = 1,…,K o i ← 0 For j = 0,…,d o i ← o i + w ij x t j For i = 1,…,K y i ← exp(o i ) / Σ k exp(o k ) For i = 1,…,K For j = 0,…,d w ij ← w ij + ɳ( r t i – y i )x t j Until convergence CptS 570 - Machine Learning 13
CptS 570 - Machine Learning 14
No w 0 , w 1 , w 2 satisfy: ≤ w 0 0 + > Minsky and Papert (1969): w w 0 2 0 Stalled perceptron + > w w 0 research for 15 years. 1 0 + + ≤ w w w 0 1 2 0 CptS 570 - Machine Learning 15
Perceptrons can only approximate linear functions But multiple layers of perceptrons can approximate nonlinear functions Hidden Layer CptS 570 - Machine Learning 16
H ∑ = = + T y v z v v z i i ih h i 0 = h 1 ( ) = T z sigmoid w x h h 1 [ ] ( ) = ∑ d + − + 1 w x w exp hj j h 0 = j 1 (Rumelhart et al., 1986) CptS 570 - Machine Learning 17
x 1 XOR x 2 = (x 1 AND ~x 2 ) OR (~x 1 AND x 2 ) CptS 570 - Machine Learning 18
MLP can represent any Boolean function ◦ Any Boolean function can be expressed as a disjunction of conjunctions ◦ Each conjunction implemented by a hidden unit ◦ Disjunction implemented by one output unit ◦ May need 2 d hidden units in worst case CptS 570 - Machine Learning 19
MLP with two hidden layers can approximate any function with continuous inputs and outputs ◦ First hidden layer computes hyperplanes for isolating regions of instance space ◦ Second hidden layer ANDs hyperplanes together to isolate regions ◦ Weight from second-layer hidden unit to output unit is value of function in this region ◦ Piecewise constant approximator MLP with one sufficiently large hidden layer can learn any nonlinear function CptS 570 - Machine Learning 20
Weights v ih feeding into H ∑ output units learned = = + T y v z v v z 0 i i ih h i using previous methods = 1 h ( ) = T sigmoid Weights w hj feeding into z w x h h hidden units learned 1 [ ] ( ) = ∑ based on error d + − + 1 exp w x w = 0 hj j h 1 j propagated from output layer ∂ ∂ ∂ ∂ E E y z Error backpropagation = i h ∂ ∂ ∂ ∂ w y z w (Rumelhart et al., 1986) hj i h hj CptS 570 - Machine Learning 21
( ) 1 ∑ ( ) 2 X = − t t E , v | r y W 2 t ( ) t ∑ H ∆ = η − ∑ t t v r y z = + t t y v z v h h h h 0 t = h 1 Backward ∂ E ∆ = − η Forward w hj ∂ w hj ( ) ∂ ∂ ∂ t t E y z ∑ = w T = − η z sigmoid x h h h ∂ ∂ ∂ t t y z w t h hj ( ) ( ) ∑ = − η − − − t t t t t r y v z 1 z x h h h j t ( ) ( ) t ∑ = η − − x t t t t r y v z 1 z x h h h j t CptS 570 - Machine Learning 22
( ) 1 ( ) ∑∑ 2 X y i = − t t , | E W V r y i i 2 t i H ∑ = + t t y v z v v ih 0 i ih h i = 1 h ( ) ∑ ∆ = η − t t t v r y z z h ih i i h t w hj ( ) ( ) t ∑ ∑ ∆ = η − − t t t t 1 w r y v z z x hj i i ih h h j x j t i CptS 570 - Machine Learning 23
Epoch is one pass through training data X. Note: All weight updates computed before any made. CptS 570 - Machine Learning 24
f(x) = sin(6x) x t ~ U(-0.5,0.5) y t = f(x t )+N(0,0.1) 2 hidden units After 100, 200 and 300 epochs CptS 570 - Machine Learning 25
CptS 570 - Machine Learning 26
Hyperplanes Outputs z h Inputs v h z h w h x + w 0 computed by to output unit computed by hidden units hidden units CptS 570 - Machine Learning 27
One sigmoid output y t for P(C 1 |x t ) and P(C 2 |x t ) ≡ 1 -y t H ∑ = + t t sigmoid y v z v 0 h h = 1 h ( ) ( ) ( ) ∑ X = − + − − t t t t , | log 1 log 1 E W v r y r y t ( ) ∑ ∆ = η − t t t v r y z h h Same as before t ( ) ( ) t ∑ ∆ = η − − t t t t 1 w r y v z z x hj h h h j t CptS 570 - Machine Learning 28
( ) t H exp ( ) o ∑ = + = ≡ t t t t i | o v z v y P C x ∑ 0 i ih h i i i t exp ( ) o = 1 h k k ( ) ∑∑ X = − t t , | log E W V r y i i t i ( ) ∑ ∆ = η − t t t v r y z ih i i h t ( ) ( ) t ∑ ∑ ∆ = η − − t t t t 1 w r y v z z x hj i i ih h h j t i CptS 570 - Machine Learning 29
Theoretically, only need one hidden layer Multiple hidden layers may simplify network ( ) d ∑ = = + = T z sigmoid sigmoid w x w , h 1 ,..., H w x 1 h 1 h 1 hj j 1 h 0 1 = j 1 ( ) H ∑ 1 = = + = T z sigmoid sigmoid w z w , l 1 ,..., H w z 2 l 2 l 1 2 lh 1 h 2 l 0 2 = h 1 H ∑ 2 = = + T y v z v v z 2 l 2 l 0 = l 1 Training proceeds by propagating error back layer by layer CptS 570 - Machine Learning 30
Gradient descent can be slow to converge Successive weight updates can lead to large oscillations Idea: Use previous weight update to smooth trajectory Momentum α ∈ (0.5,1.0) ∂ t E − ∆ = − η + α ∆ t t 1 w w ∂ i i w i CptS 570 - Machine Learning 31
Learning rate ɳ ∈ (0.0,0.3) Kept low to avoid oscillations, but slow learning Prefer high ɳ initially, but lower ɳ as network converges + τ + < t t a if E E Adaptive learning rate ∆ η = − η b otherwise ◦ Increase ɳ if error decreases ◦ Decrease ɳ if error increases ◦ Best if E t is average over past few epochs CptS 570 - Machine Learning 32
Network with d inputs, K outputs and H hidden units has K(H+1)+H(d+1) weights Choosing H too high can lead to overfitting This is the same bias/variance dilemma as before CptS 570 - Machine Learning 33
Previous example: f(x) = sin(6x) CptS 570 - Machine Learning 34
Similar overfitting behavior if training continued too long More and more weights move from zero Overtraining CptS 570 - Machine Learning 35
Recommend
More recommend