cpts 570 machine learning school of eecs washington state
play

CptS 570 Machine Learning School of EECS Washington State - PowerPoint PPT Presentation

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 Also called multilayer perceptrons Inspired from human brain Brain consists of interconnected neurons Brain still outperforms


  1. CptS 570 – Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1

  2.  Also called multilayer perceptrons  Inspired from human brain ◦ Brain consists of interconnected neurons ◦ Brain still outperforms machines on several tasks ◦ E.g., vision, speech recognition, learning  Nonparametric estimator  Classification and regression  Trained using error backpropagation CptS 570 - Machine Learning 2

  3. CptS 570 - Machine Learning 3

  4.  Processors ◦ Computer: Typically 1-2 (~10 9 Hz) ◦ Brain: 10 11 neurons (~ 10 3 Hz)  Parallelism ◦ Computer: Typically little ◦ Brain: Massive parallelism  On average, each neuron is connected via synapses to 10 4 other neurons CptS 570 - Machine Learning 4

  5. “The Singularity is Near” Ray Kurzweil. CptS 570 - Machine Learning 5

  6. d = ∑ + = T y w x w w x j j 0 = j 1 [ ] = T w , w ,..., w w 0 1 d [ ] = T 1 , x ,..., x x 1 d CptS 570 - Machine Learning 6

  7.  y = wx + w 0 y y w 0 w x x x 0 =+1 CptS 570 - Machine Learning 7

  8.  If (wx + w 0 > 0) Then y=1 Else y=0 y w 0 w x x ( ) 1 = = [ ] T sigmoid y w x + − T 1 exp w x CptS 570 - Machine Learning 8

  9. Cla lassif ific ication: = T o w x i i exp( ) o = i y ∑ i exp( ) o k k choose C i = if max y y i k k d = ∑ + = T Regressi ssion: y w x w w x i ij j i 0 i = j 1 = y W x CptS 570 - Machine Learning 9

  10.  Batch learning (gradient descent) ◦ Requires entire training set ◦ Each weight update based on pass through entire training set  Online learning (stochastic gradient descent) ◦ Allows incremental arrival of training examples ◦ Weights updated for each training example ◦ Adaptive to problems changing over time ◦ Tends to converge faster ( ) t ∆ = η − t t t w r y x ij i i j CptS 570 - Machine Learning 10

  11.  Regression  Single linear output [ ] ( ) ( ) ( ) 1 1 2 2 = − = − t t t t t t T t E | , r r y r w x w x 2 2 ( ) t ∆ = η − t t t w r y x j j CptS 570 - Machine Learning 11

  12.  Classification  Single sigmoid output ( ) = t T t y sigmoid w x ( ) ( ) ( ) = − − − − t t t t t t t E | , r log y 1 r log 1 y w x r ( ) t ∆ = η − t t t w r y x j j Cross Entropy  K>2 softmax outputs ( ) T t exp( ) w x { } ∑ = = − t t t t t t i | , log y E w x r r y ∑ i i i i i T t exp( ) w x i k k ( ) t ∆ = η − t t t w r y x ij i i j CptS 570 - Machine Learning 12

  13. For i = 1,…,K  Stochastic For j = 0,…,d online gradient w ij ← rand(-0.01,0.01) Repeat descent for For all (x t ,r t ) in X in random order K>2 classes For i = 1,…,K o i ← 0 For j = 0,…,d o i ← o i + w ij x t j For i = 1,…,K y i ← exp(o i ) / Σ k exp(o k ) For i = 1,…,K For j = 0,…,d w ij ← w ij + ɳ( r t i – y i )x t j Until convergence CptS 570 - Machine Learning 13

  14. CptS 570 - Machine Learning 14

  15. No w 0 , w 1 , w 2 satisfy: ≤ w 0 0 + > Minsky and Papert (1969): w w 0 2 0 Stalled perceptron + > w w 0 research for 15 years. 1 0 + + ≤ w w w 0 1 2 0 CptS 570 - Machine Learning 15

  16.  Perceptrons can only approximate linear functions  But multiple layers of perceptrons can approximate nonlinear functions Hidden Layer CptS 570 - Machine Learning 16

  17. H ∑ = = + T y v z v v z i i ih h i 0 = h 1 ( ) = T z sigmoid w x h h 1 [ ] ( ) = ∑ d + − + 1 w x w exp hj j h 0 = j 1 (Rumelhart et al., 1986) CptS 570 - Machine Learning 17

  18.  x 1 XOR x 2 = (x 1 AND ~x 2 ) OR (~x 1 AND x 2 ) CptS 570 - Machine Learning 18

  19.  MLP can represent any Boolean function ◦ Any Boolean function can be expressed as a disjunction of conjunctions ◦ Each conjunction implemented by a hidden unit ◦ Disjunction implemented by one output unit ◦ May need 2 d hidden units in worst case CptS 570 - Machine Learning 19

  20.  MLP with two hidden layers can approximate any function with continuous inputs and outputs ◦ First hidden layer computes hyperplanes for isolating regions of instance space ◦ Second hidden layer ANDs hyperplanes together to isolate regions ◦ Weight from second-layer hidden unit to output unit is value of function in this region ◦ Piecewise constant approximator  MLP with one sufficiently large hidden layer can learn any nonlinear function CptS 570 - Machine Learning 20

  21.  Weights v ih feeding into H ∑ output units learned = = + T y v z v v z 0 i i ih h i using previous methods = 1 h ( ) = T sigmoid  Weights w hj feeding into z w x h h hidden units learned 1 [ ] ( ) = ∑ based on error d + − + 1 exp w x w = 0 hj j h 1 j propagated from output layer ∂ ∂ ∂ ∂ E E y z  Error backpropagation = i h ∂ ∂ ∂ ∂ w y z w (Rumelhart et al., 1986) hj i h hj CptS 570 - Machine Learning 21

  22. ( ) 1 ∑ ( ) 2 X = − t t E , v | r y W 2 t ( ) t ∑ H ∆ = η − ∑ t t v r y z = + t t y v z v h h h h 0 t = h 1 Backward ∂ E ∆ = − η Forward w hj ∂ w hj ( ) ∂ ∂ ∂ t t E y z ∑ = w T = − η z sigmoid x h h h ∂ ∂ ∂ t t y z w t h hj ( ) ( ) ∑ = − η − − − t t t t t r y v z 1 z x h h h j t ( ) ( ) t ∑ = η − − x t t t t r y v z 1 z x h h h j t CptS 570 - Machine Learning 22

  23. ( ) 1 ( ) ∑∑ 2 X y i = − t t , | E W V r y i i 2 t i H ∑ = + t t y v z v v ih 0 i ih h i = 1 h ( ) ∑ ∆ = η − t t t v r y z z h ih i i h t w hj  ( )  ( ) t ∑ ∑ ∆ = η − − t t t t 1 w  r y v  z z x hj i i ih h h j x j   t i CptS 570 - Machine Learning 23

  24. Epoch is one pass through training data X. Note: All weight updates computed before any made. CptS 570 - Machine Learning 24

  25.  f(x) = sin(6x)  x t ~ U(-0.5,0.5)  y t = f(x t )+N(0,0.1)  2 hidden units  After 100, 200 and 300 epochs CptS 570 - Machine Learning 25

  26. CptS 570 - Machine Learning 26

  27. Hyperplanes Outputs z h Inputs v h z h w h x + w 0 computed by to output unit computed by hidden units hidden units CptS 570 - Machine Learning 27

  28.  One sigmoid output y t for P(C 1 |x t ) and P(C 2 |x t ) ≡ 1 -y t   H ∑ =  +  t t sigmoid y v z v 0 h h   = 1 h ( ) ( ) ( ) ∑ X = − + − − t t t t , | log 1 log 1 E W v r y r y t ( ) ∑ ∆ = η − t t t v r y z h h Same as before t ( ) ( ) t ∑ ∆ = η − − t t t t 1 w r y v z z x hj h h h j t CptS 570 - Machine Learning 28

  29. ( ) t H exp ( ) o ∑ = + = ≡ t t t t i | o v z v y P C x ∑ 0 i ih h i i i t exp ( ) o = 1 h k k ( ) ∑∑ X = − t t , | log E W V r y i i t i ( ) ∑ ∆ = η − t t t v r y z ih i i h t  ( )  ( ) t ∑ ∑ ∆ = η − − t t t t 1 w  r y v  z z x hj i i ih h h j   t i CptS 570 - Machine Learning 29

  30.  Theoretically, only need one hidden layer  Multiple hidden layers may simplify network   ( ) d ∑   = = + = T z sigmoid sigmoid w x w , h 1 ,..., H w x   1 h 1 h 1 hj j 1 h 0 1   = j 1   ( ) H ∑ 1   = = + = T z sigmoid sigmoid w z w , l 1 ,..., H w z   2 l 2 l 1 2 lh 1 h 2 l 0 2   = h 1 H ∑ 2 = = + T y v z v v z 2 l 2 l 0 = l 1  Training proceeds by propagating error back layer by layer CptS 570 - Machine Learning 30

  31.  Gradient descent can be slow to converge  Successive weight updates can lead to large oscillations  Idea: Use previous weight update to smooth trajectory  Momentum α ∈ (0.5,1.0) ∂ t E − ∆ = − η + α ∆ t t 1 w w ∂ i i w i CptS 570 - Machine Learning 31

  32.  Learning rate ɳ ∈ (0.0,0.3)  Kept low to avoid oscillations, but slow learning  Prefer high ɳ initially, but lower ɳ as network converges  + τ + < t t a if E E  Adaptive learning rate ∆ η =  − η  b otherwise ◦ Increase ɳ if error decreases ◦ Decrease ɳ if error increases ◦ Best if E t is average over past few epochs CptS 570 - Machine Learning 32

  33.  Network with d inputs, K outputs and H hidden units has K(H+1)+H(d+1) weights  Choosing H too high can lead to overfitting  This is the same bias/variance dilemma as before CptS 570 - Machine Learning 33

  34. Previous example: f(x) = sin(6x) CptS 570 - Machine Learning 34

  35.  Similar overfitting behavior if training continued too long  More and more weights move from zero  Overtraining CptS 570 - Machine Learning 35

Recommend


More recommend