CptS 570 Machine Learning School of EECS Washington State - PowerPoint PPT Presentation

CptS 570 – Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1

 Also called multilayer perceptrons  Inspired from human brain ◦ Brain consists of interconnected neurons ◦ Brain still outperforms machines on several tasks ◦ E.g., vision, speech recognition, learning  Nonparametric estimator  Classification and regression  Trained using error backpropagation CptS 570 - Machine Learning 2

CptS 570 - Machine Learning 3

 Processors ◦ Computer: Typically 1-2 (~10 9 Hz) ◦ Brain: 10 11 neurons (~ 10 3 Hz)  Parallelism ◦ Computer: Typically little ◦ Brain: Massive parallelism  On average, each neuron is connected via synapses to 10 4 other neurons CptS 570 - Machine Learning 4

“The Singularity is Near” Ray Kurzweil. CptS 570 - Machine Learning 5

d = ∑ + = T y w x w w x j j 0 = j 1 [ ] = T w , w ,..., w w 0 1 d [ ] = T 1 , x ,..., x x 1 d CptS 570 - Machine Learning 6

 y = wx + w 0 y y w 0 w x x x 0 =+1 CptS 570 - Machine Learning 7

 If (wx + w 0 > 0) Then y=1 Else y=0 y w 0 w x x ( ) 1 = = [ ] T sigmoid y w x + − T 1 exp w x CptS 570 - Machine Learning 8

Cla lassif ific ication: = T o w x i i exp( ) o = i y ∑ i exp( ) o k k choose C i = if max y y i k k d = ∑ + = T Regressi ssion: y w x w w x i ij j i 0 i = j 1 = y W x CptS 570 - Machine Learning 9

 Batch learning (gradient descent) ◦ Requires entire training set ◦ Each weight update based on pass through entire training set  Online learning (stochastic gradient descent) ◦ Allows incremental arrival of training examples ◦ Weights updated for each training example ◦ Adaptive to problems changing over time ◦ Tends to converge faster ( ) t ∆ = η − t t t w r y x ij i i j CptS 570 - Machine Learning 10

 Regression  Single linear output [ ] ( ) ( ) ( ) 1 1 2 2 = − = − t t t t t t T t E | , r r y r w x w x 2 2 ( ) t ∆ = η − t t t w r y x j j CptS 570 - Machine Learning 11

 Classification  Single sigmoid output ( ) = t T t y sigmoid w x ( ) ( ) ( ) = − − − − t t t t t t t E | , r log y 1 r log 1 y w x r ( ) t ∆ = η − t t t w r y x j j Cross Entropy  K>2 softmax outputs ( ) T t exp( ) w x { } ∑ = = − t t t t t t i | , log y E w x r r y ∑ i i i i i T t exp( ) w x i k k ( ) t ∆ = η − t t t w r y x ij i i j CptS 570 - Machine Learning 12

For i = 1,…,K  Stochastic For j = 0,…,d online gradient w ij ← rand(-0.01,0.01) Repeat descent for For all (x t ,r t ) in X in random order K>2 classes For i = 1,…,K o i ← 0 For j = 0,…,d o i ← o i + w ij x t j For i = 1,…,K y i ← exp(o i ) / Σ k exp(o k ) For i = 1,…,K For j = 0,…,d w ij ← w ij + ɳ( r t i – y i )x t j Until convergence CptS 570 - Machine Learning 13

No w 0 , w 1 , w 2 satisfy: ≤ w 0 0 + > Minsky and Papert (1969): w w 0 2 0 Stalled perceptron + > w w 0 research for 15 years. 1 0 + + ≤ w w w 0 1 2 0 CptS 570 - Machine Learning 15

 Perceptrons can only approximate linear functions  But multiple layers of perceptrons can approximate nonlinear functions Hidden Layer CptS 570 - Machine Learning 16

H ∑ = = + T y v z v v z i i ih h i 0 = h 1 ( ) = T z sigmoid w x h h 1 [ ] ( ) = ∑ d + − + 1 w x w exp hj j h 0 = j 1 (Rumelhart et al., 1986) CptS 570 - Machine Learning 17

 x 1 XOR x 2 = (x 1 AND ~x 2 ) OR (~x 1 AND x 2 ) CptS 570 - Machine Learning 18

 MLP can represent any Boolean function ◦ Any Boolean function can be expressed as a disjunction of conjunctions ◦ Each conjunction implemented by a hidden unit ◦ Disjunction implemented by one output unit ◦ May need 2 d hidden units in worst case CptS 570 - Machine Learning 19

 MLP with two hidden layers can approximate any function with continuous inputs and outputs ◦ First hidden layer computes hyperplanes for isolating regions of instance space ◦ Second hidden layer ANDs hyperplanes together to isolate regions ◦ Weight from second-layer hidden unit to output unit is value of function in this region ◦ Piecewise constant approximator  MLP with one sufficiently large hidden layer can learn any nonlinear function CptS 570 - Machine Learning 20

 Weights v ih feeding into H ∑ output units learned = = + T y v z v v z 0 i i ih h i using previous methods = 1 h ( ) = T sigmoid  Weights w hj feeding into z w x h h hidden units learned 1 [ ] ( ) = ∑ based on error d + − + 1 exp w x w = 0 hj j h 1 j propagated from output layer ∂ ∂ ∂ ∂ E E y z  Error backpropagation = i h ∂ ∂ ∂ ∂ w y z w (Rumelhart et al., 1986) hj i h hj CptS 570 - Machine Learning 21

( ) 1 ∑ ( ) 2 X = − t t E , v | r y W 2 t ( ) t ∑ H ∆ = η − ∑ t t v r y z = + t t y v z v h h h h 0 t = h 1 Backward ∂ E ∆ = − η Forward w hj ∂ w hj ( ) ∂ ∂ ∂ t t E y z ∑ = w T = − η z sigmoid x h h h ∂ ∂ ∂ t t y z w t h hj ( ) ( ) ∑ = − η − − − t t t t t r y v z 1 z x h h h j t ( ) ( ) t ∑ = η − − x t t t t r y v z 1 z x h h h j t CptS 570 - Machine Learning 22

( ) 1 ( ) ∑∑ 2 X y i = − t t , | E W V r y i i 2 t i H ∑ = + t t y v z v v ih 0 i ih h i = 1 h ( ) ∑ ∆ = η − t t t v r y z z h ih i i h t w hj  ( )  ( ) t ∑ ∑ ∆ = η − − t t t t 1 w  r y v  z z x hj i i ih h h j x j   t i CptS 570 - Machine Learning 23

Epoch is one pass through training data X. Note: All weight updates computed before any made. CptS 570 - Machine Learning 24

 f(x) = sin(6x)  x t ~ U(-0.5,0.5)  y t = f(x t )+N(0,0.1)  2 hidden units  After 100, 200 and 300 epochs CptS 570 - Machine Learning 25

Hyperplanes Outputs z h Inputs v h z h w h x + w 0 computed by to output unit computed by hidden units hidden units CptS 570 - Machine Learning 27

 One sigmoid output y t for P(C 1 |x t ) and P(C 2 |x t ) ≡ 1 -y t   H ∑ =  +  t t sigmoid y v z v 0 h h   = 1 h ( ) ( ) ( ) ∑ X = − + − − t t t t , | log 1 log 1 E W v r y r y t ( ) ∑ ∆ = η − t t t v r y z h h Same as before t ( ) ( ) t ∑ ∆ = η − − t t t t 1 w r y v z z x hj h h h j t CptS 570 - Machine Learning 28

( ) t H exp ( ) o ∑ = + = ≡ t t t t i | o v z v y P C x ∑ 0 i ih h i i i t exp ( ) o = 1 h k k ( ) ∑∑ X = − t t , | log E W V r y i i t i ( ) ∑ ∆ = η − t t t v r y z ih i i h t  ( )  ( ) t ∑ ∑ ∆ = η − − t t t t 1 w  r y v  z z x hj i i ih h h j   t i CptS 570 - Machine Learning 29

 Theoretically, only need one hidden layer  Multiple hidden layers may simplify network   ( ) d ∑   = = + = T z sigmoid sigmoid w x w , h 1 ,..., H w x   1 h 1 h 1 hj j 1 h 0 1   = j 1   ( ) H ∑ 1   = = + = T z sigmoid sigmoid w z w , l 1 ,..., H w z   2 l 2 l 1 2 lh 1 h 2 l 0 2   = h 1 H ∑ 2 = = + T y v z v v z 2 l 2 l 0 = l 1  Training proceeds by propagating error back layer by layer CptS 570 - Machine Learning 30

 Gradient descent can be slow to converge  Successive weight updates can lead to large oscillations  Idea: Use previous weight update to smooth trajectory  Momentum α ∈ (0.5,1.0) ∂ t E − ∆ = − η + α ∆ t t 1 w w ∂ i i w i CptS 570 - Machine Learning 31

 Learning rate ɳ ∈ (0.0,0.3)  Kept low to avoid oscillations, but slow learning  Prefer high ɳ initially, but lower ɳ as network converges  + τ + < t t a if E E  Adaptive learning rate ∆ η =  − η  b otherwise ◦ Increase ɳ if error decreases ◦ Decrease ɳ if error increases ◦ Best if E t is average over past few epochs CptS 570 - Machine Learning 32

 Network with d inputs, K outputs and H hidden units has K(H+1)+H(d+1) weights  Choosing H too high can lead to overfitting  This is the same bias/variance dilemma as before CptS 570 - Machine Learning 33

Previous example: f(x) = sin(6x) CptS 570 - Machine Learning 34

 Similar overfitting behavior if training continued too long  More and more weights move from zero  Overtraining CptS 570 - Machine Learning 35

CptS 570 Machine Learning School of EECS Washington State - PowerPoint PPT Presentation

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 Also called multilayer perceptrons Inspired from human brain Brain consists of interconnected neurons Brain still outperforms

School of EECS Washington State University CptS 570 - Machine Learning 1 Course overview

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Introduction Mitchell, Chapter 1 CptS 570 Machine Learning School of EECS Washington State

Introduction Mitchell, Chapter 1 CptS 570 Machine Learning School of EECS Washington State

Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington

Concept Learning Mitchell, Chapter 2 CptS 570 Machine Learning School of EECS Washington State

Introduction CptS 570 Machine Learning School of EECS Washington State University What is

Conclusions Larry Holder CptS 570 Machine Learning School of Electrical Engineering and

Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 Machine Learning

CptS 360 (System Programming) Unit 6: Files and Directories Bob Lewis School of Engineering and

CptS 360 (System Programming) Unit 2: Introduction to UNIX and Linux Bob Lewis School of

CptS 360 (System Programming) Unit 17: Network IPC (Sockets) Bob Lewis School of Engineering and

CptS 360 (System Programming) Unit 3: Development Tools Bob Lewis School of Engineering and

CptS 360 (System Programming) Unit 19: Curses Bob Lewis School of Engineering and Applied

CptS 360 (System Programming) Unit 8: System Data Files Bob Lewis School of Engineering and

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &

EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x , y ): K

A general-purpose method for faithfully rounded floating-point function approximation in FPGAs

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets

CS4811 Neural Network Training Example Consider the following network. It has two inputs (two

CptS 570 Machine Learning School of EECS Washington State - PowerPoint PPT Presentation

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 Also called multilayer perceptrons Inspired from human brain Brain consists of interconnected neurons Brain still outperforms

School of EECS Washington State University CptS 570 - Machine Learning 1 Course overview

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Introduction Mitchell, Chapter 1 CptS 570 Machine Learning School of EECS Washington State

Introduction Mitchell, Chapter 1 CptS 570 Machine Learning School of EECS Washington State

Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington

Concept Learning Mitchell, Chapter 2 CptS 570 Machine Learning School of EECS Washington State

Introduction CptS 570 Machine Learning School of EECS Washington State University What is

Conclusions Larry Holder CptS 570 Machine Learning School of Electrical Engineering and

Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 Machine Learning

CptS 360 (System Programming) Unit 6: Files and Directories Bob Lewis School of Engineering and

CptS 360 (System Programming) Unit 2: Introduction to UNIX and Linux Bob Lewis School of

CptS 360 (System Programming) Unit 17: Network IPC (Sockets) Bob Lewis School of Engineering and

CptS 360 (System Programming) Unit 3: Development Tools Bob Lewis School of Engineering and

CptS 360 (System Programming) Unit 19: Curses Bob Lewis School of Engineering and Applied

CptS 360 (System Programming) Unit 8: System Data Files Bob Lewis School of Engineering and

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &amp;

EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x , y ): K

A general-purpose method for faithfully rounded floating-point function approximation in FPGAs

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets

CS4811 Neural Network Training Example Consider the following network. It has two inputs (two

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &