Chapter 6: Multilayer Neural Networks (Sections 6.1-6.3) • Introduction • Feedforward Operation and Classification • Backpropagation Algorithm
Pattern Recognition Two main challenges • Representation • Matching Jain CSE 802, Spring 2013
Representation and Matching
How Good is Face Representation? Courtesy: Pete Langenfeld, MSP Driver License Information Gallery: 34 million (30M DMV photos, 4M mugshots) 2009 driver license photo
How Good is Face Representation? Smile makes a difference! Top-10 retrievals 1 2 3 4 5 6 7 8 9 10 Gallery: 34 million (30M DMV photos, 4M mugshots) Courtesy: Pete Langenfeld, MSP
State of the Art in FR: Verification FRGC v2.0 (2006) MBGC (2010) LFW (2007) IJB-A (2015) LFW Standard Protocol 99.77% (Accuracy) 3,000 genuine & 3,000 imposter pairs; 10-fold CV LFW BLUFR Protocol 88% TAR @ 0.1% FAR 156,915 genuine, ~46M imposter pairs; 10-fold CV D. Wang, C. Otto and A. K. Jain, "Face Search at Scale: 80 Million Gallery", arXiv, July 28, 2015
Neural Networks n Massive parallelism is essential for complex recognition tasks (speech & image recognition) n Humans take only ~200ms for most cognitive tasks; this suggests parallel computation in human brain n Biological networks achieve excellent recognition performance via dense interconnection of simple computational elements (neurons) n Number of neurons » 10 10 – 10 12 n Number of interconnections/neuron » 10 3 – 10 4 n Total number of interconnections » 10 14 n Damage to a few neurons or synapse (links) does not appear to impair performance (robustness)
Neuron n Nodes are nonlinear, typically analog x 1 w 1 x 2 Y (output) x d w d where is internal threshold or offset
Neural Networks n Feed-forward networks with one or more layers (hidden) between input & output nodes n How many nodes & hidden layers? . . . . . . . . . c outputs d inputs First hidden layer Second hidden layer NH 1 input units NH 2 input units n Network training?
Form of the Discriminant Function • Linear: Hyperplane decision boundaries • Non-Linear: Arbitrary decision boundaries • Adopt a model and then use the resulting decision boundary • Specify the desired decision boundary
Linear Discriminant Function • For a 2-class problem, discriminant function that is a linear combination of input features can be written as Sign of the function value gives Bias or Threshold the class Weight weight label vector
Quadratic Discriminant Function • Quadratic Discriminant Function • Obtained by adding pair-wise products of features Linear Part Quadratic part, d(d+1)/2 (d+1) parameters additional parameters • g(x) positive implies class 1; g(x) negative implies class 2 • g(x) = 0 represents a hyperquadric, as opposed to hyperplanes in linear discriminant case • Adding more terms such as w ijk x i x j x k results in polynomial discriminant functions
Generalized Discriminant Function • A generalized linear discriminant function, where y= f(x) can be written as Setting y i (x) to be Dimensionality of the monomials results in augmented feature space. polynomial discriminant functions Weights in the augmented feature space. Note that the function is linear in a . • Equivalently, = = t t a [ a , a ,..., a ] y [ y ( x ), y ( x ),..., y d x ( )] ˆ ˆ 1 2 1 2 d also called the augmented feature vector .
13 Perceptron • Perceptron is a linear classifier; it makes predictions based on a linear predictor function combining a set of weights with feature vector • The perceptron algorithm was invented by Rosenblatt in the late 1950s; its first implementation, in custom hardware, was one of the first artificial neural networks to be produced • The algorithm allows for online learning; it processes training samples one at a time
Two-category Linearly Separable Case • Let y 1 , y 2 ,…, y n be n training samples in augmented feature space, which are linearly separable • We need to find a weight vector a such that • a t y > 0 for examples from the positive class • a t y < 0 for examples from the negative class • “Normalizing” the input examples by multiplying them with their class label (replace all samples from class 2 by their negatives), find a weight vector a such that • a t y > 0 for all the examples (here y is multiplied with class label) • The resulting weight vector is called a separating vector or a solution vector
The Perceptron Criterion Function • Goal: Find weight vector a such that a t y > 0 for all the samples (assuming it exists) • Mathematically, this can be expressed as finding a weight vector a that minimizes the no. of samples misclassified • Function is piecewise constant (discontinuous, and hence non- differentiable) and is difficult to optimize • Perceptron Criterion Function: Find a that minimizes this criterion The criterion is proportional to the sum of distances from the misclassified samples to the decision boundary Now, the minimization is mathematically tractable, and hence it is a better criterion fn. than no. of misclassifications .
Fixed-increment Single Sample Perceptron • Also called perceptron learning in an online setting • For large datasets, this is more efficient compared to batch mode n = no. of training samples; a = weight vector; k = iteration # Chapter 5, page 230
Perceptron Convergence Theorem If training samples are linearly separable, then the sequence of weight vectors given by Fixed- increment single-sample Perceptron will terminate at a solution vector What happens if the patterns are non-linearly separable?
18 Multilayer Perceptron Can we learn the nonlinearity at the same time as the linear discriminant? This is the goal of multilayer neural networks or multilayer Perceptrons Pattern Classification, Chapter 6
19 Pattern Classification, Chapter 6
20 Feedforward Operation and Classification • A three-layer neural network consists of an input layer, a hidden layer and an output layer interconnected by modifiable (learned) weights represented by links between layers • Multilayer neural network implements linear discriminants, but in a space where the inputs have been mapped nonlinearly • Figure 6.1 shows a simple three-layer network Pattern Classification, Chapter 6
21 NNo training here No training involved here, since we are implementing a known input/output mapping
22 Pattern Classification, Chapter 6
23 • A single “ bias unit ” is connected to each unit in addition to the input units d d å å • Net activation: = + = º t net x w w x w w . x , j i ji j 0 i ji j = = i 1 i 0 where the subscript i indexes units in the input layer, j in the hidden layer; w ji denotes the input-to-hidden layer weights at the hidden unit j . (In neurobiology, such weights or connections are called “ synapses ” ) • Each hidden unit emits an output that is a nonlinear function of its activation, that is: y j = f(net j ) Pattern Classification, Chapter 6
24 Figure 6.1 shows a simple threshold function ³ ì 1 if net 0 = º f ( net ) sgn( net ) í - < 1 if net 0 î • The function f(.) is also called the activation function or “ nonlinearity ” of a unit. There are more general activation functions with desirables properties • Each output unit similarly computes its net activation based on the hidden unit signals as: n n å H å H = + = = t net y w w y w w . y , k j kj k 0 j kj k = = j 1 j 0 where the subscript k indexes units in the ouput layer and n H denotes the number of hidden units Pattern Classification, Chapter 6
25 • The output units are referred as z k . An output unit computes the nonlinear function of its net input, emitting z k = f(net k ) • In the case of c outputs (classes), we can view the network as computing c discriminant functions z k = g k (x); the input x is classified according to the largest discriminant function g k (x) " k = 1, …,c • The three-layer network with the weights listed in fig. 6.1 solves the XOR problem Pattern Classification, Chapter 6
26 • The hidden unit y 1 computes the boundary: ³ 0 Þ y 1 = +1 x 1 + x 2 + 0.5 = 0 < 0 Þ y 1 = -1 • The hidden unit y 2 computes the boundary: £ 0 Þ y 2 = +1 x 1 + x 2 -1.5 = 0 < 0 Þ y 2 = -1 • Output unit emits z 1 = +1 if and only if y 1 = +1 and y 2 = +1 Using the terminology of computer logic, the units are behaving like gates, where the first hidden unit is an OR gate, the second hidden unit is an AND gate, and the output unit implements z k = y 1 AND NOT y 2 = (x 1 OR x 2 ) and NOT (x 1 AND x 2 ) = x 1 XOR x 2 which provides the nonlinear decision of fig. 6.1 Pattern Classification, Chapter 6
Recommend
More recommend