Perceptron Algorithm πΉ πΉ πΊ πΏ is the best-case margin R is the length of the longest vector 37
Adjusting weights π _`& = π _ β ππΌπΉ c π _ β’ Weight update for a training pair (π c , π§ (c) ) : β Perceptron : If π‘πππ(π d π (c) ) β π§ (c) then βπ = π (c) π§ (c) else βπ = π β ADALINE : βπ = π(π§ (c) β π d π (c) )π (c) πΉ c π = π§ (c) β π d π (c) ' β’ Widrow-Hoff, LMS, or delta rule 38
How to learn the weights: multi class example 40
How to learn the weights: multi class example β’ If correct: no change β’ If wrong: β lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) β raise score of the target (by adding the input to the weight vector of the target class) 41
How to learn the weights: multi class example β’ If correct: no change β’ If wrong: β lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) β raise score of the target (by adding the input to the weight vector of the target class) 42
How to learn the weights: multi class example β’ If correct: no change β’ If wrong: β lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) β raise score of the target (by adding the input to the weight vector of the target class) 43
How to learn the weights: multi class example β’ If correct: no change β’ If wrong: β lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) β raise score of the target (by adding the input to the weight vector of the target class) 44
How to learn the weights: multi class example β’ If correct: no change β’ If wrong: β lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) β raise score of the target (by adding the input to the weight vector of the target class) 45
How to learn the weights: multi class example β’ If correct: no change β’ If wrong: β lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) β raise score of the target (by adding the input to the weight vector of the target class) 46
Single layer networks as template matching β’ Weights for each class as a template (or sometimes also called a prototype) for that class. β The winner is the most similar template. β’ The ways in which hand-written digits vary are much too complicated to be captured by simple template matches of whole shapes. β’ To capture all the allowable variations of a digit we need to learn the features that it is composed of. 47
The history of perceptrons β’ They were popularised by Frank Rosenblatt in the early 1960βs. β They appeared to have a very powerful learning algorithm. β Lots of grand claims were made for what they could learn to do. β’ In 1969, Minsky and Papert published a book called βPerceptronsβ that analyzed what they could do and showed their limitations. β Many people thought these limitations applied to all neural network models. 48
What binary threshold neurons cannot do β’ A binary threshold output unit cannot even tell if two single bit features are the same! β’ A geometric view of what binary threshold neurons cannot do β’ The positive and negative cases cannot be separated by a plane 49
What binary threshold neurons cannot do β’ Positive cases (same): (1,1)->1; (0,0)->1 β’ Negative cases (different): (1,0)->0; (0,1)->0 β’ The four input-output pairs give four inequalities that are impossible to satisfy: β w 1 + w 2 β₯ΞΈ β 0 β₯ΞΈ β w 1 <ΞΈ β w 2 <ΞΈ 50
Discriminating simple patterns under translation with wrap-around β’ Suppose we just use pixels as the features. β’ binary decision unit cannot discriminate patterns with the same number of on pixels β if the patterns can translate with wrap- around! 51
Sketch of a proof β’ For pattern A, use training cases in all possible translations. β Each pixel will be activated by 4 different translations of pattern A. β So the total input received by the decision unit over all these patterns will be four times the sum of all the weights. β’ For pattern B, use training cases in all possible translations. β Each pixel will be activated by 4 different translations of pattern B. β So the total input received by the decision unit over all these patterns will be four times the sum of all the weights. β’ But to discriminate correctly, every single case of pattern A must provide more input to the decision unit than every single case of pattern B. β’ This is impossible if the sums over cases are the same. 52
Networks with hidden units β’ Networks without hidden units are very limited in the input-output mappings they can learn to model. β More layers of linear units do not help. Its still linear. β Fixed output non-linearities are not enough. β’ We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets? 53
The multi-layer perceptron β’ A network of perceptrons β Generally βlayered β 54
Feed-forward neural networks β’ Also called Multi-Layer Perceptron (MLP) 55
MLP with single hidden layer β’ Two-layer MLP (Number of layers of adaptive weights is counted) β° β° ( ['] π¨ ['] π * π₯ ,Λ [&] π¦ , π β π = π * π₯ β π β π = π * π₯ Λ Λβ Λβ ΛqΕ ΛqΕ ,qΕ π¨ Ε = 1 π¨ Λ ['] [&] π₯ π₯ ,Λ Λβ π¨ & π¦ Ε = 1 π π π & π¦ & β¦ β¦ π β¦ π π β’ π¦ ( π π¨ β° Output Input π = 0, β¦ , π π = 1 β¦ π π = 1 β¦ π π = 1, β¦ , πΏ 56
Beyond linear models π = πΏπ π = πΏ ' π πΏ π π 57
Beyond linear models π = πΏπ π = πΏ ' π πΏ π π π = πΏ 0 π πΏ ' π πΏ π π 58
Defining βdepthβ β’ What is a βdeepβ network 60
Deep Structures β’ In any directed network of computational elements with input source nodes and output sink nodes, βdepthβ is the length of the longest path from a source to a sink β’ Left: Depth =2. Right: Depth =3 β’ β Deep β [ Depth > 2 61
The multi-layer perceptron N.Net β’ Inputs are real or Boolean stimuli β’ Outputs are real or Boolean value s β Can have multiple outputs for a single input β’ What can this network compute? β What kinds of input/output relationships can it model? 63
MLPs approximate functions 2 β 2 1 1 0 1 β n 1 -1 1 1 x 2 2 1 2 1 1 1 -1 -1 1 1 1 1 X Y Z A β’ MLP s can compose Boolean functions β’ MLPs can compose real-valued functions β’ What are the limitations? 64
Multi-layer Perceptrons as universal Boolean functions 65
The perceptron as a Boolean gate X 1 -1 2 X 0 1 X 1 Y 1 1 Y β’ A perceptron can model any simple binary Boolean gate 67
Perceptron as aBoolean gate 1 1 1 L -1 -1 -1 Will fire only if X 1 .. X L are all 1 and X L+1 .. X N are all 0 β’ The universal AND gate β AND any number of inputs β’ Any subset of who may be negated 68
Perceptron as aBoolean gate 1 1 1 L-N+1 -1 -1 -1 Will fire only if any of X 1 .. X L are 1 or any of X L+1 .. X N are 0 β’ The universal OR gate β OR any number of inputs β’ Any subset of who may be negated 69
Perceptron as aBoolean gate 1 1 1 Will fire only if at least K inputs are 1 K 1 1 1 β’ Generalized majority gate β Fire if at least K inputs are of the desired polarity 70
Perceptron as aBoolean gate 1 1 Will fire only if the total number of of X 1 .. 1 X L that are 1 or X L+1 .. X N that are 0 is at L -N+K -1 least K -1 -1 β’ Generalized majority gate β Fire if at least K inputs are of the desired polarity 71
The perceptron is not enough X ? ? ? Y β’ Cannot compute an XOR 72
Multi-layer perceptron XOR X 1 1 1 1 2 1 -1 -1 -1 Y Hidden Layer β’ An XOR takes three perceptrons 73
Multi-layer perceptron XOR X 1 1 -2 1.5 0.5 1 1 Y β’ With 2 neurons β 5 weights and two thresholds 74
Multi-layer perceptron 2 1 1 0 1 1 -1 1 1 2 2 1 1 1 1 1 -1 -1 1 1 1 1 X Y Z A β’ MLPs can compute more complex Boolean functions β’ MLPs can compute any Boolean function β Since they can emulate individual gates β’ MLPs are universal Boolean functions 75
MLP as Boolean Functions 2 1 1 0 1 1 -1 1 1 2 1 1 2 -1 -1 1 1 1 1 1 1 1 X Y Z A β’ MLPs are universal Boolean functions β Any function over any number of inputs and any number of outputs β’ But how many βlayersβ will they need? 76
How many layers for aBoolean MLP? Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 Truth table shows all input 0 1 1 0 0 1 combinations for which output is 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 β’ A Boolean function is just a truth table 77
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 β & π β ' X 0 X β π β β + π β & X ' π β 0 X β X β + π β & X ' X 0 π β β π β β + y = π 0 1 1 0 0 1 β ' π β 0 π β β X β + X & π β ' X 0 X β X β + X & X ' π β 0 π β β X β X & π 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 β’ Expressed in disjunctive normal form 78
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 β & π β ' X 0 X β π β β + π β & X ' π β 0 X β X β + π β & X ' X 0 π β β π β β + y = π 0 1 1 0 0 1 β ' π β 0 π β β X β + X & π β ' X 0 X β X β + X & X ' π β 0 π β β X β X & π 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X β X β β’ Expressed in disjunctive normal form 79
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 β & π β ' X 0 X β π β β + π β & X ' π β 0 X β X β + π β & X ' X 0 π β β π β β + y = π 0 1 1 0 0 1 β ' π β 0 π β β X β + X & π β ' X 0 X β X β + X & X ' π β 0 π β β X β X & π 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X β X β β’ Expressed in disjunctive normal form 80
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 β & π β ' X 0 X β π β β + π β & X ' π β 0 X β X β + π β & X ' X 0 π β β π β β + y = π 0 1 1 0 0 1 β ' π β 0 π β β X β + X & π β ' X 0 X β X β + X & X ' π β 0 π β β X β X & π 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X β X β β’ Expressed in disjunctive normal form 81
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 β & π β ' X 0 X β π β β + π β & X ' π β 0 X β X β + π β & X ' X 0 π β β π β β + y = π 0 1 1 0 0 1 β ' π β 0 π β β X β + X & π β ' X 0 X β X β + X & X ' π β 0 π β β X β X & π 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X β X β β’ Expressed in disjunctive normal form 82
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 β & π β ' X 0 X β π β β + π β & X ' π β 0 X β X β + π β & X ' X 0 π β β π β β + y = π 0 1 1 0 0 1 β ' π β 0 π β β X β + X & π β ' X 0 X β X β + X & X ' π β 0 π β β X β X & π 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X β X β β’ Expressed in disjunctive normal form 83
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 β & π β ' X 0 X β π β β + π β & X ' π β 0 X β X β + π β & X ' X 0 π β β π β β + y = π 0 1 1 0 0 1 β ' π β 0 π β β X β + X & π β ' X 0 X β X β + X & X ' π β 0 π β β X β X & π 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X β X β β’ Expressed in disjunctive normal form 84
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 β & π β ' X 0 X β π β β + π β & X ' π β 0 X β X β + π β & X ' X 0 π β β π β β + y = π 0 0 1 1 0 1 0 1 0 1 1 1 β ' π β 0 π β β X β + X & π β ' X 0 X β X β + X & X ' π β 0 π β β X β X & π 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X β X β β’ Expressed in disjunctive normal form 85
How many layers for aBoolean MLP? β & π β ' X 0 X β π β β + π β & X ' π β 0 X β X β + π β & X ' X 0 π β β π β β + y = π Truth Table X X X X X Y β ' π β 0 π β β X β + X & π β ' X 0 X β X β + X & X ' π β 0 π β β X β X & π 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X β X β β’ Any truth table can be expressed in this manner! β’ A one-hidden-layer MLP is a Universal Boolean Function β’ But what is the largest number of perceptrons required in the single hidden layer for an N-input-variable function? 86
Worst case β’ Which truth tables cannot be reduced further simply? β’ Largest width needed for a single-layer Boolean network on N inputs β Worst case: 2 uΛ& β’ Example: Parity function π, π 00 01 11 10 π, π 1 0 1 0 00 0 1 0 1 01 1 0 1 0 11 0 1 0 1 10 π β π β π β π 87
Boolean functions β’ Input: N Boolean variable β’ How many neurons in a one hidden layer MLP is required? β’ More compact representation of a Boolean function β βKarnaugh Mapβ β’ representing the truth table as a grid β’ Grouping adjacent boxes to reduce the complexity of the Disjunctive Normal Form (DNF) formula π, π 00 01 10 11 π, π 1 1 1 1 00 01 1 1 10 11 1 1 88
How many neurons in the hidden layer? βπ βπ Ε πΜ β¨ π βππ Ε πΜ β¨ ππ βπ Ε πΜ β¨ πππ Ε πΜ β¨ π βπ βππ β¨ ππ βππΜ β¨ ππππΜ β¨ β’ π βππ ππ π, π 00 01 11 10 π, π 1 1 1 1 00 01 11 1 1 1 1 10 Ε πΜ β¨ π βππ β¨ πππΜ β’ π 89
Width of a deepMLP Y Z WX 00 01 11 10 Y Z WX 00 00 01 01 11 11 10 11 10 01 10 00 Y Z 00 01 11 10 UV 92
Using deep network: Parity function on N inputs β’ Simple MLP with one hidden layer: 2 uΛ& Hidden units π + 2 2 uΛ& + 1 Weights and biases 93
Using deep network: Parity function on N inputs β’ Simple MLP with one hidden layer: 2 uΛ& Hidden units π + 2 2 uΛ& + 1 Weights and biases π β β’ π = π & β π ' β β― β π u 3(π β 1) Nodes π 0 9(π β 1) Weights and biases The actual number of parameters in a network is the number that really matters in software or hardware implementations π & π ' 94
A better architecture β’ Only requires 2logπ layers β’ π = π & β π ' β π 0 β π β β π β β π β β π Β’ β π Β£ π & π ' π β π Β’ π 0 π β π Β£ π Β€ 95
The challenge of depth β¦ β¦ π π & β° π¦ & π¦ u β’ Using only K hidden layers will require π 2 Β₯u neurons in the kth layer, where π· = 2 Λβ /' β Because the output can be shown to be the XOR of all the outputs of k-1th hidden layer β i.e. reducing the number of layers below the minimum will result in an exponentially sized network to express the function fully β A network with fewer than the minimum required number of neurons cannot model the function 96
Caveat 1: Not all Booleanfunctions.. β’ Not all Boolean circuits have such clear depth-vs-size tradeoff β’ Shannonβs theorem: For π > 2 , there is Boolean function of π variables that requires at least 2 u /π gates β More correctly, for large N, almost all N-input Boolean function need more than 2 u /π gates β’ Regardless of depth β’ Note: if all Boolean functions over π inputs could be computed using a circuit of size that is polynomial in π , P=NP ! 99
Caveat 2 β’ Used a simple βBoolean circuitβ analogy for explanation β’ We actually have threshold circuit (TC) not, just a Boolean circuit (AC) β Specifically composed of threshold gates β’ More versatile than Boolean gates (can compute majority function) β’ E.g. βat least K inputs are 1β is a single TC gate, but an exponential size AC β’ For fixed depth, πΆππππππ πππ ππ£ππ’π‘ β π’βπ ππ‘βπππ πππ ππ£ππ’π‘ (π‘π’π πππ’ π‘π£ππ‘ππ’) β A depth-2 TC parity circuit can be composed with π(π ' ) weights β’ But a network of depth log (π) requires only π(π) weights β’ Other formal analyses typically view neural networks as arithmetic circuits β Circuits which compute polynomials over any field β’ So lets consider functions over the field of reals 100
Summary: Wide vs. deep network β’ MLP with a single hidden layer is a universal Boolean function β’ However, a single-layer network might need an exponential number of hidden units w.r.t. the number of inputs β’ Deeper networks may require far fewer neurons than shallower networks to express the same function β Could be exponentially smaller β’ Optimal width and depth depend on the number of variables and the complexity of the Boolean function β Complexity: minimal number of terms in DNF formula to represent it 101
MLPs as universal classifiers 102
The MLPas a classifier 2 784 dimensions (MNIST) 784 dimensions β’ MLP as a function over real inputs β’ MLP as a function that finds a complex βdecision boundaryβ over a space of reals 103
A Perceptron onReals x 1 x 2 x 3 x N 1 * π₯ , π¦ , β₯ π , π₯ & π¦ & + π₯ ' π¦ ' = π x 2 0 x 2 x 1 x 1 β’ A perceptron operates on real-valued vectors β This is a linear classifier 104
Boolean functions with areal perceptron 1,1 1,1 1,1 0,1 0,1 0,1 X X Y Y Y X 0,0 1,0 0,0 1,0 0,0 1,0 β’ Boolean perceptrons are also linear classifiers β Purple regions are 1 105
Composing complicated βdecisionβ boundaries Can now be composed into βnetworksβ to x 2 compute arbitrary classification βboundariesβ x 1 β’ Build a network of units with a single output that fires if the input is in the coloured area 106
Booleans over the reals x 2 x 1 x 2 x 1 β’ The network must fire if the input is in the coloured area 107
Booleans over the reals x 2 x 1 x 2 x 1 β’ The network must fire if the input is in the coloured area 108
Booleans over the reals x 2 x 1 x 2 x 1 β’ The network must fire if the input is in the coloured area 109
Booleans over the reals x 2 x 1 x 2 x 1 β’ The network must fire if the input is in the coloured area 110
Recommend
More recommend