How many layers for a Boolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X 1 X 2 X 3 X 4 X 5 Y � � � � � � � � � � � � � � � 0 0 1 1 0 1 � � � � � � � � � � � � � � � 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X 2 X 3 X 4 X 5 X 1 • Any truth table can be expressed in this manner! • A one-hidden-layer MLP is a Universal Boolean Function But what is the largest number of perceptrons required in the single hidden layer for an N-input-variable function? 37
Reducing a Boolean Function YZ WX 00 01 11 10 This is a “Karnaugh Map” 00 It represents a truth table as a grid Filled boxes represent input combinations 01 for which output is 1; blank boxes have output 0 11 Adjacent boxes can be “grouped” to reduce the complexity of the DNF formula 10 for the table • DNF form: – Find groups – Express as reduced DNF 38
Reducing a Boolean Function YZ WX 00 01 11 10 00 Basic DNF formula will require 7 terms 01 11 10 39
Reducing a Boolean Function YZ WX 00 01 11 10 00 01 11 10 • Reduced DNF form: – Find groups – Express as reduced DNF 40
Reducing a Boolean Function YZ WX 00 01 11 10 00 01 11 10 W X Y Z • Reduced DNF form: – Find groups – Express as reduced DNF 41
Largest irreducible DNF? YZ WX 00 01 11 10 00 01 11 10 • What arrangement of ones and zeros simply cannot be reduced further? 42
Largest irreducible DNF? YZ WX 00 01 11 10 00 01 11 10 • What arrangement of ones and zeros simply cannot be reduced further? 43
Largest irreducible DNF? YZ How many neurons WX 00 01 11 10 in a DNF (one- 00 hidden-layer) MLP 01 for this Boolean 11 function? 10 • What arrangement of ones and zeros simply cannot be reduced further? 44
Width of a single-layer Boolean network YZ WX 00 01 11 10 11 10 01 00 YZ 00 01 11 10 UV • How many neurons in a DNF (one-hidden- layer) MLP for this Boolean function of 6 variables? 45
The actual number of parameters in a network X 2 X 3 X 4 X 5 X 1 • The actual number of parameters in a network is the number of connections – In this example there are 30 • This is the number that really matters in software or hardware implementations 46
Width of a single-layer Boolean network YZ WX 00 01 11 10 11 10 01 00 YZ 00 01 11 10 UV • How many neurons in a DNF (one-hidden-layer) MLP for this Boolean function of 6 variables? – How many weights will this network require? 47
Width of a single-layer Boolean network YZ WX 00 Can be generalized: Will require 2 N-1 perceptrons in hidden layer 01 Exponential in N 11 10 11 10 01 00 YZ Will require O(N2 N-1 ) weights 00 01 11 10 UV superexponential in N • How many neurons in a DNF (one-hidden- layer) MLP for this Boolean function 48
Width of a single-layer Boolean network YZ WX 00 Can be generalized: Will require 2 N-1 perceptrons in hidden layer 01 Exponential in N 11 10 11 10 01 How many units if we use multiple layers? 00 YZ 00 01 11 10 UV How many weights? • How many neurons in a DNF (one-hidden- layer) MLP for this Boolean function 49
Width of a deep network YZ WX 00 01 11 10 YZ WX 00 00 01 01 11 11 10 11 10 01 10 00 YZ 00 01 11 10 UV 50
Multi-layer perceptron XOR X 1 1 1 -1 2 1 1 -1 -1 Y Hidden Layer • An XOR takes three perceptrons – 6 weights and three threshold values • 9 total parameters 51
Width of a deep network YZ WX 00 01 11 10 00 01 9 perceptrons 11 10 W X Y Z • An XOR needs 3 perceptrons • This network will require 3x3 = 9 perceptrons – 27 parameters 52
Width of a deep network YZ WX 00 01 11 10 11 10 01 00 YZ 00 01 11 10 UV 15 perceptrons U V W X Y Z • An XOR needs 3 perceptrons • This network will require 3x5 = 15 perceptrons – 45 parameters 53
Width of a deep network YZ WX 00 01 11 10 11 10 01 00 YZ 00 01 11 10 UV More generally, the XOR of N variables will require 3(N-1) U V W X Y Z perceptrons (and 9(N-1) weights) • An XOR needs 3 perceptrons • This network will require 3x5 = 15 perceptrons – 45 weights 54
Width of a single-layer Boolean network YZ WX 00 Single hidden layer: Will require 2 N-1 +1 perceptrons in all (including output unit) 01 Exponential in N 11 10 11 10 01 Will require 3(N-1) perceptrons in a deep 00 YZ 00 01 11 10 UV network (with 9(N-1) parameters) • How many neurons in a DNF (one-hidden- Linear in N!!! layer) MLP for this Boolean function Can be arranged in only 2log 2 (N) layers 55
A better representation 𝑌 � 𝑌 � • Only layers – By pairing terms – 2 layers per XOR … 56
The challenge of depth …… 𝑎 � 𝑎 � 𝑌 � 𝑌 � • Using only K hidden layers will require O(2 CN ) neurons in the Kth layer, where ��/� – Because the output can be shown to be the XOR of all the outputs of the K-1th hidden layer – I.e. reducing the number of layers below the minimum will result in an exponentially sized network to express the function fully – A network with fewer than the minimum required number of neurons cannot model 57 the function
Recap: The need for depth • Deep Boolean MLPs that scale linearly with the number of inputs … • … can become exponentially large if recast using only one layer • It gets worse.. 58
The need for depth a b c d e f X 2 X 3 X 4 X 5 X 1 • The wide function can happen at any layer • Having a few extra layers can greatly reduce network size 59
Depth vs Size in Boolean Circuits • The XOR is really a parity problem • Any Boolean circuit of depth using AND,OR and NOT gates with unbounded fan-in must have size – Parity, Circuits, and the Polynomial-Time Hierarchy, M. Furst, J. B. Saxe, and M. Sipser, Mathematical Systems Theory 1984 – Alternately stated: • Set of constant-depth polynomial size circuits of unbounded fan-in elements 60
Caveat 1: Not all Boolean functions.. • Not all Boolean circuits have such clear depth-vs-size tradeoff • Shannon’s theorem: For , there is Boolean function of variables that requires at least gates – More correctly, for large ,almost all n -input Boolean functions need more than � gates • Note: If all Boolean functions over inputs could be computed using a circuit of size that is polynomial in , P = NP! 61
Network size: summary • An MLP is a universal Boolean function • But can represent a given function only if – It is sufficiently wide – It is sufficiently deep – Depth can be traded off for (sometimes) exponential growth of the width of the network • Optimal width and depth depend on the number of variables and the complexity of the Boolean function – Complexity: minimal number of terms in DNF formula to represent it 62
Story so far • Multi-layer perceptrons are Universal Boolean Machines • Even a network with a single hidden layer is a universal Boolean machine – But a single-layer network may require an exponentially large number of perceptrons • Deeper networks may require far fewer neurons than shallower networks to express the same function – Could be exponentially smaller 63
Caveat 2 • Used a simple “Boolean circuit” analogy for explanation • We actually have threshold circuit (TC) not, just a Boolean circuit (AC) – Specifically composed of threshold gates • More versatile than Boolean gates – E.g. “at least K inputs are 1” is a single TC gate, but an exponential size AC – For fixed depth, 𝐶𝑝𝑝𝑚𝑓𝑏𝑜 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 ⊂ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 (strict subset) � weights – A depth-2 TC parity circuit can be composed with • But a network of depth log (𝑜) requires only 𝒫 𝑜 weights – But more generally, for large , for most Boolean functions, a threshold circuit that is polynomial in at optimal depth becomes exponentially large at • Other formal analyses typically view neural networks as arithmetic circuits – Circuits which compute polynomials over any field • So lets consider functions over the field of reals 64
Today • Multi-layer Perceptrons as universal Boolean functions – The need for depth • MLPs as universal classifiers – The need for depth • MLPs as universal approximators • A discussion of optimal depth and width • Brief segue: RBF networks 65
The MLP as a classifier 2 784 dimensions (MNIST) 784 dimensions • MLP as a function over real inputs • MLP as a function that finds a complex “decision boundary” over a space of reals 66
A Perceptron on Reals x 1 1 x 2 x 3 w 1 x 1 +w 2 x 2 =T x 2 0 x N x 1 � � � x 2 • A perceptron operates on x 1 real- valued vectors – This is a linear classifier 67
Boolean functions with a real perceptron 1,1 1,1 1,1 0,1 0,1 0,1 X X Y Y Y X 0,0 1,0 0,0 1,0 0,0 1,0 • Boolean perceptrons are also linear classifiers – Purple regions are 1 68
Composing complicated “decision” boundaries Can now be composed into x 2 “networks” to compute arbitrary classification “boundaries” x 1 • Build a network of units with a single output that fires if the input is in the coloured area 69
Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 70
Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 71
Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 72
Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 73
Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 74
Booleans over the reals 3 � � x 2 x 2 4 ��� 4 AND 3 3 5 y 1 y 2 y 3 y 4 y 5 x 1 x 1 4 4 3 3 4 x 2 x 1 • The network must fire if the input is in the coloured area 75
More complex decision boundaries OR AND AND x 2 x 1 x 2 x 1 • Network to fire if the input is in the yellow area – “OR” two polygons – A third layer is required 76
Complex decision boundaries • Can compose arbitrarily complex decision boundaries 77
Complex decision boundaries OR AND x 1 x 2 • Can compose arbitrarily complex decision boundaries 78
Complex decision boundaries OR AND x 1 x 2 • Can compose arbitrarily complex decision boundaries – With only one hidden layer! – How ? 79
Exercise: compose this with one hidden layer x 2 x 1 x 2 x 1 • How would you compose the decision boundary to the left with only one hidden layer? 80
Composing a Square decision boundary 2 2 2 4 2 � � y � ≥ 4? ��� • The polygon net y 1 y 2 y 3 y 4 x 2 x 1 81
Composing a pentagon 2 2 3 4 4 3 3 5 4 4 2 4 2 3 3 � � y � ≥ 5? ��� 2 y 1 y 2 y 3 y 4 y 5 • The polygon net x 2 x 1 82
Composing a hexagon 3 4 3 3 5 5 5 6 4 4 5 5 5 3 3 4 4 3 � � y � ≥ 6? ��� y 6 y 1 y 2 y 3 y 4 y 5 • The polygon net x 2 x 1 83
How about a heptagon • What are the sums in the different regions? – A pattern emerges as we consider N > 6.. 84
16 sides • What are the sums in the different regions? – A pattern emerges as we consider N > 6.. 85
64 sides • What are the sums in the different regions? – A pattern emerges as we consider N > 6.. 86
1000 sides • What are the sums in the different regions? – A pattern emerges as we consider N > 6.. 87
Polygon net � � y � ≥ 𝑂? ��� y 1 y 2 y 3 y 4 y 5 x 2 x 1 • Increasing the number of sides reduces the area outside the polygon that have N/2 < Sum < N 88
In the limit � � y � ≥ 𝑂? ��� y 1 y 2 y 3 y 4 y 5 x 2 x 1 N N/2 � ������ • � � � 𝐲������� • For small radius, it’s a near perfect cylinder – N in the cylinder, N/2 outside 89
Composing a circle � N � y � ≥ 𝑂? ��� N/2 • The circle net – Very large number of neurons – Sum is N inside the circle, N/2 outside almost everywhere – Circle can be at any location 90
Composing a circle 𝑶 − 𝑶 N/2 � 𝐳 𝒋 𝟑 > 𝟏? 𝒋�𝟐 −𝑂/2 1 0 • The circle net – Very large number of neurons – Sum is N/2 inside the circle, 0 outside almost everywhere – Circle can be at any location 91
Adding circles 𝟑𝑶 − 𝑶 � 𝐳 𝒋 𝟑 > 𝟏? 𝒋�𝟐 • The “sum” of two circles sub nets is exactly N/2 inside either circle, and 0 almost everywhere outside 92
Composing an arbitrary figure 𝑳𝑶 − 𝑶 � 𝐳 𝒋 𝟑 > 𝟏? 𝒋�𝟐 • Just fit in an arbitrary number of circles – More accurate approximation with greater number of smaller circles – Can achieve arbitrary precision 93
MLP: Universal classifier 𝑳𝑶 − 𝑶 � 𝐳 𝒋 𝟑 > 𝟏? 𝒋�𝟐 • MLPs can capture any classification boundary • A one-layer MLP can model any classification boundary • MLPs are universal classifiers 94
Depth and the universal classifier x 2 x 1 x 1 x 2 • Deeper networks can require far fewer neurons 95
Optimal depth.. • Formal analyses typically view these as category of arithmetic circuits – Compute polynomials over any field • Valiant et. al: A polynomial of degree n requires a network of depth � – Cannot be computed with shallower networks • Bengio et. al: Shows a similar result for sum-product networks – But only considers two-input units – Generalized by Mhaskar et al. to all functions that can be expressed as a binary tree – Depth/Size analyses of arithmetic circuits still a research problem 96
Special case: Sum-product nets • “Shallow vs deep sum-product networks,” Oliver Dellaleau and Yoshua Bengio – For networks where layers alternately perform either sums or products, a deep network may require an exponentially fewer number of layers than a shallow one 97
Depth in sum-product networks 98
Optimal depth in generic nets • We look at a different pattern: – “worst case” decision boundaries • For threshold-activation networks – Generalizes to other nets 99
Optimal depth 𝑳𝑶 − 𝑶 � 𝐳 𝒋 𝟑 > 𝟏? 𝒋�𝟐 • A one-hidden-layer neural network will required infinite hidden neurons 100
Recommend
More recommend