TensorFlow Workshop 2018 Understanding Neural Networks Part I: Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Artificial Neural Networks Neural networks are a class of simple, yet effective, “computing systems” with a diverse range of applications. In these systems, small computational units, or nodes, are arranged to form networks in which connectivity is leveraged to carry out complex calculations. � Deep Learning by Goodfellow, Bengio, and Courville: http://www.deeplearningbook.org/ � Convolutional Neural Networks for Visual Recognition at Stanford: http://cs231n.stanford.edu/ SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Artificial Neurons Diagram modified from Stack Exchange post answered by Gonzalo Medina. � Weights are first used to scale inputs; the results are summed with a bias term and passed through an activation function. SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Formula and Vector Representation The diagram from the previous slide can be interpreted as: y = f ( x 1 · w 1 + x 2 · w 2 + x 3 · w 3 + b ) which can be conveniently represented in vector form via: � � w T x + b y = f by interpreting the neuron inputs and weights as column vectors. SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Artificial Neurons: Multiple Outputs SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Matrix Representation This corresponds to a pair of equations, one for each ouput: � � w T y 1 = f 1 x + b 1 � � w T y 2 = f 2 x + b 2 which can be represented in matrix form by the system: y = f ( W x + b ) where we assume the activation function has been vectorized . SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Fully-Connected Neural Layers � The resulting layers, referred to as fully-connected or dense , can be visualized as a collection of nodes connected by edges corresponding to weights (bias/activations are typically omitted) SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Floating Point Operation Count Matrix-Vector Multiplication w 11 . . . w 1 N x 1 . . . ... . . . . . . w M 1 . . . w MN x N w 11 · x 1 . . . w 1 N · x N . . ... Mult: MN ∼ . . . . w M 1 · x 1 . . . w MN · x N w 11 · x 1 + . . . + w 1 N · x N . . . Add: M ( N − 1) ∼ . . . . . . w M 1 · x 1 + . . . + w MN · x N SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Floating Point Operation Count So we see that when bias terms are omitted, the FLOPs required for a neural connection between N inputs and M outputs is: 2 MN − M = MN multiplies + M ( N − 1) adds When bias terms are included, an additional M addition operations are required, resulting in a total of 2 MN FLOPs. Note: This omits the computation required for applying the activation function to M values resulting from the linear operations. Depending on the activation function selected, this may or may not have a significant impact on the overall computational complexity. SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Activation Functions Activation functions are a fundamental component of neural network architectures; these functions are responsible for: � Providing all of the network’s non-linear modeling capacity � Controlling the gradient flows that guide the training process While activation functions play a fundamental role in all neural networks, it is still desirable to limit their computational demands (e.g. avoid defining them in terms of a Krylov subspace method...). In practice, activations such as rectified linear units (ReLUs) with the most trivial function and derivative definitions often suffice. SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Activation Functions Rectified Linear Unit (ReLU) SoftPlus Activation � � � x x ≥ 0 f ( x ) = f ( x ) = ln 1 + exp( − x ) 0 x < 0 SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Activation Functions Hyperbolic Tangent Unit Sigmoidal Unit 1 f ( x ) = f ( x ) = tanh( x ) 1 + exp( − x ) SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Activation Functions (Parameterized) Exponential Linear Unit (ELU) Leaky Rectified Linear Unit � � x x ≥ 0 x x ≥ 0 f α ( x ) = f α ( x ) = α · ( e x − 1) x < 0 α · x x < 0 SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Activation Functions (Learnable Parameters) Parameterized ReLU Swish Units � x β · x x ≥ 0 f β ( x ) = f β ( x ) = 1 + exp( − β · x ) x x < 0 SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Hidden Layers Intermediate, or hidden , layers can be added between the input and ouput nodes to allow for additional non-linear processing. For example, we can first define a layer such as: h = f 1 ( W 1 x + b 1 ) and construct a subsequent layer to produce the final output: y = f 2 ( W 2 h + b 2 ) SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Hidden Layers SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Multiple Hidden Layers SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Multiple Hidden Layers Multiple hidden layers can easily be defined in the same way: h 1 = f 1 ( W 1 x + b 1 ) h 2 = f 2 ( W 2 h 1 + b 2 ) y = f 3 ( W 3 h 2 + b 3 ) One of the challenges of working with additional layers is the need to determine the impact that earlier layers have on the final ouput. This will be necessary for tuning/optimizing network parameters (i.e. weights and biases) to produce accurate predictions. SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Universal Approximators: Cybenko (1989) Cybenko, G., 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems , 2(4), pp.303-314. Basic Idea of Result: Let I n denote the unit hypercube in R n ; the collection of functions which can be expressed in the form: � N � � w T i =1 α i · σ i x + b i ∀ x ∈ I n is dense in the space of continuous functions C ( I n ) defined on I n : i.e. ∀ f ∈ C ( I n ) , ε > 0 there exist constants N, α i , w i , b i � � � N � � � � i =1 α i · σ ( w T � f ( x ) − i x + b i ) � < ε ∀ x ∈ I n such that SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Universal Approximators: Hornik et al. / Funahashi Hornik, K., Stinchcombe, M. and White, H., 1989. Multilayer feedforward networks are universal approximators. Neural networks , 2(5), pp.359-366. Funahashi, K.I., 1989. On the approximate realization of continuous mappings by neural networks. Neural networks , 2(3), pp.183-192. Summary of Results: For any compact set K ⊂ R n , multi-layer feedforward neural networks are dense in the space of continuous funtions C ( K ) on K , with respect to the supremum norm, provided that the activation function used for the network layers is: � Continuous and increasing � Non-constant and bounded SIAM@Purdue 2018 - Nick Winovich Understanding Neural Networks : Part I
Recommend
More recommend