Neural Networks and Autodifferentiation CMSC 678 UMBC
Recap from last timeβ¦
Maximum Entropy (Log-linear) Models π π¦ π§) β exp(π π π π¦, π§ ) βmodel the posterior probabilities of the K classes via linear functions in ΞΈ , while at the same time ensuring that they sum to one and remain in [0, 1 ]β ~ Ch 4.4 β[The log -linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information .β Jaynes, 1957
Normalization for Classification Z = Ξ£ weight 1 * f 1 (fatally shot, X ) exp ( ) weight 2 * f 2 (seriously wounded, X ) weight 3 * f 3 (Shining Path, X ) β¦ label x
Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression Softmax regression Max`imum Entropy models (MaxEnt) Generalized Linear Models Discriminative NaΓ―ve Bayes Very shallow (sigmoidal) neural nets π§ = ΰ· π π π¦ π + π π the response can be a general (transformed) version of another response log π(π¦ = π) logistic regression log π(π¦ = πΏ) = ΰ· π π π(π¦ π , π) + π π
Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p ΞΈ π½ π [π(π¦ β² , π§ π ) ΰ· thinks it computes for feature f k π
Outline Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)
Sigmoid s=10 1 s=1 π π€ = 1 + exp(βπ‘π€) s=0.5
Sigmoid 1 π π€ = 1 + exp(βπ‘π€) s=10 ππ π€ = π‘ β π π€ β 1 β π π€ ππ€ s=1 calc practice: verify for yourself s=0.5
Remember Multi-class Linear Regression/Perceptron? π¦ π± π§ π§ = π± π π¦ + π output: if y > 0: class 1 else: class 2
Linear Regression/Perceptron: A Per-Class View π¦ π¦ π§ π± π π± π π¦ + π π§ 1 = π± π π§ 1 π§ π§ 2 π π¦ + π π§ 2 = π± π π§ = π± π π¦ + π π± π output: output: if y > 0: class 1 i = argmax { y 1 , y 2 } else: class 2 class i binary version is special case
Logistic Regression/Classification π¦ π¦ π§ π± π π π¦ + π) π§ 1 β exp(π± π π± π§ 1 π§ π§ 2 π π¦ + π) π§ 2 β exp( π± π π§ = π(π± π π¦ + π) π± π π§ = softmax(π± π π¦ + π) output: i = argmax { y 1 , y 2 } class i
Logistic Regression/Classification π¦ π§ π± π Q : Why didnβt our maxent π π¦ + π) π§ 1 β exp(π± π formulation from last class have multiple weight vectors? π§ 1 π§ 2 π π¦ + π) π§ 2 β exp( π± π π± π output: i = argmax { y 1 , y 2 } class i
Logistic Regression/Classification π¦ π§ π± π Q : Why didnβt our maxent π π¦ + π) π§ 1 β exp(π± π formulation from last class have multiple weight vectors? π§ 1 π§ 2 A : Implicitly it did. Our π π¦ + π) π§ 2 β exp( π± π formulation was π± π π§ β exp(π₯ π π π¦, π§ ) output: i = argmax { y 1 , y 2 } class i
Stacking Logistic Regression π§ π¦ β π± π π± π π± π π± π Goal : you still want to predict y Idea : Can making an initial round of separate (independent) binary predictions h help? π π¦ + π 0 ) β π = π(π± π£
Stacking Logistic Regression π§ π¦ β π± π π± π π± π π± π πΈ π§ 1 π§ 2 π π¦ + π 0 ) π β + π 1 ) β π = π(π± π£ π§ π = softmax(π π€ Predict y from your first round of predictions h Idea : data/signal compression
Stacking Logistic Regression π§ π¦ β π± π π± π π± π π± π πΈ π§ 1 π§ 2 π π¦ + π 0 ) π β + π 1 ) β π = π(π± π£ π§ π = softmax(π π€ Do we need (binary) probabilities here?
Stacking Logistic Regression π§ π¦ β π± π π± π π± π π± π πΈ π§ 1 π§ 2 π π¦ + π 0 ) π β + π 1 ) β π = πΊ(π± π£ π§ π = softmax(π π€ F : (non-linear) Do we need activation function probabilities here?
Stacking Logistic Regression π§ π¦ β π± π π± π π± π π± π πΈ π§ 1 π§ 2 π π¦ + π 0 ) π β + π 1 ) β π = πΊ(π± π£ π§ π = softmax(π π€ F : (non-linear) Do we need activation function probabilities here? Classification: probably Regression: not really
Stacking Logistic Regression π§ π¦ β π± π π± π π± π π± π πΈ π§ 1 π§ 2 π π¦ + π 0 ) π β + π 1 ) β π = πΊ(π± π£ π§ π = G(π π€ G: (non-linear) F : (non-linear) activation function activation function Classification: softmax Regression: identity
Multilayer Perceptron, a.k.a. Feed-Forward Neural Network π§ π¦ β π± π π± π π± π π± π πΈ π§ 1 π§ 2 π π¦ + π 0 ) π β + π 1 ) β π = πΊ(π± π£ π§ π = G(π π€ G: (non-linear) F : (non-linear) activation function activation function Classification: softmax Regression: identity
Feed-Forward Neural Network π§ π¦ β π± π π± π π± π π± π πΈ π§ 1 π§ 2 π π¦ + π 0 ) π β + π 1 ) β π = πΊ(π± π£ π§ π = G(π π€ πΈ : # output X # hidden π± : # hidden X # input
Why Non-Linear? π§ π¦ β π± π π± π π± π π± π πΈ π β + π 1 π§ π = G π π π§ 1 π§ 2 π πΊ π₯ π π π¦ + π 0 π§ π = π» πΎ π π
Feed-Forward π§ π¦ β π± π π± π π± π π± π πΈ π§ 1 π§ 2 information/ no self-loops computation flow (recurrence/reuse of weights)
Why βNeural?β argue from neuroscience perspective neurons (in the brain) receive input and βfireβ when sufficiently excited/activated Image courtesy Hamed Pirsiavash
Universal Function Approximator Theorem [Kurt Hornik et al., 1989]: Let F be a continuous function on a bounded subset of D-dimensional space. Then there exists a two-layer network G with finite number of hidden units that approximates F arbitrarily well. For all x in the domain of F, |F(x) β G(x) |< Ξ΅ β a two- layer network can approximate any functionβ Going from one to two layers dramatically improves the representation power of the network Slide courtesy Hamed Pirsiavash
How Deep Can They Be? So many choices: Computational Issues : Vanishing gradients Architecture Gradients shrink as one moves # of hidden layers away from the output layer # of units per hidden layer Convergence is slow Opportunities : Training deep networks is an active area of research Layer-wise initialization (perhaps using unsupervised data) Engineering: GPUs to train on massive labelled datasets Slide courtesy Hamed Pirsiavash
Some Results: Digit Classification simple feed logistic forward regression (similar to MNIST in A2, but not exactly the same) ESL, Ch 11
Tensorflow Playground http://playground.tensorflow.org Experiment with small (toy) data neural networks in your browser Feel free to use this to gain an intuition
Outline Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)
Empirical Risk Minimization β xent π§ β , π§ = β ΰ· π§ β π log π(π§ = π) Cross entropy loss π β L2 π§ β , π§ = (π§ β β π§)^2 mean squared error/L2 loss 2 squared expectation β sqβexpt π§ β , π§ = π§ β β π π§ loss 2 β hinge π§ β , π§ = max 0, 1 + max πβ π§ β π§ π β π§ β [π] hinge loss
Gradient Descent: Backpropagate the Error Set t = 0 Pick a starting value ΞΈ t Until converged: epoch : a single for example(s) i: run over all training data 1. Compute loss l on x i epoch 2. Get gradient g t = lβ(x i ) (mini)batch (mini-)batch : a 3. Get scaling factor Ο t run over a subset of the data 4. Set ΞΈ t+1 = ΞΈ t - Ο t *g t 5. Set t += 1
Gradients for Feed Forward Neural Network π π π₯ π§ β π log π§ π π π¦ + π 0 π§ π = π πΎ π β = β ΰ· π π π β : a vector ππ§ π§ β πβ = β1 ππΎ ππ π§ π§ β ππΎ ππ πβ ππ₯ ππ
Gradients for Feed Forward Neural Network π π π₯ π§ β π log π§ π π π¦ + π 0 π§ π = π πΎ π β = β ΰ· π π π β : a vector π β βπ β² πΎ π§ β π β πβ = β1 ππ§ π§ β ππΎ π = π β ππΎ ππ π§ π§ β ππΎ ππ ππΎ ππ π πΎ π§ β πβ ππ₯ ππ
Gradients for Feed Forward Neural Network π π π₯ π§ β π log π§ π π π¦ + π 0 π§ π = π πΎ π β = β ΰ· π π π β : a vector π β π β βπ β² πΎ π§ β βπ β² πΎ π§ β π β π Ο π πΎ π§ β π β π πβ = β1 ππ§ π§ β ππΎ π = = π β π β ππΎ ππ π§ π§ β ππΎ ππ ππΎ ππ ππΎ ππ π πΎ π§ β π πΎ π§ β πβ ππ₯ ππ
Recommend
More recommend