Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 16: Neural networks Mar 16, 2017
https://www.forbes.com/sites/kevinmurnane/2016/04/01/what-is-deep-learning-and-how-is-it-useful
Neural network libraries
The perceptron, again x β 1 -0.5 not if � F � 1 i x i β i ≥ 0 y i = ˆ − 1 otherwise bad 1 -1.7 movie 0 0.3
The perceptron, again x β β 1 if � F x 1 � 1 i x i β i ≥ 0 y i = ˆ − 1 otherwise 1 -0.5 not β 2 x 2 y bad 1 -1.7 β 3 x 3 movie 0 0.3
Neural networks • Two core ideas: • Non-linear activation functions • Multiple layers
W V x 1 W 1,1 h 1 V 1 W 1,2 W 2,1 y W 2,2 x 2 V 2 h 2 W 3,1 W 3,2 x 3 Input Output “Hidden” Layer
W V x 1 h 1 y x 2 h 2 x 3 W x V y -0.5 1.3 not 1 4.1 -1 bad 0.4 0.08 1 -0.9 movie 1.7 3.1 0
W V x 1 h 1 y x 2 h 2 x 3 � F � the hidden nodes are � h j = f x i W i , j completely determined by the input and weights i = 1
W V x 1 h 1 y x 2 h 2 x 3 � F � � h 1 = f x i W i , 1 i = 1
Activation functions 1 σ ( z ) = 1 + exp( − z ) 1.00 0.75 0.50 y 0.25 0.00 -10 -5 0 5 10 x
Activation functions tanh( z ) = exp( z ) − exp( − z ) exp( z ) + exp( − z ) 1.0 0.5 0.0 y -0.5 -1.0 -10 -5 0 5 10 x
Activation functions rectifier ( z ) = max( 0 , z ) 10.0 7.5 5.0 y 2.5 0.0 -10 -5 0 5 10 x
W V x 1 h 1 y x 2 h 2 x 3 � F � � h 1 = σ x i W i , 1 y = V 1 h 1 + V 2 h 2 i = 1 ˆ � F � � h 2 = σ x i W i , 2 i = 1
W V x 1 h 1 y x 2 h 2 x 3 � F � F � �� � �� � � y = V 1 σ x i W i , 1 + V 2 σ x i W i , 2 ˆ i = 1 i = 1 � �� � � �� � h 1 h 2 we can express y as a function only of the input x and the weights W and V
� F � F � �� � �� � � y = V 1 σ x i W i , 1 + V 2 σ x i W i , 2 ˆ i = 1 i = 1 � �� � � �� � h 1 h 2 This is hairy, but differentiable Backpropagation: Given training samples of <x,y> pairs, we can use stochastic gradient descent to find the values of W and V that minimize the loss.
x + α (-2x) 0 [ α = 0.1] x .1(-2x) -25 8.00 -1.60 6.40 -1.28 5.12 -1.02 -x^2 4.10 -0.82 -50 3.28 -0.66 2.62 -0.52 2.10 -0.42 -75 1.68 -0.34 1.34 -0.27 1.07 -0.21 0.86 -0.17 -100 0.69 -0.14 -10 -5 0 5 10 x d We can get to maximum value of this dx − x 2 = − 2 x function by following the gradient 17
Neural network structures x 1 h 1 1 y x 2 h 2 x 3 Output one real value
Neural network structures 0 y x 1 h 1 1 y x 2 h 2 0 y x 3 Multiclass: output 3 values, only one = 1 in training data
Neural network structures 1 y x 1 h 1 1 y x 2 h 2 0 y x 3 output 3 values, several = 1 in training data
Regularization • Increasing the number of parameters = increasing the possibility for overfitting to training data
Regularization • L2 regularization: penalize W and V for being too large • Dropout: when training on a <x,y> pair, randomly remove some node and weights. • Early stopping: Stop backpropagation before the training error is too small.
Deeper networks W 1 W 2 V x 1 h 1 h 2 x 2 y h 2 h 2 x 3 h 2 x 3
http://neuralnetworksanddeeplearning.com/chap1.html
Higher order features learned for image recognition Lee et al. 2009 (ICML)
Autoencoder • Unsupervised neural network, where y = x • Learns a low-dimensional representation of x x 1 x 1 h 1 x 2 x 2 h 2 x 3 x 3
Feedforward networks x 1 h 1 y x 2 h 2 x 3
Recurrent networks input x hidden layer h label y
Interpretability x β β 1 x 1 exp( x � β ) P ( y = 1 | x, β ) = 1 + exp( x � β ) β 2 1 -0.5 not x 2 y bad 1 -1.7 β 3 x 3 movie 0 0.3 With a single-layer linear model (logistic/linear regression, perceptron) there’s an immediate relationship between x and y apparent in β
Interpretability x 1 h 1 y x 2 h 2 x 3 Non-linear activation functions induce dependencies between the inputs.
Recommend
More recommend