Monte Carlo Methods and Neural Networks Alexander Keller, partially joint work with Noah Gamboa
Artificial Neural Networks in a Nutshell Supervised learning of high dimensional function approximation � input layer a 0 , L − 1 fully connected hidden layers, and output layer a L a 0 , 0 a L , 0 a 1 , 0 a 2 , 0 a 0 , 1 a L , 1 a 1 , 1 a 2 , 1 a 0 , 2 a L , 2 a 1 , 2 a 2 , 2 . . . . . . . . . . . . a 0 , n 0 − 1 a L , n L − 1 a 1 , n 1 − 1 a 2 , n 2 − 1 2
Artificial Neural Networks in a Nutshell Supervised learning of high dimensional function approximation � input layer a 0 , L − 1 fully connected hidden layers, and output layer a L a 0 , 0 a L , 0 a 1 , 0 a 2 , 0 a 0 , 1 a L , 1 a 1 , 1 a 2 , 1 a 2 , 1 a 0 , 2 a L , 2 a 1 , 2 a 2 , 2 . . . . . . . . . . . . a 0 , n 0 − 1 a L , n L − 1 a 1 , n 1 − 1 a 2 , n 2 − 1 – n l rectified linear units (ReLU) a l , i = max { 0 , ∑ w l , j , i · a l − 1 , j } in layer l 2
Artificial Neural Networks in a Nutshell Supervised learning of high dimensional function approximation � input layer a 0 , L − 1 fully connected hidden layers, and output layer a L a 0 , 0 a L , 0 a 1 , 0 a 2 , 0 a 0 , 1 a L , 1 a 1 , 1 a 2 , 1 a 2 , 1 a 0 , 2 a L , 2 a 1 , 2 a 2 , 2 . . . . . . . . . . . . a 0 , n 0 − 1 a L , n L − 1 a 1 , n 1 − 1 a 2 , n 2 − 1 – n l rectified linear units (ReLU) a l , i = max { 0 , ∑ w l , j , i · a l − 1 , j } in layer l – backpropagating the error δ l − 1 , i = ∑ a l , j > 0 δ l , j · w l , j , i 2
Artificial Neural Networks in a Nutshell Supervised learning of high dimensional function approximation � input layer a 0 , L − 1 fully connected hidden layers, and output layer a L a 0 , 0 a L , 0 a 1 , 0 a 2 , 0 a 0 , 1 a L , 1 a 1 , 1 a 2 , 1 a 2 , 1 a 0 , 2 a L , 2 a 1 , 2 a 2 , 2 . . . . . . . . . . . . a 0 , n 0 − 1 a L , n L − 1 a 1 , n 1 − 1 a 2 , n 2 − 1 – n l rectified linear units (ReLU) a l , i = max { 0 , ∑ w l , j , i · a l − 1 , j } in layer l – backpropagating the error δ l − 1 , i = ∑ a l , j > 0 δ l , j · w l , j , i , update weights w ′ l , j , i = w l , j , i − λ · δ l , j · a l − 1 , i if a l , j > 0 2
Artificial Neural Networks in a Nutshell Convolutional neural networks: Similarity measures � convolutional layer: feature map defined by convolution kernel – identical weights across all neural units of one feature map � max pooling layer: maximum of tile of neurons in feature map for subsampling ◮ Gradient based learning applied to document recognition ◮ Quasi-Monte Carlo feature maps for shift-invariant kernels 3
Relations to Mathematical Objects
Relations to Mathematical Objects Maximum pooling layers � rectified linear unit ReLU ( x ) := max { 0 , x } as a basic non-linearity 5
Relations to Mathematical Objects Maximum pooling layers � rectified linear unit ReLU ( x ) := max { 0 , x } as a basic non-linearity � for example, leaky ReLU is ReLU ( x ) − α · ReLU ( − x ) 5
Relations to Mathematical Objects Maximum pooling layers � rectified linear unit ReLU ( x ) := max { 0 , x } as a basic non-linearity � for example, leaky ReLU is ReLU ( x ) − α · ReLU ( − x ) which for α = − 1 yields the absolute value | x | = ReLU ( x )+ ReLU ( − x ) 5
Relations to Mathematical Objects Maximum pooling layers � rectified linear unit ReLU ( x ) := max { 0 , x } as a basic non-linearity � for example, leaky ReLU is ReLU ( x ) − α · ReLU ( − x ) which for α = − 1 yields the absolute value | x | = ReLU ( x )+ ReLU ( − x ) � hence the maximum of two values is � � max { x , y } = x + y x − y � = 1 � � + 2 · ( x + y + ReLU ( x − y )+ ReLU ( y − x ))) � � 2 2 � which allows one to represent maximum pooling by ReLU functions and introduces skip links 5
Relations to Mathematical Objects Residual layers looks like projections onto half spaces � halfspace H + with weights ˆ ω as normal and bias b as distance from origin O 6
Relations to Mathematical Objects Residual layers as differential equations � relation to a differential equation by introduction a step size h W ( 2 ) � 0 , W ( 1 ) � a l = a l − 1 + h · max · a l − 1 l l 7
Relations to Mathematical Objects Residual layers as differential equations � relation to a differential equation by introduction a step size h W ( 2 ) � 0 , W ( 1 ) � a l = a l − 1 + h · max · a l − 1 resembles like Euler method l l a l − a l − 1 ⇔ h 7
Relations to Mathematical Objects Residual layers as differential equations � relation to a differential equation by introduction a step size h W ( 2 ) � 0 , W ( 1 ) � a l = a l − 1 + h · max · a l − 1 resembles like Euler method l l a l − a l − 1 W ( 2 ) � 0 , W ( 1 ) � ⇔ = max · a l − 1 h l l which for h → 0 becomes ˙ a l 7
Relations to Mathematical Objects Residual layers as differential equations � relation to a differential equation by introduction a step size h W ( 2 ) � 0 , W ( 1 ) � a l = a l − 1 + h · max · a l − 1 resembles like Euler method l l a l − a l − 1 W ( 2 ) � 0 , W ( 1 ) � ⇔ = max · a l − 1 h l l which for h → 0 becomes ˙ a l – select your favorite ordinary differential equation to determine W ( 1 ) and W ( 2 ) l l ◮ Neural networks motivated by partial differential Equations – use your favorite ordinary differential equation solver for both inference and training ◮ A radical new neural network design could overcome big challenges in AI 7
Relations to Mathematical Objects Learning integral operator kernels � neural unit with ReLU � � n l − 1 − 1 ∑ a l , j := max 0 , w l , j , i a l − 1 , i i = 0 8
Relations to Mathematical Objects Learning integral operator kernels � neural unit with ReLU � � n l − 1 − 1 n l − 1 − 1 ∑ ∑ a l , j := max 0 , w l , j , i a l − 1 , i → a l , j := w l , j , i max { 0 , a l − 1 , i } i = 0 i = 0 8
Relations to Mathematical Objects Learning integral operator kernels � neural unit with ReLU � � n l − 1 − 1 n l − 1 − 1 ∑ ∑ a l , j := max 0 , w l , j , i a l − 1 , i → a l , j := w l , j , i max { 0 , a l − 1 , i } i = 0 i = 0 written in continuous form � 1 a l ( y ) := 0 w l ( x , y )max { 0 , a l − 1 ( x ) } dx relates to high-dimensional integro-approximation 8
Relations to Mathematical Objects Learning integral operator kernels � neural unit with ReLU � � n l − 1 − 1 n l − 1 − 1 ∑ ∑ a l , j := max 0 , w l , j , i a l − 1 , i → a l , j := w l , j , i max { 0 , a l − 1 , i } i = 0 i = 0 written in continuous form � 1 a l ( y ) := 0 w l ( x , y )max { 0 , a l − 1 ( x ) } dx relates to high-dimensional integro-approximation � recurrent neural network layer in continuous form alludes to integral equation � 1 a ′ l ( y ) := 0 w l ( x , y )max { 0 , a l − 1 ( x ) } dx 8
Relations to Mathematical Objects Learning integral operator kernels � neural unit with ReLU � � n l − 1 − 1 n l − 1 − 1 ∑ ∑ a l , j := max 0 , w l , j , i a l − 1 , i → a l , j := w l , j , i max { 0 , a l − 1 , i } i = 0 i = 0 written in continuous form � 1 a l ( y ) := 0 w l ( x , y )max { 0 , a l − 1 ( x ) } dx relates to high-dimensional integro-approximation � recurrent neural network layer in continuous form alludes to integral equation � 1 � 1 a ′ 0 w h l ( y ) := 0 w l ( x , y )max { 0 , a l − 1 ( x ) } dx + l ( x , y )max { 0 , a l ( x ) } dx – weights w h establish recurrence, e.g. for processing sequences of data 8
Monte Carlo Methods and Neural Networks Explore algorithms linear in time and space � structural equivalence of integral equations and reinforcement learning � learning integro-approximation from noisy/sampled data 9
Monte Carlo Methods and Neural Networks Explore algorithms linear in time and space � structural equivalence of integral equations and reinforcement learning � learning integro-approximation from noisy/sampled data � examples of random sampling – pseudo-random initialization – training by stochastic gradient descent – regularization by drop-out and drop-connect – random binarization – sampling by generative adversarial networks – fixed pseudo-random matrices for direct feedback alignment ◮ Learning light transport the reinforced way ◮ Machine learning and integral equations ◮ Noise2Noise: Learning image restoration without clean data 9
Partition instead of Dropout
Partition instead of Dropout Guaranteeing coverage of neural units � dropout neuron if threshold 1 P > ξ – ξ by linear feedback shift register generator (for example) a 0 , 0 a L , 0 a 1 , 0 a 2 , 0 a 0 , 1 a L , 1 a 1 , 1 a 2 , 1 a 0 , 2 a L , 2 a 1 , 2 a 2 , 2 . . . . . . . . . . . . a 0 , n 0 − 1 a L , n L − 1 a 1 , n 1 − 1 a 2 , n 2 − 1 11
Partition instead of Dropout Guaranteeing coverage of neural units � dropout neuron if threshold 1 P > ξ – ξ by linear feedback shift register generator (for example) a 0 , 0 a L , 0 a 1 , 0 a 2 , 0 a 0 , 1 a L , 1 a 1 , 1 a 2 , 1 a 0 , 2 a L , 2 a 1 , 2 a 2 , 2 . . . . . . . . . . . . a 0 , n 0 − 1 a L , n L − 1 a 1 , n 1 − 1 a 2 , n 2 − 1 11
Recommend
More recommend