T-61.182 Information Theory and Machine Learning 38. Introduction - PowerPoint PPT Presentation

T-61.182 Information Theory and Machine Learning 38. Introduction to Neural Networks 40. Capacity of a Single Neuron 41. Learning as Inference Presented by Yang, Zhi-rong on 22, April 2004 T-61.182 Information Theory and Machine Learning – p. 1/2

Contents Introduction to Neural Networks – Memories – Terminology Capacity of a Single Neuron – Neural network learning as communication – The capacity of a single neuron – Counting threshold functions Learning as Inference – Neural network learning as inference – Beyond optimization: making predictions – Implementation by Monte Carlo method – Implementation by Gaussian approximations T-61.182 Information Theory and Machine Learning – p. 2/2

Memories Address-based memory scheme – not associative – not robust or fault-tolerant – not distributed Biological memory systems – content addressable – error-tolerant and robust – parallel and distributed T-61.182 Information Theory and Machine Learning – p. 3/2

Terminology Architecture Activity rule Learning rule Supervised neural networks Unsupervised neural networks T-61.182 Information Theory and Machine Learning – p. 4/2

NN learning as communication 1. Obtain adapted weights { t n } N n =1 ↓ { x n } N − → Learning algorithm − → w n =1 2. Communication { x n } N { ˆ t n } N − → − → w n =1 n =1 T-61.182 Information Theory and Machine Learning – p. 5/2

The capacity of a single neuron General position Definition 1 A set of points { x n } in K-dimensional space are in general position if any subset of size ≤ K is linearly independent The linear threshold function � K � � y = f w k x k k =1 � 1 a > 0 f ( a ) = 0 a ≤ 0 T-61.182 Information Theory and Machine Learning – p. 6/2

Counting threshold functions Denote T ( N, K ) the number of distinct threshold functions on N points n general position in K dimensions. In this section, the author try to derive a fomula for T ( N, K ) . To start with, let us work out a few cases by hand. K = 1 , for any N T ( N, 1) = 2 N = 1 , for any K T (1 , K ) = 2 K = 2 T ( N, 2) = 2 N The points of XOR function are unrealizable. T-61.182 Information Theory and Machine Learning – p. 7/2

Counting threshold functions Final Result � 2 N K ≥ N T ( N, K ) = 2 � K − 1 � N − 1 � K < N k =0 k Vapnik-Chervonenkis dimension (VC dimension) 1.2 1 0.8 T(N,K)/2 N 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 N/K T-61.182 Information Theory and Machine Learning – p. 8/2

NN learning as inference Objective function to be minimized M ( w ) = G ( w ) + αE W ( w ) with error function � � � t ( n ) ln y ( x ( n ) ; w ) + (1 − t ( n ) )ln(1 − y ( x ( n ) ; w )) G ( w ) = − n and a regularizer E W ( w ) = 1 � w 2 i 2 i T-61.182 Information Theory and Machine Learning – p. 9/2

NN learning as inference Denote y ( w ; x ) ≡ P ( t = 1 | x , w ) Then P ( t | x , w ) = y t (1 − y ) 1 − t = exp[ t ln y + (1 − t )ln(1 − y )] The likelihood can be expressed in terms of the error function P ( D | w ) = exp[ − G ( w )] Similarly for the regularizer 1 P ( w | α ) = Z W ( α )exp( − αE W ) T-61.182 Information Theory and Machine Learning – p. 11/2

Making predictions Over-confident prediction (example) *� *� *� *� *� *� *� *� A� A� *� *� B� B� T-61.182 Information Theory and Machine Learning – p. 12/2

Bayesian prediction: marginalizing Take into account the whole posterior ensemble P ( t ( N +1) | x ( N +1) , D, α ) d K w P ( t ( N +1) | x ( N +1) , w , α ) P ( w | D, α ) � = Try to find a way of computing the integral P ( t ( N +1) = 1 | x ( N +1) , D, α ) d K w P ( t ( N +1) | x ( N +1) , w , α ) 1 � = Z M exp( − M ( w )) T-61.182 Information Theory and Machine Learning – p. 13/2

The Langevin Monte Carlo Method g = gradM(w); M = findM(w); for l=1:L p = randn(size(w)); H = p’*p/2+M; p = p-epsilon*g/2; wnew = w+epsilon*p; gnew = gradM(wnew); p = p-epsilon*gnew/2; Mnew = findM(wnew); Hnew = p’*p/2+Mnew; dH = Hnew-H; if (dH<0||rand()<exp(-dH)) g=gnew; w=wnew; M=Mnew; endfor T-61.182 Information Theory and Machine Learning – p. 14/2

The Langevin Monte Carlo Method ‘gradient descent with added noise’ ∆ w = − 1 2 ǫ 2 g + ǫ p speedup by Hamiltonian Monte Carlo wnew=w; gnew=g; for tau=1:Tau p = p-epsilon*gnew/2; wnew = wnew+epsilon*p; gnew = gradM(wnew); p = p-epsilon*gnew/2; endfor T-61.182 Information Theory and Machine Learning – p. 15/2

Gaussian approximations Taylor expand M ( w ) M ( w ) ≃ M ( w MP ) + 1 2( w − w MP ) T A ( w − w MP ) + · · · with Hessian matrix ∂ 2 � � A ij ≡ M ( w ) � ∂w i ∂w j � w = w MP The Gaussian approximation is defined as: Q ( w ; w MP , A ) 2 ( w − w MP ) T A ( w − w MP ) = [ det ( A / 2 π )] 1 / 2 exp − 1 � � T-61.182 Information Theory and Machine Learning – p. 16/2

Gaussian approximations the second derivative of M ( w ) with respect to w is given by N ∂ 2 f ′ ( a ( n ) ) x ( n ) x ( n ) � M ( w ) = + αδ ij i j ∂w i ∂w j n =1 where 1 f ( a ) ≡ 1 + e − a a ( n ) = w j x ( n ) � j j T-61.182 Information Theory and Machine Learning – p. 17/2

Gaussian approximations = Normal( a MP , s 2 ) P ( a | x , D, α ) � − ( a − a MP ) 2 � 1 = 2 πs 2 exp √ 2 s 2 where a MP = a ( x ; w MP ) and s 2 = x T A − 1 x T-61.182 Information Theory and Machine Learning – p. 18/2

Gaussian approximations Therefore the marginalized output is: � P ( t = 1 | x , D, α ) = ψ ( a MP , s 2 ) ≡ d af ( a )Normal( a MP , s 2 ) And further approximation can be applied: ψ ( a MP , s 2 ) ≃ φ ( a MP , s 2 ) ≡ f ( κ ( s ) a MP ) where � 1 + πs 2 / 8 κ ( s ) = 1 / T-61.182 Information Theory and Machine Learning – p. 19/2

Exercises Practice on counting threshold functions: Ex. 40.6 (page 490) Prove the approximation on Hessian matrix: Ex. 41.1 (page 501) T-61.182 Information Theory and Machine Learning – p. 20/2

T-61.182 Information Theory and Machine Learning 38. Introduction - PowerPoint PPT Presentation

T-61.182 Information Theory and Machine Learning 38. Introduction to Neural Networks 40. Capacity of a Single Neuron 41. Learning as Inference Presented by Yang, Zhi-rong on 22, April 2004 T-61.182 Information Theory and Machine Learning

Slide 1 / 182 Slide 2 / 182 Algebra Based Physics Kinematics in One Dimension 2015-08-25

Motion in One Dimension Average Speed Position and Reference Frame Displacement Average

Algebra I Solving & Graphing Inequalities 2016-01-11 www.njctl.org Slide 3 / 182 Table of

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Satisfiability modulo theory David Monniaux CNRS / VERIMAG October 2017 1 / 182 Schedule

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Feature extraction from deep models Olgert Denas Synopsis Intro to deep models Applications

$$$ $$$ Cache Memory $$$ 2 Schedule Today + Friday

The CMS Track Trigger and the Processing of its Data Christian Amstutz Institute for Data

Memories Memories Viktor wall Dept. of Electrical and Information Technology p gy Lund

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

Chapter 13. Neurodynamics Neural Networks and Learning Machines (Haykin) Lecture Notes on

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence g (associated

Computer Architecture 101 SDBS How does a computer look

T-61.182 Information Theory and Machine Learning 38. Introduction - PowerPoint PPT Presentation

T-61.182 Information Theory and Machine Learning 38. Introduction to Neural Networks 40. Capacity of a Single Neuron 41. Learning as Inference Presented by Yang, Zhi-rong on 22, April 2004 T-61.182 Information Theory and Machine Learning

Slide 1 / 182 Slide 2 / 182 Algebra Based Physics Kinematics in One Dimension 2015-08-25

Motion in One Dimension Average Speed Position and Reference Frame Displacement Average

Algebra I Solving &amp; Graphing Inequalities 2016-01-11 www.njctl.org Slide 3 / 182 Table of

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Satisfiability modulo theory David Monniaux CNRS / VERIMAG October 2017 1 / 182 Schedule

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Feature extraction from deep models Olgert Denas Synopsis Intro to deep models Applications

$$$ $$$ Cache Memory $$$ 2 Schedule Today + Friday

The CMS Track Trigger and the Processing of its Data Christian Amstutz Institute for Data

Memories Memories Viktor wall Dept. of Electrical and Information Technology p gy Lund

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

Chapter 13. Neurodynamics Neural Networks and Learning Machines (Haykin) Lecture Notes on

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence g (associated

Computer Architecture 101 SDBS How does a computer look

Algebra I Solving & Graphing Inequalities 2016-01-11 www.njctl.org Slide 3 / 182 Table of