neural networks for machine learning lecture 13a the ups
play

Neural Networks for Machine Learning Lecture 13a The ups and downs - PowerPoint PPT Presentation

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation The backpropagation algorithm


  1. Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed

  2. A brief history of backpropagation • The backpropagation algorithm • Backpropagation clearly had great for learning multiple layers of promise for learning multiple layers features was invented several of non-linear feature detectors. times in the 70’s and 80’s: • But by the late 1990’s most serious – Bryson & Ho (1969) linear researchers in machine learning had given up on it. – Werbos (1974) – It was still widely used in – Rumelhart et. al. in 1981 psychological models and in – Parker (1985) practical applications such as – LeCun (1985) credit card fraud detection. – Rumelhart et. al. (1985)

  3. Why backpropagation failed • The popular explanation of why • The real reasons it failed: backpropagation failed in the 90’s: – Computers were thousands of times too slow. – It could not make good use of multiple hidden layers. – Labeled datasets were (except in convolutional nets) hundreds of times too small. – It did not work well in recurrent – Deep networks were too small networks or deep auto-encoders. and not initialized sensibly. – Support Vector Machines worked • These issues prevented it from better, required less expertise, being successful for tasks where produced repeatable results, it would eventually be a big win. and had much fancier theory.

  4. A spectrum of machine learning tasks Typical Statistics --------------- Artificial Intelligence • Low-dimensional data • High-dimensional data ( e.g. more ( e.g. less than 100 dimensions) than 100 dimensions) • Lots of noise in the data. • The noise is not the main problem. • Not much structure in the data. • There is a huge amount of structure The structure can be captured in the data, but its too complicated to by a fairly simple model. be represented by a simple model. • The main problem is separating • The main problem is figuring out a true structure from noise. way to represent the complicated structure so that it can be learned. – Not ideal for non-Bayesian neural nets. Try SVM or GP. – Let backpropagation figure it out.

  5. Why Support Vector Machines were never a good bet for Artificial Intelligence tasks that need good representations • View 1: SVM’s are just a clever • View 2: SVM’s are just a clever reincarnation of Perceptrons. reincarnation of Perceptrons. – They expand the input to a – They use each input vector in (very large) layer of non- the training set to define a linear non-adaptive features. non-adaptive “pheature”. – They only have one layer of • The global match between a test input and that training input. adaptive weights. – They have a clever way of – They have a very efficient simultaneously doing feature way of fitting the weights selection and finding weights on that controls overfitting. the remaining features.

  6. Historical document from AT&T Adaptive Systems Research Dept., Bell Labs

  7. Neural Networks for Machine Learning Lecture 13b Belief Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed

  8. What is wrong with back-propagation? • It requires labeled training • It can get stuck in poor local data. optima. – Almost all data is – These are often quite good, unlabeled. but for deep nets they are far from optimal. • The learning time does not scale well – Should we retreat to models that allow convex – It is very slow in networks optimization? with multiple hidden layers. – Why?

  9. Overcoming the limitations of back-propagation by using unsupervised learning • The learning objective for a • Keep the efficiency and simplicity of generative model: using a gradient method for adjusting the weights, but use it for modeling – Maximise p(x) not p(y | x) the structure of the sensory input. • What kind of generative model – Adjust the weights to maximize should we learn? the probability that a generative – An energy-based model model would have generated the like a Boltzmann machine? sensory input. – A causal model made of – If you want to do computer vision, idealized neurons? first learn computer graphics. – A hybrid of the two?

  10. Artificial Intelligence and Probability “Many ancient Greeks supported “All of this will lead to theories of Socrates opinion that deep, computation which are much less rigidly inexplicable thoughts came from the of an all-or-none nature than past and gods. Today’s equivalent to those present formal logic ... There are gods is the erratic, even probabilistic numerous indications to make us believe neuron. It is more likely that that this new system of formal logic will increased randomness of neural move closer to another discipline which behavior is the problem of the has been little linked in the past with logic. epileptic and the drunk, not the This is thermodynamics primarily in the advantage of the brilliant.” form it was received from Boltzmann.” P.H. Winston, “Artificial Intelligence”, John von Neumann, “The Computer and 1977. (The first AI textbook) the Brain”, 1958 (unfinished manuscript)

  11. The marriage of graph theory and probability theory • In the 1980’s there was a lot • Graphical models: Pearl, Heckerman, of work in AI that used bags Lauritzen, and many others showed of rules for tasks such as that probabilities worked better. medical diagnosis and – Graphs were good for representing exploration for minerals. what depended on what. – For practical problems, – Probabilities then had to be they had to deal with computed for nodes of the graph, uncertainty. given the states of other nodes. – They made up ways of • Belief Nets: For sparsely connected, doing this that did not directed acyclic graphs, clever involve probabilities! inference algorithms were discovered.

  12. Belief Nets • A belief net is a directed acyclic graph stochastic hidden causes composed of stochastic variables. • We get to observe some of the variables and we would like to solve two problems: • The inference problem: Infer the states of the unobserved variables. • The learning problem: Adjust the interactions between variables to make the network more likely to visible effects generate the training data.

  13. Graphical Models versus Neural Networks • Early graphical models • For neural nets, learning was used experts to define the central. Hand-wiring the knowledge graph structure and the was not cool (OK, maybe a little bit). conditional probabilities. – Knowledge came from learning the training data. – The graphs were sparsely connected. • Neural networks did not aim for – Researchers initially interpretability or sparse connectivity focused on doing to make inference easy. correct inference, not – Nevertheless, there are neural on learning. network versions of belief nets.

  14. Two types of generative neural network composed of stochastic binary neurons • Causal: We connect binary stochastic • Energy-based: We connect neurons in a directed acyclic graph to binary stochastic neurons get a Sigmoid Belief Net (Neal 1992). using symmetric connections to get a Boltzmann Machine. stochastic – If we restrict the hidden connectivity in a special causes way, it is easy to learn a Boltzmann machine. visible – But then we only have effects one hidden layer.

  15. Neural Networks for Machine Learning Lecture 13c Learning Sigmoid Belief Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed

  16. Learning Sigmoid Belief Nets • It is easy to generate an unbiased stochastic hidden causes example at the leaf nodes, so we can see what kinds of data the network believes in. • It is hard to infer the posterior distribution over all possible configurations of hidden causes. • It is hard to even get a sample from the posterior. • So how can we learn sigmoid belief visible effects nets that have millions of parameters?

  17. The learning rule for sigmoid belief nets • Learning is easy if we can get an j s unbiased sample from the j posterior distribution over hidden w ji states given the observed data. s i i • For each unit, maximize the log 1 prob. that its binary state in the p i ≡ p ( s i = 1 ) = $ ' sample from the posterior would & ∑ ) 1 + exp − b i − s j w ji ) & ) be generated by the sampled % ( j binary states of its parents. w s ( s p ) Δ = ε − ji j i i

  18. Explaining away (Judea Pearl) • Even if two hidden causes are independent in the prior, they can become dependent when we observe an effect that they can both influence. – If we learn that there was an earthquake it reduces the probability that the house jumped because of a truck. -10 -10 posterior over hiddens truck hits house earthquake p(1,1)=.0001 p(1,0)=.4999 20 20 p(0,1)=.4999 p(0,0)=.0001 -20 house jumps

Recommend


More recommend