neural networks
play

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 - PowerPoint PPT Presentation

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy


  1. Only N patterns? (1,1) (1,-1) β€’ Patterns that differ in 𝑂/2 bits are orthogonal β€’ You can have max 𝑂 orthogonal vectors in an 𝑂 -dimensional space 33

  2. Another random fact that should interest you β€’ The Eigenvectors of any symmetric matrix 𝐗 are orthogonal β€’ The Eigen values may be positive or negative 34

  3. Storing more than one pattern β€’ Requirement: Given 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – Design 𝐗 such that β€’ π‘‘π‘—π‘•π‘œ 𝐗𝐳 π‘ž = 𝐳 π‘ž for all target patterns β€’ There are no other binary vectors for which this holds β€’ What is the largest number of patterns that can be stored? 35

  4. Storing 𝑳 orthogonal patterns β€’ Simple solution: Design 𝐗 such that 𝐳 1 , 𝐳 2 , … , 𝐳 𝐿 are the Eigen vectors of 𝐗 – Let 𝐙 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐗 = 𝐙Λ𝐙 π‘ˆ – πœ‡ 1 , … , πœ‡ 𝐿 are positive – For πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 this is exactly the Hebbian rule β€’ The patterns are provably stationary 36

  5. Hebbian rule β€’ In reality – Let 𝐙 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝐗 = 𝐙Λ𝐙 π‘ˆ – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 – πœ‡ 𝐿+1 , … , πœ‡ 𝑂 = 0 β€’ All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are also stationary – Although not stable 37

  6. Storing 𝑢 orthogonal patterns β€’ When we have 𝑂 orthogonal (or near orthogonal) patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 – 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝑂 𝐗 = 𝐙Λ𝐙 π‘ˆ – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝑂 = 1 β€’ The Eigen vectors of 𝐗 span the space β€’ Also, for any 𝐳 𝑙 𝐗𝐳 𝑙 = 𝐳 𝑙 38

  7. Storing 𝑢 orthogonal patterns β€’ The 𝑂 orthogonal patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 span the space β€’ Any pattern 𝐳 can be written as 𝐳 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + β‹― + 𝑏 𝑂 𝐳 𝑂 𝐗𝐳 = 𝑏 1 𝐗𝐳 1 + 𝑏 2 𝐗𝐳 2 + β‹― + 𝑏 𝑂 𝐗𝐳 𝑂 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + β‹― + 𝑏 𝑂 𝐳 𝑂 = 𝐳 β€’ All patterns are stable – Remembers everything – Completely useless network 39

  8. Storing K orthogonal patterns β€’ Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 π‘ˆ – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 – πœ‡ 𝐿+1 , … , πœ‡ 𝑂 = 0 β€’ All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are stationary β€’ Any pattern that is entirely in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 is also stable (same logic as earlier) β€’ Only patterns that are partially in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 are unstable – Get projected onto subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 40

  9. Problem with Hebbian Rule β€’ Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 π‘ˆ – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 β€’ Problems arise because Eigen values are all 1.0 – Ensures stationarity of vectors in the subspace – What if we get rid of this requirement? 41

  10. Hebbian rule and general (non- orthogonal) vectors π‘ž 𝑧 π‘˜ π‘ž π‘₯ π‘˜π‘— = ෍ 𝑧 𝑗 π‘žβˆˆ{π‘ž} β€’ What happens when the patterns are not orthogonal β€’ What happens when the patterns are presented more than once – Different patterns presented different numbers of times – Equivalent to having unequal Eigen values.. β€’ Can we predict the evolution of any vector 𝐳 – Hint: Lanczos iterations π‘ˆ β€’ Can write 𝐙 𝑄 = 𝐙 π‘π‘ π‘’β„Žπ‘ 𝐂 , οƒ  𝐗 = 𝐙 π‘π‘ π‘’β„Žπ‘ 𝐂Λ𝐂 π‘ˆ 𝐙 π‘π‘ π‘’β„Žπ‘ 42

  11. The bottom line β€’ With an network of 𝑂 units (i.e. 𝑂 -bit patterns) β€’ The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable β€’ For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂 – Mostafa and St. Jacques 85’ β€’ For large N, the upper bound on K is actually N/4logN – McElice et. Al. 87’ – But this may come with many β€œparasitic” memories 43

  12. The bottom line β€’ With an network of 𝑂 units (i.e. 𝑂 -bit patterns) β€’ The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable β€’ For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂 – Mostafa and St. Jacques 85’ β€’ For large N, the upper bound on K is actually N/4logN – McElice et. Al. 87’ – But this may come with many β€œparasitic” memories 44

  13. The bottom line β€’ With an network of 𝑂 units (i.e. 𝑂 -bit patterns) β€’ The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable β€’ For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂 Can we do something – Mostafa and St. Jacques 85’ about this? β€’ For large N, the upper bound on K is actually N/4logN – McElice et. Al. 87’ – But this may come with many β€œparasitic” memories 45

  14. A different tack β€’ How do we make the network store a specific pattern or set of patterns? – Hebbian learning – Geometric approach – Optimization β€’ Secondary question – How many patterns can we store? 46

  15. Consider the energy function 𝐹 = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ This must be maximally low for target patterns β€’ Must be maximally high for all other patterns – So that they are unstable and evolve into one of the target patterns 47

  16. Alternate Approach to Estimating the Network 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ Estimate 𝐗 (and 𝐜 ) such that – 𝐹 is minimized for 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – 𝐹 is maximized for all other 𝐳 β€’ Caveat: Unrealistic to expect to store more than 𝑂 patterns, but can we make those 𝑂 patterns memorable 48

  17. Optimizing W (and b) 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 The bias can be captured by another fixed-value component β€’ Minimize total energy of target patterns – Problem with this? 49

  18. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Minimize total energy of target patterns β€’ Maximize the total energy of all non-target patterns 50

  19. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Simple gradient descent: 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 51

  20. Optimizing W 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Can β€œemphasize” the importance of a pattern by repeating – More repetitions οƒ  greater emphasis 52

  21. Optimizing W 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Can β€œemphasize” the importance of a pattern by repeating – More repetitions οƒ  greater emphasis β€’ How many of these? – Do we need to include all of them? – Are all equally important? 53

  22. The training again.. 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Note the energy contour of a Hopfield network for any weight 𝐗 Bowls will all actually be quadratic Energy 54 state

  23. The training again 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ The first term tries to minimize the energy at target patterns – Make them local minima – Emphasize more β€œimportant” memories by repeating them more frequently Target patterns Energy 55 state

  24. The negative class 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ The second term tries to β€œraise” all non -target patterns – Do we need to raise everything ? Energy 56 state

  25. Option 1: Focus on the valleys 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Focus on raising the valleys – If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish Energy 57 state

  26. Identifying the valleys.. 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Problem: How do you identify the valleys for the current 𝐗 ? Energy 58 state

  27. Identifying the valleys.. β€’ Initialize the network randomly and let it evolve – It will settle in a valley Energy 59 state

  28. Training the Hopfield network 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Compute the total outer product of all target patterns – More important patterns presented more frequently β€’ Randomly initialize the network several times and let it evolve – And settle at a valley β€’ Compute the total outer product of valley patterns β€’ Update weights 60

  29. Training the Hopfield network: SGD version 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve β€’ And settle at a valley 𝐳 𝑀 – Update weights π‘ˆ βˆ’ 𝐳 𝑀 𝐳 𝑀 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 61

  30. Training the Hopfield network 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve β€’ And settle at a valley 𝐳 𝑀 – Update weights π‘ˆ βˆ’ 𝐳 𝑀 𝐳 𝑀 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 62

  31. Which valleys? β€’ Should we randomly sample valleys? – Are all valleys equally important? Energy 63 state

  32. Which valleys? β€’ Should we randomly sample valleys? – Are all valleys equally important? β€’ Major requirement: memories must be stable – They must be broad valleys β€’ Spurious valleys in the neighborhood of memories are more important to eliminate Energy 64 state

  33. Identifying the valleys.. β€’ Initialize the network at valid memories and let it evolve – It will settle in a valley. If this is not the target pattern, raise it Energy 65 state

  34. Training the Hopfield network 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Compute the total outer product of all target patterns – More important patterns presented more frequently β€’ Initialize the network with each target pattern and let it evolve – And settle at a valley β€’ Compute the total outer product of valley patterns β€’ Update weights 66

  35. Training the Hopfield network: SGD version 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 π‘ž and let it evolve β€’ And settle at a valley 𝐳 𝑀 – Update weights π‘ˆ βˆ’ 𝐳 𝑀 𝐳 𝑀 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 67

  36. A possible problem β€’ What if there’s another target pattern downvalley – Raising it will destroy a better-represented or stored pattern! Energy 68 state

  37. A related issue β€’ Really no need to raise the entire surface, or even every valley Energy 69 state

  38. A related issue β€’ Really no need to raise the entire surface, or even every valley β€’ Raise the neighborhood of each target memory – Sufficient to make the memory a valley – The broader the neighborhood considered, the broader the valley Energy 70 state

  39. Raising the neighborhood β€’ Starting from a target pattern, let the network evolve only a few steps – Try to raise the resultant location β€’ Will raise the neighborhood of targets β€’ Will avoid problem of down-valley targets Energy 71 state

  40. Training the Hopfield network: SGD version 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 π‘ž and let it evolve a few steps (2- 4) β€’ And arrive at a down-valley position 𝐳 𝑒 – Update weights π‘ˆ βˆ’ 𝐳 𝑒 𝐳 𝑒 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 72

  41. A probabilistic interpretation 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 2 𝐳 π‘ˆ 𝐗𝐳 β€’ For continuous 𝐳 , the energy of a pattern is a perfect analog to the negative log likelihood of a Gaussian density β€’ For binary y it is the analog of the negative log likelihood of a Boltzmann distribution – Minimizing energy maximizes log likelihood 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 2 𝐳 π‘ˆ 𝐗𝐳 73

  42. The Boltzmann Distribution 𝐹 𝐳 = βˆ’ 1 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 π‘™π‘ˆ 1 𝐷 = Οƒ 𝐳 𝑄(𝐳) β€’ 𝑙 is the Boltzmann constant β€’ π‘ˆ is the temperature of the system β€’ The energy terms are like the loglikelihood of a Boltzmann distribution at π‘ˆ = 1 – Derivation of this probability is in fact quite trivial.. 74

  43. Continuing the Boltzmann analogy 𝐹 𝐳 = βˆ’ 1 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 π‘™π‘ˆ 1 𝐷 = Οƒ 𝐳 𝑄(𝐳) β€’ The system probabilistically selects states with lower energy – With infinitesimally slow cooling, at π‘ˆ = 0, it arrives at the global minimal state 75

  44. Spin glasses and Hopfield nets Energy state β€’ Selecting a next state is akin to drawing a sample from the Boltzmann distribution at π‘ˆ = 1, in a universe where 𝑙 = 1 76

  45. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Simple gradient descent: 𝛽 𝐳 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝛾 𝐹(𝐳) 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 More importance to more frequently More importance to more attractive presented memories spurious memories 77

  46. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Simple gradient descent: 𝛽 𝐳 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝛾 𝐹(𝐳) 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 More importance to more frequently More importance to more attractive presented memories spurious memories THIS LOOKS LIKE AN EXPECTATION! 78

  47. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Update rule 𝛽 𝐳 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝛾 𝐹(𝐳) 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 𝐗 = 𝐗 + πœƒ 𝐹 𝐳~𝐙 𝑄 𝐳𝐳 π‘ˆ βˆ’ 𝐹 𝐳~𝑍 𝐳𝐳 π‘ˆ Natural distribution for variables: The Boltzmann Distribution 79

  48. Continuing on.. β€’ The Hopfield net as a Boltzmann distribution β€’ Adding capacity to a Hopfield network – The Boltzmann machine 80

  49. Continuing on.. β€’ The Hopfield net as a Boltzmann distribution β€’ Adding capacity to a Hopfield network – The Boltzmann machine 81

  50. Storing more than N patterns β€’ The memory capacity of an 𝑂 -bit network is at most 𝑂 – Stable patterns (not necessarily even stationary) β€’ Abu Mustafa and St. Jacques, 1985 β€’ Although β€œinformation capacity” is 𝒫(𝑂 3 ) β€’ How do we increase the capacity of the network – Store more patterns 82

  51. Expanding the network K Neurons N Neurons β€’ Add a large number of neurons whose actual values you don’t care about! 83

  52. Expanded Network K Neurons N Neurons β€’ New capacity: ~(𝑂 + 𝐿) patterns – Although we only care about the pattern of the first N neurons – We’re interested in N-bit patterns 84

  53. Terminology Hidden Visible Neurons Neurons β€’ Terminology: – The neurons that store the actual patterns of interest: Visible neurons – The neurons that only serve to increase the capacity but whose actual values are not important: Hidden neurons – These can be set to anything in order to store a visible pattern

  54. Training the network Hidden Visible Neurons Neurons β€’ For a given pattern of visible neurons, there are any number of hidden patterns (2 K ) β€’ Which of these do we choose? – Ideally choose the one that results in the lowest energy – But that’s an exponential search space! β€’ Solution: Combinatorial optimization – Simulated annealing

  55. The patterns β€’ In fact we could have multiple hidden patterns coupled with any visible pattern – These would be multiple stored patterns that all give the same visible output – How many do we permit β€’ Do we need to specify one or more particular hidden patterns? – How about all of them – What do I mean by this bizarre statement?

  56. But first.. β€’ The Hopfield net as a distribution.. 88

  57. Revisiting Thermodynamic Phenomena PE state β€’ Is the system actually in a specific state at any time? β€’ No – the state is actually continuously changing – Based on the temperature of the system β€’ At higher temperatures, state changes more rapidly β€’ What is actually being characterized is the probability of the state – And the expected value of the state

  58. The Helmholtz Free Energy of a System β€’ A thermodynamic system at temperature π‘ˆ can exist in one of many states – Potentially infinite states – At any time, the probability of finding the system in state 𝑑 at temperature π‘ˆ is 𝑄 π‘ˆ (𝑑) β€’ At each state 𝑑 it has a potential energy 𝐹 𝑑 β€’ The internal energy of the system, representing its capacity to do work, is the average: 𝑉 π‘ˆ = ෍ 𝑄 π‘ˆ 𝑑 𝐹 𝑑 𝑑

  59. The Helmholtz Free Energy of a System β€’ The capacity to do work is counteracted by the internal disorder of the system, i.e. its entropy 𝐼 π‘ˆ = βˆ’ ෍ 𝑄 π‘ˆ 𝑑 log 𝑄 π‘ˆ 𝑑 𝑑 β€’ The Helmholtz free energy of the system measures the useful work derivable from it and combines the two terms 𝐺 π‘ˆ = 𝑉 π‘ˆ + π‘™π‘ˆπΌ π‘ˆ = ෍ 𝑄 π‘ˆ 𝑑 𝐹 𝑑 βˆ’ π‘™π‘ˆ ෍ 𝑄 π‘ˆ 𝑑 log 𝑄 π‘ˆ 𝑑 𝑑 𝑑

  60. The Helmholtz Free Energy of a System 𝐺 π‘ˆ = ෍ 𝑄 π‘ˆ 𝑑 𝐹 𝑑 βˆ’ π‘™π‘ˆ ෍ 𝑄 π‘ˆ 𝑑 log 𝑄 π‘ˆ 𝑑 𝑑 𝑑 β€’ A system held at a specific temperature anneals by varying the rate at which it visits the various states, to reduce the free energy in the system, until a minimum free-energy state is achieved β€’ The probability distribution of the states at steady state is known as the Boltzmann distribution

  61. The Helmholtz Free Energy of a System 𝐺 π‘ˆ = ෍ 𝑄 π‘ˆ 𝑑 𝐹 𝑑 βˆ’ π‘™π‘ˆ ෍ 𝑄 π‘ˆ 𝑑 log 𝑄 π‘ˆ 𝑑 𝑑 𝑑 β€’ Minimizing this w.r.t 𝑄 π‘ˆ 𝑑 , we get 𝑄 π‘ˆ 𝑑 = 1 π‘Ž π‘“π‘¦π‘ž βˆ’πΉ 𝑑 π‘™π‘ˆ – Also known as the Gibbs distribution – π‘Ž is a normalizing constant – Note the dependence on π‘ˆ – A π‘ˆ = 0, the system will always remain at the lowest- energy configuration with prob = 1.

  62. The Energy of the Network Visible 𝐹 𝑇 = βˆ’ ෍ π‘₯ π‘—π‘˜ 𝑑 𝑗 𝑑 π‘˜ βˆ’ 𝑐 𝑗 𝑑 𝑗 Neurons 𝑗<π‘˜ π‘“π‘¦π‘ž βˆ’πΉ(𝑇) 𝑄 𝑇 = Οƒ 𝑇′ π‘“π‘¦π‘ž βˆ’πΉ(𝑇′) β€’ We can define the energy of the system as before β€’ Since neurons are stochastic, there is disorder or entropy (with T = 1) β€’ The equilibribum probability distribution over states is the Boltzmann distribution at T=1 – This is the probability of different states that the network will wander over at equilibrium

  63. The Hopfield net is a distribution Visible 𝐹 𝑇 = βˆ’ ෍ π‘₯ π‘—π‘˜ 𝑑 𝑗 𝑑 π‘˜ βˆ’ 𝑐 𝑗 𝑑 𝑗 Neurons 𝑗<π‘˜ π‘“π‘¦π‘ž βˆ’πΉ(𝑇) 𝑄 𝑇 = Οƒ 𝑇′ π‘“π‘¦π‘ž βˆ’πΉ(𝑇′) β€’ The stochastic Hopfield network models a probability distribution over states – Where a state is a binary string – Specifically, it models a Boltzmann distribution – The parameters of the model are the weights of the network β€’ The probability that (at equilibrium) the network will be in any state is 𝑄 𝑇 – It is a generative model: generates states according to 𝑄 𝑇

  64. The field at a single node β€’ Let 𝑇 and 𝑇 β€² be otherwise identical states that only differ in the i-th bit – S has i-th bit = +1 and S’ has i-th bit = βˆ’1 𝑄 𝑇 = 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— 𝑄(𝑑 π‘˜β‰ π‘— ) 𝑄 𝑇′ = 𝑄 𝑑 𝑗 = βˆ’1 𝑑 π‘˜β‰ π‘— 𝑄(𝑑 π‘˜β‰ π‘— ) π‘šπ‘π‘•π‘„ 𝑇 βˆ’ π‘šπ‘π‘•π‘„ 𝑇 β€² = π‘šπ‘π‘•π‘„ 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— βˆ’ π‘šπ‘π‘•π‘„ 𝑑 𝑗 = 0 𝑑 π‘˜β‰ π‘— 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— π‘šπ‘π‘•π‘„ 𝑇 βˆ’ π‘šπ‘π‘•π‘„ 𝑇 β€² = π‘šπ‘π‘• 1 βˆ’ 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— 96

  65. The field at a single node β€’ Let 𝑇 and 𝑇 β€² be the states with the ith bit in the +1 and βˆ’ 1 states log 𝑄(𝑇) = βˆ’πΉ 𝑇 + 𝐷 𝐹 𝑇 = βˆ’ 1 2 𝐹 π‘œπ‘π‘’ 𝑗 + ෍ π‘₯ π‘˜ 𝑑 π‘˜ + 𝑐 𝑗 π‘˜β‰ π‘— 𝐹 𝑇′ = βˆ’ 1 2 𝐹 π‘œπ‘π‘’ 𝑗 βˆ’ ෍ π‘₯ π‘˜ 𝑑 π‘˜ βˆ’ 𝑐 𝑗 π‘˜β‰ π‘— β€’ π‘šπ‘π‘•π‘„ 𝑇 βˆ’ π‘šπ‘π‘•π‘„ 𝑇 β€² = 𝐹 𝑇 β€² βˆ’ 𝐹 𝑇 = Οƒ π‘˜β‰ π‘— π‘₯ π‘˜ 𝑑 π‘˜ + 𝑐 𝑗 97

  66. The field at a single node 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— π‘šπ‘π‘• = ෍ π‘₯ π‘˜ 𝑑 π‘˜ + 𝑐 𝑗 1 βˆ’ 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— π‘˜β‰ π‘— β€’ Giving us 1 𝑄 𝑑 𝑗 = 1 𝑑 π‘˜β‰ π‘— = 1 + 𝑓 βˆ’ Οƒ π‘˜β‰ π‘— π‘₯ π‘˜ 𝑑 π‘˜ +𝑐 𝑗 β€’ The probability of any node taking value 1 given other node values is a logistic 98

  67. Redefining the network 𝑨 𝑗 = ෍ π‘₯ π‘˜π‘— 𝑑 π‘˜ + 𝑐 𝑗 Visible Neurons π‘˜ 1 𝑄(𝑑 𝑗 = 1|𝑑 π‘˜β‰ π‘— ) = 1 + 𝑓 βˆ’π‘¨ 𝑗 β€’ First try: Redefine a regular Hopfield net as a stochastic system β€’ Each neuron is now a stochastic unit with a binary state 𝑑 𝑗 , which can take value 0 or 1 with a probability that depends on the local field – Note the slight change from Hopfield nets – Not actually necessary; only a matter of convenience

  68. The Hopfield net is a distribution 𝑨 𝑗 = ෍ π‘₯ π‘˜π‘— 𝑑 π‘˜ + 𝑐 𝑗 Visible Neurons π‘˜ 1 𝑄(𝑑 𝑗 = 1|𝑑 π‘˜β‰ π‘— ) = 1 + 𝑓 βˆ’π‘¨ 𝑗 β€’ The Hopfield net is a probability distribution over binary sequences – The Boltzmann distribution β€’ The conditional distribution of individual bits in the sequence is a logistic

  69. Running the network 𝑨 𝑗 = ෍ π‘₯ π‘˜π‘— 𝑑 π‘˜ + 𝑐 𝑗 Visible Neurons π‘˜ 1 𝑄(𝑑 𝑗 = 1|𝑑 π‘˜β‰ π‘— ) = 1 + 𝑓 βˆ’π‘¨ 𝑗 β€’ Initialize the neurons β€’ Cycle through the neurons and randomly set the neuron to 1 or -1 according to the probability given above – Gibbs sampling: Fix N-1 variables and sample the remaining variable – As opposed to energy-based update (mean field approximation): run the test z i > 0 ? β€’ After many many iterations (until β€œconvergence”), sample the individual neurons

Recommend


More recommend