neural networks
play

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 - PowerPoint PPT Presentation

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy


  1. Four non- orthogonal 6-bit patterns • Patterns are perfectly stationary and stable for K > 0.14N • Fewer spurious minima than for the orthogonal 2-pattern case – Most fake-looking memories are in fact ghosts.. 33

  2. Six non- orthogonal 6-bit patterns • Breakdown largely due to interference from “ghosts” • But patterns are stationary, and often stable – For K >> 0.14N 34

  3. More visualization.. • Lets inspect a few 8-bit patterns – Keeping in mind that the Karnaugh map is now a 4-dimensional tesseract 35

  4. One 8-bit pattern • Its actually cleanly stored, but there are a few spurious minima 36

  5. Two orthogonal 8-bit patterns • Both have regions of attraction • Some spurious minima 37

  6. Two non-orthogonal 8-bit patterns • Actually have fewer spurious minima – Not obvious from visualization.. 38

  7. Four orthogonal 8-bit patterns • Successfully stored 39

  8. Four non-orthogonal 8-bit patterns • Stored with interference from ghosts.. 40

  9. Eight orthogonal 8-bit patterns • Wipeout 41

  10. Eight non-orthogonal 8-bit patterns • Nothing stored – Neither stationary nor stable 42

  11. Making sense of the behavior • Seems possible to store K > 0.14N patterns – i.e. obtain a weight matrix W such that K > 0.14N patterns are stationary – Possible to make more than 0.14N patterns at-least 1-bit stable • So what was Hopfield talking about? • Patterns that are non-orthogonal easier to remember – I.e. patterns that are closer are easier to remember than patterns that are farther!! • Can we attempt to get greater control on the process than Hebbian learning gives us? 43

  12. Bold Claim • I can always store (upto) N orthogonal patterns such that they are stationary! – Although not necessarily stable • Why? 44

  13. “Training” the network • How do we make the network store a specific pattern or set of patterns? – Hebbian learning – Geometric approach – Optimization • Secondary question – How many patterns can we store? 45

  14. A minor adjustment • Note behavior of 𝐅 𝐳 = 𝐳 𝑈 𝐗𝐳 with 𝐗 = 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 Energy landscape only differs by an additive constant • Is identical to behavior with Gradients and location 𝐗 = 𝐙𝐙 𝑈 of minima remain same • Since 𝐳 𝑈 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 𝐳 = 𝐳 𝑈 𝐙𝐙 𝑈 𝐳 − 𝑂𝑂 𝑞 • But 𝐗 = 𝐙𝐙 𝑈 is easier to analyze. Hence in the following slides we will use 𝐗 = 𝐙𝐙 𝑈 46

  15. A minor adjustment • Note behavior of 𝐅 𝐳 = 𝐳 𝑈 𝐗𝐳 with 𝐗 = 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 Energy landscape only differs by Both have the an additive constant • Is identical to behavior with same Eigen vectors Gradients and location 𝐗 = 𝐙𝐙 𝑈 of minima remain same • Since 𝐳 𝑈 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 𝐳 = 𝐳 𝑈 𝐙𝐙 𝑈 𝐳 − 𝑂𝑂 𝑞 • But 𝐗 = 𝐙𝐙 𝑈 is easier to analyze. Hence in the following slides we will use 𝐗 = 𝐙𝐙 𝑈 47

  16. A minor adjustment • Note behavior of 𝐅 𝐳 = 𝐳 𝑈 𝐗𝐳 with 𝐗 = 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 Energy landscape only differs by Both have the an additive constant • Is identical to behavior with same Eigen vectors Gradients and location 𝐗 = 𝐙𝐙 𝑈 of minima remain same NOTE: This • Since is a positive semidefinite matrix 𝐳 𝑈 𝐙𝐙 𝑈 − 𝑂 𝑞 𝐉 𝐳 = 𝐳 𝑈 𝐙𝐙 𝑈 𝐳 − 𝑂𝑂 𝑞 • But 𝐗 = 𝐙𝐙 𝑈 is easier to analyze. Hence in the following slides we will use 𝐗 = 𝐙𝐙 𝑈 48

  17. Consider the energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • Reinstating the bias term for completeness sake – Remember that we don’t actually use it in a Hopfield net 49

  18. Consider the energy function This is a quadratic! For Hebbian learning W is positive semidefinite E is convex 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • Reinstating the bias term for completeness sake – Remember that we don’t actually use it in a Hopfield net 50

  19. The energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • 𝐹 is a convex quadratic 51

  20. The energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • 𝐹 is a convex quadratic – Shown from above (assuming 0 bias) • But components of 𝑧 can only take values ±1 – I.e 𝑧 lies on the corners of the unit hypercube 52

  21. The energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • 𝐹 is a convex quadratic – Shown from above (assuming 0 bias) • But components of 𝑧 can only take values ±1 – I.e 𝑧 lies on the corners of the unit hypercube 53

  22. The energy function Stored patterns 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • The stored values of 𝐳 are the ones where all adjacent corners are higher on the quadratic – Hebbian learning attempts to make the quadratic steep in the vicinity of stored patterns 54

  23. Patterns you can store Ghosts (negations) Stored patterns • Ideally must be maximally separated on the hypercube – The number of patterns we can store depends on the actual distance between the patterns 55

  24. Storing patterns • A pattern 𝐳 𝑄 is stored if: – 𝑡𝑗𝑕𝑜 𝐗𝐳 𝑞 = 𝐳 𝑞 for all target patterns • Note: for binary vectors 𝑡𝑗𝑕𝑜 𝐳 is a projection – Projects 𝐳 onto the nearest corner of the hypercube – It “quantizes” the space into orthants 56

  25. Storing patterns • A pattern 𝐳 𝑄 is stored if: – 𝑡𝑗𝑕𝑜 𝐗𝐳 𝑞 = 𝐳 𝑞 for all target patterns • Training: Design 𝐗 such that this holds • Simple solution: 𝐳 𝑞 is an Eigenvector of 𝐗 – And the corresponding Eigenvalue is positive 𝐗𝐳 𝑞 = 𝜇𝐳 𝑞 – More generally orthant( 𝐗𝐳 𝑞 ) = orthant( 𝐳 𝑞 ) • How many such 𝐳 𝑞 can we have? 57

  26. Only N patterns? (1,1) (1,-1) • Patterns that differ in 𝑂/2 bits are orthogonal • You can have no more than 𝑂 orthogonal vectors in an 𝑂 -dimensional space 59

  27. Another random fact that should interest you • The Eigenvectors of any symmetric matrix 𝐗 are orthogonal • The Eigen values may be positive or negative 60

  28. Storing more than one pattern • Requirement: Given 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – Design 𝐗 such that • 𝑡𝑗𝑕𝑜 𝐗𝐳 𝑞 = 𝐳 𝑞 for all target patterns • There are no other binary vectors for which this holds • What is the largest number of patterns that can be stored? 61

  29. Storing 𝑳 orthogonal patterns • Simple solution: Design 𝐗 such that 𝐳 1 , 𝐳 2 , … , 𝐳 𝐿 are the Eigen vectors of 𝐗 – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝑋 = 𝑍Λ𝑍 𝑈 – 𝜇 1 , … , 𝜇 𝐿 are positive – For 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 this is exactly the Hebbian rule • The patterns are provably stationary 62

  30. Hebbian rule • In reality – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 – 𝜇 𝐿+1 , … , 𝜇 𝑂 = 0 • All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are also stationary – Although not stable 63

  31. Storing 𝑶 orthogonal patterns • When we have 𝑂 orthogonal (or near orthogonal) patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 – 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝜇 1 = 𝜇 2 = 𝜇 𝑂 = 1 • The Eigen vectors of 𝑋 span the space • Also, for any 𝐳 𝑙 𝐗𝐳 𝑙 = 𝐳 𝑙 64

  32. Storing 𝑶 orthogonal patterns • The 𝑂 orthogonal patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 span the space • Any pattern 𝐳 can be written as 𝐳 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + ⋯ + 𝑏 𝑂 𝐳 𝑂 𝐗𝐳 = 𝑏 1 𝐗𝐳 1 + 𝑏 2 𝐗𝐳 2 + ⋯ + 𝑏 𝑂 𝐗𝐳 𝑂 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + ⋯ + 𝑏 𝑂 𝐳 𝑂 = 𝐳 • All patterns are stable – Remembers everything – Completely useless network 65

  33. Storing K orthogonal patterns • Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 – 𝜇 𝐿+1 , … , 𝜇 𝑂 = 0 • All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are stationary • Any pattern that is entirely in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 is also stable (same logic as earlier) • Only patterns that are partially in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 are unstable – Get projected onto subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 66

  34. Problem with Hebbian Rule • Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 𝑈 – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – 𝜇 1 = 𝜇 2 = 𝜇 𝐿 = 1 • Problems arise because Eigen values are all 1.0 – Ensures stationarity of vectors in the subspace – What if we get rid of this requirement? 67

  35. Hebbian rule and general (non- orthogonal) vectors 𝑞 𝑧 𝑘 𝑞 𝑥 𝑘𝑗 = ෍ 𝑧 𝑗 𝑞∈{𝑞} • What happens when the patterns are not orthogonal • What happens when the patterns are presented more than once – Different patterns presented different numbers of times – Equivalent to having unequal Eigen values.. • Can we predict the evolution of any vector 𝐳 – Hint: Lanczos iterations 𝑈 • Can write 𝐙 𝑄 = 𝐙 𝑝𝑠𝑢ℎ𝑝 𝐂 ,  𝐗 = 𝐙 𝑝𝑠𝑢ℎ𝑝 𝐂Λ𝐂 𝑈 𝐙 𝑝𝑠𝑢ℎ𝑝 68

  36. The bottom line • With an network of 𝑂 units (i.e. 𝑂 -bit patterns) • The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable • For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≤ 𝑂 – Mostafa and St. Jacques 85’ • For large N, the upper bound on K is actually N/4logN – McElice et. Al. 8 7’ – But this may come with many “parasitic” memories 69

  37. The bottom line • With an network of 𝑂 units (i.e. 𝑂 -bit patterns) • The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable • For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≤ 𝑂 – Mostafa and St. Jacques 85’ • For large N, the upper bound on K is actually N/4logN – McElice et. Al. 8 7’ – But this may come with many “parasitic” memories 70

  38. The bottom line • With an network of 𝑂 units (i.e. 𝑂 -bit patterns) • The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable • For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≤ 𝑂 Can we do something – Mostafa and St. Jacques 85’ about this? • For large N, the upper bound on K is actually N/4logN – McElice et. Al. 8 7’ – But this may come with many “parasitic” memories 71

  39. A different tack • How do we make the network store a specific pattern or set of patterns? – Hebbian learning – Geometric approach – Optimization • Secondary question – How many patterns can we store? 72

  40. Consider the energy function 𝐹 = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • This must be maximally low for target patterns • Must be maximally high for all other patterns – So that they are unstable and evolve into one of the target patterns 73

  41. Alternate Approach to Estimating the Network 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 • Estimate 𝐗 (and 𝐜 ) such that – 𝐹 is minimized for 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – 𝐹 is maximized for all other 𝐳 • Caveat: Unrealistic to expect to store more than 𝑂 patterns, but can we make those 𝑂 patterns memorable 74

  42. Optimizing W (and b) 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 The bias can be captured by another fixed-value component • Minimize total energy of target patterns – Problem with this? 75

  43. Optimizing W 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) − ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Minimize total energy of target patterns • Maximize the total energy of all non-target patterns 76

  44. Optimizing W 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 ෡ 𝐗 = argmin ෍ 𝐹(𝐳) − ෍ 𝐹(𝐳) 𝐗 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Simple gradient descent: 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 77

  45. Optimizing W 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Can “emphasize” the importance of a pattern by repeating – More repetitions  greater emphasis 78

  46. Optimizing W 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Can “emphasize” the importance of a pattern by repeating – More repetitions  greater emphasis • How many of these? – Do we need to include all of them? – Are all equally important? 79

  47. The training again.. 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • Note the energy contour of a Hopfield network for any weight 𝐗 Bowls will all actually be quadratic Energy 80 state

  48. The training again 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • The first term tries to minimize the energy at target patterns – Make them local minima – Emphasize more “important” memories by repeating them more frequently Target patterns Energy 81 state

  49. The negative class 𝐳𝐳 𝑈 − ෍ 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 • The second term tries to “raise” all non -target patterns – Do we need to raise everything ? Energy 82 state

  50. Option 1: Focus on the valleys 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Focus on raising the valleys – If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish Energy 83 state

  51. Identifying the valleys.. 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Problem: How do you identify the valleys for the current 𝐗 ? Energy 84 state

  52. Identifying the valleys.. • Initialize the network randomly and let it evolve – It will settle in a valley Energy 85 state

  53. Training the Hopfield network 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Compute the total outer product of all target patterns – More important patterns presented more frequently • Randomly initialize the network several times and let it evolve – And settle at a valley • Compute the total outer product of valley patterns • Update weights 86

  54. Training the Hopfield network: SGD version 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve • And settle at a valley 𝐳 𝑤 – Update weights 𝑈 − 𝐳 𝑤 𝐳 𝑤 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 87

  55. Training the Hopfield network 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve • And settle at a valley 𝐳 𝑤 – Update weights 𝑈 − 𝐳 𝑤 𝐳 𝑤 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 88

  56. Which valleys? • Should we randomly sample valleys? – Are all valleys equally important? Energy 89 state

  57. Which valleys? • Should we randomly sample valleys? – Are all valleys equally important? • Major requirement: memories must be stable – They must be broad valleys • Spurious valleys in the neighborhood of memories are more important to eliminate Energy 90 state

  58. Identifying the valleys.. • Initialize the network at valid memories and let it evolve – It will settle in a valley. If this is not the target pattern, raise it Energy 91 state

  59. Training the Hopfield network 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Compute the total outer product of all target patterns – More important patterns presented more frequently • Initialize the network with each target pattern and let it evolve – And settle at a valley • Compute the total outer product of valley patterns • Update weights 92

  60. Training the Hopfield network: SGD version 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 𝑞 and let it evolve • And settle at a valley 𝐳 𝑤 – Update weights 𝑈 − 𝐳 𝑤 𝐳 𝑤 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 93

  61. A possible problem • What if there’s another target pattern downvalley – Raising it will destroy a better-represented or stored pattern! Energy 94 state

  62. A related issue • Really no need to raise the entire surface, or even every valley Energy 95 state

  63. A related issue • Really no need to raise the entire surface, or even every valley • Raise the neighborhood of each target memory – Sufficient to make the memory a valley – The broader the neighborhood considered, the broader the valley Energy 96 state

  64. Raising the neighborhood • Starting from a target pattern, let the network evolve only a few steps – Try to raise the resultant location • Will raise the neighborhood of targets • Will avoid problem of down-valley targets Energy 97 state

  65. Training the Hopfield network: SGD version 𝐳𝐳 𝑈 − 𝐳𝐳 𝑈 𝐗 = 𝐗 + 𝜃 ෍ ෍ 𝐳∈𝐙 𝑄 𝐳∉𝐙 𝑄 &𝐳=𝑤𝑏𝑚𝑚𝑓𝑧 • Initialize 𝐗 • Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 𝑞 • Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 𝑞 and let it evolve a few steps (2- 4) • And arrive at a down-valley position 𝐳 𝑒 – Update weights 𝑈 − 𝐳 𝑒 𝐳 𝑒 𝑈 • 𝐗 = 𝐗 + 𝜃 𝐳 𝑞 𝐳 𝑞 98

  66. A probabilistic interpretation 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 1 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 2 𝐳 𝑈 𝐗𝐳 • For continuous 𝐳 , the energy of a pattern is a perfect analog to the negative log likelihood of a Gaussian density • For binary y it is the analog of the negative log likelihood of a Boltzmann distribution – Minimizing energy maximizes log likelihood 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 1 𝐹(𝐳) = − 1 2 𝐳 𝑈 𝐗𝐳 2 𝐳 𝑈 𝐗𝐳 99

  67. The Boltzmann Distribution 𝐹 𝐳 = − 1 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 −𝐹(𝐳) 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 𝑙𝑈 1 𝐷 = σ 𝐳 𝑄(𝐳) • 𝑙 is the Boltzmann constant • 𝑈 is the temperature of the system • The energy terms are like the loglikelihood of a Boltzmann distribution at 𝑈 = 1 – Derivation of this probability is in fact quite trivial.. 100

  68. Continuing the Boltzmann analogy 𝐹 𝐳 = − 1 𝑄(𝐳) = 𝐷𝑓𝑦𝑞 −𝐹(𝐳) 2 𝐳 𝑈 𝐗𝐳 − 𝐜 𝑈 𝐳 𝑙𝑈 1 𝐷 = σ 𝐳 𝑄(𝐳) • The system probabilistically selects states with lower energy – With infinitesimally slow cooling, at 𝑈 = 0, it arrives at the global minimal state 101

Recommend


More recommend