Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 - - PowerPoint PPT Presentation

β–Ά
neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 - - PowerPoint PPT Presentation

Neural Networks Hopfield Nets and Boltzmann Machines Spring 2018 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy


slide-1
SLIDE 1

Neural Networks

Hopfield Nets and Boltzmann Machines Spring 2018

1

slide-2
SLIDE 2
  • Symmetric loopy network
  • Each neuron is a perceptron with +1/-1 output

𝑧𝑗 = Θ ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘—π‘§π‘˜ + 𝑐𝑗

Θ 𝑨 = α‰Š+1 𝑗𝑔 𝑨 > 0 βˆ’1 𝑗𝑔 𝑨 ≀ 0

Recap: Hopfield network

2

slide-3
SLIDE 3

Recap: Hopfield network

  • At each time each neuron receives a β€œfield” Οƒπ‘˜β‰ π‘— π‘₯

π‘˜π‘—π‘§π‘˜ + 𝑐𝑗

  • If the sign of the field matches its own sign, it does not

respond

  • If the sign of the field opposes its own sign, it β€œflips” to

match the sign of the field

𝑧𝑗 = Θ ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘—π‘§π‘˜ + 𝑐𝑗

Θ 𝑨 = α‰Š+1 𝑗𝑔 𝑨 > 0 βˆ’1 𝑗𝑔 𝑨 ≀ 0

3

slide-4
SLIDE 4

Recap: Energy of a Hopfield Network

𝐹 = βˆ’ ෍

𝑗,π‘˜<𝑗

π‘₯π‘—π‘˜π‘§π‘—π‘§π‘˜

  • The system will evolve until the energy hits a local minimum
  • In vector form, including a bias term (not typically used in

Hopfield nets)

𝑧𝑗 = Θ ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘—π‘§π‘˜

Θ 𝑨 = α‰Š+1 𝑗𝑔 𝑨 > 0 βˆ’1 𝑗𝑔 𝑨 ≀ 0

4

Not assuming node bias

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

slide-5
SLIDE 5

Recap: Evolution

  • The network will evolve until it arrives at a

local minimum in the energy contour

state PE 5

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³

slide-6
SLIDE 6

Recap: Content-addressable memory

  • Each of the minima is a β€œstored” pattern

– If the network is initialized close to a stored pattern, it will inevitably evolve to the pattern

  • This is a content addressable memory

– Recall memory content from partial or corrupt values

  • Also called associative memory

state PE

6

slide-7
SLIDE 7

Recap – Analogy: Spin Glasses

  • Magnetic diploes
  • Each dipole tries to align itself to the local field

– In doing so it may flip

  • This will change fields at other dipoles

– Which may flip

  • Which changes the field at the current dipole…

7

slide-8
SLIDE 8

Recap – Analogy: Spin Glasses

  • The total energy of the system

𝐹(𝑑) = 𝐷 βˆ’ 1 2 ෍

𝑗

𝑦𝑗𝑔 π‘žπ‘— = βˆ’ ෍

𝑗

෍

π‘˜>𝑗

πΎπ‘—π‘˜π‘¦π‘—π‘¦π‘˜ βˆ’ ෍

𝑗

π‘π‘—π‘¦π‘˜

  • The system evolves to minimize the energy

– Dipoles stop flipping if flips result in increase of energy

Total field at current dipole:

𝑔 π‘žπ‘— = ෍

π‘˜β‰ π‘—

πΎπ‘—π‘˜π‘¦π‘˜ + 𝑐𝑗

Response of current diplose

𝑦𝑗 = ࡝𝑦𝑗 𝑗𝑔 π‘‘π‘—π‘•π‘œ 𝑦𝑗 𝑔 π‘žπ‘— = 1 βˆ’π‘¦π‘— π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓

8

slide-9
SLIDE 9

Recap : Spin Glasses

  • The system stops at one of its stable configurations

– Where energy is a local minimum

  • Any small jitter from this stable configuration returns it to the stable

configuration

– I.e. the system remembers its stable state and returns to it

state PE

9

slide-10
SLIDE 10

Recap: Hopfield net computation

  • Very simple
  • Updates can be done sequentially, or all at once
  • Convergence

𝐹 = βˆ’ ෍

𝑗

෍

π‘˜>𝑗

π‘₯

π‘˜π‘—π‘§π‘˜π‘§π‘—

does not change significantly any more

  • 1. Initialize network with initial pattern

𝑧𝑗 0 = 𝑦𝑗, 0 ≀ 𝑗 ≀ 𝑂 βˆ’ 1

  • 2. Iterate until convergence

𝑧𝑗 𝑒 + 1 = Θ ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘—π‘§π‘˜

, 0 ≀ 𝑗 ≀ 𝑂 βˆ’ 1

10

slide-11
SLIDE 11

Examples: Content addressable memory

  • http://staff.itee.uq.edu.au/janetw/cmc/chapters/Hopfield/

11

slide-12
SLIDE 12

β€œTraining” the network

  • How do we make the network store a specific

pattern or set of patterns?

– Hebbian learning – Geometric approach – Optimization

  • Secondary question

– How many patterns can we store?

12

slide-13
SLIDE 13

Recap: Hebbian Learning to Store a Specific Pattern

  • For a single stored pattern, Hebbian learning

results in a network for which the target pattern is a global minimum

HEBBIAN LEARNING: π‘₯

π‘˜π‘— = π‘§π‘˜π‘§π‘—

1

  • 1
  • 1
  • 1

1 13

𝐗 = π³π‘žπ³π‘ž

π‘ˆ βˆ’ I

slide-14
SLIDE 14

Storing multiple patterns

π‘₯

π‘˜π‘— = ෍ π‘žβˆˆ{π‘§π‘ž}

𝑧𝑗

π‘žπ‘§π‘˜ π‘ž

  • {π‘§π‘ž} is the set of patterns to store
  • Superscript π‘ž represents the specific pattern

1

  • 1
  • 1
  • 1

1 1 1

  • 1

1

  • 1

14

slide-15
SLIDE 15

Storing multiple patterns

  • Let π³π‘ž be the vector representing π‘ž-th pattern
  • Let 𝐙 = 𝐳1 𝐳2 … be a matrix with all the stored pattern
  • Then..

𝐗 = ෍

𝒒

(π³π‘žπ³π‘ž

π‘ˆ βˆ’ I) = π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰

1

  • 1
  • 1
  • 1

1 1 1

  • 1

1

  • 1

15

Number of patterns

slide-16
SLIDE 16
  • {p} is the set of patterns to store

– Superscript π‘ž represents the specific pattern

  • π‘‚π‘ž is the number of patterns to store

1

  • 1
  • 1
  • 1

1 1 1

  • 1

1

  • 1

16

𝐗 = ෍

π‘ž

π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐉 = π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰

π‘₯

π‘˜π‘— = ෍ π‘žβˆˆ{π‘ž}

𝑧𝑗

π‘žπ‘§π‘˜ π‘ž

Recap: Hebbian Learning to Store Multiple Patterns

slide-17
SLIDE 17

How many patterns can we store?

  • Hopfield: For a network of 𝑂 neurons can store up to

0.14𝑂 patterns

  • In reality, seems possible to store K > 0.14N patterns

– i.e. obtain a weight matrix W such that K > 0.14N patterns are stationary

17

slide-18
SLIDE 18

Bold Claim

  • I can always store (upto) N orthogonal

patterns such that they are stationary!

– Although not necessarily stable

  • Why?

18

slide-19
SLIDE 19

β€œTraining” the network

  • How do we make the network store a specific

pattern or set of patterns?

– Hebbian learning – Geometric approach – Optimization

  • Secondary question

– How many patterns can we store?

19

slide-20
SLIDE 20

A minor adjustment

  • Note behavior of 𝐅 𝐳 = π³π‘ˆπ—π³ with

𝐗 = π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰

  • Is identical to behavior with

𝐗 = π™π™π‘ˆ

  • Since

π³π‘ˆ π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰ 𝐳 = π³π‘ˆπ™π™π‘ˆπ³ βˆ’ π‘‚π‘‚π‘ž

  • But 𝐗 = π™π™π‘ˆ is easier to analyze. Hence in the

following slides we will use 𝐗 = π™π™π‘ˆ

20

Energy landscape

  • nly differs by

an additive constant Gradients and location

  • f minima remain same
slide-21
SLIDE 21

A minor adjustment

  • Note behavior of 𝐅 𝐳 = π³π‘ˆπ—π³ with

𝐗 = π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰

  • Is identical to behavior with

𝐗 = π™π™π‘ˆ

  • Since

π³π‘ˆ π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰ 𝐳 = π³π‘ˆπ™π™π‘ˆπ³ βˆ’ π‘‚π‘‚π‘ž

  • But 𝐗 = π™π™π‘ˆ is easier to analyze. Hence in the

following slides we will use 𝐗 = π™π™π‘ˆ

21

Energy landscape

  • nly differs by

an additive constant Gradients and location

  • f minima remain same

Both have the same Eigen vectors

slide-22
SLIDE 22

A minor adjustment

  • Note behavior of 𝐅 𝐳 = π³π‘ˆπ—π³ with

𝐗 = π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰

  • Is identical to behavior with

𝐗 = π™π™π‘ˆ

  • Since

π³π‘ˆ π™π™π‘ˆ βˆ’ π‘‚π‘žπ‰ 𝐳 = π³π‘ˆπ™π™π‘ˆπ³ βˆ’ π‘‚π‘‚π‘ž

  • But 𝐗 = π™π™π‘ˆ is easier to analyze. Hence in the

following slides we will use 𝐗 = π™π™π‘ˆ

22

Energy landscape

  • nly differs by

an additive constant Gradients and location

  • f minima remain same

NOTE: This is a positive semidefinite matrix Both have the same Eigen vectors

slide-23
SLIDE 23

Consider the energy function

  • Reinstating the bias term for completeness

sake

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

23

slide-24
SLIDE 24

Consider the energy function

  • Reinstating the bias term for completeness

sake

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³ This is a quadratic! For Hebbian learning W is positive semidefinite E is convex

24

slide-25
SLIDE 25

The energy function

  • 𝐹 is a convex quadratic

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

25

slide-26
SLIDE 26

The energy function

  • 𝐹 is a convex quadratic

– Shown from above (assuming 0 bias)

  • But components of 𝑧 can only take values Β±1

– I.e 𝑧 lies on the corners of the unit hypercube

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

26

slide-27
SLIDE 27

The energy function

  • 𝐹 is a convex quadratic

– Shown from above (assuming 0 bias)

  • But components of 𝑧 can only take values Β±1

– I.e 𝑧 lies on the corners of the unit hypercube

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

27

slide-28
SLIDE 28

The energy function

  • The stored values of 𝐳 are the ones where all

adjacent corners are lower on the quadratic

𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

Stored patterns

28

slide-29
SLIDE 29

Patterns you can store

  • All patterns are on the corners of a hypercube

– If a pattern is stored, it’s β€œghost” is stored as well – Intuitively, patterns must ideally be maximally far apart

  • Though this doesn’t seem to hold for Hebbian learning

Stored patterns Ghosts (negations)

29

slide-30
SLIDE 30

Evolution of the network

  • Note: for binary vectors π‘‘π‘—π‘•π‘œ 𝐳 is a projection

– Projects 𝐳 onto the nearest corner of the hypercube – It β€œquantizes” the space into orthants

  • Response to field: 𝐳 ← π‘‘π‘—π‘•π‘œ 𝐗𝐳

– Each step rotates the vector 𝐳𝑄 and then projects it onto the nearest corner

30

𝐳 𝐗𝐳 Projection: π‘‘π‘—π‘•π‘œ 𝐗𝐳

slide-31
SLIDE 31

Storing patterns

  • A pattern 𝐳𝑄 is stored if:

– π‘‘π‘—π‘•π‘œ π—π³π‘ž = π³π‘ž for all target patterns

  • Training: Design 𝐗 such that this holds
  • Simple solution: π³π‘ž is an Eigenvector of 𝐗

– And the corresponding Eigenvalue is positive π—π³π‘ž = πœ‡π³π‘ž – More generally orthant(π—π³π‘ž) = orthant(π³π‘ž)

  • How many such π³π‘žcan we have?

31

slide-32
SLIDE 32

Only N patterns?

  • Patterns that differ in 𝑂/2 bits are orthogonal
  • You can have max 𝑂 orthogonal vectors in an 𝑂-dimensional

space

33

(1,1) (1,-1)

slide-33
SLIDE 33

Another random fact that should interest you

  • The Eigenvectors of any symmetric matrix 𝐗

are orthogonal

  • The Eigenvalues may be positive or negative

34

slide-34
SLIDE 34

Storing more than one pattern

  • Requirement: Given 𝐳1, 𝐳2, … , 𝐳𝑄

– Design 𝐗 such that

  • π‘‘π‘—π‘•π‘œ π—π³π‘ž = π³π‘ž for all target patterns
  • There are no other binary vectors for which this holds
  • What is the largest number of patterns that

can be stored?

35

slide-35
SLIDE 35

Storing 𝑳 orthogonal patterns

  • Simple solution: Design 𝐗 such that 𝐳1,

𝐳2, … , 𝐳𝐿 are the Eigen vectors of 𝐗

– Let 𝐙 = 𝐳1 𝐳2 … 𝐳𝐿 𝐗 = π™Ξ›π™π‘ˆ – πœ‡1, … , πœ‡πΏ are positive – For πœ‡1 = πœ‡2 = πœ‡πΏ = 1 this is exactly the Hebbian rule

  • The patterns are provably stationary

36

slide-36
SLIDE 36

Hebbian rule

  • In reality

– Let 𝐙 = 𝐳1 𝐳2 … 𝐳𝐿 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 𝐗 = π™Ξ›π™π‘ˆ – 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 are orthogonal to 𝐳1 𝐳2 … 𝐳𝐿 – πœ‡1 = πœ‡2 = πœ‡πΏ = 1 – πœ‡πΏ+1 , … , πœ‡π‘‚ = 0

  • All patterns orthogonal to 𝐳1 𝐳2 … 𝐳𝐿are also

stationary

– Although not stable

37

slide-37
SLIDE 37

Storing 𝑢 orthogonal patterns

  • When we have 𝑂 orthogonal (or near
  • rthogonal) patterns 𝐳1, 𝐳2, … , 𝐳𝑂

– 𝑍 = 𝐳1 𝐳2 … 𝐳𝑂 𝐗 = π™Ξ›π™π‘ˆ – πœ‡1 = πœ‡2 = πœ‡π‘‚ = 1

  • The Eigen vectors of 𝐗 span the space
  • Also, for any 𝐳𝑙

𝐗𝐳𝑙 = 𝐳𝑙

38

slide-38
SLIDE 38

Storing 𝑢 orthogonal patterns

  • The 𝑂 orthogonal patterns 𝐳1, 𝐳2, … , 𝐳𝑂 span the

space

  • Any pattern 𝐳 can be written as

𝐳 = 𝑏1𝐳1 + 𝑏2𝐳2 + β‹― + 𝑏𝑂𝐳𝑂 𝐗𝐳 = 𝑏1𝐗𝐳1 + 𝑏2𝐗𝐳2 + β‹― + 𝑏𝑂𝐗𝐳𝑂 = 𝑏1𝐳1 + 𝑏2𝐳2 + β‹― + 𝑏𝑂𝐳𝑂 = 𝐳

  • All patterns are stable

– Remembers everything – Completely useless network

39

slide-39
SLIDE 39

Storing K orthogonal patterns

  • Even if we store fewer than 𝑂 patterns

– Let 𝑍 = 𝐳1 𝐳2 … 𝐳𝐿 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 𝑋 = π‘Ξ›π‘π‘ˆ – 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 are orthogonal to 𝐳1 𝐳2 … 𝐳𝐿 – πœ‡1 = πœ‡2 = πœ‡πΏ = 1 – πœ‡πΏ+1 , … , πœ‡π‘‚ = 0

  • All patterns orthogonal to 𝐳1 𝐳2 … 𝐳𝐿 are stationary
  • Any pattern that is entirely in the subspace spanned by 𝐳1 𝐳2 … 𝐳𝐿is also

stable (same logic as earlier)

  • Only patterns that are partially in the subspace spanned by 𝐳1 𝐳2 … 𝐳𝐿 are

unstable

– Get projected onto subspace spanned by 𝐳1 𝐳2 … 𝐳𝐿

40

slide-40
SLIDE 40

Problem with Hebbian Rule

  • Even if we store fewer than 𝑂 patterns

– Let 𝑍 = 𝐳1 𝐳2 … 𝐳𝐿 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 𝑋 = π‘Ξ›π‘π‘ˆ – 𝐬𝑳+1 𝐬𝑳+2 … 𝐬𝑂 are orthogonal to 𝐳1 𝐳2 … 𝐳𝐿 – πœ‡1 = πœ‡2 = πœ‡πΏ = 1

  • Problems arise because Eigen values are all 1.0

– Ensures stationarity of vectors in the subspace – What if we get rid of this requirement?

41

slide-41
SLIDE 41

Hebbian rule and general (non-

  • rthogonal) vectors

π‘₯

π‘˜π‘— = ෍ π‘žβˆˆ{π‘ž}

𝑧𝑗

π‘žπ‘§π‘˜ π‘ž

  • What happens when the patterns are not orthogonal
  • What happens when the patterns are presented more than
  • nce

– Different patterns presented different numbers of times – Equivalent to having unequal Eigen values..

  • Can we predict the evolution of any vector 𝐳

– Hint: Lanczos iterations

  • Can write 𝐙𝑄 = π™π‘π‘ π‘’β„Žπ‘π‚, οƒ  𝐗 = π™π‘π‘ π‘’β„Žπ‘π‚Ξ›π‚π‘ˆπ™π‘π‘ π‘’β„Žπ‘

π‘ˆ

42

slide-42
SLIDE 42

The bottom line

  • With an network of 𝑂 units (i.e. 𝑂-bit patterns)
  • The maximum number of stable patterns is actually

exponential in 𝑂

– McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable

  • For a specific set of 𝐿 patterns, we can always build a

network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂

– Mostafa and St. Jacques 85’

  • For large N, the upper bound on K is actually N/4logN

– McElice et. Al. 87’

– But this may come with many β€œparasitic” memories

43

slide-43
SLIDE 43

The bottom line

  • With an network of 𝑂 units (i.e. 𝑂-bit patterns)
  • The maximum number of stable patterns is actually

exponential in 𝑂

– McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable

  • For a specific set of 𝐿 patterns, we can always build a

network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂

– Mostafa and St. Jacques 85’

  • For large N, the upper bound on K is actually N/4logN

– McElice et. Al. 87’

– But this may come with many β€œparasitic” memories

44

How do we find this network?

slide-44
SLIDE 44

The bottom line

  • With an network of 𝑂 units (i.e. 𝑂-bit patterns)
  • The maximum number of stable patterns is actually

exponential in 𝑂

– McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable

  • For a specific set of 𝐿 patterns, we can always build a

network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂

– Mostafa and St. Jacques 85’

  • For large N, the upper bound on K is actually N/4logN

– McElice et. Al. 87’

– But this may come with many β€œparasitic” memories

45

Can we do something about this? How do we find this network?

slide-45
SLIDE 45

A different tack

  • How do we make the network store a specific

pattern or set of patterns?

– Hebbian learning – Geometric approach – Optimization

  • Secondary question

– How many patterns can we store?

46

slide-46
SLIDE 46

Consider the energy function

  • This must be maximally low for target patterns
  • Must be maximally high for all other patterns

– So that they are unstable and evolve into one of the target patterns 𝐹 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

47

slide-47
SLIDE 47

Alternate Approach to Estimating the Network

  • Estimate 𝐗 (and 𝐜) such that

– 𝐹 is minimized for 𝐳1, 𝐳2, … , 𝐳𝑄 – 𝐹 is maximized for all other 𝐳

  • Caveat: Unrealistic to expect to store more than

𝑂 patterns, but can we make those 𝑂 patterns memorable

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³

48

slide-48
SLIDE 48

Optimizing W (and b)

  • Minimize total energy of target patterns

– Problem with this? 𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³

49

ΰ·‘ 𝐗 = argmin

𝐗

෍

π³βˆˆπ™π‘„

𝐹(𝐳)

The bias can be captured by another fixed-value component

slide-49
SLIDE 49

Optimizing W

  • Minimize total energy of target patterns
  • Maximize the total energy of all non-target

patterns

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³

50

ΰ·‘ 𝐗 = argmin

𝐗

෍

π³βˆˆπ™π‘„

𝐹(𝐳) βˆ’ ෍

π³βˆ‰π™π‘„

𝐹(𝐳)

slide-50
SLIDE 50

Optimizing W

  • Simple gradient descent:

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³

51

ΰ·‘ 𝐗 = argmin

𝐗

෍

π³βˆˆπ™π‘„

𝐹(𝐳) βˆ’ ෍

π³βˆ‰π™π‘„

𝐹(𝐳) 𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

slide-51
SLIDE 51

Optimizing W

  • Can β€œemphasize” the importance of a pattern

by repeating

– More repetitions οƒ  greater emphasis

52

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

slide-52
SLIDE 52

Optimizing W

  • Can β€œemphasize” the importance of a pattern

by repeating

– More repetitions οƒ  greater emphasis

  • How many of these?

– Do we need to include all of them? – Are all equally important?

53

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

slide-53
SLIDE 53

The training again..

  • Note the energy contour of a Hopfield

network for any weight 𝐗

54

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

state Energy Bowls will all actually be quadratic

slide-54
SLIDE 54

The training again

  • The first term tries to minimize the energy at target patterns

– Make them local minima – Emphasize more β€œimportant” memories by repeating them more frequently

55

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

state Energy Target patterns

slide-55
SLIDE 55

The negative class

  • The second term tries to β€œraise” all non-target

patterns

– Do we need to raise everything?

56

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

π³π³π‘ˆ

state Energy

slide-56
SLIDE 56

Option 1: Focus on the valleys

  • Focus on raising the valleys

– If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish

57

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

state Energy

slide-57
SLIDE 57

Identifying the valleys..

  • Problem: How do you identify the valleys for

the current 𝐗?

58

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

state Energy

slide-58
SLIDE 58

Identifying the valleys..

59

state Energy

  • Initialize the network randomly and let it evolve

– It will settle in a valley

slide-59
SLIDE 59

Training the Hopfield network

  • Initialize 𝐗
  • Compute the total outer product of all target patterns

– More important patterns presented more frequently

  • Randomly initialize the network several times and let it

evolve

– And settle at a valley

  • Compute the total outer product of valley patterns
  • Update weights

60

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-60
SLIDE 60

Training the Hopfield network: SGD version

  • Initialize 𝐗
  • Do until convergence, satisfaction, or death from

boredom:

– Sample a target pattern π³π‘ž

  • Sampling frequency of pattern must reflect importance of pattern

– Randomly initialize the network and let it evolve

  • And settle at a valley 𝐳𝑀

– Update weights

  • 𝐗 = 𝐗 + πœƒ π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐳𝑀𝐳𝑀 π‘ˆ

61

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-61
SLIDE 61

Training the Hopfield network

  • Initialize 𝐗
  • Do until convergence, satisfaction, or death from

boredom:

– Sample a target pattern π³π‘ž

  • Sampling frequency of pattern must reflect importance of pattern

– Randomly initialize the network and let it evolve

  • And settle at a valley 𝐳𝑀

– Update weights

  • 𝐗 = 𝐗 + πœƒ π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐳𝑀𝐳𝑀 π‘ˆ

62

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-62
SLIDE 62

Which valleys?

63

state Energy

  • Should we randomly sample valleys?

– Are all valleys equally important?

slide-63
SLIDE 63

Which valleys?

64

state Energy

  • Should we randomly sample valleys?

– Are all valleys equally important?

  • Major requirement: memories must be stable

– They must be broad valleys

  • Spurious valleys in the neighborhood of

memories are more important to eliminate

slide-64
SLIDE 64

Identifying the valleys..

65

state Energy

  • Initialize the network at valid memories and let it evolve

– It will settle in a valley. If this is not the target pattern, raise it

slide-65
SLIDE 65

Training the Hopfield network

  • Initialize 𝐗
  • Compute the total outer product of all target patterns

– More important patterns presented more frequently

  • Initialize the network with each target pattern and let it

evolve

– And settle at a valley

  • Compute the total outer product of valley patterns
  • Update weights

66

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-66
SLIDE 66

Training the Hopfield network: SGD version

  • Initialize 𝐗
  • Do until convergence, satisfaction, or death from

boredom:

– Sample a target pattern π³π‘ž

  • Sampling frequency of pattern must reflect importance of pattern

– Initialize the network at π³π‘ž and let it evolve

  • And settle at a valley 𝐳𝑀

– Update weights

  • 𝐗 = 𝐗 + πœƒ π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐳𝑀𝐳𝑀 π‘ˆ

67

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-67
SLIDE 67

A possible problem

68

state Energy

  • What if there’s another target pattern

downvalley

– Raising it will destroy a better-represented or stored pattern!

slide-68
SLIDE 68

A related issue

  • Really no need to raise the entire surface, or

even every valley

69

state Energy

slide-69
SLIDE 69

A related issue

  • Really no need to raise the entire surface, or even

every valley

  • Raise the neighborhood of each target memory

– Sufficient to make the memory a valley – The broader the neighborhood considered, the broader the valley

70

state Energy

slide-70
SLIDE 70

Raising the neighborhood

71

state Energy

  • Starting from a target pattern, let the network

evolve only a few steps

– Try to raise the resultant location

  • Will raise the neighborhood of targets
  • Will avoid problem of down-valley targets
slide-71
SLIDE 71

Training the Hopfield network: SGD version

  • Initialize 𝐗
  • Do until convergence, satisfaction, or death from

boredom:

– Sample a target pattern π³π‘ž

  • Sampling frequency of pattern must reflect importance of pattern

– Initialize the network at π³π‘ž and let it evolve a few steps (2- 4)

  • And arrive at a down-valley position 𝐳𝑒

– Update weights

  • 𝐗 = 𝐗 + πœƒ π³π‘žπ³π‘ž

π‘ˆ βˆ’ 𝐳𝑒𝐳𝑒 π‘ˆ

72

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„&𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§

π³π³π‘ˆ

slide-72
SLIDE 72

A probabilistic interpretation

  • For continuous 𝐳, the energy of a pattern is a perfect

analog to the negative log likelihood of a Gaussian density

  • For binary y it is the analog of the negative log likelihood of

a Boltzmann distribution

– Minimizing energy maximizes log likelihood

73

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³ 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 2 π³π‘ˆπ—π³ 𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³ 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 2 π³π‘ˆπ—π³

slide-73
SLIDE 73

The Boltzmann Distribution

  • 𝑙 is the Boltzmann constant
  • π‘ˆ is the temperature of the system
  • The energy terms are like the loglikelihood of a Boltzmann

distribution at π‘ˆ = 1

– Derivation of this probability is in fact quite trivial..

74

𝐹 𝐳 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³ 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) π‘™π‘ˆ 𝐷 = 1 σ𝐳 𝑄(𝐳)

slide-74
SLIDE 74

Continuing the Boltzmann analogy

  • The system probabilistically selects states with

lower energy

– With infinitesimally slow cooling, at π‘ˆ = 0, it arrives at the global minimal state

75

𝐹 𝐳 = βˆ’ 1 2 π³π‘ˆπ—π³ βˆ’ πœπ‘ˆπ³ 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) π‘™π‘ˆ 𝐷 = 1 σ𝐳 𝑄(𝐳)

slide-75
SLIDE 75

Spin glasses and Hopfield nets

  • Selecting a next state is akin to drawing a

sample from the Boltzmann distribution at π‘ˆ = 1, in a universe where 𝑙 = 1

76

state Energy

slide-76
SLIDE 76

Optimizing W

  • Simple gradient descent:

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³

77

ΰ·‘ 𝐗 = argmin

𝐗

෍

π³βˆˆπ™π‘„

𝐹(𝐳) βˆ’ ෍

π³βˆ‰π™π‘„

𝐹(𝐳) 𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π›½π³π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

𝛾 𝐹(𝐳) π³π³π‘ˆ

More importance to more frequently presented memories More importance to more attractive spurious memories

slide-77
SLIDE 77

Optimizing W

  • Simple gradient descent:

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³

78

ΰ·‘ 𝐗 = argmin

𝐗

෍

π³βˆˆπ™π‘„

𝐹(𝐳) βˆ’ ෍

π³βˆ‰π™π‘„

𝐹(𝐳)

THIS LOOKS LIKE AN EXPECTATION!

𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π›½π³π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

𝛾 𝐹(𝐳) π³π³π‘ˆ

More importance to more frequently presented memories More importance to more attractive spurious memories

slide-78
SLIDE 78

Optimizing W

  • Update rule

𝐹(𝐳) = βˆ’ 1 2 π³π‘ˆπ—π³

79

ΰ·‘ 𝐗 = argmin

𝐗

෍

π³βˆˆπ™π‘„

𝐹(𝐳) βˆ’ ෍

π³βˆ‰π™π‘„

𝐹(𝐳)

Natural distribution for variables: The Boltzmann Distribution

𝐗 = 𝐗 + πœƒ 𝐹𝐳~π™π‘„π³π³π‘ˆ βˆ’ 𝐹𝐳~π‘π³π³π‘ˆ 𝐗 = 𝐗 + πœƒ ෍

π³βˆˆπ™π‘„

π›½π³π³π³π‘ˆ βˆ’ ෍

π³βˆ‰π™π‘„

𝛾 𝐹(𝐳) π³π³π‘ˆ

slide-79
SLIDE 79

Continuing on..

  • The Hopfield net as a Boltzmann distribution
  • Adding capacity to a Hopfield network

– The Boltzmann machine

80

slide-80
SLIDE 80

Continuing on..

  • The Hopfield net as a Boltzmann distribution
  • Adding capacity to a Hopfield network

– The Boltzmann machine

81

slide-81
SLIDE 81

Storing more than N patterns

  • The memory capacity of an 𝑂-bit network is at

most 𝑂

– Stable patterns (not necessarily even stationary)

  • Abu Mustafa and St. Jacques, 1985
  • Although β€œinformation capacity” is 𝒫(𝑂3)
  • How do we increase the capacity of the

network

– Store more patterns

82

slide-82
SLIDE 82

Expanding the network

  • Add a large number of neurons whose actual

values you don’t care about!

N Neurons K Neurons

83

slide-83
SLIDE 83

Expanded Network

  • New capacity: ~(𝑂 + 𝐿) patterns

– Although we only care about the pattern of the first N neurons – We’re interested in N-bit patterns

N Neurons K Neurons

84

slide-84
SLIDE 84

Terminology

  • Terminology:

– The neurons that store the actual patterns of interest: Visible neurons – The neurons that only serve to increase the capacity but whose actual values are not important: Hidden neurons – These can be set to anything in order to store a visible pattern

Visible Neurons Hidden Neurons

slide-85
SLIDE 85

Training the network

  • For a given pattern of visible neurons, there are any

number of hidden patterns (2K)

  • Which of these do we choose?

– Ideally choose the one that results in the lowest energy – But that’s an exponential search space!

  • Solution: Combinatorial optimization

– Simulated annealing

Visible Neurons Hidden Neurons

slide-86
SLIDE 86

The patterns

  • In fact we could have multiple hidden patterns

coupled with any visible pattern

– These would be multiple stored patterns that all give the same visible output – How many do we permit

  • Do we need to specify one or more particular

hidden patterns?

– How about all of them – What do I mean by this bizarre statement?

slide-87
SLIDE 87

But first..

  • The Hopfield net as a distribution..

88

slide-88
SLIDE 88

Revisiting Thermodynamic Phenomena

  • Is the system actually in a specific state at any time?
  • No – the state is actually continuously changing

– Based on the temperature of the system

  • At higher temperatures, state changes more rapidly
  • What is actually being characterized is the probability
  • f the state

– And the expected value of the state

state PE

slide-89
SLIDE 89

The Helmholtz Free Energy of a System

  • A thermodynamic system at temperature π‘ˆ can exist in
  • ne of many states

– Potentially infinite states – At any time, the probability of finding the system in state 𝑑 at temperature π‘ˆ is π‘„π‘ˆ(𝑑)

  • At each state 𝑑 it has a potential energy 𝐹𝑑
  • The internal energy of the system, representing its

capacity to do work, is the average: π‘‰π‘ˆ = ෍

𝑑

π‘„π‘ˆ 𝑑 𝐹𝑑

slide-90
SLIDE 90

The Helmholtz Free Energy of a System

  • The capacity to do work is counteracted by the internal

disorder of the system, i.e. its entropy πΌπ‘ˆ = βˆ’ ෍

𝑑

π‘„π‘ˆ 𝑑 log π‘„π‘ˆ 𝑑

  • The Helmholtz free energy of the system measures the

useful work derivable from it and combines the two terms πΊπ‘ˆ = π‘‰π‘ˆ + π‘™π‘ˆπΌπ‘ˆ = ෍

𝑑

π‘„π‘ˆ 𝑑 𝐹𝑑 βˆ’ π‘™π‘ˆ ෍

𝑑

π‘„π‘ˆ 𝑑 log π‘„π‘ˆ 𝑑

slide-91
SLIDE 91

The Helmholtz Free Energy of a System

πΊπ‘ˆ = ෍

𝑑

π‘„π‘ˆ 𝑑 𝐹𝑑 βˆ’ π‘™π‘ˆ ෍

𝑑

π‘„π‘ˆ 𝑑 log π‘„π‘ˆ 𝑑

  • A system held at a specific temperature anneals by

varying the rate at which it visits the various states, to reduce the free energy in the system, until a minimum free-energy state is achieved

  • The probability distribution of the states at steady state

is known as the Boltzmann distribution

slide-92
SLIDE 92

The Helmholtz Free Energy of a System

πΊπ‘ˆ = ෍

𝑑

π‘„π‘ˆ 𝑑 𝐹𝑑 βˆ’ π‘™π‘ˆ ෍

𝑑

π‘„π‘ˆ 𝑑 log π‘„π‘ˆ 𝑑

  • Minimizing this w.r.t π‘„π‘ˆ 𝑑 , we get

π‘„π‘ˆ 𝑑 = 1 π‘Ž π‘“π‘¦π‘ž βˆ’πΉπ‘‘ π‘™π‘ˆ

– Also known as the Gibbs distribution – π‘Ž is a normalizing constant – Note the dependence on π‘ˆ – A π‘ˆ = 0, the system will always remain at the lowest- energy configuration with prob = 1.

slide-93
SLIDE 93

The Energy of the Network

  • We can define the energy of the system as before
  • Since neurons are stochastic, there is disorder or entropy (with T = 1)
  • The equilibribum probability distribution over states is the Boltzmann

distribution at T=1

– This is the probability of different states that the network will wander over at equilibrium

Visible Neurons 𝐹 𝑇 = βˆ’ ෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—π‘‘

π‘˜ βˆ’ 𝑐𝑗𝑑𝑗

𝑄 𝑇 = π‘“π‘¦π‘ž βˆ’πΉ(𝑇) σ𝑇′ π‘“π‘¦π‘ž βˆ’πΉ(𝑇′)

slide-94
SLIDE 94

The Hopfield net is a distribution

  • The stochastic Hopfield network models a probability distribution over

states

– Where a state is a binary string – Specifically, it models a Boltzmann distribution – The parameters of the model are the weights of the network

  • The probability that (at equilibrium) the network will be in any state is 𝑄 𝑇

– It is a generative model: generates states according to 𝑄 𝑇

Visible Neurons 𝐹 𝑇 = βˆ’ ෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—π‘‘

π‘˜ βˆ’ 𝑐𝑗𝑑𝑗

𝑄 𝑇 = π‘“π‘¦π‘ž βˆ’πΉ(𝑇) σ𝑇′ π‘“π‘¦π‘ž βˆ’πΉ(𝑇′)

slide-95
SLIDE 95

The field at a single node

  • Let 𝑇 and 𝑇 β€² be otherwise identical states that only differ in the i-th bit

– S has i-th bit = +1 and S’ has i-th bit = βˆ’1

𝑄 𝑇 = 𝑄 𝑑𝑗 = 1 𝑑

π‘˜β‰ π‘— 𝑄(𝑑 π‘˜β‰ π‘—)

𝑄 𝑇′ = 𝑄 𝑑𝑗 = βˆ’1 π‘‘π‘˜β‰ π‘— 𝑄(π‘‘π‘˜β‰ π‘—) π‘šπ‘π‘•π‘„ 𝑇 βˆ’ π‘šπ‘π‘•π‘„ 𝑇′ = π‘šπ‘π‘•π‘„ 𝑑𝑗 = 1 𝑑

π‘˜β‰ π‘— βˆ’ π‘šπ‘π‘•π‘„ 𝑑𝑗 = 0 π‘‘π‘˜β‰ π‘—

π‘šπ‘π‘•π‘„ 𝑇 βˆ’ π‘šπ‘π‘•π‘„ 𝑇′ = π‘šπ‘π‘• 𝑄 𝑑𝑗 = 1 𝑑

π‘˜β‰ π‘—

1 βˆ’ 𝑄 𝑑𝑗 = 1 π‘‘π‘˜β‰ π‘—

96

slide-96
SLIDE 96

The field at a single node

  • Let 𝑇 and 𝑇 β€² be the states with the ith bit in the +1 and

βˆ’ 1 states log 𝑄(𝑇) = βˆ’πΉ 𝑇 + 𝐷 𝐹 𝑇 = βˆ’ 1 2 πΉπ‘œπ‘π‘’ 𝑗 + ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘‘ π‘˜ + 𝑐𝑗

𝐹 𝑇′ = βˆ’ 1 2 πΉπ‘œπ‘π‘’ 𝑗 βˆ’ ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘‘ π‘˜ βˆ’ 𝑐𝑗

  • π‘šπ‘π‘•π‘„ 𝑇 βˆ’ π‘šπ‘π‘•π‘„ 𝑇′ = 𝐹 𝑇′ βˆ’ 𝐹 𝑇 = Οƒπ‘˜β‰ π‘— π‘₯

π‘˜π‘‘ π‘˜ + 𝑐𝑗

97

slide-97
SLIDE 97

The field at a single node

π‘šπ‘π‘• 𝑄 𝑑𝑗 = 1 𝑑

π‘˜β‰ π‘—

1 βˆ’ 𝑄 𝑑𝑗 = 1 𝑑

π‘˜β‰ π‘—

= ෍

π‘˜β‰ π‘—

π‘₯

π‘˜π‘‘ π‘˜ + 𝑐𝑗

  • Giving us

𝑄 𝑑𝑗 = 1 𝑑

π‘˜β‰ π‘— =

1 1 + π‘“βˆ’ Οƒπ‘˜β‰ π‘— π‘₯π‘˜π‘‘π‘˜+𝑐𝑗

  • The probability of any node taking value 1

given other node values is a logistic

98

slide-98
SLIDE 98

Redefining the network

  • First try: Redefine a regular Hopfield net as a stochastic system
  • Each neuron is now a stochastic unit with a binary state 𝑑𝑗, which

can take value 0 or 1 with a probability that depends on the local field

– Note the slight change from Hopfield nets – Not actually necessary; only a matter of convenience

Visible Neurons 𝑨𝑗 = ෍

π‘˜

π‘₯

π‘˜π‘—π‘‘ π‘˜ + 𝑐𝑗

𝑄(𝑑𝑗 = 1|𝑑

π‘˜β‰ π‘—) =

1 1 + π‘“βˆ’π‘¨π‘—

slide-99
SLIDE 99

The Hopfield net is a distribution

  • The Hopfield net is a probability distribution over

binary sequences

– The Boltzmann distribution

  • The conditional distribution of individual bits in the

sequence is a logistic

Visible Neurons 𝑨𝑗 = ෍

π‘˜

π‘₯

π‘˜π‘—π‘‘ π‘˜ + 𝑐𝑗

𝑄(𝑑𝑗 = 1|𝑑

π‘˜β‰ π‘—) =

1 1 + π‘“βˆ’π‘¨π‘—

slide-100
SLIDE 100

Running the network

  • Initialize the neurons
  • Cycle through the neurons and randomly set the neuron to 1 or -1 according to the

probability given above

– Gibbs sampling: Fix N-1 variables and sample the remaining variable – As opposed to energy-based update (mean field approximation): run the test zi > 0 ?

  • After many many iterations (until β€œconvergence”), sample the individual neurons

Visible Neurons 𝑨𝑗 = ෍

π‘˜

π‘₯

π‘˜π‘—π‘‘ π‘˜ + 𝑐𝑗

𝑄(𝑑𝑗 = 1|𝑑

π‘˜β‰ π‘—) =

1 1 + π‘“βˆ’π‘¨π‘—

slide-101
SLIDE 101

Training the network

  • As in Hopfield nets, in order to train the network,

we need to select weights such that those states are more probable than other states

– Maximize the likelihood of the β€œstored” states

Visible Neurons 𝐹 𝑇 = βˆ’ ෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—π‘‘

π‘˜ βˆ’ 𝑐𝑗𝑑𝑗

𝑄 𝑇 = π‘“π‘¦π‘ž βˆ’πΉ(𝑇) σ𝑇′ π‘“π‘¦π‘ž βˆ’πΉ(𝑇′) 𝑄 𝑇 = π‘“π‘¦π‘ž σ𝑗<π‘˜ π‘₯π‘—π‘˜π‘‘π‘—π‘‘

π‘˜ + 𝑐𝑗𝑑𝑗

σ𝑇′ π‘“π‘¦π‘ž σ𝑗<π‘˜ π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€²

slide-102
SLIDE 102

Maximum Likelihood Training

  • Maximize the average log likelihood of all β€œtraining”

vectors 𝐓 = {𝑇1, 𝑇2, … , 𝑇𝑂}

– In the first summation, si and sj are bits of S – In the second, si’ and sj’ are bits of S’

log 𝑄 𝑇 = ෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—π‘‘

π‘˜ + 𝑐𝑗𝑑𝑗

βˆ’ log ෍

𝑇′

π‘“π‘¦π‘ž ෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€²

< log 𝑄 𝐓 > = 1 𝑂 ෍

π‘‡βˆˆπ“

log 𝑄 𝑇 = 1 𝑂 ෍

𝑇

෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—π‘‘

π‘˜ + 𝑐𝑗𝑑𝑗(𝑇)

βˆ’ log ෍

𝑇′

π‘“π‘¦π‘ž ෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€²

slide-103
SLIDE 103

Maximum Likelihood Training

  • We will use gradient descent, but we run into a problem..
  • The first term is just the average sisj over all training

patterns

  • But the second term is summed over all states

– Of which there can be an exponential number!

log 𝑄 𝐓 = 1 𝑂 ෍

𝑇

෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—π‘‘

π‘˜ + 𝑐𝑗𝑑𝑗(𝑇)

βˆ’ log ෍

𝑇′

π‘“π‘¦π‘ž ෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€²

𝑒 log 𝑄 𝐓 𝑒π‘₯π‘—π‘˜ = 1 𝑂 ෍

𝑇

𝑑𝑗𝑑

π‘˜ βˆ’? ? ?

slide-104
SLIDE 104

The second term

  • The second term is simply the expected value
  • f sisj, over all possible values of the state
  • We cannot compute it exhaustively, but we

can compute it by sampling!

𝑒log σ𝑇′ π‘“π‘¦π‘ž σ𝑗<π‘˜ π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€²

𝑒π‘₯π‘—π‘˜ = ෍

𝑇′

π‘“π‘¦π‘ž σ𝑗<π‘˜ π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€²

σ𝑇′ π‘“π‘¦π‘ž σ𝑗<π‘˜ π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€² 𝑑𝑗 ′𝑑 π‘˜ β€²

𝑒log σ𝑇′ π‘“π‘¦π‘ž σ𝑗<π‘˜ π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€²

𝑒π‘₯π‘—π‘˜ = ෍

𝑇′

𝑄(𝑇′)𝑑𝑗

′𝑑 π‘˜ β€²

slide-105
SLIDE 105

The simulation solution

  • Initialize the network randomly and let it β€œevolve”

– By probabilistically selecting state values according to our model

  • After many many epochs, take a snapshot of the state
  • Repeat this many many times
  • Let the collection of states be

π“π‘‘π‘—π‘›π‘£π‘š = {π‘‡π‘‘π‘—π‘›π‘£π‘š,1, π‘‡π‘‘π‘—π‘›π‘£π‘š,1=2, … , π‘‡π‘‘π‘—π‘›π‘£π‘š,𝑁}

slide-106
SLIDE 106

The simulation solution for the second term

  • The second term in the derivative is computed

as the average of sampled states when the network is running β€œfreely”

෍

𝑇′

𝑄(𝑇′)𝑑𝑗

′𝑑 π‘˜ β€² β‰ˆ 1

𝑁 ෍

π‘‡β€²βˆˆπ“π‘‘π‘—π‘›π‘£π‘š

𝑑𝑗

′𝑑 π‘˜ β€²

𝑒log σ𝑇′ π‘“π‘¦π‘ž σ𝑗<π‘˜ π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€²

𝑒π‘₯π‘—π‘˜ = ෍

𝑇′

𝑄(𝑇′)𝑑𝑗

′𝑑 π‘˜ β€²

slide-107
SLIDE 107

Maximum Likelihood Training

  • The overall gradient ascent rule

log 𝑄 𝐓 = 1 𝑂 ෍

𝑇

෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—π‘‘

π‘˜ + 𝑐𝑗𝑑𝑗(𝑇)

βˆ’ log ෍

𝑇′

π‘“π‘¦π‘ž ෍

𝑗<π‘˜

π‘₯π‘—π‘˜π‘‘π‘—

′𝑑 π‘˜ β€² + 𝑐𝑗𝑑𝑗 β€²

𝑒 log 𝑄 𝐓 𝑒π‘₯π‘—π‘˜ = 1 𝑂 ෍

𝑇

𝑑𝑗𝑑

π‘˜ βˆ’ 1

𝑁 ෍

π‘‡β€²βˆˆπ“π‘‘π‘—π‘›π‘£π‘š

𝑑𝑗

′𝑑 π‘˜ β€²

π‘₯π‘—π‘˜ = π‘₯π‘—π‘˜ + πœƒ 𝑒 log 𝑄 𝐓 𝑒π‘₯π‘—π‘˜

slide-108
SLIDE 108

Overall Training

  • Initialize weights
  • Let the network run to obtain simulated state samples
  • Compute gradient and update weights
  • Iterate

π‘₯π‘—π‘˜ = π‘₯π‘—π‘˜ + πœƒ 𝑒 log 𝑄 𝐓 𝑒π‘₯π‘—π‘˜

𝑒 log 𝑄 𝐓 𝑒π‘₯π‘—π‘˜ = 1 𝑂 ෍

𝑇

𝑑𝑗𝑑

π‘˜ βˆ’ 1

𝑁 ෍

π‘‡β€²βˆˆπ“π‘‘π‘—π‘›π‘£π‘š

𝑑𝑗

′𝑑 π‘˜ β€²

slide-109
SLIDE 109

Lookahead..

  • Boltzmann Machines
  • Training strategies
  • RBMs
  • DBMs..

110