Dense Associative Memories and Deep Learning Dmitry Krotov IBM Research MIT-IBM Watson AI Lab Institute for Advanced Study
Learning Mechanisms Architectures
What is associative memory? energy landscape ξ 1 ξ 2 ξ 3 ξ 4 memories
Standard Dense Associative Memory Associative Memory N X E = − σ i T ij σ j K X ξ µ i ξ µ K ⇣ N i,j =1 T ij = ⌘ n j X X ξ µ E = − i σ i µ =1 µ =1 i =1 -dynamical variables σ i n ≥ 2 ξ µ -memorized patterns i power of the N -number of neurons interaction vertex K -number of memories K ⇣ N ⌘ 2 X X ξ µ E = − i σ i µ =1 i =1 K max ≈ 0 . 14 N K max ≈ α n N n − 1
K ✓ ⌘◆� ⇣ ⌘ ⇣ σ ( t +1) j σ ( t ) j σ ( t ) X X X ξ µ ξ µ � ξ µ ξ µ = Sign F i + � F i + i j j µ =1 j 6 = i j 6 = i h ξ µ i i = 0 h ξ µ j i = δ µ ν δ ij i ξ ν
Pattern recognition with DAM classification visible neurons neurons 28 v i = v i x α or c α 784 10 28
K ✓ ⌘◆� ⇣ ⌘ ⇣ σ ( t +1) j σ ( t ) j σ ( t ) X X X ξ µ ξ µ � ξ µ ξ µ = Sign F i + � F i + i j j µ =1 j 6 = i j 6 = i K N N ✓ ⌘◆� ⇣ ⌘ ⇣ X X X ξ µ X X ξ µ − ξ µ ξ µ ξ µ ξ µ c α = g α x α + γ x γ + α x α + γ x γ + β F i v i − F i v i µ =1 i =1 i =1 γ 6 = α γ 6 = α output c α . The update g ( x ) = tanh( x )
training random memories constructed memory ξ µ i ∈ N (0 , 0 . 1) vectors MNIST Dataset
Main question: What kind of representation of the data has the neural network learned?
Features vs. prototypes in psychology and neuroscience Feature-matching theory Prototype theory Electrical signal from brain Visual area of brain Recording electrode Stimulus Hubel,Wiesel, 1959 Solso, McCarthy,1981 training set Wallis, et al., Journal of Vision,2008
Feature to prototype transition power of the interaction vertex n = 3 n = 20 n = 2 n = 30 1 256 0 . 5 192 0 128 − 0 . 5 64 − 1 feature detectors prototype detectors
Feature to prototype transition power of the interaction vertex n = 3 n = 20 n = 2 n = 30 1 256 0 . 5 192 0 128 − 0 . 5 64 − 1 1 . 44% 1 . 51% 1 . 61% 1 . 80% 1 . 6% Simard, Steinkraus, Platt, 2003
Duality with feed-forward nets c α ⇣ K ⌘ c α X v i ξ µ c α = g α h µ µ =1 h µ ⇣ N v i x α ⌘ X ξ µ h µ = f i v i i =1 K ⇣ N 10 v i ⌘ X X X ξ µ ξ µ E = − F i v i + α c α µ =1 i =1 α =1 Duality rule: f ( x ) = F 0 ( x ) activation energy function function
Commonly used activation functions n = 2 n standard DAM Hopfield net f ( x ) = ReP n − 1 f ( x ) = ReLU x x
Question: Are there any tasks for which models with higher order interactions perform better than models with quadratic interactions?
Adversarial Inputs n=2 2 3 v i → v i − ∂ C ∂ v i
Adversarial Deformations in DAM 10 C 1st log( C α ) A A 0 C 2nd A A -10 decision boundary -20 10 20 30 40 50 60 70 80 number of image updates n=2 n=3 n=20 n=30 8 3 8 3 3 3 3 9 5 8 8 8
Question: Can we use Dense Associative Memories for classification of high resolution images?
VGG16 coupled to DAM �������������������������������� ��������������������������� ���� ���
Adversarial Inputs in the Image Domain ����������������������������
Input transfer classified by n=2 classified by n=8 made with n=2 ������������������� classified by n=2 classified by n=8 made with n=8 �������������������
Error rate of misclassification Classify n=2 n=8 Generate n=2 100% 32% n=8 57% 100%
n=30 37.6% 48.3% 56.9% 98.8% generate n=20 45.3% 63.7% 98.9% 5.77% n=3 33.9% 99% 8.71% 3.32% 98.9% 50.7% 9.07% 3.44% n=2 n=2 n=3 n=20 n=30 test
Results on ImageNet Accuracy: 69%
ImageNet errors police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria bell cote, bell cot
Large Capacity Physics Dense Associative Memories K ⇣ N ⌘ n X X ξ µ E = − i σ i µ =1 i =1 Psychology Computer Neuroscience Science No Adversarial Feature to Prototype Problems Transition
Recommend
More recommend