A Vector-Symbolic Architecture for Representing Scene Structure - - PowerPoint PPT Presentation
A Vector-Symbolic Architecture for Representing Scene Structure - - PowerPoint PPT Presentation
A Vector-Symbolic Architecture for Representing Scene Structure Eric Weiss Redwood Center for Theoretical Neuroscience UC Berkeley 4/15/2016 Active Vision Active Vision Advantages: - Selective processing saves computation - Should scale
SLIDE 1
SLIDE 2
Active Vision
SLIDE 3
Active Vision
Advantages:
- Selective processing saves computation
- Should scale well with image size
SLIDE 4
Recurrent Network
Classification
SLIDE 5
Example Trial
t = 0 t = 1 t = 2 t = 3
Glimpse windows Glimpse content (input to neural net)
SLIDE 6
Issues
Network training objective: locate a specific digit (“find the 4”) Does not achieve good performance, difficult to debug...
SLIDE 7
Goal:
A memory architecture for recurrent neural networks that:
- can represent a set of objects and their
positions in the image
- is differentiable
- is easy to analyze
SLIDE 8
t=0 t=1 t=2
SLIDE 9
Vector-Symbolic Representations
Everything (object identities, positions, etc.) is represented by an n-dimensional vector.
Kanerva 2009, Plate 2003, Danihelka et. al 2016 (Google DeepMind)
Encoding Locations (r)
- need a continuous function: r = L(x, y)
SLIDE 10
Vector-Symbolic Representations
Association (Binding)
- can bind two vectors by multiplying element-wise,
forming a new vector m = v*r
- m will be nearly orthogonal to any other
meaningful vector. So, binding ≅ scrambling
SLIDE 11
Vector-Symbolic Representations
Storing a Set (“scene vector”)
- set of items is constructed by summing vectors
m = v0*r0 + v1*r1 + v2*r2
SLIDE 12
Vector-Symbolic Representations
Retrieval (Un-binding)
- can recover the information associated with a
vector r0 by multiplying the memory vector m by it’s complex conjugate r0
- 1
m*r0
- 1 = (v0*r0 + v1*r1 + v2*r2 )*r0
- 1
= v0*r0*r0
- 1+ v1*r1*r0
- 1+ v2*r2*r0
- 1 = v0 + “noise”
SLIDE 13
t=0 t=1 t=2
Encoding a Scene as a Vector
v1*L(x0, y0) + v2*L(x1, y1) + v3*L(x2, y2) = m
SLIDE 14
Dataset & Learning
- Trained on toy “multi-
MNIST” dataset to report digit identity when queried with a location
SLIDE 15
Glimpse Selection Strategy
- Use reinforcement learning
to obtain glimpse policy
SLIDE 16
background
SLIDE 17
background
SLIDE 18
background
SLIDE 19
background
SLIDE 20
background
SLIDE 21
background
SLIDE 22
background
SLIDE 23
background
SLIDE 24
background
SLIDE 25
background
SLIDE 26
Intelligent vs. Random Glimpses
Classification Accuracy Glimpse Number
SLIDE 27
Summary / Contributions
- Introduced a novel spatial memory
framework for image processing and neural networks
- Demonstrated its effectiveness on a toy
dataset
- Showed how it can be used to intelligently
guide visual attention
SLIDE 28
Future Work
- Variable glimpse scaling
- Incorporation in an end-to-end trainable
recurrent neural network
- Application to real-world datasets with much
larger numbers of object categories
SLIDE 29
Thanks for listening!
Many thanks to: Stephen Tyree Shalini Gupta Pavlo Molchanov Brian Cheung Bruno Olshausen Jan Kautz for support and valuable discussions.
SLIDE 30
Differentiable Glimpse Mechanism
I: Image S(x,y): sampling matrix x,y: Glimpse coordinates g: Glimpse output
- 1 0 1
- 1
1
- 1 0 1
- 1
1