A new look at state-space models for neural data Liam Paninski - PowerPoint PPT Presentation

A new look at state-space models for neural data Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/ ∼ liam liam@stat.columbia.edu June 27, 2008 Support: NIH CRCNS, Sloan Fellowship, NSF CAREER, McKnight Scholar award.

State-space models Unobserved state q t with Markov dynamics p ( q t +1 | q t ) Observed y t : p ( y t | q t ) Goal: infer p ( q t | Y 0: T ) Exact solutions: finite state-space HMM, Kalman filter (KF): forward-backward algorithm (recursive; O ( T ) time) Approximate solutions: extended KF, particle filter, etc.... basic idea: recursively update an approximation to “forward” distribution p ( q t | Y 0: t )

Example: image stabilization From (Pitkow et al., 2007): neighboring letters on the 20/20 line of the Snellen eye chart. Trace shows 500 ms of eye movement.

A state-space method for image stabilization Assume image I ( � x ) is fixed; � q t = the (unknown) eye position. Simple random-walk dynamics for q t : q t +1 = q t + e , e i.i.d. Image falling on retina at point � x : I t ( � x ) = I ( � x − q t ). Goal: infer p ( I | Y 0: T ). Initialize: prior p ( I ). Now recurse: • dynamics step: � p ( I t | Y 0: t ) → p ( I t +1 | Y 0: t ) = S e [ p ( I t | Y 0: t )] p ( e ) de (mixture) • observation step: p ( I t +1 | Y 0: t +1 ) = p ( I t +1 | Y 0: t ) p ( y t +1 | I t +1 ) • do a greedy merge to make sure number of mixture components stays bounded Now we just need a model for p ( y t +1 | I t +1 )...

Multineuronal generalized linear model � � b i + � ; θ = ( b i ,� � k i · I t + h i,j n j ( t − τ ) λ i ( t ) = f k i , h ij ) j,τ — log p ( Y | I, θ ) is concave in both θ and I (Pillow et al., 2008).

Simulated example: image stabilization true image w/ translations; observed noisy retinal responses; estimated image. Questions: how much high-frequency information can we recover? What is effect of nonlinear spiking response (Rucci et al., 2007)?

Computing the MAP path We often want to compute the MAP estimate ˆ Q = arg max Q p ( Q | Y ) . In standard Kalman setting, forward-backward gives MAP (because E ( Q | Y ) and ˆ Q coincide in Gaussian case). More generally, extended Kalman-based methods give approximate MAP, but are non-robust: forward distribution p ( q t | Y 0: t ) may be highly non-Gaussian even if full joint distribution p ( Q | Y ) is nice and log-concave.

Write out the posterior: log p ( Q | Y ) = log p ( Q ) + log p ( Y | Q ) X X = log p ( q t +1 | q t ) + log p ( y t | q t ) t t Two basic observations: • If log p ( q t +1 | q t ) and log p ( y t | q t ) are concave, then so is log p ( Q | Y ). • Hessian H of log p ( Q | Y ) is block-tridiagonal: p ( y t | q t ) contributes a block-diag term, and log p ( q t +1 | q t ) contributes a block-tridiag term. Now recall Newton’s method: iteratively solve HQ dir = ∇ . Solving tridiagonal systems requires O ( T ) time. — computing MAP by Newton’s method requires O ( T ) time, even in highly non-Gaussian cases. (Newton here acts as an iteratively reweighted Kalman smoother (Davis and Rodriguez-Yam, 2005; Jungbacker and Koopman, 2007); all suff. stats may be obtained in O ( T ) time. Similar results also well-known for expectation propagation (Ypma and Heskes, 2003; Yu and Sahani, 2007).)

Comparison on simulated soft-threshold leaky integrate-and-fire data Model: dV t = − ( V t /τ ) dt + σdB t ; λ ( t ) = f ( V t ). — extended Kalman-based methods are best in high-information (low-noise) limit, where Gaussian approximation is most accurate (Koyama et al., 2008).

Parameter estimation Standard method: Expectation-Maximization (EM). Iterate between computing E ( Q | Y ) (or ˆ Q ) and maximizing w.r.t. parameters θ . Can be seen as coordinate ascent (slow) on first two terms of Laplace approximation: Z log p ( Y | θ ) = log p ( Q | θ ) p ( Y | θ, Q ) dQ Q θ , θ ) − 1 log p ( ˆ Q θ | θ ) + log p ( Y | ˆ ≈ 2 log | H ˆ Q θ | ˆ Q θ = arg max { log p ( Q | θ ) + log p ( Y | Q, θ ) } Q Better approach: simultaneous joint optimization. Main case of interest: 2 3 4 b + � X k i · � h i ′ ,j n i ′ ( t − j ) + q i ( t ) λ i ( t ) = f x ( t ) + 5 i ′ ,j = f [ X t θ + q i ( t )] √ � q t + dt = � q t + A� q t dt + σ dt� ǫ t

More generally, assume q t has an AR(p) prior and the observations y t are members of a canonical exponential family with parameter X t θ + q t . We want to optimize Q θ , θ ) − 1 log p ( ˆ Q θ | θ ) + log p ( Y | ˆ 2 log | H ˆ Q θ | w.r.t. θ . If we drop the last term, we have a simple jointly concave optimization: n o ˆ log p ( ˆ Q θ | θ ) + log p ( Y | ˆ θ = arg max Q θ , θ ) θ n o log p ( ˆ Q | θ ) + log p ( Y | ˆ = arg max max Q, θ ) . θ Q 0 1 H T @ H θθ θQ Write the joint Hessian in ( Q, θ ) as A , with H QQ block-tridiag. H θQ H QQ Now use the Schur complement to efficiently compute the Newton step. Computing ∇ θ log | H ˆ Q θ | also turns out to be easy ( O ( T ) time) here.

Constrained optimization In many cases we need to impose (e.g., nonnegativity) constraints on q t . Easy to incorporate here, via interior-point (barrier) methods: ( ) X Q ∈ C log p ( Q | Y ) log p ( Q | Y ) + ǫ arg max = ǫ ց 0 arg max lim f ( q t ) Q t (X ) = ǫ ց 0 arg max lim log p ( q t +1 | q t ) + log p ( y t | q t ) + ǫf ( q t ) ; Q t f ( . ) is concave and approaching −∞ near boundary of constraint set C . The Hessian remains block-tridiagonal and negative semidefinite for all ǫ > 0, so optimization still requires just O ( T ) time.

Example: computing the MAP subthreshold voltage given superthreshold spikes Leaky, noisy integrate-and-fire model: √ V ( t + dt ) = V ( t ) − dtV ( t ) /τ + σ dtǫ t , ǫ t ∼ N (0 , 1) Observations: y t = 0 (no spike) if V t < V th ; y t = 1 if V t = V th (Paninski, 2006)

Example: inferring presynaptic input X g j ( t )( V j − V t ) I t = j g j ( t + dt ) = g j ( t ) − dtg j ( t ) /τ j + N j ( t ) , N j ( t ) > 0 3 2 true g 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 50 I(t) 0 −50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2.5 estimated g 2 1.5 1 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 time (sec) (Paninski, 2007)

Example: inferring spike times from slow, noisy calcium data C ( t + dt ) = C ( t ) − dtC ( t ) /τ + N t ; N t > 0; y t = C t + ǫ t — nonnegative deconvolution is a recurring problem in signal processing (e.g., spike sorting); many applications of these fast methods (Vogelstein et al., 2008).

Further generalizations: GLM spike train decoding We’ve emphasized tridiagonal structure so far, but similar results hold for any problem with a banded Hessian. For example, look at point-process GLM again: � � b + � � k i · � h i ′ ,j n i ′ ( t − j ) λ i ( t ) = f x ( t ) + i ′ ,j If the spatiotemporal filter � k i has a finite impulse response, then Hessian (w.r.t. � x ( t )) is banded and optimal decoding of stimulus � x ( t ) requires O ( T ) time. Similar speedups for MCMC methods (Ahmadian et al., 2008).

How important is timing? (Ahmadian et al., 2008)

Coincident spike are more “important”

Constructing a metric between spike trains d ( r 1 , r 2 ) ≡ d x ( x 1 , x 2 ) Locally, d ( r, r + δr ) = δr T G r δr : interesting information in G r .

Effects of jitter on spike trains Look at degradations as we add Gaussian noise with covariance: • α ∗ : C ∝ G − 1 (optimal: minimizes error under constraint on | C | ) • α 1 : C ∝ diag ( G ) − 1 (perturb less important spikes more) • α 2 : C ∝ blkdiag ( G ) − 1 (perturb spikes from different cells independently) • α 3 : C ∝ I (simplest) — Non-correlated perturbations are more costly. Can also add/remove spikes: cost of spike addition ≈ cost of jittering by 10 ms.

One last extension: two-d smoothing Estimation of two-d firing rate surfaces comes up in a number of examples: • place fields / grid cells • post-fitting in spike-triggered covariance analysis • tracking of non-stationary (time-varying) tuning curves • “inhomogeneous Markov interval” models for spike-history dependence How to generalize fast 1-d state-space methods to 2-d case? Idea: use Gaussian process priors which are carefully selected to give banded Hessians. Model: hidden variable Q is a random surface with a Gaussian prior: Q ∼ N ( µ, C ); Spikes are generated by a point process whose rate is a function of Q : λ ( � x ) = f [ Q ( � x )] (easy to incorporate additional effects here, e.g. spike history) Now the Hessian of the log-posterior of Q is C − 1 + D , where D is diagonal (Cunningham et al., 2007). For Newton, we need to solve ( C − 1 + D ) Q dir = ∇ .

Example: nearest-neighbor smoothing prior For prior covariance C such that C − 1 contains only neighbor potentials, we can solve ( C − 1 + D ) Q dir = ∇ in O (dim( Q ) 1 . 5 ) time using direct methods (“approximate minimum degree” algorithm — built-in to Matlab sparse A \ b code) and potentially in O (dim( Q )) time using multigrid (iterative) methods (Rahnama Rad and Paninski, 2008).

Estimating a time-varying tuning curve given a limited sample path

Estimating a two-d place field

A new look at state-space models for neural data Liam Paninski - PowerPoint PPT Presentation

A new look at state-space models for neural data Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/ liam liam@stat.columbia.edu June 27, 2008 Support: NIH

Collection #1 LOOk 1/8 LOOk 2/8 LOOk 3/8 LOOk 4/8 LOOk 5/8 LOOk 6/8

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

A new look at state-space models for neural data Liam Paninski Department of Statistics and

A new look at state-space models for neural data Liam Paninski Department of Statistics and

Hyperbolic Neural Networks Hyperbolic Neural Networks Use hyperbolic space instead of Euclidean

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Partial-Order Planning 1 State-Space vs. Plan-Space State-space ( situation space ) planning

Outline Why model neural networks? Modeling Neural Networks A brief look at the neuron.

Overview of State Space Models Standard State Space Model Standard state space model x n +1 =

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Anatomy of the Cerebellum Computational Models of neural Systems Lecture 2.1 David S. Touretzky

Foundations of Artificial Intelligence 5. State-Space Search: State Spaces Malte Helmert

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Concepts Following https://plato.stanford.edu/entries/tense-aspect/ Event The engine broke

Integral Method n n th Harmonic number for Sums B n = H n /2 Albert R Meyer, April

Introduction to Artificial Intelligence Introduktion til kunstig intelligens DM533 Artificial

Lexical-Functional Grammar Ash Asudeh Carleton University University of Iceland July 3, 2009 1

Assignments Checkpoint 6 Tone Reproduction Due Monday Another extrashadow ray

Efficient Neural Network Compression Namhoon Lee University of Oxford 3 May 2019 A Challenge in

YANG Data Models for TE and RSVP Tunnels and Interfaces

Lock-Free Concurrent Data Structures Danny Hendler Ben-Gurion University 1 Danny Hendler,