Two problems from neural data analysis: Sparse entropy estimation - PowerPoint PPT Presentation

Two problems from neural data analysis: Sparse entropy estimation and efficient adaptive experimental design Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/ ∼ liam liam@stat.columbia.edu September 7, 2006

The fundamental question in neuroscience The neural code : what is P ( response | stimulus )? Main question : how to estimate P ( r | s ) from (sparse) experimental data?

Curse of dimensionality Both stimulus and response can be very high-dimensional. Stimuli: • images • sounds • time-varying behavior Responses: • observations from single or multiple simultaneously-recorded point processes

Avoiding the curse of insufficient data 1 : Estimate some functional f ( p ) instead of full joint p ( r, s ) — information-theoretic functionals 2 : Select stimuli more efficiently — optimal experimental design 3 : Improved nonparametric estimators — minimax theory for discrete distributions under KL loss ( 4 : Parametric approaches; connections to biophysical models )

Part 1: Estimation of information Many central questions in neuroscience are inherently information-theoretic: • What inputs are most reliably encoded by a given neuron? • Are sensory neurons optimized to transmit information about the world to the brain? • Do noisy synapses limit the rate of information flow from neuron to neuron? Quantification of “information” is fundamental problem. (...interest in neuroscience but also physics, telecommunications, genomics, etc.)

Shannon mutual information � dp ( x, y ) I ( X ; Y ) = dp ( x, y ) log dp ( x ) × p ( y ) X×Y Information-theoretic justifications: • invariance • “uncertainty” axioms • data processing inequality • channel and source coding theorems But obvious open experimental question: • is this computable for real data?

How to estimate information I very hard to estimate in general... ... but lower bounds are easier. Data processing inequality: I ( X ; Y ) ≥ I ( S ( X ); T ( Y )) Suggests a sieves-like approach.

Discretization approach Discretize X, Y → X disc , Y disc , estimate I discrete ( X ; Y ) = I ( X disc ; Y disc ) • Data processing inequality = ⇒ I discrete ≤ I • I discrete ր I as partition is refined Strategy: refine partition as samples N increases; if number of bins m doesn’t grow too fast, ˆ I → I discrete ր I Completely nonparametric, but obvious concerns: • Want N ≫ m ( N ) samples, to “fill in” histograms p ( x, y ) • How large is bias, variance for fixed m ?

Bias is major problem m y m x p MLE ( x, y ) ˆ ˆ � � I MLE ( X ; Y ) = p MLE ( x, y ) log ˆ p MLE ( x )ˆ ˆ p MLE ( y ) x =1 y =1 p MLE ( x ) = p N ( x ) = n ( x ) ˆ (empirical measure) N Fix p ( x, y ) , m x , m y and let sample size N → ∞ . Then: • Bias(ˆ I MLE ) : ∼ ( m x m y − m x − m y + 1) / 2 N . • Variance(ˆ I MLE ) : ∼ (log m ) 2 /N ; dominated by bias if m = m x m y large. • No unbiased estimator exists. (Miller, 1955; Paninski, 2003)

Convergence of common information estimators Result 1 : If N/m → ∞ , ˆ I MLE and related estimators universally almost surely consistent. Converse : if N/m → c < ∞ , ˆ I MLE and related estimators typically converge to wrong answer almost surely. (Asymptotic bias can often be computed explicitly.) Implication: if N/m small, large bias although errorbars vanish, even if “bias-corrected” estimators are used (Paninski, 2003).

Estimating information on m bins with fewer than m samples Result 2 : A new estimator that is uniformly consistent as N → ∞ even if N/m → 0 (albeit sufficiently slowly) Error bounds good for all underlying distributions: estimator works well even in worst case Interpretation: information is strictly easier to estimate than p ! (Paninski, 2004)

Derivation of new estimator Suffices to develop good estimator of discrete entropy: I discrete ( X ; Y ) = H ( X disc ) + H ( Y disc ) − H ( X disc , Y disc ) m x � H ( X ) = − p ( x ) log p ( x ) x =1

Derivation of new estimator Variational idea: choose estimator that minimizes upper bound on error over H = { ˆ H : ˆ � g ( p N ( i )) } H ( p N ) = ( p N = empirical measure) i Approximation-theoretic (binomial) bias bound N g ( j Bias p ( ˆ H ) ≤ B ∗ ( ˆ � max H ) ≡ m · max 0 ≤ p ≤ 1 |− p log p − N ) B N,j ( p ) | p j =0 McDiarmid-Steele bound on variance 2 � � N ) − g ( j − 1 � g ( j V ar p ( ˆ H ) ≤ V ∗ ( ˆ � � H ) ≡ N max max ) � � N p j �

Derivation of new estimator Choose estimator to minimize (convex) error bound over (convex) space H : H ) 2 + V ∗ ( ˆ ˆ H ∈H [ B ∗ ( ˆ H BUB = argmin ˆ H )] . Optimization of convex functions on convex parameter spaces is computationally tractable by simple descent methods Consistency proof involves Stone-Weierstrass theorem, penalized polynomial approximation theory in Poisson limit N/m → c .

Error comparisons: upper and lower bounds Upper and lower bounds on maximum rms error; N/m = 0.25 2.2 BUB 2 JK 1.8 1.6 RMS error bound (bits) 1.4 1.2 1 0.8 0.6 0.4 0.2 1 2 3 4 5 6 10 10 10 10 10 10 N

Undersampling example true p(x,y) estimated p(x,y) | error | −5 x 10 1.6 2 1.4 4 1.2 6 8 1 10 0.8 x 12 0.6 14 0.4 16 18 0.2 20 0 5 10 15 20 5 10 15 20 5 10 15 20 y y y m x = m y = 1000; N/m xy = 0 . 25 ˆ I MLE = 2 . 42 bits “bias-corrected” ˆ I MLE = − 0 . 47 bits ˆ I BUB = 0.74 bits; conservative (worst-case RMS upper bound) error: ± 0 . 2 bits true I ( X ; Y ) = 0.76 bits

Shannon ( − p log p ) is special i p α Obvious conjecture: � i , 0 < α < 1 (Renyi entropy) should behave similarly. 0.3 0.4 0.25 − p log (p) 0.3 0.2 sqrt (p) 0.15 0.2 0.1 0.1 0.05 0.05 0.1 0.15 0.2 0.05 0.1 0.15 0.2 Result 3 : Surprisingly, not true: no estimator can uniformly i p α estimate � i , α ≤ 1 / 2, if N ∼ m (Paninski, 2004). In fact, need N > m (1 − α ) /α : smaller α = ⇒ more data needed. (Proof via Bayesian lower bounds on minimax error.)

Directions • KL-minimax estimation of full distribution in sparse limit N/m → 0 (Paninski, 2005b) • Continuous (unbinned) entropy estimators: similar result holds for kernel density estimates • Sparse testing for uniformity: much easier than estimation ( N ≫ m 1 / 2 suffices) • Open questions: 1 / 2 < α < 1? Other functionals?

Part 2: Adaptive optimal design of experiments Assume: • parametric model p θ ( y | � x ) on outputs y given inputs � x • prior distribution p ( θ ) on finite-dimensional model space Goal: estimate θ from experimental data Usual approach: draw stimuli i.i.d. from fixed p ( � x ) Adaptive approach: choose p ( � x ) on each trial to maximize I ( θ ; y | � x ) (e.g. “staircase” methods).

Snapshot: one-dimensional simulation x 1 p(y = 1 | x, θ 0 ) 0.5 0 −3 x 10 4 I(y ; θ | x) 2 0 40 trial 100 30 p( θ ) 20 optimized 10 i.i.d. 0 θ

Asymptotic result Under regularity conditions, a posterior CLT holds (Paninski, 2005a): � √ � → N ( µ N , σ 2 ); µ N ∼ N (0 , σ 2 ) p N N ( θ − θ 0 ) iid ) − 1 = E x ( I x ( θ 0 )) • ( σ 2 info ) − 1 = argmax C ∈ co ( I x ( θ 0 )) log | C | • ( σ 2 ⇒ σ 2 iid > σ 2 = info unless I x ( θ 0 ) is constant in x co ( I x ( θ 0 )) = convex closure (over x ) of Fisher information matrices I x ( θ 0 ). (log | C | strictly concave: maximum unique.)

Illustration of theorem 0 0.2 θ 0.4 10 20 30 40 50 60 70 80 90 100 0 0.2 θ 0.4 10 20 30 40 50 60 70 80 90 100 0.4 E(p) 0.2 10 20 30 40 50 60 70 80 90 100 σ (p) −2 10 1 2 10 10 1 P( θ 0 ) 0.5 0 10 20 30 40 50 60 70 80 90 100 trial number

Technical details Stronger regularity conditions than usual to prevent “obsessive” sampling and ensure consistency. Significant complication: exponential decay of posteriors p N off of neighborhoods of θ 0 does not necessarily hold.

Psychometric example • stimuli x one-dimensional: intensity • responses y binary: detect/no detect p (1 | x, θ ) = f (( x − θ ) /a ) • scale parameter a (assumed known) • want to learn threshold parameter θ as quickly as possible 1 p(1 | x, θ ) 0.5 0 θ

Psychometric example: results • variance-minimizing and info-theoretic methods asymptotically same • just one unique function f ∗ for which σ iid = σ opt ; for any other f , σ iid > σ opt ( ˙ f a,θ ) 2 I x ( θ ) = f a,θ (1 − f a,θ ) • f ∗ solves � ˙ f a,θ (1 − f a,θ ) f a,θ = c f ∗ ( t ) = sin( ct ) + 1 2 • σ 2 iid /σ 2 opt ∼ 1 /a for a small

Computing the optimal stimulus Simple Poisson regression model for neural data: y i ∼ Poiss ( λ i ) x i , � θ = f ( � λ i | � θ · � x i ) Goal: learn � θ in as few trials as possible. Problems: • � θ is very high-dimensional; difficult to update p ( � θ | � x i , y i ), compute I ( θ, y | � x ) • � x is very high-dimensional; difficult to optimize I ( θ, y | � x ) — Joint work with J. Lewi and R. Butera, Georgia Tech (Lewi et al., 2006)

Efficient updating Idea: Laplace approximation p ( � θ |{ � x i , y i } i ≤ N ) ≈ N ( µ N , C N ) Justification: • posterior CLT ⇒ log-likelihood concave in � • f convex and log-concave = θ = ⇒ log-posterior is also concave

Two problems from neural data analysis: Sparse entropy estimation - PowerPoint PPT Presentation

Two problems from neural data analysis: Sparse entropy estimation and efficient adaptive experimental design Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Introduction to Data Science: Common observation to be religion, income, frequency where sex and

Solving Percent Problems Word Problems Find a Pattern Estimation Problems Fraction Problems

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Introduction to Deep Neural Networks 0. Logistics Spring 2020 1 Neural Networks are taking

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Preasymptotic estimates for approximation of multivariate Sobolev functions Thomas K uhn

Asymptotic Theory for Statistics of Geometric Structures Joe Yukich Universidad Carlos III,

10/5/2016 CSE373: Data Structures and Algorithms Asymptotic Analysis (Big O, , and )

Asymptotics Will Perkins January 22, 2013 Asymptotics In many theorems and questions in

Asymptotic expansions and Mellin-Barnes representation David Greynat I.F.A.E. Universitat

Planar orthogonal polynomials and related determinantal processes: random normal matrices and

Less is More Dimensionality Reduction from a Theoretical Perspective CHES 2015 Saint-Malo,

Higher order asymptotics from multivariate generating functions Mark C. Wilson, University of