Optimal Information Passing: How much vs. How fast Abbas Kazemipour MAST Group Meeting University of Maryland. College Park kaazemi@umd.edu March 24, 2016 Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 1 / 20
Overview 1 Introduction Discrete Hawkes Process as a Markov Chain 2 Part 2: Stationary Distributions Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 2 / 20
Discrete Hawkes Process as a Markov Chain 1 Discrete Hawkes Process: x k = Ber ( φ ( θ T x k − 1 k − p )) , (1) 2 History components form a Markov Chain x k − 1 k − p , a binary vector of length p . Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 3 / 20
Discrete Hawkes Process as a Markov Chain 1 Discrete Hawkes Process: x k = Ber ( φ ( θ T x k − 1 k − p )) , (1) 2 History components form a Markov Chain x k − 1 k − p , a binary vector of length p . Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 3 / 20
Simulation: p = 100, n = 500, s = 3 and γ n = 0 . 1. 1 Each spike train under this model, corresponds to a walk across the states. 2 The corresponding likelihood is the product of the weights of the edges visited along the walk. 3 Figure: State Space for p = 3 Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 4 / 20
Simulation: p = 100, n = 500, s = 3 and γ n = 0 . 1. 1 Each spike train under this model, corresponds to a walk across the states. 2 The corresponding likelihood is the product of the weights of the edges visited along the walk. 3 Figure: State Space for p = 3 Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 4 / 20
Simulation: p = 100, n = 500, s = 3 and γ n = 0 . 1. 1 Each spike train under this model, corresponds to a walk across the states. 2 The corresponding likelihood is the product of the weights of the edges visited along the walk. 3 Figure: State Space for p = 3 Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 4 / 20
Introduction 1 We observe n consecutive snapshots of length p (a total of n + p − 1 samples): { x k } n k = − p +1 2 x n 1 can be approximated by a sequence of Bernoulli random variables with rates λ n 1 What is a good optimization problem for estimating θ ? Answer: ℓ 1 -regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O ( p 2 / 3 ). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20
Introduction 1 We observe n consecutive snapshots of length p (a total of n + p − 1 samples): { x k } n k = − p +1 2 x n 1 can be approximated by a sequence of Bernoulli random variables with rates λ n 1 What is a good optimization problem for estimating θ ? Answer: ℓ 1 -regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O ( p 2 / 3 ). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20
Introduction 1 We observe n consecutive snapshots of length p (a total of n + p − 1 samples): { x k } n k = − p +1 2 x n 1 can be approximated by a sequence of Bernoulli random variables with rates λ n 1 What is a good optimization problem for estimating θ ? Answer: ℓ 1 -regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O ( p 2 / 3 ). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20
Introduction 1 We observe n consecutive snapshots of length p (a total of n + p − 1 samples): { x k } n k = − p +1 2 x n 1 can be approximated by a sequence of Bernoulli random variables with rates λ n 1 What is a good optimization problem for estimating θ ? Answer: ℓ 1 -regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O ( p 2 / 3 ). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20
Introduction 1 We observe n consecutive snapshots of length p (a total of n + p − 1 samples): { x k } n k = − p +1 2 x n 1 can be approximated by a sequence of Bernoulli random variables with rates λ n 1 What is a good optimization problem for estimating θ ? Answer: ℓ 1 -regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O ( p 2 / 3 ). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20
Introduction 1 We observe n consecutive snapshots of length p (a total of n + p − 1 samples): { x k } n k = − p +1 2 x n 1 can be approximated by a sequence of Bernoulli random variables with rates λ n 1 What is a good optimization problem for estimating θ ? Answer: ℓ 1 -regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O ( p 2 / 3 ). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20
Preliminaries 1 Consider the Discrete Hawkes process model λ i = µ + θ ′ x i − 1 i − p , (2) 2 Negative (conditional) log-likelihood n � L ( θ ) = − 1 x i log λ i − λ i . (3) n i =1 3 Bernoulli approximation n � L ( θ ) ≈ − 1 x i log λ i + (1 − x i ) log(1 − λ i ) = h ( x 1 , x 2 , · · · , x n ) . n i =1 (4) 4 Negative log-likelihood equals the joint entropy (information) of the spiking. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 6 / 20
Preliminaries 1 Consider the Discrete Hawkes process model λ i = µ + θ ′ x i − 1 i − p , (2) 2 Negative (conditional) log-likelihood n � L ( θ ) = − 1 x i log λ i − λ i . (3) n i =1 3 Bernoulli approximation n � L ( θ ) ≈ − 1 x i log λ i + (1 − x i ) log(1 − λ i ) = h ( x 1 , x 2 , · · · , x n ) . n i =1 (4) 4 Negative log-likelihood equals the joint entropy (information) of the spiking. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 6 / 20
Preliminaries 1 Consider the Discrete Hawkes process model λ i = µ + θ ′ x i − 1 i − p , (2) 2 Negative (conditional) log-likelihood n � L ( θ ) = − 1 x i log λ i − λ i . (3) n i =1 3 Bernoulli approximation n � L ( θ ) ≈ − 1 x i log λ i + (1 − x i ) log(1 − λ i ) = h ( x 1 , x 2 , · · · , x n ) . n i =1 (4) 4 Negative log-likelihood equals the joint entropy (information) of the spiking. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 6 / 20
Preliminaries 1 Consider the Discrete Hawkes process model λ i = µ + θ ′ x i − 1 i − p , (2) 2 Negative (conditional) log-likelihood n � L ( θ ) = − 1 x i log λ i − λ i . (3) n i =1 3 Bernoulli approximation n � L ( θ ) ≈ − 1 x i log λ i + (1 − x i ) log(1 − λ i ) = h ( x 1 , x 2 , · · · , x n ) . n i =1 (4) 4 Negative log-likelihood equals the joint entropy (information) of the spiking. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 6 / 20
ML vs. ℓ 1 -regularization Maximum Likelihood Estimation � θ ML = arg min L ( θ ) , (5) θ ∈ Θ 1 Maximizes the joint entropy of spiking to have maximum transferred information. ℓ 1 -regularized estimate � θ sp := arg min L ( θ ) + γ n � θ � 1 . (6) θ ∈ Θ 2 What does regularization do apart from motivating sparsity? 3 To show: regularization determines the speed of data transfer. 4 Battle between speed and amount of information. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 7 / 20
ML vs. ℓ 1 -regularization Maximum Likelihood Estimation � θ ML = arg min L ( θ ) , (5) θ ∈ Θ 1 Maximizes the joint entropy of spiking to have maximum transferred information. ℓ 1 -regularized estimate � θ sp := arg min L ( θ ) + γ n � θ � 1 . (6) θ ∈ Θ 2 What does regularization do apart from motivating sparsity? 3 To show: regularization determines the speed of data transfer. 4 Battle between speed and amount of information. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 7 / 20
ML vs. ℓ 1 -regularization Maximum Likelihood Estimation � θ ML = arg min L ( θ ) , (5) θ ∈ Θ 1 Maximizes the joint entropy of spiking to have maximum transferred information. ℓ 1 -regularized estimate � θ sp := arg min L ( θ ) + γ n � θ � 1 . (6) θ ∈ Θ 2 What does regularization do apart from motivating sparsity? 3 To show: regularization determines the speed of data transfer. 4 Battle between speed and amount of information. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 7 / 20
Second Largest Eigenvalue Modulus and its Significance 1 The Markov chain defined by the history components of the Hawkes process has a stationary distribution π . 2 Converges to π irrespective of the initial state. 3 How fast this happens determines how fast the data has been transferred. 4 The transition probability matrix is a function of θ . 5 Perron-Frobenius theorem: has unique largest eigenvalue λ 1 = 1. 6 The second largest eigenvalue modulus determines the speed of convergence. λ = max { λ 2 , − λ n } i ∈ S � P t ( i, . ) − π ( . ) �∼ Cλ t , max t → ∞ Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 8 / 20
Recommend
More recommend