On-line estimation with the multivariate Gaussian distribution Sanjoy Dasgupta and Daniel Hsu UC San Diego On-line estimation with the multivariate Gaussian distribution Nr. 1
Outline 1. On-line density estimation and previous work 2. On-line multivariate Gaussian density estimation 3. Regret analysis of follow-the-leader 4. Open problem On-line estimation with the multivariate Gaussian distribution Nr. 2
On-line density estimation Learning protocol For trial t = 1 , 2 , . . . 1. Learner chooses parameter θ t ∈ Θ 2. Nature chooses instance x t ∈ X 3. Learner incurs loss ℓ t ( θ t ) = ℓ ( θ t , x t ) In on-line (parametric) density estimation, ℓ ( θ, x ) = − log p ( x | θ ) where { p ( ·| θ ) : θ ∈ Θ } is a parametric family of densities. On-line estimation with the multivariate Gaussian distribution Nr. 3
On-line density estimation Loss and regret T � L T = ℓ t ( θ t ) Total loss of learner after T trials t =1 T � L ∗ T = inf ℓ t ( θ ) Best-in-hindsight fixed-parameter loss after T trials θ ∈ Θ t =1 R T = L T − L ∗ Regret of learner after T trials T Goal: on-line density estimation strategies with regret bounds. As usual, no stochastic assumption about how Nature generates data. On-line estimation with the multivariate Gaussian distribution Nr. 4
Previous work in on-line density estimation Some on-line learning literature Freund, 1996 Bernoulli (weighted coins) Azoury & Warmuth, 1999 some exponential families, including ↓ Takimoto & Warmuth, 2000a fixed-covariance Gaussian Takimoto & Warmuth, 2000b some one-dimensional exponential families • Straightforward on-line parameter estimation yields O (log T ) regret; subtle variations can improve the constants. – In case of fixed-covariance Gaussian, a recursively-defined update rule yields minimax strategy. • Often, simple random sequences yield lower bounds. On-line estimation with the multivariate Gaussian distribution Nr. 5
On-line Gaussian density estimation For simplicity, just look at one-dimensional case; results generalize to multi- variate case with linear algebra. • Parameter space: Θ = R × R > 0 (mean and variance) – Learner chooses ( µ t , σ 2 t ) in trial t • Data space: X = R • Loss function: ℓ (( µ, σ 2 ) , x ) = ( x − µ ) 2 + 1 2 ln σ 2 2 σ 2 Can view as squared-loss on prediction µ with “confidence” parameter σ 2 . On-line estimation with the multivariate Gaussian distribution Nr. 6
Main results • Standard formulation suffers from degenerate cases – similar to problems in MLE of Gaussian distributions. • Instead, consider alternative formulation with hallucinated zeroth trial. • We study the strategy that chooses sample mean and variance of previ- ous instances (follow-the-leader). Trivial regret bound is O ( T 2 ) . 1. For any p > 1 , there are sequences ( � x t ) for which regret is Ω( T 1 − 1 /p ) . Similar for any sublinear function in T . 2. Linear bound on regret for all sequences. 3. For any sequence, average regret is → 0 ; i.e. for any sequence, R T lim sup ≤ 0 . T T ≥ 1 On-line estimation with the multivariate Gaussian distribution Nr. 7
Problems with standard formulation Unbounded instances • Learner’s means have | µ t | < ∞ . • Nature chooses x t so | x t − µ t | arbitrarily large. • ∴ Regret unbounded. Fix: assume all | x t | ≤ r for some r ≥ 0 (same as in fixed-variance case). On-line estimation with the multivariate Gaussian distribution Nr. 8
Problems with standard formulation Non-varying instances 2 ln σ 2 = −∞ . • If x 1 = x 2 = . . . = x T , then L ∗ T T = lim σ 2 → 0 • ∴ Regret unbounded. Fix: force some variance by hallucinating a zeroth trial, and include in loss and regret quantities. � ( x − µ ) 2 � ℓ 0 ( µ, σ 2 ) = 1 + 1 � 2 ln σ 2 for some constant s > 0 . 2 σ 2 2 x ∈{± s } Consequence: L ∗ T > −∞ for all T . (Alternative: compare to best-in-hindsight loss plus Bregman divergence to initial parameter.) On-line estimation with the multivariate Gaussian distribution Nr. 9
Follow-the-leader Follow-the-leader: use parameter setting that minimizes total loss over all previous trials. This is the “natural” strategy: choose sample mean and variance of previously seen instances. for some s 2 > 0 due to trial zero ( µ 1 , σ 2 1 ) = ( µ 0 , σ 2 0 ) = (0 , s 2 ) � � t t 1 1 � � s 2 + σ 2 x 2 − µ 2 µ t +1 = x i and t +1 = i t +1 t + 1 t + 1 i =1 i =1 • No randomization / perturbation (cf. follow-the- perturbed -leader). • Similar to algorithms proposed in Azoury & Warmuth (1999) for expo- nential families, which enjoyed O (log T ) regret bounds. In our setting, looks like O ( T 2 ) bounds (without further assumptions). On-line estimation with the multivariate Gaussian distribution Nr. 10
#1: Regret lower bound for follow-the-leader Finite sequence : s = 1 ; sequence is x 1 = . . . = x T − 1 = 0 and x T = 1 . Learner’s parameters: µ t ≡ 0 , σ 2 t = 1 /t ; Final regret: R T = Ω(1 /σ 2 T ) = Ω( T ) . 1 0.5 2 σ t 0 5 10 15 20 t 10 5 R t 0 5 10 15 20 t On-line estimation with the multivariate Gaussian distribution Nr. 11
#1: Regret lower bound for follow-the-leader Infinite sequence : iterate the finite sequence; regret arbitrarily close to linear. 60 50 40 30 R t 20 10 0 0.5 1 1.5 2 2.5 t 4 x 10 Let f : N → N be increasing. There is a sequence such that for any T in the range of f , T + 1 R T ≥ C · ( f − 1 ( T ) + 1) 2 . On-line estimation with the multivariate Gaussian distribution Nr. 12
#2: Regret upper bound Can derive expression for regret of follow-the-leader either directly or, say, via Bregman divergence formulation (Azoury & Warmuth, 1999): T � 2 � ( x t − µ t ) 2 1 +1 � R T ≤ 4 ln( T + 1) + Θ(1) σ 2 4( t + 1) t t =1 � �� � O ( t 2 ) � �� � O ( t ) � �� � O ( T 2 ) (bound due to second-order Taylor approximation). t ≥ s 2 /t (where s 2 is trial-zero variance), have trivial bound of O ( T 2 ) . Since σ 2 Problem: small variances . . . (but they can’t always be small). On-line estimation with the multivariate Gaussian distribution Nr. 13
#2: Regret upper bound Rewrite variance parameter: � � t − 1 t = 1 k � s 2 + σ 2 k + 1( x k − µ k ) 2 . ∆ k where ∆ k = t k =1 ( ∆ k is ≈ square distance of new instance to average of old instances.) ∴ Use potential function argument plus algebra: � � 2 T R T ≤ 1 ∆ t � ( t + 1) + O (log T ) s 2 + � t − 1 4 k =1 ∆ k t =1 � � s 2 ≤ C 4 · ( T + 1) 1 − + O (log T ) s 2 + � T k =1 ∆ k i.e. linear bound. On-line estimation with the multivariate Gaussian distribution Nr. 14
Two regimes for follow-the-leader 1. Sequences achieving lower bounds have σ 2 t → 0 : regret can be arbitrarily close to linear. 2. If, instead, lim inf σ 2 t > 0 , then there is some T 0 such that for all T > T 0 , � � log T R T ≤ c · T 0 + O T 0 � � log T i.e. eventually , average regret R T /T tends to zero at rate O . T On-line estimation with the multivariate Gaussian distribution Nr. 15
#3: lim sup average regret bound Actually, even when σ 2 t → 0 , average regret tends to zero. Formally, for any sequence, R T lim sup ≤ 0 . T T ≥ 1 Proof idea: show lim sup R T /T ≤ ǫ for any ǫ > 0 . • Two types of trials, depending on ∆ t ≈ ( x t − µ t ) 2 : 1. ∆ t small → contribute ≪ ǫ to regret. 2. ∆ t large → cause variance to rise substantially; behavior is more like second regime. On-line estimation with the multivariate Gaussian distribution Nr. 16
Multivariate Gaussians • For d -dimensional Gaussians, essentially have extra factor of d in front of all bounds. • “Progress” in covariance happens one dimension at a time, so lower bounds can also exploit each dimension (almost) independently. • Potential function for upper bound is tr( Σ − 1 ) . On-line estimation with the multivariate Gaussian distribution Nr. 17
Open problem • This work: analysis of follow-the-leader for on-line Gaussian density es- timation with arbitrary covariance. • Still open (from Takimoto & Warmuth, 2000a): What is the min-max strategy? On-line estimation with the multivariate Gaussian distribution Nr. 18
Thanks! Authors supported by: • NSF grant IIS-0347646 • Engineering Instititue (Los Alamos National Laborartory / UC San Diego) graduate fellowship On-line estimation with the multivariate Gaussian distribution Nr. 19
Incremental off-line algorithm Derived by Azoury & Warmuth (1999) for general exponential families. Update rule: choose initial parameter ( µ 1 , σ 2 1 ) ∈ R × R > 0 ; then � � t � ( µ t +1 , σ 2 η − 1 1 ∆(( µ 2 , σ 2 ) , ( µ 1 , σ 2 ℓ i ( µ, σ 2 ) t +1 ) = arg min 1 )) + ( µ,σ 2 ) i =1 where ∆( · , · ) is the Bregman divergence for Gaussians � ( µ − ˜ � µ ) 2 σ 2 σ 2 σ 2 )) = 1 + ˜ σ 2 − ln ˜ ∆(( µ, σ 2 ) , (˜ µ, ˜ σ 2 − 1 σ 2 2 and η − 1 is a parameter (e.g. η − 1 = 1 ). 1 1 On-line estimation with the multivariate Gaussian distribution Nr. 20
Recommend
More recommend