Fast and efficient estimation of the geometric median in high dimension : a stochastic gradient approach Herv´ e Cardot Institut de Math´ ematiques de Bourgogne, Universit´ e de Bourgogne with Peggy C´ enac (Univ. Bourgogne) and Mohamed Chaouch (EDF) Herve.Cardot@u-bourgogne.fr Compstat - August 2010 - Paris
The median in R A ”central” notion in statistics since Laplace. For a random variable taking values in R : ”the” (may not be unique) value m such that P ( X ≤ m ) = 0 . 5 . Another characterization of the median m � E (sign( X − m )) = sign( X ( ω ) − m ) dP ( ω ) = 0 . X − m Since sign( X − m ) = | X − m | , we also have the following characterization, m = arg min z ∈ R E | X − z | . • The quantile of order α, for α ∈ ]0 , 1[ , is defined by P ( X ≤ q α ) = α. Equivalently, q α = arg min z ∈ R E [ | X − z | + (2 α − 1)( X − z )] .
The geometric median in R d (or in a separable Hilbert space H ) In the Euclidean space R d equipped with its usual norm � � , a natural generalization of the median m := arg min z ∈ H E � X − z � , called geometric or spatial median (Haldane, 1948). Property (Kemperman, 1987) If the space H is strictly convex, the geometric median m is unique, unless the support of X is within a one dimensional subspace. • Examples of strictly convex spaces : - Euclidean spaces R d , when d > 1 , - separable Hilbert spaces H , - some Banach spaces ( L p , 1 < p < ∞ ). • Support condition : ∃ ( u 1 , u 2 ) ∈ H × H , � u 1 , u 2 � = 0 , V ar( � u 1 , X � ) > 0 and V ar( � u 2 , X � ) > 0 .
Characterization of the geometric median We suppose there are no atoms ( ∀ x ∈ H , P ( X = x ) = 0). Then G : H �→ R defined by G ( x ) = E � X − x � , is strictly convex and Fr´ echet diff´ erentiable � X − x � Φ( x ) := ∇ G x = − E . � X − x � The median m is characterized by ∇ G m = 0 . G has a second order Fr´ echet derivative, at point m , Γ m : H �→ H , � � �� 1 I H − ( X − m ) ⊗ ( X − m ) Γ m := E , � X − m � � X − m � 2 where I H is the identity operator and u ⊗ v = � u , . � v , for ( u , v ) ∈ H 2 . If E � X − m � − 1 < ∞ , then Γ m is a bounded and strictly positive operator. There are constants, ∞ > E � X − m � − 1 = λ M > λ m > 0 , λ M � u � 2 ≥ � Γ m u , u � ≥ λ m � u � 2 , ∀ u ∈ H .
Robustness : the influence function Consider a distribution P 0 contaminated by a point-mass distribution at z ∈ H , P ǫ, z = (1 − ǫ ) P 0 + ǫδ z . The influence function m ( P ǫ, z ) − m ( P 0 ) IF m ( z ) = lim ǫ ǫ → 0 is a measure of the sensitivity of the median m to a small perturbation of the target distribution. Property z − m IF m ( z ) = Γ − 1 m � z − m � and the gross error sensitivity is bounded as follows sup {� IF m ( z ) � , z ∈ H } = 1 . λ m • The gross error sensitivity is not bounded for the mean.
Estimation in R d Suppose we have a sample of n independent realizations, X 1 , . . . , X n . The usual estimator of m (Gower, 1974, Vardi & Zhang, 2000, Gervini, 2008) is characterized by n � X i − � m m � = 0 , � X i − � i =1 Iterative and rather long procedures are needed (Newton-Raphson or Weiszfeld) to get a numerical solution n n � � X i − � m m e +1 = m e ) X i . m � = 0 ⇒ � p i ( � � X i − � i =1 i =1 Property (Haberman, 1989, Niemiro, 1992). If H = R d , when n → + ∞ , √ n ( � m n − m ) � N (0 , Γ − 1 m Var ( S ( X − m ))Γ − 1 m ) with S ( u ) = u / � u � , u ∈ R d .
A very simple and very fast algorithm We consider the algorithm X n +1 − m n m n +1 = m n + γ n � X n +1 − m n � and suppose the steps γ n are such that ∀ n , γ n > 0 , � � γ 2 γ n = ∞ and n < ∞ . n ≥ 1 n ≥ 1 Advantages • For a sample of n realizations in R d : O ( nd ) operations. • No need to store all the data (data streams) • Automatic update (online estimation)
A Robbins-Monro (1951) algorithm This algorithm (stochastic gradient) can also be written m n +1 = m n − γ n ( Φ( m n ) + ζ n +1 ) , � �� � gradient with ζ n +1 = − X n +1 − m n � X n +1 − m n � − Φ( m n ) . • If the X n are i . i . d ., the sequence ζ n +1 is a martingale difference, E ( ζ n +1 | F n ) = 0 avec F n = σ ( X 0 , . . . , X n ) . Moreover, � � � ζ n +1 � 2 |F n ≤ 4 . E
A remark on geometric quantiles estimation This approach can be extended directly to get stochastic approximations to geometric quantiles (Chaudhuri, 1996). Consider a vector u ∈ H , such that � u � < 1 . The geometric quantile of X , say m u , corresponding to direction u , is defined, uniquely under previous assumptions, by m u = arg min Q ∈ H E ( � X − Q � + � X − Q , u � ) . It is characterized by Φ u ( m u ) = Φ( m u ) − u = 0 . The following stochastic approximation � X n +1 − � � m u n m u m u n +1 = � n + γ n n � + u . � � X n +1 − � m u
A convergence result in Hilbert spaces Result (Cardot, C´ enac, Zitt 2010) The sequence m n converges almost surely when n tends to infinity, � m n − m � → 0 , p . s . • Sketch of the proof (classical) Show that for all ǫ ∈ ]0 , 1[ , P (Ω ǫ ) = 0 , with ˘ ǫ < V n ( ω ) < ǫ − 1 ¯ Ω ǫ = ω ∈ Ω : ∃ n ǫ ( ω ) ≥ 1 , ∀ n ≥ n ǫ ( ω ) , considering that 0 1 n n X X B C γ 2 n →∞ E V n +1 = E V 0 + lim lim j + 2 γ n E � Φ( m n ) , m − m n � A ≤ C B C n →∞ @ | {z } j =1 j =1 < − λ ǫ ǫ in Ω ǫ Property For all r > 0 , the function G restricted to x ∈ B ( m , r ) is strongly convex : ∃ λ r > 0 , such that ∀ x ∈ B ( m , r ) � Φ( x ) , x − m � ≥ G ( x ) − G ( m ) ≥ λ r � x − m � 2 .
Does it really work ? A sample with Gaussian distribution, � 10 � 3 with mean (0 , 0) and variance . 3 2 4 2 X2 0 -2 -4 -5 0 5 X1
Not really, even for simple examples ! ! ! X n +1 − m n g m n +1 = m n + n 3 / 4 � X n +1 − m n � 0.10 0.5 RM, g=10 0.08 RM, g=1 0.4 AV, g=10 RM, g=10 AV, g=1 RM, g=1 AV, g=10 AV, g=1 0.06 0.3 MSE MSE 0.04 0.2 0.02 0.1 0.00 0.0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Iterations iteration
Averaging : a magic formula n � m n = 1 Consider now the mean of all past iterations, ¯ m j , n j =1 X n +1 − m n m n +1 = m n + γ n � X n +1 − m n � m n + m n +1 − ¯ m n ¯ = ¯ m n +1 n + 1 Property (in R d ) • If γ n = g / n α , 0 . 5 < α < 1 , √ n ( ¯ m n − m ) � N (0 , ∆) in distribution, where ∆ is the same variance matrix as for � m n , ∆ = Γ − 1 m Var ( S ( X − m ))Γ − 1 m with S ( u ) = u / � u � , u ∈ R d . Proof : As in Polyak & Juditsky (1992).
200 samples with size n = 2000 . Averaging 0.25 0.25 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0.00 g = 0.1 g = 0.5 g = 1 g = 2 g = 5 g = 10 Mean g = 0.1 g = 0.5 g = 1 g = 2 g = 5 g = 10
Example : electricity consumption curves • EDF (Electricit´ e de France) has planned to install communicating meters (30 millions). Survey sampling techniques will be used to select 300 000 meters which will provide individual electricity consumption at fine time scales. • A first test on a population of N = 18900 giving electricity consumption every 30 minutes. 1200 1000 800 Electricity consumption 600 400 200 0 0 50 100 150
Example : median trajectory 300 250 200 Electricity consumption 150 100 50 mean curve median (pointwise) median curve 0 0 1 2 3 4 5 6 7 Days
Perspectives • Averaging in Hilbert spaces H : still no results on nonlinear algorithms in the literature. • Discretized noisy trajectories � � X n ( t n 1 ) + ǫ 1 n , . . . , X n ( t n Z n = p n ) + ǫ p n n • Covariates : conditional geometric median m ( X | Z = z ) = β 0 + β 1 z where Z is for example the mean consumption of the week before. We look for ( β 0 ,β 1 ) ∈ H × H E � X − ( β 0 + β 1 Z ) � min • (robust) Clustering with medians based on � . � (k-median).
Recommend
More recommend