Semiparametric models with functional responses in a model assisted survey sampling setting e Cardot 1 , Alain Dessertaine 2 , and Etienne Josserand 1 Herv´ 1 Institut de Math´ ematiques de Bourgogne, UMR 5584 CNRS herve.cardot@u-bourgogne.fr, etienne.josserand@u-bourgogne.fr 2 EDF, R&D, ICAME - SOAD alain.dessertaine@edf.fr Computational Statistics - Paris - August 23rd 2010
Outline Introduction Survey sampling and curve estimation Estimation with auxiliary information Application to electricity consumption curves
Sampling survey on curves A new subject in statistic boundaries between functional data analysis and survey sampling theory. EDF problematic : ◮ EDF does not know what their clients consume at each time ! ◮ EDF plans to install electricity meters which will be able to send individual electricity consumptions at very fine time scales. ◮ Collecting, saving and analysing all this information would be very expensive ( ≈ 30 millions of electricity meters). ◮ How to estimate as precisely as possible the mean consumption curve in France or a part of this (particular region, type of clients, . . .) ?
Consumption curves A sample of individual electricity consumption curves measured every half hour during one week. 1200 1000 800 Electricity consumption 600 400 200 0 0 50 100 150 Hours
Survey sampling in large databases of functional data Chiky 2009 (these, ENST) : survey sampling procedures on the sensors, which allow a trade off between limited storage capacities and accuracy of the data, can be relevant approaches compared to signal compression in order to get accurate approximations to simple estimates such as mean or total trajectories.
Sampling design and mean curve estimation A population U = { 1 , . . . , k , . . . , N } with finite size N . At each individual (statistic unit) k of the population U , we associate a deterministic curve Y k = ( Y k ( t )) t ∈ [0 , T ] ∈ C [0 , T ] . Let µ ∈ C [0 , T ] , the mean of Y k in the population � µ ( t ) = 1 Y k ( t ) , t ∈ [0 , T ] . N k ∈ U A sample s , i.e. a part s ⊂ U , with known size n , and p a probability law on the set of parts on U , ◮ π k = Pr( k ∈ s ) > 0 for all k ∈ U , ◮ π kl = Pr( k & l ∈ s ) > 0 for all k , l ∈ U , k � = l . The Horvitz-Thompson estimator of the mean curve is � � 1 Y k ( t ) = 1 Y k ( t ) µ ( t ) = ✶ k ∈ s , t ∈ [0 , T ] . � N π k N π k k ∈ s k ∈ U
Two classical sampling designs • The simple random sampling without replacement with size n ◮ π k = n N for all k ∈ U n ( n − 1) ◮ π kl = N ( N − 1) for all k , l ∈ U , k � = l We find again the common mean estimator � µ ( t ) = 1 � Y k ( t ) . n k ∈ s • Stratified sampling with size n . The population U is stratified in H stratum � H h =1 U h = U , with size N h ◮ π k = n h N h for all k ∈ U h ◮ π kl = n h ( n h − 1) N h ( N h − 1) for all k , l ∈ U h , k � = l ◮ π kl = n h n ℓ N h N ℓ for all k ∈ U h , l ∈ U ℓ , h � = ℓ So � � � µ ( t ) = 1 1 1 � N h Y k ( t ) = N h � µ h ( t ) . N n h N h ∈ H k ∈ s h h ∈ H
Utilization of auxiliary information Considering information given by m auxiliary variables ◮ meteorological : temperature, cloud covering , . . . ◮ geographical : altitude, longitude, latitude, . . . ◮ behavioral : past mean consumption, . . . would be able to improve the estimator accuracy of the mean curve. This requires modeling the behavior of individual electricity meters that are not in the sample : Y k ( t ) = µ ( t ) + f ( x k 1 , . . . , x km , t ) + error ⊲ Not much hope to obtain directly an accurate and flexible estimator of the function f which depends on time t and covariables X 1 , . . . , X m . • Reducing the dimension of data seems to be an interesting way.
Dimension reduction in finite population ⊲ The best linear approximation, with quadratic error, of functions Y k in a functional space of fixed dimension q , q < N , generated by q orthonormal functions φ 1 , . . . , φ q : q � Y k ( t ) = µ ( t ) + � Y k − µ, φ j � φ j ( t ) + R qk ( t ) j =1 The mean rest with the norm L 2 [0 , T ] satisfies q � � � 1 � R qk � 2 = 1 � Y k − µ � 2 − � Γ φ j , φ j � N N j =1 k ∈ U k ∈ U where the covariance operator Γ is associated with the covariance function � γ ( s , t ) = 1 ( Y k ( t ) − µ ( t )) ( Y k ( s ) − µ ( s )) , N k ∈ U � T where for all f ∈ L 2 [0 , T ] , Γ f ( s ) = γ ( s , t ) f ( t ) dt , s ∈ [0 , T ] . 0 � k ∈ U � R qk � 2 is the To minimize against φ 1 , . . . , φ q , the mean rest 1 N same to find eigen vectors of Γ.
Model on principal components Property The rest is minimal for φ 1 = v 1 , . . . , φ q = v q , where Γ v j ( t ) = λ j v j ( t ) , t ∈ [0 , T ] , the functions v j constitute an orthonormal system in L 2 [0 , T ] the eigen values are sorted, λ 1 ≥ λ 2 ≥ ... ≥ λ N ≥ 0 . • Obtaining estimations of individual variations on principal components (real) � Y k − µ, v j � ≈ g j ( x k 1 , . . . , x km ) allow the application of model-assisted techniques to build an estimator of µ �� � � � µ ( t ) − 1 Y k ( t ) � � µ x ( t ) = � − Y k ( t ) N π k k ∈ s k ∈ U where q � � Y k ( t ) = � µ ( t ) + g j ( x k 1 , . . . , x km ) � v j ( t ) . � j =1
An illustration of EDF consumption curves (a) Mean consumption (b) 300 0.8 Explained variance 250 0.6 Consumption 200 0.4 0.2 150 0.0 0 50 100 200 300 2 4 6 8 10 Time Principal components (c) First eigenfunction (d) 1.3 5000 First principal components 1.1 Consumption 3000 0.9 1000 0.7 0 0 50 100 200 300 0 1000 3000 5000 Time Weekly consumption
Error estimation of µ : � � µ − µ � ● 20 ● ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 SRWR OPTIM MA1 The model (MA1) considered is very simple � µ ( t ) + ( � β 0 + � Y k ( t ) = � β 1 X k ) � v 1 ( t ) where X k is the mean consumption of the last week.
Variances comparison γ ( t , t ) of estimators � µ SRSWR 50 OPTIM MA1 40 Empirical variance 30 20 10 0 0 50 100 150 200 250 300 Problem : Lack of explicit formula for variance estimation • Candidate for asymptotic formula (when n , N → ∞ ) • Need a corrected variance which depends on eigen vectors’s variances (perturbations) ? • ...
Recommend
More recommend