Goodness-of-fit tests for the functional linear model with scalar response with responses missing at random Manuel Febrero-Bande 1 Pedro Galeano 2 es 2 and Wenceslao Gonz´ alez-Manteiga 1 Eduardo Garc´ ıa-Portugu´ 1 Department of Statistics, Mathematical Analysis and Optimization Universidade de Santiago de Compostela 2 Department of Statistics and UC3M-BS Institute of Financial Big Data Universidad Carlos III de Madrid Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 1 / 22
Motivation Regression model with a functional covariate and a scalar response: ◮ General model: Y = m ( X ) + ε , where: ⋆ Real response: Y centered. ⋆ Functional covariate: X ∈ H centered and with covariance operator Γ. ⋆ Hilbert space: H of square integrable functions, with inner product �· , ·� and associated norm � · � . ⋆ Regression operator: m ( X ) = E [ Y |X = X ]. � 0 , σ 2 � ⋆ Error random variable: ε ∼ and ε uncorrelated with X . ◮ Interest: Given a random sample from ( X , Y ), { ( X i , Y i ) } n i =1 , check whether the regression operator m is linear. ◮ Goodness-of-fit tests for linearity: ⋆ Garc´ ıa-Portugu´ es, Gonz´ alez-Manteiga and Febrero-Bande (2014, JCGS). ⋆ Cuesta-Albertos, Garc´ ıa-Portugu´ es, Gonz´ alez-Manteiga and Febrero-Bande (2019, AoS). Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 2 / 22
Motivation Febrero-Bande, Galeano, and Gonz´ alez-Manteiga (2019, CSDA): ◮ Data set: Data from 73 Spanish weather stations in the period 1980 − 2009. ◮ Functional covariate: Mean curve of the annual average daily temperature. ◮ Real response: Average of the total number of hours of sunshine per year. ◮ Missing responses: The responses are not observed in 26 and out of the 73 weather stations (35 . 62% of missing responses). ◮ Functional linear model with scalar response (FLMSR): m ( X ) = �X , β � , where β ∈ H is a functional slope and �· , ·� is the inner product of H . ◮ Two methods for estimating β with FPCs: Simplified method: Delete the pairs with missing responses. 1 Imputed method: Impute the missing responses before estimation. 2 ◮ Results suggest: The imputed method outperforms the simplified method. Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 3 / 22
Motivation Work in progress: ◮ Goal: Analyze goodness-of-fit tests for functional regression models when some of the responses are missing at random. ◮ Two possibilities: Use goodness-of-fit tests after deleting pairs with missing responses. 1 Impute missing responses, then use goodness-of-fit tests. 2 ◮ Question: Which option is better? ◮ Today, initial results on: ⋆ Model: Functional linear model with scalar response (FLMSR). ⋆ Goodness-of-fit test: Garc´ ıa-Portugu´ es et al. (2014, JCGS). Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 4 / 22
The testing problem Elements of the testing problem: ◮ Problem: Test the linear hypothesis H 0 : m ∈ {�· , β � : β ∈ H } versus the alternative hypothesis H 1 : m �∈ {�· , β � : β ∈ H } . ◮ Random sample: { ( X i , Y i , R i ) } n i =1 generated from ( X , Y , R ), where R is Bernoulli with R i = 1, if Y i is observed, and R i = 0, if Y i is missing. ◮ Missing at Random (MAR) mechanism: P ( R = 1 | Y , X ) = P ( R = 1 |X ) = p ( X ) where p : H → [0 , 1] is an unspecified function operator of X . ◮ Consequence: This mechanism allows missing responses to be predicted with the available information. Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 5 / 22
Estimation of the FLMSR with MAR responses Estimation of β with Functional Principal Components (FPCs): ◮ The FLMSR: Y = �X , β � + ε . ◮ Functional slope: β = � ∞ k =1 b k ψ k , where: ⋆ ψ 1 , ψ 2 , . . . are eigenfunction of Γ linked to eigenvalues λ 1 > λ 2 > . . . > 0. ⋆ b k = Cov [ Y , S k ] , for k ∈ N . λ k ⋆ S k = �X , ψ k � , for k ∈ N , are the FPCs scores of X . ◮ Problem: Estimate β with a random sample { ( X i , Y i , R i ) } n i =1 . ◮ Need: ⋆ Estimates of ψ 1 , ψ 2 , . . . and λ 1 , λ 2 , . . . ⋆ Sample S 1 , S 2 , . . . ⋆ A cutoff to truncate the infinite sum that defines β . Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 6 / 22
Estimation of the FLMSR with MAR responses Simplified estimation (Febrero-Bande et al., 2019, CSDA): ◮ Complete-case analysis: Delete pairs with missing responses. ◮ Covariates of complete pairs: X S = {X i : i ∈ I S } , where I S = { i : R i = 1 } . ◮ Estimates of ψ 1 , ψ 2 , . . . and λ 1 , λ 2 , . . . : Eigenfunctions � ψ 1 , S , � ψ 2 , S , . . . and eigenvalues � λ 1 , S ≥ � λ 2 , S ≥ · · · of the sample covariance operator � Γ X S . � � ◮ Sample FPCs scores: � X i , � S i , k , S = ψ k , S , for i ∈ I S and k ∈ N . � � � ◮ Estimate of b k : � i ∈ I S Y i � 1 1 b k , S = S i , k , S , where n S = # I S , for k ∈ N . � n S λ k , S β k S = � k S ◮ Estimate of β : � k =1 � b k , S � ψ k , S , where k S is a cutoff. Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 7 / 22
Estimation of the FLMSR with MAR responses Imputed estimator (Febrero-Bande et al., 2019, CSDA): � � ◮ Impute missing responses: � X i , � Y i , k S = β k S , for i / ∈ I S . ◮ New set of responses: Y i , k S = R i Y i + (1 − R i ) � Y i , k S , for i = 1 , . . . , n . ◮ Covariates of all pairs: X C = {X i : i = 1 , . . . , n } . ◮ Estimates of ψ 1 , ψ 2 , . . . and λ 1 , λ 2 , . . . : Eigenfunctions � ψ 1 , C , � ψ 2 , C , . . . and eigenvalues � λ 1 , C ≥ � λ 2 , C ≥ · · · of the sample covariance operator � Γ X C . � � ◮ Sample FPCs scores: � X i , � S i , k , C = ψ k , C , for i = 1 , . . . , n and k ∈ N . � � � n ◮ Estimate of b k : � 1 1 i =1 Y i , k S � b k , k S , C = S i , k , C , for k ∈ N . � n λ k , C � k C ◮ Estimate of β : � � b k , k S , C � β k S , k C = ψ k , C , where k C is a cutoff. k =1 Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 8 / 22
Estimation of the FLMSR with MAR responses Important notes: ◮ Selection of cutoffs: Use leave-one-out cross-validation or standard model selection criteria (GCV, AIC, AICc, SIC, SICc,. . . ). ◮ Consequence: k S in � β k S may be different to k S and/or k C in � β k S , k C , e. g., it is possible that � β 2 and � β 1 , 3 are the chosen estimators, respectively. ◮ Two sources of potential improvement: Principal component estimation: � β k S depends on � ψ k , S (constructed with X S ), 1 while � β k S , k C depends on � ψ k , C (constructed with X C ). Cutoff selection: � β k S , k C may have smaller MSEE than � β k S if the cutoffs are 2 selected appropriately (see, Febrero-Bande et al., 2019, CSDA). Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 9 / 22
Testing linearity with MAR responses A Cram´ er-von Mises testing procedure (I): ◮ Garc´ ıa-Portugu´ es et al. (2014, JCGS): The following statements are equiva- lent: m ( X ) = �X , β � , ∀X ∈ H . 1 � � E ( Y − �X , β � ) ✶ {�X ,γ �≤ u } = 0, for a.e. u ∈ R and ∀ γ ∈ S H , where S H = 2 { γ ∈ H : � γ � = 1 } . ◮ Estimate of β : � β , may be � β k S , � β k S , k C or some other estimator. � � ◮ Residuals: � X i , � ε i = Y i − β , for i ∈ I S = { i : R i = 1 } . ◮ Therefore: Only residuals for the observed responses. Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 10 / 22
Testing linearity with MAR responses A Cram´ er-von Mises testing procedure (II): ◮ Residual marked empirical process based on projections: � � = n − 1 / 2 � n � R β, u , γ i =1 R i � ε i ✶ {�X i ,γ �≤ u } , where u ∈ R and γ ∈ S H . ◮ CvM statistic: Measure the deviation of { ( X i , Y i , R i ) } n i =1 from H 0 with: � � � � � 2 � � PCvM β = R β, u , γ F n ,γ ( du ) ω ( d γ ) , R × S H where F n ,γ is the ECDF of {�X i , γ � : i = 1 , . . . , n } , and ω is a measure on S H . � � ◮ Unfortunately: Computation of the statistic PCvM � β is not feasible be- cause S H is of infinite dimension. Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 11 / 22
Testing linearity with MAR responses A Cram´ er-von Mises testing procedure (III): ◮ Idea: Replace γ ∈ S H in PCvM with: � � γ k S = � k S ⋆ Simplified estimator: � γ, � � ψ k , S , where γ ∈ S H . ψ k , S k =1 � � γ k S , k C = � k C ⋆ Imputed estimator: � γ, � � ψ k , C , where γ ∈ S H . ψ k , C k =1 ◮ Modified CvM statistic: � � � � � 2 � � β = β, u , � γ k γ k ( du ) ω ( d � γ k ) , MPCvM R F n , � R × S k H where k is either k S or k C , and F n , � γ k is the ECDF of {�X i , � γ k � : i = 1 , . . . , n } . ◮ Simpler expression: After some algebra, it is possible to show that: � � = n − 2 � ε ′ � β ε S , MPCvM S A � where � ε S is the vector of residuals and A is a certain square symmetric matrix. Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 12 / 22
Recommend
More recommend