Computational treatment of the error distribution in nonparametric - PowerPoint PPT Presentation

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Computational treatment of the error distribution in nonparametric regression with right-censored and selection-biased data Géraldine LAURENT Jointly with Cédric HEUCHENNE QuantOM, HEC-ULg Management School-University of Liege Tuesday, 24 August 2010

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis The Spanish Institute for Statistics studied between 1987 and 1997 the unemployment of active people, and more especially the married women. For these data, we note that • the time of unemployment will not be completely observed, • the age of the woman acts on the future job.

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis 200 Censored Observed 180 160 Unemployment duration (in months) 140 120 100 80 60 40 20 0 0 100 200 300 400 500 600 700 800 900 1000 Woman age (in months)

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Estimation Asymptotic results Bandwidth selection Simulations Data Analysis

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Estimation

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We consider the nonparametric regression model Y = m ( X ) + σ ( X ) ε where • Y is the response variable • X is the covariate • m ( · ) = E [ Y |· ] and σ 2 ( · ) = Var [ Y |· ] are unknown smooth functions • ε is independent of X , with E [ ε ] = 0 and Var [ ε ] = 1

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Particularity of ( X , Y ) • ( X , Y ) is obtained from cross-sectional sampling • Y is subject to right censoring. We study the variable Y delimited by T ≤ Y ≤ C where • T is the truncation variable • C is the censoring variable.

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Real World Time We use as notation F for cdf

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Real World Truncation Time Time We use as notation F for cdf

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Intermediate Observed World Y1 C2 Y3 Y4 C5 C6 Truncation Time Time We use as notation H for cdf, n the sample size

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Observed World Y1 Y3 Y4 Truncation Time Time We use as notation H for cdf

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Aim : Estimation of the error distribution F ε ( e ) = I P ( ε ≤ e ) with ( X , Y ) where T ≤ Y ≤ C where • the distribution F T | X is a parametric distribution • the distribution F C − T | X is completely unknown

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Assumptions: • the variables Y and T are independent, conditionally on X • for each value x , the support of F Y | X ( ·| x ) is included into the support of F T | X ( ·| x ) • the lower bound of the T support is zero • the variables ( T , Y ) and C − T are independent, conditionally on T ≤ Y , X

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We have H X , Y ( x , y ) = I P ( X ≤ x , Y ≤ y | T ≤ Y ≤ C ) ❩ ❩ ( E [ w ( X , Y )]) − 1 = s ≤ y w ( r , s ) dF X , Y ( r , s ) , r ≤ x the weight function w ( x , y ) is defined by ❩ w ( x , y ) = t ≤ y { 1 − G ( y − t | x ) } dF T | X ( t | x ) where G ( z | x ) = I P ( C − T ≤ z | X = x , T ≤ Y ) .

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis In particular, if C = T + τ where τ is a positive constant, the weight function is ❩ y w ( x , y ) = 0 ∨ y − τ dF T | X ( t | x ) by applying the same procedure.

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We obtain ❩ ❩ E [ w ( X , Y )] F X , Y ( x , y ) = dH X , Y ( r , s ) w ( r , s ) r ≤ x s ≤ y Therefore, ✒ Y − m ( X ) ✓ F ε ( e ) = P I ≤ e σ ( X ) ❩❩ ➛ ➞ dF X , Y ( x , y ) = ( x , y ): y − m ( x ) ≤ e σ ( x ) ❩❩ ➞ E [ w ( X , Y )] ➛ = dH X , Y ( x , y ) w ( x , y ) ( x , y ): y − m ( x ) ≤ e σ ( x )

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Thus, the estimator is ❳ n ˆ F ε ( e ) = 1 E [ w ( X , Y )] ˆ w ( X i , Y i ) I { ˆ ε i ≤ e , ∆ i = 1 } M ˆ i = 1 with n ❳ ε i = Y i − ˆ m ( X i ) ˆ , M = ∆ i , σ ( X i ) ˆ i = 1 ✥ ✦ − 1 ❳ n 1 ∆ i ˆ E [ w ( X , Y )] = M w ( X i , Y i ) ˆ i = 1 where the functions ˆ m ( · ) , ˆ σ ( · ) and ˆ w ( · , · ) are nonparametric estimators.

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis For G ( t | x ) , we use the Beran (1981) estimator defined by ❶ ➀ ❨ W i ( x , h n ) ˆ P n G ( t | x ) = 1 − 1 − j = 1 W j ( x , h n ) I { Z j ≥ Z i } Z i ≤ t , ∆ i = 0 where • Z i = min ( C i − T i , Y i − T i ) and ∆ i = I { Y i ≤ C i } ⑨ x − Xi ❾ K hn ⑨ x − Xj ❾ are the Nadaraya-Watson weights • W i ( x , h n ) = P n j = 1 K hn • K is a kernel function • h n is a bandwidth sequence tending to 0 when n → ∞ ❩ ➛ ➞ 1 − ˆ = > ˆ w ( x , y ) = G ( y − t | x ) dF T | X ( t | x ) t ≤ y

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis The estimators of m ( · ) and σ ( · ) are given by P n W i ( x , h n ) Y i ∆ i i = 1 w ( x , Y i ) ˆ m ( x ) = ˆ , P n W i ( x , h n )∆ i i = 1 w ( x , Y i ) ˆ P n m ( x )) 2 W i ( x , h n )∆ i ( Y i − ˆ i = 1 w ( x , Y i ) ˆ σ 2 ( x ) = ˆ , P n W i ( x , h n )∆ i i = 1 w ( x , Y i ) ˆ extension of the estimators in de Uña-Alvarez and Iglesias-Pérez (2008).

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Asymptotic results

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Under some assumptions, ❳ n − 1 ˆ 2 ) F ε ( e ) − F ε ( e ) = V ( X i , Y i , Z i , ∆ i , e ) + o p ( n i = 1 uniformly in e . = > Weak convergence of the process √ n (ˆ F ε ( e ) − F ε ( e )) → Ω( e ) where Ω is a Gaussian process with zero mean and complex covariance.

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Bandwidth selection

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We want to determine the smoothing parameter h n which minimizes ➉❩ ➛ ˆ ➌ ➞ 2 de MISE = E F ε, h n ( e ) − F ε ( e ) We consider bootstrap procedure which is an extension of Li and Datta (2001).

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis For b = 1 , . . . , B , For i = 1 , . . . , n Step 1 Generate X ∗ i , b from n ˆ ❳ E [ w ( X , Y )] ˆ F X ( · ) = I { X j ≤ · , ∆ j = 1 } , ˆ E [ w ( X , Y ) | X = · ] j = 1 ❳ n ❳ n W j ( · , g n )∆ j where ˆ E [ w ( X , Y ) | X = · ] = W j ( · , g n )∆ j / w ( · , Y j ) ˆ j = 1 j = 1 and g n is a pilot bandwidth

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Step 2 Generate Y ∗ i , b from ˆ E [ w ( X , Y ) | X = X ∗ i , b ] W j ( X ∗ ❳ n i , b , g n ) F Y | X ( ·| X ∗ ˆ i , b , Y j )( P n i , b ) = w ( X ∗ k = 1 W k ( X ∗ ˆ i , b , g n )∆ k ) j = 1 I { Y j ≤ · , ∆ j = 1 } Step 3 Draw T ∗ i , b from the distribution F T | X ( ·| X ∗ i , b ) . • If T ∗ i , b > Y ∗ i , b , then reject ( X ∗ i , b , Y ∗ i , b , T ∗ i , b ) and go to Step 1 . • Otherwise, go to Step 4 . i , b from ˆ Step 4 Select at random V ∗ G ( ·| X ∗ i , b ) calculated with g n Step 5 Define • Z ∗ i , b = min ( Y ∗ i , b − T ∗ i , b , V ∗ i , b ) • ∆ ∗ i , b = I { Y ∗ i , b − T ∗ i , b ≤ V ∗ i , b } .

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Compute ˆ F ∗ ε, h n , b , the error distribution based on • bandwidth h n • resample { ( X ∗ i , b , T ∗ i , b , Z ∗ i , b , ∆ ∗ i , b ) : i = 1 , . . . , n } . The expression of the MISE can be approximated by ❩ ❳ B argmin h n B − 1 { ˆ F ∗ ε, h n , b ( e ) − ˆ F ε, g n ( e ) } 2 de . b = 1

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Simulations

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis We consider • model Y = X + ε where • X ∼ U ([ 1 . 7321 ; 2 ]) ⑨➈ ➋❾ √ √ • ε ∼ U − 3 ; 3 • model log Y = X + ε where • X ∼ U ([ 0 ; 1 ]) • ε ∼ N ( 0 ; 1 ) • model Y = X 2 + X ∗ ε where ⑨➈ ➋❾ √ • X ∼ U 2 ; 2 ∗ 3 ⑨➈ ➋❾ √ √ • ε ∼ U − 3 ; 3 • model log Y = X 2 + X ∗ ε where • X ∼ U ([ 0 ; 1 ]) • ε ∼ N ( 0 ; 1 ) where X and ε are independent in each model

Computational treatment of the error distribution in nonparametric - PowerPoint PPT Presentation

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Computational treatment of the error distribution in nonparametric regression with right-censored and selection-biased data Graldine LAURENT Jointly with Cdric

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

MALE INFERTILITY CASE-I Before Treatment: After Treatment: After Treatment: CASE 2 BEFORE

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

1. Normal distribution 2. Geometric distribution 3. Binomial distribution 4.

What youll learn today The difference between sample error and true error Confidence

Natural and Flexible Error Recovery for Generated Parsers Maartje de Jonge Emma Nilsson-Nyman

Was it operator error or human error? Commodore David Squire, CBE, FNI, FCMI Editor, Alert! The

10/4/18 What is a medication error? A medication error is defined by the Nation Coordinating

QEC11 Quantum Error Correction and Quantum Error-Correcting Codes Todd A. Brun Center for

Error Handling in RCMS Error Handling in RCMS An Overview Francesco Lelli

Introduction to Machine Learning Evaluation: Training Error compstat-lmu.github.io/lecture_i2ml

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

Machine Learning Probabilistic KNN. Mark Girolami girolami@dcs.gla.ac.uk Department of

Lesson 3 Approximating Fourier series 1 Last lecture, we saw that the trapezoidal rule was

State-of-the-Art ! 30-85 errors are made per 1000 lines of source CS 619 Introduction to OO Design

Comparing against a benchmark IN TRODUCTION TO P ORTF OLIO AN ALYS IS IN P YTH ON Charlotte

4. Asymptotic Approximations http://aofa.cs.princeton.edu A N A L Y T I C C O M B I N A T O R I

Distance Sampling Simulations Overview Why simulate? How it works Automated survey

Evalua8on Overview and Results Mar8al Michel Jonathan Fiscus

Sambuz

Useful Links

Newsletter

Mail Us

Computational treatment of the error distribution in nonparametric - PowerPoint PPT Presentation

Estimation Asymptotic results Bandwidth selection Simulations Data Analysis Computational treatment of the error distribution in nonparametric regression with right-censored and selection-biased data Graldine LAURENT Jointly with Cdric

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

MALE INFERTILITY CASE-I Before Treatment: After Treatment: After Treatment: CASE 2 BEFORE

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

1. Normal distribution 2. Geometric distribution 3. Binomial distribution 4.

What youll learn today The difference between sample error and true error Confidence

Natural and Flexible Error Recovery for Generated Parsers Maartje de Jonge Emma Nilsson-Nyman

Was it operator error or human error? Commodore David Squire, CBE, FNI, FCMI Editor, Alert! The

10/4/18 What is a medication error? A medication error is defined by the Nation Coordinating

QEC11 Quantum Error Correction and Quantum Error-Correcting Codes Todd A. Brun Center for

Error Handling in RCMS Error Handling in RCMS An Overview Francesco Lelli

Introduction to Machine Learning Evaluation: Training Error compstat-lmu.github.io/lecture_i2ml

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

Machine Learning Probabilistic KNN. Mark Girolami girolami@dcs.gla.ac.uk Department of

Lesson 3 Approximating Fourier series 1 Last lecture, we saw that the trapezoidal rule was

State-of-the-Art ! 30-85 errors are made per 1000 lines of source CS 619 Introduction to OO Design

Comparing against a benchmark IN TRODUCTION TO P ORTF OLIO AN ALYS IS IN P YTH ON Charlotte

4. Asymptotic Approximations http://aofa.cs.princeton.edu A N A L Y T I C C O M B I N A T O R I

Distance Sampling Simulations Overview Why simulate? How it works Automated survey

Evalua8on Overview and Results Mar8al Michel Jonathan Fiscus

Sambuz

Useful Links

Newsletter

Mail Us

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits