robust location and scale estimation with censored
play

ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome - PowerPoint PPT Presentation

ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome H. Friedman Stanford University MACHINE LEARNING y = F ( x , z ) y = outcome variable x = ( x 1 , x p ) observed predictor variables z = ( z 1 , z 2 , ) other


  1. ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome H. Friedman Stanford University

  2. MACHINE LEARNING y = F ( x , z ) y = outcome variable x = ( x 1 · ·· , x p ) observed predictor variables z = ( z 1 , z 2 , · · · ) other variables Goal: estimate E [ y | x ] given data { y i , x i } N i =1

  3. STATISTICAL MODEL y = f ( x ) + s ( x ) · ǫ f ( x ) = E [ y | x ] location function s ( x ) > 0 scale function ε = random variable, E [ ε | x ] = 0 Prediction: y = f ( x ) ˆ s ( x ) · ǫ = “irreducible error” (unavoidable)

  4. REDUCIBLE ERROR r ( x ) = E | f ( x ) − ˆ f ( x ) | f ( x ) = optimal location (target) function ˆ f ( x ) = estimate based on training data & ML method ML goal: methods to reduce r ( x ) Statistics goal: methods to estimate r ( x ) Prediction error ( y ) = Reducible + Irreducible Usually: Irreducible s ( x ) >> Reducible r ( x )

  5. USUAL ASSUMPTIONS s ( x ) = s = constant (homoscedasticity) ǫ ∼ N (0 , 1) (normality)

  6. HOMOSCEDASTICITY F ( x , z ) = f ( x ) + g ( z ) additive p ( x , z ) = ⇒ scale [ g ( z ) | x ] = constant Not very likely

  7. P ( y | x ) P ( y | x ) 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 0 loc = 5, scale = 0.5 loc = 5, scale = 0.1 2 2 4 4 y y 6 6 8 8 10 10 P ( y | x ) P ( y | x ) 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 0 loc = 5, scale = 0.25 2 2 loc = 5, scale = 1 4 4 y y 6 6 8 8 10 10

  8. NORMALITY - not very likely either Tukey: “small residuals ≃ normal, larger have heavier tails.” Heterodistributionality

  9. Heterodistributionality Robustness: Choose compromise ¯ p ( ǫ ) good properties for others p ( ǫ ) = normal, not good! ¯

  10. LOGISTIC DISTRIBUTION ǫ | x = ( y − f ( x )) /s ( x ) e − ǫ p ( ǫ ) = ¯ s (1+ e − ǫ ) 2 small | ε | ∼ normal, large | ε | ∼ exponential 0.25 0.20 logistic distribution 0.15 0.10 0.05 0.00 −5 0 5 eps

  11. y = ˆ Prediction: ˆ f ( x ) N � [ ε i + 2 log(1 + e − ε i )] ˆ f ( x ) = arg min f ∈ F i =1 ε i = ( y i − f ( x i )) /s ( x i ) minimized at f ( x i ) = y i indep s ( x i ) 1 /s ( x i ) ∼ “weight” for obs i controls relative influence of i to fit

  12. Using incorrect s ( x ) to estimate f ( x ) increases variance, not bias assume s ( x ) = constant usually not too bad

  13. ESTIMATE ˆ s ( x ) (1) Improve ˆ f ( x ) in high variance settings. (2) Important inferential statistic: (a) prediction interval ∼ accuracy of ˆ y -prediction: logistic: IQR [ y | f ( x )] = 2 s ( x ) / log(3) (b) can affect decision (3) Crucial with censoring

  14. CENSORING ( y -value partially known) Data: { y i , x i } N 1 → { a i , b i , x i } N 1 a i ≤ y i ≤ b i a i = b i = y i ⇒ y -value known a i = −∞ ⇒ censored below b i b i = ∞ ⇒ censored above a i Otherwise: interval censored [ a i , b i ]

  15. Special Case { a i , b i } → K disjoint intervals (bins): K = 2 ⇒ usual binary logistic regression K > 2 ⇒ ordered multiclass logistic regression

  16. LIKELIHOOD 1 1 Pr( a ≤ y ≤ b ) = 1+ e − ( b − f ) /s − 1+ e − ( a − f ) /s Depends strongly on both f and s Need to estimate both f ( x ) and s ( x )

  17. Logistic distribution: f = 5 0.25 0.20 Probability density 0.15 0.10 0.05 0.00 −15 −10 −5 0 5 10 15 y

  18. EXERCISE N � [ ˆ f ( x ) , ˆ s ( x )] = arg min L [ a i , b i , f ( x i ) , s ( x i )] ( f,s ) ∈ F i =1 � � 1 1 L ( a, b, f, s ) = − log 1+ e − ( b − f ) /s − 1+ e − ( a − f ) /s

  19. PROBLEM L ( a, b, f, s ) NOT convex in s IS convex in t = 1 /s ⇒ solve for t Constraint t > 0 ⇒ solve for log( t ) = − log( s )

  20. GRADIENT BOOSTED TREE ENSEMBLES Ann. Statist, 29 . 1189 — 1232 (2001) f ( x ) = � K f k =1 T ( f ) ˆ ( x ) k s ( x )) = � K s k =1 T ( s ) log(ˆ k ( x ) T k ( x ) = CART—tree ( x )

  21. ITERATIVE GRADIENT BOOSTING s ( x ) = constant Start: ˆ Loop { ˆ f ( x ) = tree—boost f ( x ) given ˆ s ( x ) s ( x )) = tree—boost log( s ( x )) given ˆ log(ˆ f ( x ) } Until no change

  22. DIAGNOSTICS (1) median [ y | f ( x )] = f ( x ) (2) median [ | y − f ( x ) | | s ( x ) ] = s ( x ) · log(3) (3) # ( y i ∈ [ u, v ] | f i ∈ [ g, h ]) = � � � 1 1 1+ e − ( v − fi ) /si − f i ∈ [ g,h ] 1+ e − ( u − fi ) /si ( f i = ˆ f ( x i ) , s i = ˆ s ( x i ) )

  23. California Housing Price Data (STATLIB Repository) N = 20460 CA neighborhoods (1990 census block groups) y = Median House Value x = ( Median Income, Housing Median Age, Ave No Rooms, Ave No Bedrooms, Population, Ave Occupancy, Latitude, Longitude)

  24. CA housing prices 5 800 4 Frequency 600 y − value 3 400 2 200 1 0 0 0 1 2 3 4 5 1 2 3 4 5 Neighborhood median Predicted location ( f ) 1.0 abs ( ( y − f ) / log(3)) 0.500 0.8 Predicted scale 0.6 0.050 0.4 0.2 0.005 0.0 0.05 0.10 0.20 0.50 1.00 0 1 2 3 4 5 6 Predicted scale Predicted location

  25. Location Relative importance 100 20 40 60 80 −1.5 0.0 1.0 2.0 0 CA housing : location model income −122 long lat −120 occup long rooms bedrooms −118 age pop Location Location −1.5 0.0 1.0 2.0 −1.5 0.0 1.0 2.0 33 2 34 3 35 4 income 36 lat 5 37 6 38 7 39 8 40

  26. log ( scale ) Relative importance 100 20 40 60 80 −1.0 0.0 1.0 0 CA housing : log ( scale ) model 33 occup 34 lat income 35 pop 36 lat long 37 rooms 38 bedrooms 39 age 40 log ( scale ) log ( scale ) −1.0 0.0 1.0 −1.0 0.0 1.0 2.0 2 3 2.5 4 3.0 income occup 5 3.5 6 4.0 7 4.5 8

  27. QUESTIONNAIRE DATA N = 8857 , p = 13  � � 14 17   �   �  18 24  �   �   � 25 34  � � y = AGE ∈ 35 44 �   �  45 54  �   �   55 64 �   �  � 65 ∞

  28. x = (Occupation, Type of Home, Sex, Marital Status, Education,Income, Lived in BA, Dual Incomes, Persons in Household, Persons in Household < 18, Householder Status, Ethnicity, Language)

  29. f in bin 1 : 285 f in bin 2 : 484 f in bin 3 : 847 Pobability Pobability Pobability 0.4 0.5 0.0 0.6 0.0 0.0 1 3 5 7 1 3 5 7 1 3 5 7 y in bin y in bin y in bin f in bin 4 : 868 f in bin 5 : 211 f in bin 6 : 41 0.30 Pobability Pobability Pobability 0.3 0.3 0.00 0.0 0.0 1 3 5 7 1 3 5 7 1 3 5 7 y in bin y in bin y in bin f in bin 7 : 216 Pobability 0.5 0.0 1 3 5 7 y in bin AGE predictions

  30. AGE 10.0 5.0 Predicted scale 2.0 1.0 0.5 20 30 40 50 60 80 Predicted location

  31. Location Model Relative importance 100 20 15 80 Location 10 38 60 5 40 40 0 159 1446436518 20 77 −5 0 occ edu status Inc Mstat kids Fsize Lived 5 1 6 8 9 3 2 4 7 occ 20 15 15 Location Location 10 5 5 194 10 M 0 198 98 −5 −5 1 2 3 M 1 2 3 4 5 6 edu status

  32. Scale Model Relative importance 100 0.2 0.1 194 10 80 198 log ( scale ) 0.0 60 −0.1 −0.2 40 98 −0.3 20 −0.4 0 −0.5 status Mstat occ Fsize Lived kids edu dual 1 2 3 M status 0.2 0.2 38 40 0.1 0.1 12 185 55 log ( scale ) log ( scale ) 65 1446 159 0.0 10 0.0 42 18 43 −0.1 −0.1 196 −0.2 −0.2 77 −0.3 −0.3 −0.4 −0.4 −0.5 −0.5 4 1 5 3 2 M 5 1 6 8 9 3 2 4 7 Mstat occ

  33. Wine Quality Data (Irvine Repository) N = 6497 samples of Portuguese "Vinho Verde" y = Quality: integer ( 1 , 2 , · · · , 10 ) ˜ median of at least 3 expert evaluations y = k ⇒ y ∈ [ k − 1 / 2 , k + 1 / 2] ˜

  34. x = (Fixed acidity, Volatile acidity, Citric acid, Residual sugar, Chlorides, Free sulfur dioxide Total sulfur dioxide, Density, pH, Sulfates, Alcohol)

  35. f in bin 3 : 685 f in bin 4 : 1103 0.6 0.0 0.2 0.4 0.6 Pobability Pobability 0.4 0.2 0.0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 y in bin y in bin f in bin 5 : 267 0.6 Pobability 0.4 0.2 0.0 1 2 3 4 5 6 7 y in bin Wine quality data

  36. Wine quality data 0.45 0.40 Predicted scale 0.35 0.30 0.25 0.20 0.15 4.5 5.0 5.5 6.0 6.5 7.0 Predicted location

  37. Wine: location citric.acid 0.4 chlorides Location sulphates density 0.0 residual.sugar total.sulfur.dioxide volatile.acidity −0.4 alcohol 0 40 80 9 10 11 12 Relative importance alcohol 0.4 0.4 Location Location 0.0 0.0 −0.4 −0.4 0.2 0.4 0.6 50 100 150 200 volatile.acidity total.sulfur.dioxide

  38. volatile.acidity pH log ( scale ) chlorides 0.00 residual.sugar fixed.acidity free.sulfur.dioxide −0.15 density alcohol 0 20 60 100 9 10 11 12 Relative importance alcohol log ( scale ) log ( scale ) 0.00 0.00 −0.15 −0.15 0.990 0.994 0.998 10 30 50 70 density free.sulfur.dioxide

  39. ORDERED MULTICLASS LOGISTIC REGRESSION y i ∈ { C 1 < C 2 , · · · , C K − 1 < C K } Interval censored: { a i , b i } → K disjoint intervals (bins): { b 0 , b 1 , · · · b K } b 0 = −∞ , b K = ∞ bins ∼ classes with separating boundaries b = { b 1 , b 2 , · · · , b K − 1 } unknown (overall location & scale arbitrary)

Recommend


More recommend