Dependencies in Interval- -valued valued Dependencies in Interval Symbolic Data Symbolic Data Lynne Billard University of Georgia lynne@stat.uga.edu Tribute to Professor Edwin Diday: Paris, France; 5 September 2007
Naturally occurring Symbolic Data -- Mushrooms
Patient Records – – Single Hospital, Single Hospital, Cardiology Cardiology Patient Records Patient Hospital Age Smoker …. Patient 1 Fontaines 74 heavy Patient 2 Fontaines 78 light Patient 3 Beaune 69 no Patient 4 Beaune 73 heavy Patient 5 Beaune 80 light Patient 6 Fontaines 70 heavy Patient 7 Fontaines 82 heavy M M M M
Patient Records by Hospital -- aggregate over patients Result: Symbolic Data Patient Hospital Age Smoker Patient 1 Fontaines 74 heavy Patient 2 Fontaines 78 light Patient 3 Beaune 69 no Patient 4 Beaune 73 heavy Patient 5 Beaune 80 light Patient 6 Fontaines 70 heavy Patient 7 Fontaines 82 heavy M M M M Hospital Age Smoker Fontaines [70, 82] {light ¼, heavy ¾} Beaune [69, 80] {no, light, heavy} M M M
Histogram-valued Data -- Weight by Age Distribution:
Logical dependency rule E.g. Y 1 = age Y 2 = # children Classical: Y a = (10, 0), Y b = (20, 2), Y c = (18, 1) Aggregation → ξ = (10 , 20) × (0, 1, 2) Symbolic: 2 1 0 10 20 I.e., ξ implies classical Y d = (10, 2) is possible ν : {If Y 1 < 15, then Y 2 = 0} Need rule
Interval-valued data u Y 1 Y 2 u Y 1 Y 2 Team # At-Bats # Hits Team # At-Bats # Hits 1 (289, 538) (75, 162) 11 (212, 492) (57, 151) 2 (88, 422) (49, 149) 12 (177, 245) (189, 238) 3 (189, 223) (201, 254) 13 (342, 614) (121, 206) 4 (184, 476) (46, 148) 14 (120, 439) (35, 102) 5 (283, 447) (86, 115) 15 (80, 468) (55, 115) 6 (24, 26) (133, 141) 16 (75, 110) (75, 110) 7 (168, 445) (37, 135) 17 (116, 557) (95, 163) 8 (123, 148) (137, 148) 18 (197, 507) (52, 53) 9 (256, 510) (78, 124) 19 (167, 203) (48, 232) 10 (101, 126) (101, 132) ξ (2): Y 2 = 149 not possible when Y 1 < 149
Observation ξ(2) Y 2 Y 2 = α Y 1 149 R 4 R 1 88 R 2 R 3 49 88 149 422
Dependencies between Variables – Interval-valued Variables E.g., Regression Analysis Y = ( Y 1 , L , Y q ), e.g., q=1 Dependent variable: X = (X 1 , L , X p ) Predictor/regression variable: Multiple regression model: Y = β 0 + β 1 X 1 + L + β p X p + e e ∼ E(e)=0, Var(E) = σ 2 , Cov(e i , e k )= 0, i ≠ k. Error:
Y = β 0 + β 1 X 1 + L + β p X p + e Multiple Regression Model: In vector terms, Y = X β + e Observation matrix: Y 0 = (Y 1 , L , Y n ) ⎛ ⎞ 1 X 11 · · · X 1 p ⎜ ⎟ . . . . . . Design matrix: X = . . . ⎝ ⎠ 1 X n 1 · · · X np Regression coefficient matrix: β 0 = ( β 0 , β 1 , L , β p ) Error matrix: e 0 = (e 1 , L , e n )
Model: Y = X β + e Least squares estimator of β is ˆ = ( X 0 X ) -1 X 0 Y β When p=1, P n i =1 ( X i − ¯ X )( Y i − ¯ Y ) = Cov ( X, Y ) ˆ β 1 = V ar ( X ) , P n i =1 ( X i − ¯ X ) 2 ˆ Y − ˆ ¯ β ¯ β 0 = X where n n X X Y = 1 X = 1 ¯ ¯ Y i , X i . n n i =1 i =1
Y = β 0 + β 1 X 1 + L + β p X p + e Model: Or, write as Y − ¯ Y = β 1 ( X 1 − ¯ X 1 ) + . . . + β p ( X p − ¯ X p ) + e n X X j = 1 ¯ j = 1 , . . . , p. X ij , n i =1 β 0 ≡ ¯ Y − ( β 1 ¯ X 1 + . . . + β p ¯ X p ) Then,
Y − ¯ Y = β 1 ( X 1 − ¯ X 1 ) + . . . + β p ( X p − ¯ X p ) + e Least squares estimator of β is 0 ( X − ¯ 0 ( Y − ¯ − 1 ( X − ¯ β = [( X − ¯ ˆ X ) X )] X ) Y ) where X ) 0 ( X − ¯ ( X − ¯ X ) = ⎛ ⎞ X 1 ) 2 Σ ( X 1 − ¯ Σ ( X 1 − ¯ X 1 )( X p − ¯ · · · X p ) ⎜ ⎟ . . . . = . . ⎝ ⎠ X p ) 2 Σ ( X p − ¯ X p )( X 1 − ¯ Σ ( X p − ¯ · · · X 1 ) ⎛ ⎞ ⎝ X ⎠ , ( X j 1 − ¯ X j 1 )( X j 2 − ¯ j 1 , j 2 = 1 , · · · , p = X j 2 ) i ⎛ ⎞ ⎝ X X ) 0 ( Y − ¯ ( X − ¯ ⎠ , j = 1 , · · · , p ( X j − ¯ X j )( Y − ¯ Y ) = Y ) i
Interval-valued data: = = ∈ = Y [ a , b ], j 1,..., , p u E { w ,..., w ,... w } uj uj uj 1 u m Bertrand and Goupil (2000): Symbolic sample mean is 1 = + ∑ Y ( b a ), j uj uj 2 m u ∈ E Symbolic sample variance is 1 1 = + + − + 2 ∑ 2 2 ∑ 2 S ( b b a a ) [ ( b a )] j uj uj uj uj uj uj 3 m 2 ∈ 4 m ∈ u E u E Notice, e.g., m = 1, Y = Weight Y 1 = [132, 138] → = 2 = Y 135, S 3 1 1 = 2 = Y 135, S 12 Y 2 = [129, 141] → 1 2
Can rewrite 1 ∑ 2 = − 2 + − − + − 2 S [( a Y ) ( a Y )( b Y ) ( b Y ) ] j uj j uj j uj j uj j m ∈ 3 u E Then, by analogy, for j = 1,2, for interval-valued variables Y 1 and Y 2 , empirical covariance function Cov ( Y 1 , Y 2 ) is 1 ∑ 1/ 2 = Cov Y Y ( , ) G G Q Q [ ] 1 2 1 2 1 2 3 m ∈ u E = − 2 + − − + − 2 Q ( a Y ) ( a Y )( b Y ) ( b Y ) j uj j uj j uj j uj j ⎧− ≤ ⎪ 1, if Y Y , uj j = ⎨ G j > ⎪ 1, if Y Y , ⎩ uj j = + Y ( a b )/ 2. uj uj uj ≡ 2 Notice, special cases: (i) C o v Y ( , Y ) S 1 1 1 (ii) If a uj = b uj = y j , for all u , i.e., classical data, 1 = Σ − − C ov Y ( , Y ) ( y Y )( y Y ) 1 2 1 1 2 2 m
Back to Bertrand and Goupil (2000) Sample variance is 1 1 = + + − + 2 ∑ 2 2 ∑ 2 S ( b b a a ) [ ( a b )] j uj uj uj uj uj uj 3 m 2 ∈ 4 m ∈ u E u E This is total variance. = 2 SS mS Take Total Sum of Squares = Total j j Then, we can show = + Total SS Within Objects SS Betwee n Obje cts S S j j j where
1 ∑ 2 2 2 = − + − − + − S [( a Y ) ( a Y )( b Y ) ( b Y ) ] j uj j uj j uj j uj j m ∈ 3 u E 1 2 2 = − + − − + − ∑ [( a Y ) ( a Y )( b Y ) ( b Y ) ] Within Objects SS j uj uj uj uj uj uj uj uj 3 u ∈ E ∑ 2 = + − S S [( a b ) / 2 Y ] Between Objects j uj uj j ∈ u E with 1 ∑ = + = + Y ( a b ) / 2, Y ( a b ). uj uj uj j uj uj 2 m ∈ u E = = a b Y Classical data: u j u j u j → Within Objects SS j = 0
So, for Y j , we have Sum of Squares SS, = + Total SS Within Objects SS Betwee n Obje cts S S j j j Likewise, for ( Y i , Y j ), we have Sum of Products SP = + Tota l SP Within Objects SP Between Objec ts SP ij ij ij
Can rewrite 1 ∑ 2 = − 2 + − − + − 2 S [( a Y ) ( a Y )( b Y ) ( b Y ) ] j uj j uj j uj j uj j m ∈ 3 u E Then, by analogy, for j = 1,2, for interval-valued variables Y 1 and Y 2 , empirical covariance function Cov ( Y 1 , Y 2 ) is 1 ∑ 1/ 2 = Cov Y Y ( , ) G G Q Q [ ] 1 2 1 2 1 2 3 m ∈ u E = − 2 + − − + − 2 Q ( a Y ) ( a Y )( b Y ) ( b Y ) j uj j uj j uj j uj j ⎧− ≤ ⎪ 1, if Y Y , uj j = ⎨ G j > ⎪ 1, if Y Y , ⎩ uj j = + Y ( a b )/ 2. uj uj uj
Can rewrite 1 ∑ 2 = − 2 + − − + − 2 S [( a Y ) ( a Y )( b Y ) ( b Y ) ] j uj j uj j uj j uj j m ∈ 3 u E Then, by analogy, for j = 1,2, for interval-valued variables Y 1 and Y 2 , empirical covariance function Cov ( Y 1 , Y 2 ) is 1 ∑ 1/ 2 = Cov Y Y ( , ) G G Q Q [ ] 1 2 1 2 1 2 3 m ∈ u E = − 2 + − − + − 2 Q ( a Y ) ( a Y )( b Y ) ( b Y ) j uj j uj j uj j uj j ⎧− ≤ ⎪ 1, if Y Y , uj j = ⎨ G j > ⎪ 1, if Y Y , ⎩ uj j = + Y ( a b )/ 2. uj uj uj ( Total)SP part can be replaced by X Total SP = 1 £ 2( a − ¯ Y )( c − ¯ X ) + ( a − ¯ Y )( d − ¯ X ) + ( b − ¯ Y )( c − ¯ X ) 6 u ¤ +2( b − ¯ Y )( d − ¯ X )
How is this obtained? Recall that for a Uniform distribution, Y ∼ S ( a, b ), V ar ( Y ) = ( b − a ) 2 12 By analogy, we can show, for u=1,…,m observations, m X 1 ( a u − b u )( c u − d u ) Within SP = 12 u =1 µ a u + b u ¶ µ c u + d u ¶ m X − ¯ − ¯ Between SP = Y 1 Y 2 2 2 u =1 where Y u 1 = [ a u , b u ] , Y u 2 = [ c u , d u ] µ a u + b u ¶ µ c u + d u ¶ m m X X Y 1 = 1 Y 2 = 1 ¯ ¯ , 2 2 m m u =1 u =1
m X 1 Within SP = ( a u − b u )( c u − d u ) 12 u =1 µ a u + b u ¶ µ c u + d u ¶ m X − ¯ − ¯ Between SP = Y 1 Y 2 2 2 u =1 Hence, from Total SP = Within SP + Between SP m X =1 £ 2( a u − ¯ Y 1 )( c − ¯ Y 2 ) + ( a − ¯ Y 1 )( d − ¯ Y 2 ) 6 u =1 ¤ +( b − ¯ Y 1 )( c − ¯ Y 2 ) + 2( b − ¯ Y 1 )( d − ¯ Y 2 )
Y X1 X2 Pulse Systolic Diastolic u Rate Pressure Pressure 1 [44, 68] [90, 110] [50, 70] 2 [60, 72] [90, 130] [70, 90] 3 [56, 90] [140, 180] [90, 100] 4 [70, 112] [110, 142] [80, 108] 5 [54, 72] [90, 100] [50, 70] 6 [70, 100] [134, 142] [80, 110] 7 [72, 100] [130, 160] [76, 90] 8 [76, 98] [110, 190] [70, 110] 9 [86, 96] [138, 180] [90, 110] 10 [86, 100] [110, 150] [78, 100] 11 [63, 75] [60, 100] [140, 150] Rule: X2 = Diastolic Pressure < Systolic Pressure = X1
Recommend
More recommend