RANDOM FORESTS IN THE EVALUATION OF THREAT FOR PEDESTRIAN ACCIDENTS IN TOWNS Marzena Nowakowska Faculty of Management and Computer Modelling Kielce University of Technology 25-345 Kielce, Al. 1000-lecia Pa ń stwa Polskiego 7, POLAND phone: +48 41 34 24 437, e-mail address: spimn@tu.kielce.pl Research issue The random forest methodology in the investigation of road traffic safety decision tree bagging approach The examination of the influence of selected factors on threat on roads in towns: accident severity – the level of a human casualty harm The subject - separated and mutually independent data sets concerning found guilty: pedestrians involved in road accidents: 1625 observations drivers in pedestrian accidents: 2519 observations The data source: urban accidents in the Ś wi ę tokrzyskie voivodship from the time period 1999-2009
The methodology; decision tree • The C4.5 algorithm for data split: Tree root: test X 1 - qualitative attributes X 1 (z)=a - the information gain criterion Node: test X 2 Leaf Leaf - the identity test X 2 (z)=b • Probability leaves for the decision Leaf Node: test X 3 • The variable importance ranking: AI ( X ) X 3 (z)=c i Importance ( X i = ) max { AI ( X )} Leaf j Leaf Leaf: d 1 j | ∑ | AI ( X ) = SSE ( Y t ) - SSE ( Y t ) i b b ∈ B ( X | t ) X 1 (z)=a ∧ X 2 (z)=b ∧ X 3 (z)=c ⇒ Y = d 1 i where SSE(Y | t) and SSE(Y | t b ) are the sums of square errors calculated before and after X 1 (z)=a ∧ X 2 (z)=b ∧ X 3 (z)=c ⇒ splitting the data set according to the {P(Y = d 1 ), ..., P(Y = d k )} variable X i respectively. Importance ∈ <0; 1> The methodology; bootstrapped tree ensemble Decision trees constituting the random forest Original data set Train data sets 1. Bootstrapping the sample 2. Random selection of observations from each decision stratum 3. The investigation of all attributes in an exhaustive search 4. Bagging – averaging posterior probabilities for the decision (accident severity) 5. The average posterior probability of fatal or serious accident status is the accident threat 6. Tree quality measures: sensitivity SNS , specificity SPS , proportion correctly classified PCC , harmonic mean of sensitivity and specificity HMSS
Investigated variables and their domains Accident severity Ac_Sv : pedestrian data set: 61.1% MA , 31.8% SA , 7.1% FA driver data set: 59.1% MA , 33.7% SA , 7.3% FA The final domain of AcSv : MA , FSA = FA + SA • Vehicle type Vhl : BMM ( bicycle, moped, motorcycle ), Car ( passenger car ), HVh ( bus, truck ), OVh ( other vehicle types ) Very big Small Medium Big • City size City : 20000 50000 100000 • Pedestrian gender Pd_Gn : M , F • Driver gender Dr_Gn : M , F • Pedestrian intoxicated by alcohol/other • Driver intoxicated by alcohol/other substances substances Pd_BAC : Y , N Dr_BAC : Y , N • Pedestrian behaviour Pd_Bhv : ImEnFrVh , • Driver behaviour Dr_Bhv : ExSp , NtGvWay , ImEnBhVh , InCrRd * , PrRd * , OtPdBh * InMnGrSp * , InMnSmSp * , InBhTwPd * , OtDrBv * • Pedestrian age group Pd_Ag : • Driving experience group Drvg : 1 1 2 3 4 5 6 7 2 3 4 5 6 7 7 14 20 35 50 65 4 8 12 16 21 26 The statistics for the random forests The guilty pedestrian forest The guilty driver forest • The train bootstrapped sets: • The train bootstrapped sets: - number of observations: 774 - number of observations: 1220 - fractions of the decision: 50% of MA, - fractions of the decision: 50% of MA, 50% of FSA (15%+35% of FA+SA) 50% of FSA (15%+35% of FA+SA) • The number of observations in the test sets: • The number of observations in the test sets: from 1002 to 1029 from 1529 to 1573 • The measures of classification quality: • The measures of classification quality: Train data sets Test data sets Train data sets Test data sets Specification Specification SNS SPC PCC HMSS SNS SPC PCC HMSS SNS SPC PCC HMSS SNS SPC PCC HMSS Min [%] 59.6 48.9 63.2 60.0 49.2 49.2 54.6 55.3 Min [%] 53.2 56.7 60.0 59.3 48.8 49.8 53.5 54.0 Max [%] 77.5 71.0 66.5 66.3 70.0 68.0 61.9 62.5 Max [%] 64.6 68.1 63.2 62.8 63.5 69.9 61.6 60.0 AMean [%] 65.9 64.5 65.2 64.7 57.9 58.6 58.2 57.8 AMean [%] 59.9 62.6 61.2 61.0 54.5 59.3 57.3 56.3 HMean [%] 65.2 64.7 58.1 57.7 65.6 63.8 57.4 58 HMean [%] 59.7 62.3 61.2 61.0 54.0 58.8 57.2 56.3 AMean – the arithmetic mean: AMean= Σ z i /n HMean – the harmonic mean: HMean=n/ Σ (1/z i )
The diagnostic of input variables The guilty pedestrian forest Importance % of occurrence 1,20 120% 1,00 100% 0,80 80% 0,60 60% 0,40 40% 0,20 20% 0,00 0% Vhl City Pd_Gn Pd_BAC Pd_Ag Pd_Bhv The guilty driver forest % of occurrence Importance 1,2 120% 1 100% 0,8 80% 0,6 60% 0,4 40% 0,2 20% 0 0% Vhl City Dr_Gn Dr_BAC Drvg Dr_Bhv 0,75 Bagging for 0,65 0,55 pedestrian caused 0,45 0,35 accident threat 0,25 <0; 7) <7; 14) <14; 20) <20; 35) <35;50) <50; 65) >=65 Very big city 0,75 0,65 The average posterior 0,55 probability 0,45 of 0,35 0,25 fatal or serious accidents <0; 7) <7; 14) <14; 20) <20; 35) <35;50) <50; 65) >=65 Big city by 0,75 pedestrian 0,65 age group 0,55 0,45 and 0,35 pedestrian 0,25 <0; 7) <7; 14) <14; 20) <20; 35) <35;50) <50; 65) >=65 behaviour Medium city 0,75 0,65 0,55 ImEnFrVh InCrRd 0,45 0,65 0,75 0,45 0,55 ImEnBhVh PrRd 0,25 0,35 <0; 7) <7; 14) <14; 20) <20; 35) <35;50) <50; 65) >=65 0,35 0,25 <0; 7) <7; 14) <14; 20) <20; 35) <35;50) <50; 65) >=65 Small city
ImEnFrVh InCrRd ImEnBhVh PrRd 0,8 0,7 Bagging for 3 0,6 Very big city 0,5 pedestrian caused 0,4 0,3 accident threat 0,2 (0;7> <14;20) <35;50) >=65 <7;14) <20;35) <50;65) (0;7> <14;20) <35;50) >=65 <7;14) <20;35) <50;65) ImEnFrVh InCrRd ImEnBhVh PrRd 0,8 0,7 2 0,6 Big city 0,5 The differences in the 0,4 0,3 pedeatrian caused accident 0,2 (0;7> <14;20) <35;50) <7;14) <20;35) <50;65) (0;7> <14;20) <35;50) <7;14) <20;35) <50;65) >=65 >=65 threat level for ImEnFrVh InCrRd ImEnBhVh PrRd 0,8 passenger cars 0,7 4 and 0,6 Medium city 0,5 heavy vehicles 0,4 0,3 0,2 (0;7> <14;20) <35;50) >=65 <7;14) <20;35) <50;65) (0;7> <14;20) <35;50) >=65 <7;14) <20;35) <50;65) Car Heavy vehicle ImEnFrVh InCrRd ImEnBhVh PrRd 0,8 0,8 0,7 0,6 0,5 0,4 0,3 0,2 1 (0;7 <35; <7;1 <50; <14; >=6 <20; 0,7 0,6 Small city 0,5 0,4 0,3 0,2 (0;7> <14;20) <35;50) <7;14) <20;35) <50;65) (0;7> <14;20) <35;50) <7;14) <20;35) <50;65) >=65 >=65 0,75 Very big city Bagging for 0,70 0,65 0,60 driver caused 0,55 0,50 0,45 0,40 accident threat 0,35 <0; 4) <4; 8) <8; 12) <12; 16) <16; 21) <21; 26) >=26 0,75 Big city 0,70 0,65 0,60 The average posterior 0,55 0,50 0,45 probability 0,40 0,35 of <0; 4) <4; 8) <8; 12) <12; 16) <16; 21) <21; 26) >=26 fatal or serious accidents 0,75 by Medium city 0,70 0,65 driving experience 0,60 0,55 and 0,50 0,45 driver behaviour 0,40 0,35 <0; 4) <4; 8) <8; 12) <12; 16) <16; 21) <21; 26) >=26 0,75 Small city 0,70 InBhTwPd ExSp NtGvWay 0,65 0,80 0,60 0,30 InMnGrSp InMnSmSp <0; 4) <4; 8) <8; 12) <12; 16) <16; 21) <21; 26) >=26 0,55 0,50 0,45 0,40 0,35 <0; 4) <4; 8) <8; 12) <12; 16) <16; 21) <21; 26) >=26
Recommend
More recommend