data cleansing for predictive models the next level

Data Cleansing for Predictive Models: The Next Level Roosevelt C. - PowerPoint PPT Presentation

Data Cleansing for Predictive Models: The Next Level Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Ratemaking & Product Management Seminar Philadelphia, PA March 19 21, 2012 Experience the Pinnacle Difference! Data Cleaning Data cleansing

  1. Data Cleansing for Predictive Models: The Next Level Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Ratemaking & Product Management Seminar Philadelphia, PA March 19 – 21, 2012 Experience the Pinnacle Difference!

  2. Data Cleaning Data cleansing – the next level • Why simple visualization may not tell the whole story Data homogeneity • There are distinct groups in your underlying data Multivariate data anomalies • Certain combinations of variables may point to data issues

  3. Data Cleansing – The Next Level

  4. Data Validation – One and Two Way Summaries

  5. Data Cleansing – the Next Level � One and two way data summarization and visualization is absolutely key in determining that individual factors are valid � In building predictive models, multivariate techniques consider independent variables simultaneously to account for dependencies � Data issues don’t just exist in one and two dimensions, they can exist in n dimensions (where n is the number of individual elements) � Underlying causes : heterogeneity, data anomalies � Multivariate data exploration techniques can be used to address these issues

  6. Data Homogeneity

  7. Clustering/Segmentation � Unsupervised classification technique � Groups data into set of discrete clusters or contiguous groups of cases � Performs disjoint cluster analysis on the basis of Euclidean distances computed from one or more quantitative input variables and cluster seeds � Objects in each cluster tend to be similar, objects in different clusters tend to be dissimilar � Can be used as a dimension reduction technique

  8. Example � Homeowners dataset � Ran clustering analysis using key risk characteristics � Amount of insurance � Age of home � Billing option � Construction � Protection class � Deductible � Multiline � State/territory � Developed predictive model on clusters independently

  9. Cluster Distance Map

  10. Cluster Characteristics Coverage A Age of Home Total Total 43 219,585 20 20 25 267,415 56 9 9 155,509 ‐ 100,000 200,000 300,000 0 10 20 30 40 50 60 Coverage A Age of Home Percent without Multiline Discount 25% Total 15% 20 9 35% 0% 10% 20% 30% 40% Percent without Multiline Discount

  11. Billing Plan Indications Bill Plan 1.600 1.407 1.400 1.346 I 1.281 n d 1.192 1.200 i 1.116 1.112 c 1.076 1.035 a 1.000 1.000 1.000 0.992 t 1.000 e d 0.800 R e l 0.600 a t i 0.400 v i t y 0.200 0.000 Monthly Semi‐Annual Pay in Full Mortgagee Bill Plan Total Cluster 9 Cluster 20

  12. Deductible Indications Deductible 1.400 I n d 1.200 i c a t 1.000 e d R 0.800 e l a t 0.600 i v i t 0.400 y 0.200 50 100 250 500 1000 2500 5000 10000 Deductible Total Cluster 9 Cluster 20

  13. Multi ­ Line Indications Multi Line 0.950 0.942 0.940 I n d 0.930 i c a 0.920 t e d 0.910 0.907 R 0.900 e l 0.892 a 0.890 t i v 0.880 i t y 0.870 0.860 Auto & Home Multi Line Total Cluster 9 Cluster 20

  14. Multivariate Data Anomalies – Back to Cluster 1 � Higher value homes Cluster 1 Total Av erage $1,109,048 $219,585 � Segment of the Amount of business that is Insurance certainly heterogeneous A verage Age of 19.6 years 42.7 years Home – will behave differently Pe rcentage of 19.9% 1.9% that overall population Deductibles > � Represents 0.2% of the $2500 overall exposures � Should we exclude data points such as these?

  15. Outlier Data Points Midpoint of the cluster, represents an average risk for that cluster Risk that is slightly different than average, but still fits well with that cluster Potential anomaly – data point fits best within this cluster but is actually an outlier for the cluster. This generally means it doesn’t fit well anywhere.

  16. Data “Cleanup” � Reflect heterogeneity in final product (rating plan adjustments, underwriting, tiering) � Data verification � Modify data � Exclude data


More recommend