de anonymization of insurance applicants sensitive
play

De-anonymization of Insurance Applicants' Sensitive Information - PowerPoint PPT Presentation

De-anonymization of Insurance Applicants' Sensitive Information Team 3: Jay Lee, Maxim Castaneda, Rosalie Dolor To enhance insurance Goal companies best practices Benefits to Stakeholders by ensuring the clients privacy rights in the


  1. De-anonymization of Insurance Applicants' Sensitive Information Team 3: Jay Lee, Maxim Castaneda, Rosalie Dolor

  2. To enhance insurance Goal companies best practices Benefits to Stakeholders by ensuring the clients’ privacy rights in the information gathering NAIC US insurance Insurance process. Companies Applicants Dangers Margin of error in risk The results of the A higher probability Reduced privacy research will of attracting invasion. predictions might provide guidance potential clients. increase instead of for new insurance decrease. policies. To gain new knowledge from the data mining outputs about insurance Success Increase in the dataset. number of applicants.

  3. Data Mining Problem To de-anonymize the ● Supervised task and 1. Family History (5): dataset by predicting predictive modelling ● Categorical and Numeric sensitive variables using the ● Death-related variables insurance company dataset. ● Classification: - Decision Trees 2. Employment Information (6): - Logistic Regression ● Categorical and Numeric - Ensembles ● Income-related variables ● Regression: Main challenge: Unknown specific - Decision Trees variable labels (Assumptions were - Random Forest made)

  4. Percentage Distribution of Data Description Applicants’ Risk Level ● Data source: Kaggle competition ● Size: 59,381 rows and 128 columns (with dummy variables: 900+ columns) ● Each row is an insurance applicant. ● Pre-processing and Exploration: ○ Fill in missing values ○ Correlations ○ PCA ● Partitioning: ○ Train & Test: 70%-30% ○ 5-fold Cross-validation (parameter-tuning) 1 2 3 4 5 6 7 8 Risk Level Risk Level

  5. Methodology (Process Flow) Yes, Employment- related Yes variables. Identify De-anonymization: De-anonymization: Use Decision Tree to Predict/Classify confidential Is Any other Any other confidential variable predict No No variable (CV). Set Family_Hist_1 Is the CV Fam_Hist_1 CV in the CV in the Family_Hist_1 (levels using the remaining it as the outcome predictable? Not No predictable? data? data? variable. variables in the 1,2,3) using the anymore remaining variables. dataset. Yes! Yes No No Based on Decision Tree Final result: Anonymized Final result: Anonymized Identify which variables results of important dataset dataset are most important in features, 4 variables are predicting CV. Drop them very predictable of from the dataset. Fam_Hist_1. Drop them one by one. Try to predict risk level Try to predict risk level using the anonymized using the anonymized dataset. Evaluate dataset. Evaluate performance. performance.

  6. Predicting Family_Hist_1 Feature Importances 1. Dimension reduction a. PCA vs Random Forest (RF) Feature importances b. Using the result from RF feature importances: i. Tuned Decision Tree with 20 variables can predict with 77% accuracy 2. Performance metrics for multi-class classification problem: a. Averaged version of Accuracy, Precision, Recall and F1-score - the higher, the better

  7. De-anonymizing Family_Hist_1

  8. De-anonymizing Employment_Info_1

  9. Evaluating the Risk Level Prediction

  10. Implementation/Production Considerations NOTES ● Assumptions about the variables should be checked with Prudential. ● Based on the results, dropping the identified sensitive variables (and the important variables related to them) is possible and it did not significantly affect the risk level prediction. ● Performance metrics (setting a threshold for de-anonymization) is critical and should be discussed with NAIC. RECOMMENDATIONS ● Repeat the algorithm with the remaining identified sensitive values. ● Re-evaluate risk level modelling.

Recommend


More recommend