Building a Fault-Tolerant ETL Pipeline for Claims CAFé Internship Presentation - Summer 2018 Kim Hammar Data Analytics Engineer khamq@allstate.com or kimham@kth.se August 30, 2018 Kim Hammar (DAE) Claims CAFé August 30, 2018 1 / 24
Extract Transform Load Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24
Extract Corrupt data that cannot be parsed Inconsistent Schema Wrong File paths Unexpected null values Duplicates Transform Code bugs Missing values Inconsistent data types Target database unavailable Load Permission error Scalability problems Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24
Outline Claims CAFé Background 1 Ensuring Data Quality 2 DEMO 3 Conclusion 4 Questions 5 Kim Hammar (DAE) Claims CAFé August 30, 2018 3 / 24
What is Claims CAFé? Claims CAFé: Claims (C)entral (A)nalytical (F)il(e) claimId policy participant persVeh ... B005 auto John Doe URK389 Audi ... Claims B007 home A.Svensson PTO291 Ford ... B003 life C.Strömbäck RNU999 Volvo ... B004 property G.Åsbrink WEM650 Benz ... B002 health L.Löfven KQO209 Tesla ... A one stop shop for claims analytics. Optimized for analytical use-cases by saving raw denormalized data in hadoop: Increase scalability Makes data processing easier: No more slow and complex SQL-Joins Kim Hammar (DAE) Claims CAFé August 30, 2018 4 / 24
What Actually Goes Into Building a Model Building models, reports & Visualizations Data Science & Analytics Data Engineering ETL processes, data transformations & data cleaning Kim Hammar (DAE) Claims CAFé August 30, 2018 5 / 24
Now: complex and slow complex and slow complex and slow complex and slow join to pull data join to pull data join to pull data join to pull data transform & clean data transform & clean data transform & clean data transform & clean data create derived fields create derived fields create derived fields create derived fields Classify Image Vehicle Loss Link Analysis Predict claim automation: Customer report Fraud Detection Deep CNN Model Deep FNN model & dashboards Claim2 10 b 0 b 1 repairable = 0 Customer2 5 11 x 0 , 1 x 1 , 1 total loss = 1 y ˆ . 10 3 . . x 0 , 2 x 1 , 2 no loss = 0 Customer1 Claimant 10 x 0 , 3 x 1 , 3 9 4 4 Claim1 10 Participant 1 Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24
Now: complex and slow complex and slow complex and slow complex and slow join to pull data join to pull data join to pull data join to pull data transform & clean data transform & clean data transform & clean data transform & clean data create derived fields create derived fields create derived fields create derived fields Classify Image Vehicle Loss Link Analysis Predict claim automation: Customer report Fraud Detection Deep CNN Model Deep FNN model & dashboards Claim2 10 b 0 b 1 repairable = 0 Customer2 5 11 x 0 , 1 x 1 , 1 total loss = 1 y ˆ . 10 3 . . x 0 , 2 x 1 , 2 no loss = 0 Customer1 Claimant 10 x 0 , 3 x 1 , 3 9 4 4 Claim1 10 Participant Future: complex and slow transform & clean data Claims join to pull data create derived fields CAFé Claim2 b 0 b 1 1 repairable = 0 10 x 0 , 1 x 1 , 1 total loss = 1 5 11 ˆ y Customer2 . . . x 0 , 2 x 1 , 2 no loss = 0 10 3 x 0 , 3 x 1 , 3 Customer1 Claimant 10 4 4 9 Claim1 10 Participant Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24
What Goes Into Making CAFé Making Coffee Processing: Prepare for consumption: Consume: Coffee Plants Harvest pulping the cherries, roasting, drink, filtering out bad ones, grinding, sell sorting&drying brewing Making Claims CAFé f ( x ) Sekante Q f ( x 0 + ε ) − f ( x 0 ) f () X Y P f ( x 0 + ε ) ε f ( x 0 ) Processing: Prepare for consumption: ε x x 0 Consume: x 0 + ε Raw Data Sources ETL fill in null values, flatten, build models, create derived fields, create different views, generate business reports data cleaning save to Hive Kim Hammar (DAE) Claims CAFé August 30, 2018 7 / 24
Claims CAFé Data Architecture Allstate Data Sources ADW NextGen . . . Canada Claims MDM Claims Synchronizes and reconciles data MDM from various sources into a single place Derived fields Data cleaning Data restructuring Claims CAFé Claims Serves the need of data scientists. CAFé Contains data optimized for analytics Machine Learning Analytics Reports 2 , 500 f ( x ) 2 , 000 Number of Q Sekante 1 , 500 fraudulent claims f ( x 0 + ε ) − f ( x 0 ) 1 , 000 during the P f ( x 0 + ε ) ε 500 third quarter: X f ( x 0 ) 0 ε x x 0 x 0 + ε [ − 4 , − 2) [ − 2 , 0) [0 , 2) [2 , 4) Kim Hammar (DAE) Claims CAFé August 30, 2018 8 / 24
What Is a Claim From A Data Pespective? Claim � �� � Claim Details Policy Details Participants Details Vehicle Details ... ���� ���� ���� ���� ���� . . . . . . . . . . . . . . . Figure: Some of the data that a single Claim in Claims CAFé contains. Kim Hammar (DAE) Claims CAFé August 30, 2018 9 / 24
Data Quality Assurance Data Quality is key to the success of Claims CAFé Examples of data quality issues: Null values in the wrong place Duplicates Missing values Inconsistent data types . . . Data Quality Issues Why does my Model not work? John Doe, Data Scientist Data Size Kim Hammar (DAE) Claims CAFé August 30, 2018 10 / 24
Data Quality Mechanisms Motivation Failures are the norm , they are not an exception Optimistic error estimate: P ( "CAFé error" ) = P ( "Data quality issue" ) ∪ P ( "Hadoop failure" ) ∪ P ( "Network failure" ) ∪ P ( "ClaimsMDM failure" ) ∪ P ( "CAFé code bug" ) ≈ 1 / 1000 = 0 . 001 CAFé stretch goal: near-real-time updates, say we pull data every 15 minutes = ⇒ 96 pulls per day = ⇒ 672 pulls per week failure probability per week: 0 . 001 × 672 = 0 . 672 = ⇒ a failure will happen on average every other week We want built-in mechanisms in the CAFé pipeline to detect and deal with errors before they affect end-users: Tests! Kim Hammar (DAE) Claims CAFé August 30, 2018 11 / 24
How to detect data quality issues? Know your data How many claims were recorded in Rhode Island 2016-2017? How many null claims exists in Idaho? What is the average number of claim par- Claim 1 Claim 2 Claim 3 Claim 4 Claim 5 ticipants? . . . . . . . . . . . . . . . . . . What is the average number of covered claimants in a claim? What is the maximum number of vehicles re- lated to a claim? Kim Hammar (DAE) Claims CAFé August 30, 2018 12 / 24
Anomaly Detection for Detecting Potential Data Issues (1/3) Spikes in the number of null values indicate a data issue. Indication of a data pull issue 16000 14000 12000 10000 Null Counts 8000 6000 4000 2000 0 06/18/18 06/18/18 06/19/18 06/19/18 06/20/18 06/20/18 06/21/18 06/21/18 Date Kim Hammar (DAE) Claims CAFé August 30, 2018 13 / 24
Anomaly Detection for Detecting Potential Data Issues (2/3) Number of claims in Rhode Island over time 14 , 000 12 , 000 Count 10 , 000 Why did this drop happen? 8 , 000 6 , 000 04.2017 05.2017 06.2017 07.2017 08.2017 09.2017 10.2017 11.2017 12.2017 01.2018 02.2018 03.2018 Date Kim Hammar (DAE) Claims CAFé August 30, 2018 14 / 24
Anomaly Detection for Detecting Potential Data Issues (3/3) Comparing a single claim statistics with the average 39 40 39 Parties in this claim vs average on 3 Count is this valid data or a bug? 20 4 3 2 1 1 0 CoverageClaimants PersVeh ClaimsParty Average cross entire dataset Actual value for this claim Kim Hammar (DAE) Claims CAFé August 30, 2018 15 / 24
Regression Tests Background (1/2) What is regression testing? Regression tests verify modifications of a program or data. If the modification fail the tests, the program can regress back. I just refactored the CAFé pipeline, how do I know that I didn’t break anything? I just pulled new data into CAFé, how do I know I did not introduce data quality issues? Why do we want regression tests? It increases the confidence in making code and data changes We can detect bugs before they bother end-users We can avoid unnecessary work: If the tests fail we can abort early and save time. Kim Hammar (DAE) Claims CAFé August 30, 2018 16 / 24
Regression Tests Background (2/2) Naive ETL Pipeline For Updating Claims CAFé: Load & Transform Claims Claims MDM CAFé Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24
Regression Tests Background (2/2) Naive ETL Pipeline For Updating Claims CAFé: Load & Transform Claims Claims MDM CAFé A More Robust ETL Pipeline: Tests passed Claims Regression Load & Claims MDM Tests Transform CAFé Tests failed Email maintainers Generate report Missing columns:... Null values:... Anomalies:... . Abort ETL . . Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24
Recommend
More recommend