Thresholds & Tasks Relationships • Link between entities used to visualize the interconnectedness between records or hierarchical structures within a dataset Smith Household 1122 Main Street Reno NV, 89522 Mike Smith Deb Becker-Smith Mikey Smith EID: 4569872 EID: 1456984 EID: 9734546 Confidential Page: 29
Thresholds & Tasks Validation Lists • Anonymous Values are usually fake or over used values • Removing them enhances match quality and search speed • “BABYBOY” or (999) 999-9999 are considered NULL • Equivalencies (or nicknames) create lists of interchangeable words • Greater flexibility in searching and comparison • Names (Bob = Robert) and abbreviations (HWY = Highway) Confidential Page: 30
Thresholds & Tasks Anonymous Values • Robert · Noonan · Jillian · Delacroix · Duk · Yussef Rajeshi · Katsumoto · Jorge · Twin · Hernandez Katai · John · M’Tembe · Mikel · Hiroshi · Larssen Marija · Ozols · Unknown · Chi · D’Esopo · Fergus O’Shea · Maria · Obidos · Wei · James · Vladamir Jian · Baby · Pham · Oliver · Kensington · Kimball Paolo · Morgan · Boy · Silvia · McDonald · Woo Hiram · Fiona · Marta · Alexi · Zhang · Krishna NoLastName · Isaac · Gabrielle · Test · Vargas Tomas · Hubert · Paul · Nguyen · Hussein · Liam Confidential Page: 31
Thresholds & Tasks MDM Data Flow Adds/Updates Parse Data to Derive Compare Link or Create are Consumed Segments Data Records Tasks Adds and Updates Each record is Derived data takes Pairs of records The results of the are called Puts. assigned an raw information compared using a comparisons are identifier in the and creates probabilistic measured against Puts can come via MEMHEAD table. standardized method. a set of thresholds the API, Web versions of the that determine Services, Message Attributes are sent data for easier Values are whether to link, Brokers, or as a to the appropriate comparison and weighted by ignore, or create a bulk process. table or UNL file, organizes records frequency, tasks Names go to into buckets for similarity, or their MEMNAME, etc. faster searching. relationship to other data, then the score is tallied. Confidential Page: 32
Thresholds & Tasks Probabilistic Matching Framework • Pairs of records are compared to each other • Key attributes are assessed and an overall match score is issued 13.8 Record X Record Y Ron R. Jones Name: Ronald Jones Name: (773) 826-2825 (773) 826-2852 Phone: Phone: 972-41-2318 972-41-2318 ID #: ID #: Confidential Page: 33
Algorithm Details
Algorithm Details Standardization • Derive ⇒ Organize ⇒ Search ⇒ Compare ⇒ Score Standardize Bucket Compare Standardization takes Bucketing organizes Comparison measures the raw data from a records with similar the probability that record and cleans it to data to optimize the candidate records make analysis easier. search process. (from buckets) match. The resulting string is Bucket hash indexes Scores are based on stored in the are stored in the pre-defined weights in mpi_memcmpd mpi_membktd table. lookup tables. table. Confidential Page: 35
Algorithm Details Standardization Functions • Standardization provides a wide variety of functions: • Case Conversion : Karen Jones ⟹ JONES:KAREN • Truncation of Values : (312) 832-1212 ⟹ 8321212 • Anonymous Values : Remove (000) 000-0000 • Validating Data Format : Rejects e-mail addresses without an @ • Equivalency Translation : Apartment ⟹ APT • Standardization functions are designed to work with specific data types, such as Names, Addresses, Dates, Phones, etc. Confidential Page: 36
Algorithm Details Standardized Data: MEMCMPD Table Original Data Standardized Data Maria R. Fontana FONTANA:MARIA:R::. 391-20-1923 391201923 (832) 812-1193 8121193 (832) 811-2915 8112915 mfont91@us.ibm.com MFONT91USIBM 1973-09-21 19730921 928 West Kingston Court N-928:S-W:S-KINGSTON:S-CT:S- Chicago, IL 60617 CHICAGO:S-IL:N-60617 Attribute delimiter OR condition FONTANA:MARIA:R::.^391201923^8121193~8112915^MFONT91USIBM^ 19730921^N-928:S-W:S-KINGSTON:S-CT:S-CHICAGO:S-IL:N-60617 Confidential Page: 37
Algorithm Details Bucketing • Derive ⇒ Organize ⇒ Search ⇒ Compare ⇒ Score Standardize Bucket Compare Standardization takes Bucketing organizes Comparison measures the raw data from a records with similar the probability that record and cleans it to data to optimize the candidate records make analysis easier. search process. (from buckets) match. The resulting string is Bucket hash indexes Scores are based on stored in the are stored in the pre-defined weights in mpi_memcmpd mpi_membktd table. lookup tables. table. Confidential Page: 38
Algorithm Details Bucketing: mpi_membktd • Buckets act like a table index, except that they reference data in multiple tables • Stored as 64-bit “hash” integers • Used to select candidates for comparison • Speeds searching by enabling sub-set search • Cast a wide net by taking abstract view of data Confidential Page: 39
Algorithm Details Bucket Tokens Token is single piece of data (eg, field or sub-string) • Normally an “or” condition exists, so more criteria broadens a search • To create “and” searching, modify min and max tokens • Tokens can be controlled at two levels: Bucket Function & Bucket Group • JOHN M SMITH 2 Name tokens Bucket Function = 3 tokens 0 DOB tokens JOHN JOHN + + Patient Name M SMITH Bucket Group (MAX: 2, MIN: 1) SMITH JOHN + + Name + DOB Bucket Function M DOB (MAX: 2, MIN: 2) SMITH M Birth Date + + DOB ⇾ 1 token 1 Name token (MAX: 1, MIN: 0) DOB DOB optional 1 DOB token Confidential Page: 40
Algorithm Details Examples of Bucket Generation Types Generation Type What Happens YYYMMDD Bucket records according to Full Date, so there is no transformation MMDD Use MMDD portion of date token to group dates by anniversary (19280912 ⟹ 0912) Metaphone: Convert tokens to key sound markers. Industry standard phonetic methodology Phonetic Normphone: Use Initiate’s proprietary phonetic method, works best with Western languages Arabic Name: Apply phonetics to English translation of Arabic names Equivalence Apply a String Equivalency (Nickname or Abbreviation) from a lookup table (Jim ⟹ James) Equivalence & Phonetics Blend phonetic and equivalence conversions against tokens. (ED ⟹ EDWARD ⟹ ETWRT) Sorted Sort the contents of numeric tokens (3014201324 ⟹ 0011223344, 52431 ⟹ 12345) nGRAM Sequences Bucket chunks of data with length N (N = 4 against “6345111” ⟹ 6345 3451 4511 5111) Confidential Page: 41
Algorithm Details Bucket Generation Types Generation types control how buckets • (723) 445-2983 are conceived Format data for optimized searching • PHONE Standardization Make buckets more accepting of • discrepancies Leverage standardized version of 4452983 • data Why use bucket generation types? • Attribute Bucket (Sorted) Helps work around typos • Makes searches less reliant on exact • matches 2344589 Casts a wide net, includes nicknames • & key sounds Confidential Page: 42
Algorithm Details Phonetics Normphone • Remove all vowels (unless vowel is first letter) • S, Z, SH, CE → S (drop if final sound of word) • J, G → J if first letter; J, G → K otherwise • T, D, DT, TH, DTH → T; M → N; V → F • C → S if surrounded by vowels (eg, “ICE”); C → K otherwise • Truncate at 6 chars: KRISHNAMOORTHY → KRSNNR • X → S if at the beginning:to “S”, X → K otherwise • Drop H (unless it’s the first letter) Confidential Page: 43
Algorithm Details Example FNTN FNTN MR Name FONTANA:MARIA:R::. + + + MR R R SSN 391201923 FNTN 011223399 1112389 + 198309 Phone 8121193 MR R + + DOB 19830921 198309 198309 Confidential Page: 44
Algorithm Details Comparison • Derive ⇒ Organize ⇒ Search ⇒ Compare ⇒ Score Standardize Bucket Compare Standardization takes Bucketing organizes Comparison measures the raw data from a records with similar the probability that record and cleans it to data to optimize the candidate records make analysis easier. search process. (from buckets) match. The resulting string is Bucket hash indexes Scores are based on stored in the are stored in the pre-defined weights in mpi_memcmpd mpi_membktd table. lookup tables. table. Confidential Page: 45
Algorithm Details Matching with Comparison Functions Exact Match : Do the two values Comparison • Data Type match, Yes or No? Method(s) to Use Starts With : Do the two values • Name All start with same initial (digit)? Exact Match Edit Distance : How many edits • Birth Date Edit Distance does it take to make the values match? Passport # Exact Match Phonetics : Do the values have • the same key sound markers? Gender Exact Match Equivalency : Could one of the • Exact Match values be a nickname or alias? Phone # Edit Distance Confidential Page: 46
Algorithm Details Edit Distance: Measuring Similarity • When two values are exactly the same, distance=0 • A degree of distance is added for three reasons: • Transposition: Stein versus Stien • Add/Remove: Margret versus Margaret • Replace: 281990 versus 281980 3922019 Phillip 1982-03-21 Hashida versus versus versus versus 3292089 Philippe 1982-08-24 Hashida Confidential Page: 47
Algorithm Details False Positive Filter (FPF) • Probabilistic comparisons are optimistic when scoring (reward any correlation) • May lead to False Positives: twins, Jr. & Sr., or siblings that scored highly • FPF increases accuracy of matching and linking by issuing penalties when pre-defined conditions are met • Penalty forces False Positive to fall below AL threshold and into Task • FPFs are deployed as comparison functions within Algorithm • Run in parallel to normal probabilistic matching process, but unlike other comparison functions, FPFs issue penalties instead of awarding points • Examine four key elements: Name; Gender; Birth Date; SSN Confidential Page: 48
Algorithm Details FPF Example: AL=15 Record X Record Y Score John Michael Smith Johnny M. Smith 5.2 FPF Penalty Name Name: partial match Gender Male Male 0.3 Gender: exact 828 W. High St. 828 W. High St. 5.8 Address Camden, ME 04843 Camden, ME 04843 DOB: missing (207) 236-9132 (207) 236-9132 5.2 Phone SSN: missing SSN 771-29-1821 0.0 Total: 16.5 Total (FPF Adjusted): 14.3 Confidential Page: 49
Generating Weights
Generating Weights What is Derived Data? • Original source data is abstracted during the derivation process: • Data parsed to attribute-based segments (eg, Names to mpi_memname) • Core data is standardized and stored in the mpi_memcmpd table • Bucket hashes are generated and stored in the mpi_membktd table • Matching and linking data is converted to binary form in the bxm files • Derivation employs the use of several techniques when abstracting data • Standardized data will nullify any anonymous values that are defined • Buckets will apply equivalencies, anonymous values, and phonetics • Attributes are tokenized, focusing on the individual words and numbers Confidential Page: 51
Generating Weights Comparison Strings & Bucket Hashes • Comparison strings are carat-delimited values stored in mpi_memcmpd: • Carat order is determined by Comparison Role algorithm components • Format of comparison strings is determined by the Standardization functions • Historical and alternate values are sub-delimited by tildes • Comparison strings act like de-normalized data (all critical information is kept) • Bucket hashes are 64-bit integers that are stored in mpi_membktd: • Bucket hashes are assigned according to bucketing groups in the algorithm • The bucket ‘Role’ number is prepended to the value before being hashed • Each member record could have several bucket hashes Confidential Page: 52
Generating Weights Derivation Process Original Parse Data to Create Comparison Assign Bucket Compile Data Segments Strings Hashes Binary Files Initiate receives data Attributes are sent to Standardized data is Hashes are assigned MEMCMPD and in a wide variety of the appropriate table carat-delimited and to each record and MEMBKTD are formats and this data or UNL file, Names go sent to MEMCMPD sent to MEMBKTD converted to binary & must be normalized to MEMNAME, etc. table or UNL file table or UNL file stored in bxm files Confidential Page: 53
Generating Weights Weight Generation • Weights are scores that represent the confidence that a single value can uniquely identify a record • Calculated for each client, based on their data • Stored in lookup tables: when data is compared, scores are pulled • Weight generation is resource intensive • Weight generation should be run when server traffic is low • Weight generation for large datasets (>10 million) may take a few days to process Confidential Page: 54
Generating Weights Three Kinds of Weights • Frequency-based Weights : provide a score based on how often a value appears within overall population. Common values (like John) have a low score, rare values (like Chitsumungo) have a high score. • Edit Distance Weights : measure the similarity between two values. Eg, “Gordon” vs. “Gorton” has a distance of 1 edit. Exact match has highest score, but each edit lowers score by a certain degree. • Parameterized Weights : These weights control maximum caps on scores, extra credit points, and penalties for variance. Eg, there is a maximum weight for Full Name that ensures that the name does not generate a disproportionate score. Confidential Page: 55
Generating Weights Weight Tables Table Description mpi_wgthead Holds core definitions of weights, including comparison specification string and weight type mpi_wgt1dim Holds weight values for comparison functions that have a single comparison attribute (eg, SSN, DOB) mpi_wgt2dim Holds weight values for appropriate comparison functions that use two attributes (eg, Eye + Hair Color) Holds weight values for appropriate comparison functions that use three attributes (eg, Zip Code + mpi_wgt3dim Address + phone) Holds weight values for the False Positive Filter, which uses four separate attributes to control situations mpi_wgt4dim like Twins or Jr’s & Sr’s who are mistakenly linked Holds common string weight values based purely on frequency (weights for people's names and mpi_wgtsval attributes that use a simple “match or do not match” like gender) mpi_wgtnval Holds common numeric weight values based purely on frequency (date information like birth year) Holds anonymous values established by the Anonymous Value Utility (not a weight table per se, but mpi_stranon commonly associated with weights because of role that anonymous values play in measuring frequency) Confidential Page: 56
Generating Weights One-Dimensional Weights: wgt1dim Comparison Type Index Weight Comparison Type Index Weight Comparison Type Index Weight CMPID-SSN-DIST 0 0 CMPID-DOB-DIST 0 0 CMPID-AXP-1DIM 0 0 CMPID-SSN-DIST 1 556 CMPID-DOB-DIST 1 0 CMPID-AXP-1DIM 1 419 CMPID-SSN-DIST 2 403 CMPID-DOB-DIST 2 44 CMPID-AXP-1DIM 2 504 CMPID-SSN-DIST 3 315 CMPID-DOB-DIST 3 -176 CMPID-AXP-1DIM 3 596 CMPID-SSN-DIST 4 177 CMPID-DOB-DIST 4 -285 CMPID-AXP-1DIM 4 723 CMPID-SSN-DIST 5 31 CMPID-AXP-1DIM 5 860 CMPID-SSN-DIST 6 -91 CMPID-AXP-1DIM 6 997 CMPID-SSN-DIST 7 -205 CMPID-AXP-1DIM 7 1126 CMPID-SSN-DIST 8 -291 CMPID-AXP-1DIM 8 1273 CMPID-SSN-DIST 9 -299 CMPID-SSN-DIST 9 -299 Confidential Page: 57
Generating Weights wgt1dim: SSN Comparison Type Index Weight Comparison Type Index Weight Comparison Type Index Weight CMPID-SSN-DIST 0 0 CMPID-DOB-DIST 0 0 CMPID-AXP-1DIM 0 0 CMPID-SSN-DIST 1 556 CMPID-DOB-DIST 1 0 CMPID-AXP-1DIM 1 419 CMPID-SSN-DIST 2 403 CMPID-DOB-DIST 2 44 CMPID-AXP-1DIM 2 504 Edit distance weights, as with SSN, use the index as CMPID-SSN-DIST 3 315 CMPID-DOB-DIST 3 -176 CMPID-AXP-1DIM 3 596 a placeholder for the edits. CMPID-SSN-DIST 4 177 CMPID-DOB-DIST 4 -285 CMPID-AXP-1DIM 4 723 CMPID-SSN-DIST 5 31 CMPID-AXP-1DIM 5 860 0=missing CMPID-SSN-DIST 6 -91 CMPID-AXP-1DIM 6 997 1=exact match CMPID-SSN-DIST 7 -205 CMPID-AXP-1DIM 7 1126 2=one edit CMPID-SSN-DIST 8 -291 CMPID-AXP-1DIM 8 1273 3=two edits … CMPID-SSN-DIST 9 -299 CMPID-SSN-DIST 9 -299 Confidential Page: 58
Generating Weights wgt1dim: Date of Birth Comparison Type Index Weight Comparison Type Index Weight Comparison Type Index Weight CMPID-SSN-DIST 0 0 CMPID-DOB-DIST 0 0 CMPID-AXP-1DIM 0 0 CMPID-SSN-DIST 1 556 CMPID-DOB-DIST 1 0 CMPID-AXP-1DIM 1 419 CMPID-SSN-DIST 2 403 CMPID-DOB-DIST 2 44 CMPID-AXP-1DIM 2 504 CMPID-SSN-DIST 3 315 CMPID-DOB-DIST 3 -176 CMPID-AXP-1DIM 3 596 Date comparison CMPID-SSN-DIST function stores weights 4 177 CMPID-DOB-DIST 4 -285 CMPID-AXP-1DIM 4 723 In the case of an exact across two tables. CMPID-SSN-DIST 5 31 CMPID-AXP-1DIM 5 860 match, the weight values are stored in CMPID-SSN-DIST 6 -91 CMPID-AXP-1DIM 6 997 0=missing wgtnval CMPID-SSN-DIST 7 -205 CMPID-AXP-1DIM 7 1126 1=exact match CMPID-SSN-DIST 8 -291 CMPID-AXP-1DIM 8 1273 2=one edit CMPID-SSN-DIST 9 -299 CMPID-SSN-DIST 9 -299 3=two edits … Confidential Page: 59
Generating Weights wgtnval: Exact Match Weights for DOB Comparison Type Index Weight Exact match scores for dates are based CMPID-DOB-YEAR 1900 342 on the frequency of dates from each year. CMPID-DOB-YEAR 1899 386 Here you clearly see fake data because CMPID-DOB-YEAR 1898 398 the ‘Baby Boomers’ should be the most CMPID-DOB-YEAR 1897 419 common (lowest score) years. CMPID-DOB-YEAR 1901 423 CMPID-DOB-YEAR 1896 425 CMPID-DOB-YEAR 1895 445 CMPID-DOB-YEAR 1894 463 If there is an exact match between dates, CMPID-DOB-YEAR 2001 463 but the year is not listed, the default CMPID-DOB-YEAR 1954 473 score (-1) is awarded. This is the highest CMPID-DOB-YEAR 1955 473 score because the value was very rare. CMPID-DOB-YEAR 1977 475 CMPID-DOB-YEAR 2003 475 CMPID-DOB-YEAR -1 477 Confidential Page: 60
Generating Weights One-Dimensional Weights: wgt1dim For AXP , the index is used to Comparison Type Index Weight Comparison Type Index Weight Comparison Type Index Weight indicate the # of digits in numerical CMPID-SSN-DIST 0 0 CMPID-DOB-DIST 0 0 CMPID-AXP-1DIM 0 0 address elements. CMPID-SSN-DIST 1 0=missing 556 CMPID-DOB-DIST 1 0 CMPID-AXP-1DIM 1 419 1=one digit CMPID-SSN-DIST 2 403 CMPID-DOB-DIST 2 44 CMPID-AXP-1DIM 2 504 2=two digits CMPID-SSN-DIST 3 315 CMPID-DOB-DIST 3 -176 CMPID-AXP-1DIM 3 596 3=three digits … CMPID-SSN-DIST 4 177 CMPID-DOB-DIST 4 -285 CMPID-AXP-1DIM 4 723 CMPID-SSN-DIST 5 31 CMPID-AXP-1DIM 5 860 Scores are applied when two CMPID-SSN-DIST 6 -91 CMPID-AXP-1DIM 6 997 addresses contain the exact same CMPID-SSN-DIST 7 -205 CMPID-AXP-1DIM 7 1126 number. These scores increase CMPID-SSN-DIST 8 -291 CMPID-AXP-1DIM 8 1273 because one-digit address CMPID-SSN-DIST 9 -299 CMPID-SSN-DIST 9 -299 numbers are common, but eight- digit numbers are rare. Confidential Page: 61
Generating Weights Two-Dimensional Weights: wgt2dim 2D weights for AXP Phone use the columns to Comparison Type Index Missing Exact ED 1 ED2 ED 3 ED 4 ED 5 ED 6 represent phone edit CMPID-AXP-2DIM 0 0 299 217 60 -89 -127 -174 -285 distance and the rows CMPID-AXP-2DIM 1 300 615 560 420 300 250 220 200 to represent address CMPID-AXP-2DIM 2 263 584 530 395 283 205 159 132 edit distance. Index CMPID-AXP-2DIM 3 226 553 500 370 266 160 98 64 CMPID-AXP-2DIM 4 189 522 470 345 249 115 37 -4 column for address Score when both phone CMPID-AXP-2DIM 5 152 491 440 320 232 70 -24 -72 works like wgt1dim and address are exact CMPID-AXP-2DIM 6 115 460 410 295 215 25 -85 -140 (0=missing, 1=exact, match. CMPID-AXP-2DIM 7 78 429 380 270 198 -20 -146 -208 2=one edit …). CMPID-AXP-2DIM 8 41 398 350 245 181 -65 -207 -276 CMPID-AXP-2DIM 9 4 367 320 220 164 -110 -268 -344 CMPID-AXP-2DIM 10 -33 336 290 195 147 -155 -329 -412 Score when both phone CMPID-AXP-2DIM 11 -70 305 260 170 130 -200 -390 -480 and address are totally CMPID-AXP-2DIM 12 -107 274 230 145 113 -245 -451 -548 different. CMPID-AXP-2DIM 13 -144 243 200 120 96 -290 -512 -616 CMPID-AXP-2DIM 14 -181 212 170 95 79 -335 -573 -684 CMPID-AXP-2DIM 15 -218 181 140 70 62 -380 -634 -752 Confidential Page: 62
Generating Weights String-Based Weights: wgtsval Comparison Type Index Weight Comparison Type Index Weight CMPID-NAME-XACT R 115 CMPID-AXP-XACT AZ 145 CMPID-NAME-XACT L 133 CMPID-AXP-XACT ST 181 XACT weights are frequency-based, so the more common a CMPID-NAME-XACT N 133 CMPID-AXP-XACT AVE 360 value is, the lower the score. The rarer a value is, the higher CMPID-NAME-XACT X 267 CMPID-AXP-XACT RD 465 the score. CMPID-NAME-XACT CHRIS 316 CMPID-AXP-XACT JUNCTION 471 CMPID-NAME-XACT JOHN 318 CMPID-AXP-XACT CREEK 557 CMPID-NAME-XACT JENNIFER 333 CMPID-AXP-XACT BLVD 566 CMPID-NAME-XACT BRITTANY 339 CMPID-AXP-XACT DR 566 CMPID-NAME-XACT LUPE 352 CMPID-AXP-XACT HILLS 597 CMPID-NAME-XACT DARIUS 354 CMPID-AXP-XACT WEST 598 CMPID-NAME-XACT a 396 CMPID-AXP-XACT AFB 617 CMPID-NAME-XACT d -100 CMPID-AXP-XACT PARK 633 CMPID-NAME-PARM __FULLNAME_MAXWGT 594 CMPID-AXP-XACT a 883 CMPID-NAME-PARM __NORM_MCCIDX_EQUAL 20 CMPID-AXP-PARM __ADDR_STREET_MAXWGT 3000 Confidential Page: 63
Generating Weights String-Based Weights: wgtsval Comparison Type Index Weight Comparison Type Index Weight CMPID-NAME-XACT R 115 CMPID-AXP-XACT AZ 145 CMPID-NAME-XACT L 133 CMPID-AXP-XACT ST 181 Default agree weights are used when a value is encountered CMPID-NAME-XACT N 133 CMPID-AXP-XACT AVE 360 that does not have a preset weight score. The ‘a’ weight is CMPID-NAME-XACT X 267 CMPID-AXP-XACT RD 465 the highest score, because the value is very rare. CMPID-NAME-XACT CHRIS 316 CMPID-AXP-XACT JUNCTION 471 CMPID-NAME-XACT JOHN 318 CMPID-AXP-XACT CREEK 557 CMPID-NAME-XACT JENNIFER 333 CMPID-AXP-XACT BLVD 566 CMPID-NAME-XACT BRITTANY 339 CMPID-AXP-XACT DR 566 CMPID-NAME-XACT LUPE 352 CMPID-AXP-XACT HILLS 597 CMPID-NAME-XACT DARIUS 354 CMPID-AXP-XACT WEST 598 CMPID-NAME-XACT a 396 CMPID-AXP-XACT AFB 617 CMPID-NAME-XACT d -100 CMPID-AXP-XACT PARK 633 CMPID-NAME-PARM __FULLNAME_MAXWGT 594 CMPID-AXP-XACT a 883 CMPID-NAME-PARM __NORM_MCCIDX_EQUAL 20 CMPID-AXP-PARM __ADDR_STREET_MAXWGT 3000 Confidential Page: 64
Generating Weights String-Based Weights: wgtsval Comparison Type Index Weight Comparison Type Index Weight CMPID-NAME-XACT R 115 CMPID-AXP-XACT AZ 145 CMPID-NAME-XACT L 133 CMPID-AXP-XACT ST 181 Parameter weights issue bonus points, put caps on CMPID-NAME-XACT N 133 CMPID-AXP-XACT AVE 360 maximum scores, and indicate the penalties to subtract for CMPID-NAME-XACT X 267 CMPID-AXP-XACT RD 465 edit distance, nicknames, and phonetics. CMPID-NAME-XACT CHRIS 316 CMPID-AXP-XACT JUNCTION 471 CMPID-NAME-XACT JOHN 318 CMPID-AXP-XACT CREEK 557 CMPID-NAME-XACT JENNIFER 333 CMPID-AXP-XACT BLVD 566 CMPID-NAME-XACT BRITTANY 339 CMPID-AXP-XACT DR 566 CMPID-NAME-XACT LUPE 352 CMPID-AXP-XACT HILLS 597 CMPID-NAME-XACT DARIUS 354 CMPID-AXP-XACT WEST 598 CMPID-NAME-XACT a 396 CMPID-AXP-XACT AFB 617 CMPID-NAME-XACT d -100 CMPID-AXP-XACT PARK 633 CMPID-NAME-PARM __FULLNAME_MAXWGT 594 CMPID-AXP-XACT a 883 CMPID-NAME-PARM __NORM_MCCIDX_EQUAL 20 CMPID-AXP-PARM __ADDR_STREET_MAXWGT 3000 Confidential Page: 65
Name & Gender FPF – wgt4dim Name Gender Name & DOB DOB SSN SSN SSN Index Gender Edit Dist. Year Di ff . Missing Exact Edit Result Result Dist.=1 0 0 0 -0.5 0 -0.65 0 Missing Missing 0 0 1 -0.5 0 -0.65 Birth Month/Day 1 Exact Missing 0 0 2 -0.5 0 -0.65 0 0 3 -0.5 0 -0.65 2 Partial Missing Index Meaning 0 0 4 -0.5 0 -0.65 3 Disagree Missing 0 One or both Dates are Missing mm/dd 0 0 5 -0.5 0 -0.65 0 1 0 0 0 0 4 Missing Agree 1 The mm/dd Dates are an Exact match 0 1 1 0 0 0 5 Exact Agree 2 The mm/dd have an Edit Distance of 1 0 1 2 0 0 0 0 1 3 0 0 0 6 Partial Agree 0 1 4 0 0 0 7 Disagree Agree 0 1 5 0 0 0 8 Missing Disagree 0 2 0 -0.65 -0.2 -0.25 Birth Year 0 2 1 0 0 -0.3 9 Exact Disagree 0 2 2 -0.7 -0.35 -0.4 Index Meaning 10 Partial Disagree 0 2 3 -0.75 -0.4 -0.45 0 2 4 -0.8 -0.45 -0.5 0 One or both years are Missing 11 Disagree Disagree 0 2 5 -0.85 -0.5 -0.55 1 Dates are 0-4 Years different 1 0 0 -0.2 0 0 1 0 1 -0.2 0 0 2 Dates are 5-9 Years different 1 0 2 -0.2 0 0 3 Dates are 10-14 Years different 1 0 3 -0.2 0 0 4 Dates are 15-19 Years different 1 0 4 -0.2 0 0 1 0 5 -0.2 0 0 1 1 0 0 0 0 1 1 1 0 0 0 SSN 1 1 2 0 0 0 1 1 3 0 0 0 Index Meaning 1 1 4 0 0 0 1 1 5 0 0 0 0 One or both SSNs are Missing 1 2 0 -0.35 -0.1 -0.15 1 SSNs are an Exact match 1 2 1 0 0 -0.3 1 2 2 -0.4 -0.25 -0.3 2 SSNs have an Edit Distance of 1 1 2 3 -0.45 -0.3 -0.35 … Edit distance can continue to 9 1 2 4 -0.7 -0.35 -0.4 1 2 5 -0.75 -0.4 -0.45 Confidential Page: 66
Name & Gender FPF – wgt4dim Name Gender Name & DOB DOB SSN SSN SSN Index Gender Edit Dist. Year Di ff . Missing Exact Edit Result Result Dist.=1 0 0 0 -0.5 0 -0.65 0 Missing Missing 0 0 1 -0.5 0 -0.65 Birth Month/Day 1 Exact Missing 0 0 2 -0.5 0 -0.65 0 0 3 -0.5 0 -0.65 2 Partial Missing Index Meaning 0 0 4 -0.5 0 -0.65 3 Disagree Missing 0 One or both Dates are Missing mm/dd 0 0 5 -0.5 0 -0.65 0 1 0 0 0 0 4 Missing Agree 1 The mm/dd Dates are an Exact match 0 1 1 0 0 0 5 Exact Agree 2 The mm/dd have an Edit Distance of 1 0 1 2 0 0 0 0 1 3 0 0 0 6 Partial Agree 0 1 4 0 0 0 7 Disagree Agree 0 1 5 0 0 0 8 Missing Disagree 0 2 0 -0.65 -0.2 -0.25 Birth Year 0 2 1 0 0 -0.3 9 Exact Disagree 0 2 2 -0.7 -0.35 -0.4 Index Meaning 10 Partial Disagree 0 2 3 -0.75 -0.4 -0.45 0 2 4 -0.8 -0.45 -0.5 0 One or both years are Missing 11 Disagree Disagree 0 2 5 -0.85 -0.5 -0.55 1 Dates are 0-4 Years different 1 0 0 -0.2 0 0 1 0 1 -0.2 0 0 2 Dates are 5-9 Years different This position holds a penalty 1 0 2 -0.2 0 0 3 Dates are 10-14 Years different 1 0 3 -0.2 0 0 for when name & gender are 4 Dates are 15-19 Years different 1 0 4 -0.2 0 0 1 0 5 -0.2 0 0 both missing, month\day is 1 1 0 0 0 0 1 1 1 0 0 0 off, the year is 15-19 different, SSN 1 1 2 0 0 0 but SSN is exactly the same. 1 1 3 0 0 0 Index Meaning 1 1 4 0 0 0 1 1 5 0 0 0 0 One or both SSNs are Missing 1 2 0 -0.35 -0.1 -0.15 1 SSNs are an Exact match 1 2 1 0 0 -0.3 1 2 2 -0.4 -0.25 -0.3 2 SSNs have an Edit Distance of 1 1 2 3 -0.45 -0.3 -0.35 … Edit distance can continue to 9 1 2 4 -0.7 -0.35 -0.4 1 2 5 -0.75 -0.4 -0.45 Confidential Page: 67
Name & Gender FPF – wgt4dim Name Gender Name & DOB DOB SSN SSN SSN Index Gender Edit Dist. Year Di ff . Missing Exact Edit Result Result Dist.=1 0 0 0 -0.5 0 -0.65 0 Missing Missing 0 0 1 -0.5 0 -0.65 Birth Month/Day 1 Exact Missing 0 0 2 -0.5 0 -0.65 0 0 3 -0.5 0 -0.65 2 Partial Missing Index Meaning 0 0 4 -0.5 0 -0.65 3 Disagree Missing 0 One or both Dates are Missing mm/dd 0 0 5 -0.5 0 -0.65 0 1 0 0 0 0 4 Missing Agree 1 The mm/dd Dates are an Exact match 0 1 1 0 0 0 5 Exact Agree 2 The mm/dd have an Edit Distance of 1 0 1 2 0 0 0 0 1 3 0 0 0 6 Partial Agree 0 1 4 0 0 0 7 Disagree Agree 0 1 5 0 0 0 8 Missing Disagree 0 2 0 -0.65 -0.2 -0.25 Birth Year 0 2 1 0 0 -0.3 9 Exact Disagree 0 2 2 -0.7 -0.35 -0.4 This position shows no Index Meaning 10 Partial Disagree 0 2 3 -0.75 -0.4 -0.45 0 2 4 -0.8 -0.45 -0.5 penalty for when all 4 0 One or both years are Missing 11 Disagree Disagree 0 2 5 -0.85 -0.5 -0.55 1 Dates are 0-4 Years different dimensions are in perfect 1 0 0 -0.2 0 0 1 0 1 -0.2 0 0 2 Dates are 5-9 Years different agreement. 1 0 2 -0.2 0 0 3 Dates are 10-14 Years different 1 0 3 -0.2 0 0 4 Dates are 15-19 Years different 1 0 4 -0.2 0 0 1 0 5 -0.2 0 0 1 1 0 0 0 0 1 1 1 0 0 0 SSN 1 1 2 0 0 0 1 1 3 0 0 0 Index Meaning 1 1 4 0 0 0 1 1 5 0 0 0 0 One or both SSNs are Missing 1 2 0 -0.35 -0.1 -0.15 1 SSNs are an Exact match 1 2 1 0 0 -0.3 1 2 2 -0.4 -0.25 -0.3 2 SSNs have an Edit Distance of 1 1 2 3 -0.45 -0.3 -0.35 … Edit distance can continue to 9 1 2 4 -0.7 -0.35 -0.4 1 2 5 -0.75 -0.4 -0.45 Confidential Page: 68
Generating Weights When to Recalculate Weights? • Added or changed comparison function or comparison code in Algorithm • When population has grown by >20% of original population • It has been two years since weights last calculated • When new source has been added, especially one that comes from a different geographical area (East Coast has different name, phone and address distributions than West Coast) • When upgrading to a new version of IBM MDM Confidential Page: 69
Generating Weights Weights and Threshold Connections • New algorithm comparison functions require new weights • Adjustments to weight scores require new thresholds • New thresholds may require adjustments to FPF weights • Sometimes what look like threshold issues, can be fixed by tweaking weights Confidential Page: 70
Threshold Analysis
Threshold Analysis Thresholds • Two thresholds: link, ignore, or create task Ignore TASKS Link Clerical Review Auto-Link Threshold Threshold (CR) (AL) Confidential Page: 72
Threshold Analysis Gaining Higher Accuracy • Spreading thresholds further apart (usually more on AL) ensures that matches will be highly accurate, but it increases the number of tasks Ignore TASKS Link Clerical Review Auto-Link Threshold Threshold (CR) (AL) Confidential Page: 73
Threshold Analysis Total Automation • If CR=AL, there are no manual review tasks Ignore Link CR = AL Confidential Page: 74
Threshold Analysis Why do we Analyze Thresholds? • Determine optimal score (where we believe MDM found same person) • Scan a data sample for False Positives and False Negatives • Determine score at which records should be manually reviewed • Validate that the MDM tool is producing accurate matches • Confirm that the weights generated by MDM make sense • Identify general data quality issues • Show hub how willing company trusts matching/linking process Confidential Page: 75
Threshold Analysis Threshold Analysis Process • Threshold Analysis uses real data from across sources: • Uses the “Threshold Analysis Pair Generation” job • Answer “Yes” “No” or “Maybe” to the assigned pairs • Results compiled and analyzed by Threshold Calculator • Thresholds will be adjusted to provide desired accuracy • Deploy updated Thresholds and repeat for further revisions Confidential Page: 76
Threshold Analysis Sample Pairs Exercise • Sample records are grouped in pairs • Pairs will appear in random score order • Do the records belong to the same person? • Make a gut reaction, don’t think too hard • Expect misspellings and typos, take into consideration • Some of the pairs are supposed to be bad • Watch for twins and Jr./Sr. pairs (data very similar, but DOB or Name might reveal difference) Confidential Page: 77
Threshold Analysis Sample Pair 1: Yes, No, Maybe? Record X Record Y Name Birth Date Address SSN Home Phone Cell Phone M M Gender Marital Status Confidential Page: 78
Threshold Analysis Sample Pair 2: Yes, No, Maybe? Record X Record Y Name Birth Date Address 993-20-1661 993-20-1661 SSN Home Phone Cell Phone Gender Marital Status Confidential Page: 79
Threshold Analysis Sample Pair 3: Yes, No, Maybe? Record X Record Y Sue Chaudray-Patel Susan C. Patel Name Birth Date Address SSN Home Phone Cell Phone Gender Marital Status Confidential Page: 80
Threshold Analysis Sample Pair 4: Yes, No, Maybe? Record X Record Y Rick H. Morrison, Jr. Richard Henry Morrison Name Birth Date 1938-06-05 Address SSN Home Phone Cell Phone Gender Marital Status Confidential Page: 81
Threshold Analysis Sample Pair 5: Yes, No, Maybe? Record X Record Y Rick H. Morrison, Jr. Richard Henry Morrison Name Birth Date 1952-04-17 1952-04-17 8821 W. Grosse Point Way Address Ann Arbor, MI 48104 993-20-1661 993-02-1661 SSN Home Phone (313) 623-1863 (517) 881-1437 Cell Phone (517) 881-1437 M M Gender Marital Status Married Confidential Page: 82
Threshold Analysis Sample Pair 6: Yes, No, Maybe? Record X Record Y LaDonna M. Jeffries Jeff M. L’Donne Name Birth Date 1979-12-14 1963-08-30 118 N. Gartner Road, Apt. 3B 19 West Big Timber Address Kalamazoo, MI 49003 Dryden, MI 48428 999-99-9999 171-12-1646 SSN Home Phone (269) 234-3782 Cell Phone (269) 383-1129 (810) 623-1672 F M Gender Marital Status Single Divorced Confidential Page: 83
Threshold Analysis Sample Pair 7: Yes, No, Maybe? Record X Record Y Vallie G. Musial Val Y. Musia Name Birth Date 1982-02-26 1982-02-26 3792 W. Kingston St. Address Ann Arbor, MI 48106 271-19-1209 SSN Home Phone (269) 392-1810 Cell Phone (269) 932-1180 F F Gender Marital Status Single Confidential Page: 84
Threshold Analysis Sample Pair 8: Yes, No, Maybe? Record X Record Y Ernest L. Johns Ernest L. Johns Name Birth Date 1932-03-28 1932-03-28 Address 810-78-1206 810-78-1206 SSN Home Phone (312) 445-2343 (708) 293-7093 Cell Phone M M Gender Marital Status Confidential Page: 85
Threshold Analysis Sample Pair Review • Company representatives review Sample Pairs • Use Pair Manager or Excel • ~2K matched pairs (10 pairs/0.1 change) • Must indicate Yes, No, or Maybe to validate each pair in the file Confidential Page: 86
Threshold Analysis Using Pair Manager Rearrange and hide fields Revisit answers based on status Exact matches turn green Buttons for “Yes”, “No”, “Maybe” Confidential Page: 87
Threshold Analysis Threshold Calculation Sample Pairs → Threshold • Calculator False Positive Rate • Goal: <1 in 1 million • False Negative Rate • Goal: <1 in 1 thousand • Estimated # of Tasks • Dependent on Resources Confidential Page: 88
Implementation Process
Implementation Process General Implementation Process Reiterate Review Configure Clean Bulk Analyze Configure Deploy Derive Generate Test Customer Data Data Cross And Algorithm Instance Data Weights Config Requirements Model Extract Match Review Confidential Page: 90
Implementation Process Review Customer Requirements • Review requirements defined for the Data Extract by customer • Dependencies : Clear understanding of data sources involved, data fields needed, and how to gather that data in a way that MDM software can consume it Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 91
Implementation Process Configure Data Model • Configure a Member Model to fit project needs: data loaded includes, metadata (like sources and attributes), validation lists, and lookup tables in Workbench • Initiate member model defines the way that the MDM software stores, manages, and validates data • Dependencies : Data Extract Guide outlines specific attributes and fields and Implementation Approach defines additional data dictionary requirements Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 92
Implementation Process Configure Algorithm • Design/build an algorithm to address attributes, comparisons, search requirements, and bucketing design needs • Dependencies : Proper data elements must be in place before algorithm can be fully developed Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 93
Implementation Process Clean Data Extract • Analyze data to ensure it conforms to specifications and fix any problems found • Data Extract is a sampling of data • Dependencies : Data Extract Guide outlines data requirements: Workbench and CloverETL perform data cleansing Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 94
Implementation Process Deploy Instance: Create Empty DB • Create a new database for Master Data Service to reference • Dependencies : Need to have a supported database platform Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 95
Implementation Process Deploy Instance: Create Hub Instance • Install Master Data Service engine, configure ODBC connection, and establish Windows service • Dependencies : Need proper software installation files for OS and an empty database in order to create your instance Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 96
Implementation Process Deploy Instance: Bootstrap DB • Create core database tables, define field properties, index tables, populate data dictionary tables with default settings • Dependencies : Need to have Master Data Service software installed and access to empty database and hub instance Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 97
Implementation Process Deploy Instance: Deploy Configuration • Data Dictionary tables control validation rules, application properties, attributes, sources, nicknames, and core algorithm settings • Dependencies : Need Master Data Service engine with a bootstrapped database running before data dictionary can be imported Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 98
Implementation Process Derive Data • Derived data is data processed by algorithm: four events • Raw data is parsed into segment specific unload files • Comparison strings are built from standardized data • Members are assigned bucket hashes • Binary files are created for faster computation • Dependencies : Need most components installed and configured (eg, hub engine, member model, algorithm): if changes made to algorithm, then data must be re-derived Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 99
Implementation Process Generate Weights • Measure value frequency and assign weights • Weight generation process goes through multiple steps to measure frequency of individual values in database • Dependencies : Must have engine installed and algorithm configured (if data already derived then weight generation will take less time) Review Configure Configure Clean Deploy Derive Generate Bulk Analyze Test Customer Data Algorithm Data Instance Data Weights Cross And Config Requirements Model Extract Match Review Confidential Page: 100
Recommend
More recommend