Robust temp mporal l grap aph clusterin ing an and cluster - PowerPoint PPT Presentation

Robust temp mporal l grap aph clusterin ing an and cluster evalu aluatio ion me meas asure for r group record linkage Charini Nanayakkara, Peter Christen , and Thilina Ranbaduge peter.christen@anu.edu.au Research School of Computer Science, College of Engineering and Computer Science, The Australian National University, Canberra, Australia This research is conducted as part of the Digitising Scotland project https://www.lscs.ac.uk/projects/digitising-scotland/ and partially funded by the Australian Research Council under DP160101934. Slide 1 of 22

Outline • Group record linkage and (temporal) constraints • Temporal constraints based graph clustering • Detailed steps of our approach • Experimental evaluation on a Scottish data set from the Isle of Skye • Cluster quality evaluation measure for group record linkage • Why traditional evaluation measures might not be adequate • A new cluster quality evaluation measure • Illustrative use on a Scottish data set • Conclusions and future work Slide 2 of 22

(His istoric ical) l) Gr Group Record Linkage • Record linkage is the process of identifying sets of records that refer to the same entity (person) within one database or across different databases. • a • In group record linkage, the aim is to link records for groups of entities, such as families or households. • a • Historical record linkage refers to the linkage of historical birth, marriage, and death records for population reconstruction (building family trees), where each record contains information about several people. Slide 3 of 22

Proble lem Statem ement • Aim : To identify groups of records that refer to the same entities where there are certain temporal constraints between records. • a • Challenges : • Existing record linkage techniques do not consider constraints that are implied by factors such as time (temporal), culture, or geographic location. • Data errors are often introduced when recording and transcribing the data. • Missing values in records. • Highly skewed frequency distributions of names. Slide 4 of 22

Temporal l Constrain ints Based sed Gr Graph Cluster erin ing • We introduce a novel graph clustering approach for group record linkage which takes temporal constraints into account. • a • Temporal constraints: The constraints implied by time differences when linking records. Due to biological limitations, it is temporally 5 months apart not possible for the same mother to have two babies 5 months apart. Baby A Baby B 0 days 9 months 30 years 31 years 3 days 8 months 0 1 2 3 273 333 11,000 11,365 Bangladesh woman with two wombs has twins one month after first birth: https://www.bbc.com/news/world-asia-47729118 Slide 5 of 22

Ph Phase e 1: 1: Simila ilarity y Gr Graph Ge Gener eration Record Baby's Mother's Father's Date of ….... ID name name name birth k Mary Kate John 01/02/1861 ….... l Tom Katy Johnny 05/07/1863 ….... m Pat Kate John 12/12/1869 ….... Transcribe ….... ….... ….... ….... ….... ….... Records o Harry Peggy - 03/09/1890 ….... p Kate Peg Ron 06/11/1896 ….... q Lizzy Peggy Roger 01/01/1901 ….... ….... ….... ….... ….... ….... ….... ….... ….... ….... ….... ….... ….... Generate Graph 0.6 g k 1.0 a 0.9 0.95 o 0.65 0.8 r m b e h c 0.95 0.8 0.45 i 0.8 t 0.75 0.8 0.75 p s 0.55 n 0.7 0.9 u 0.55 0.6 f d q j 0.6 l 0.7 Similarity graph G Slide 6 of 22

Phase Ph e 1: 1: Simila ilarity y Gr Graph Ge Gener eration Record Baby's Mother's Father's Date of ….... ID name name name birth k Mary Kate John 01/02/1861 ….... l Tom Katy Johnny 05/07/1863 ….... m Pat Kate John 12/12/1869 ….... Transcribe ….... ….... ….... ….... ….... ….... Records o Harry Peggy - 03/09/1890 ….... p Kate Peg Ron 06/11/1896 ….... q Lizzy Peggy Roger 01/01/1901 ….... ….... ….... ….... ….... ….... ….... ….... ….... ….... ….... ….... ….... Generate Graph 0.6 g k 1.0 a 0.9 0.95 o 0.65 0.8 r m b e h c 0.95 0.8 0.45 i 0.8 t 0.75 0.8 0.75 p s 0.55 n 0.7 0.9 u 0.55 0.6 f d q j 0.6 l 0.7 Temporally not possible links!!! Similarity graph G Slide 7 of 22

Ph Phase e 2 2 (a): Link k Stren ength Based sed Edge e Classif ific icatio ion • The concept of link strength is first used in record linkage by Saeedi et al. (2018). Only the edges with similarities greater than a user defined threshold are used. • a • Strong : Edges (r i , r j ) with the highest similarity with respect to all other edges connected to both r i and r j . • Norm : Edges (r i , r j ) with the highest similarity with respect to all other edges connected to either r i or r j , but not both. • WeakHigh : Edges which are neither strong nor normal. 0.6 k g • a Strong: c, b with similarity 0.95 0.9 0.95 1.0 0.8 b • Norm: f, h with similarity 0.9 h c 0.8 m 0.95 0.75 • 0.8 WeakHigh: a, k with similarity 0.6 e 0.9 f 0.7 d 0.7 Slide 8 of 22

Phase Ph e 2 2 (b): Base se Cluster er Ge Gener eratio ion Create a new similarity Generate connected Iterative cluster graph G' using the components based on G' refinement selected link strength(s) Iterative Cluster Refinement: • The temporal implausibilities of connected components are eliminated in this step. • a • For each connected component, nodes involved in implausible connections are ordered to determine the best sequence to iteratively remove non-temporal edges. Ordered list = [f, e, a, g, c] Ordered list = [e, a] Slide 9 of 22

Ph Phase e 3: 3: Iterative Cluster er Me Mergin ing Pairwise base cluster similarity Iteratively merge temporally plausible calculation using edges of the selected cluster pairs with cluster similarity greater link strength(s) than a user defined threshold • Pairwise base cluster similarity is a combination of the similarity and the coverage. • a • Similarity can be calculated as: • Maximum – maximum similarity among edges between two clusters (complete-link) • Minimum – minimum similarity among edges between two clusters (single-link) • Average – average similarity across edges between two clusters (average-link) • Coverage = Number of edges of the selected link strength between two clusters Number of all edges between two clusters (with respect to the similarity graph G ) Slide 10 of 22

Ex Exper erim imental Setup • Data set • For evaluation we used a real Scottish birth data set with 17,614 birth certificates, covering the population of the Isle of Skye from 1861 to 1901. • Each birth certificate contains personal details about a baby and its parents such as their names, address, marriage date, occupations, and the baby's date of birth. • We used six different attribute combinations for similarity calculation: all (parents names, addresses, occupations, and marriage dates), parent names with addresses, and parent names only, with and without weighting (Fellegi and Sunter, 1969). • Evaluation measures: Precision Recall Area under the precision-recall curve (AUC-PR) TP/(TP+FP) TP/(TP+FN) A summary measure of the precision and recall values across different similarity thresholds TP – True matching record pairs, FP – Wrongly matched record pairs, FN – Wrongly non-matched record pairs Slide 11 of 22

Preci cisio ion-Recall ll Curves W – Weighted, UW - Unweighted • Results are shown only for base clusters created with 'Strong' edges, since they showed highest precision (95%). Since the variation across similarity calculation methods was minimal, we have shown curves only for the 'average' similarity method. • Surprisingly, better results were obtained with fewer attributes for similarity graph generation! Slide 12 of 22

Area ea Under er the e Preci cisio ion-Rec ecall ll Curve e (AUC-PR) W – Weighted, UW - Unweighted • We compared this novel approach against our recently proposed temporal star clustering approach (Nanayakkara et al. 2018). • a • There are no other temporal clustering approaches that we are aware of. • a • Our new temporal approach achieved the highest average AUC-PR value of 0.88, compared to the previous temporal star clustering approach. Slide 13 of 22

Are e Preci cisio ion and Recall ll Suitable le for Eva valu luating Gr Group Record Linkage? • Precision and recall (as used before) have traditionally been employed to evaluate linkage quality in situations where ground truth data is available. • True Positives (true matching record pairs – correct matches). • False Positives (wrongly matched record pairs – false matches). • False Negatives (wrongly non-matched record pairs – missed matches). • These metrics measure the quality of links between records. • For group record linkage, however, we want the quality of clusters (groups) of records. • Precision and recall can be ambiguous and not meaningful. Slide 14 of 22

Robust temp mporal l grap aph clusterin ing an and cluster - PowerPoint PPT Presentation

Robust temp mporal l grap aph clusterin ing an and cluster evalu aluatio ion me meas asure for r group record linkage Charini Nanayakkara, Peter Christen , and Thilina Ranbaduge peter.christen@anu.edu.au Research School of Computer

Syntax analysis Definition keywords: (method select: (aBlock) [locals temp] (set temp ((self

Temp mporal Ma Mana nagement gement of of R RFID Da Data Peiya Liu and Fusheng Wang

CS 225 Data Structures No Novem ember er 16 Gr Graph aph Im Implementations and Tr

REFLOW SOLDERING SYSTEMS TAP30-458EM Tamura Reflow Advantage Adjustable air blow speed Heater Temp

y = x; } int a = 2, b = 6; swap(a,b); void swap(int x, int y) { int temp = y; y = x; x =

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

Community Alcohol/Drug Assessment Program APH Board Presentation - May 24, 2017 Jan Metheany,

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

N2 REFLOW SOLDERING SYSTEMS TNP25-538EM - N 2 Tamura Reflow Advantage Adjustable air blow speed

CS 225 Data Structures No Novem ember er 15 Gr Graph aph Trav aversal als G G Carl

CS 225 Data Structures No Novem ember er 11 Gr Graph aph Impl plementat ation G G

Proposed Standard of GRAP on Living and Non-living Resources Background DP 10 Accounting for

application Powered by Gian Luca Farina Perseu | www.21-style.com Mu s eoTo rin o | Grap h DB

Py Pyro: A Spa patial-Tempo mporal Big-Data Storage System m Shen Li Shaohan Hu Raghu Ganti

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Temp Faculty Hiring & Processing Workshop Presented by: The Office of Faculty Advancement

Weg2Vec: Event Embedding for Temporal Networks Mrton Karsai Temporal Networks (a) (b) (c)

Tableau-based decision method for testing satisfiability of the linear temporal logic LTL

Formal Verifjcation Lecture 1: Introduction to Model Checling and Temporal Logic Jacques

Tree-shaped one-pass tableau systems for Linear Temporal Logic satisfiability checking Nicola

Spatio-Temporal Databases Alvin Thai Amruta Sawant Samriddhi Singla Background

TDA and Persistent Homology: a new method for analysing temporal graphs Marco Piangerelli -

The role o of g ground-based ed a aer eroso sol net networks i s in ev n evaluating

Implementing Small Area Fair Market Rents (SAFMRs) for the HCV Program In-Person Training March

Robust temp mporal l grap aph clusterin ing an and cluster - PowerPoint PPT Presentation

Robust temp mporal l grap aph clusterin ing an and cluster evalu aluatio ion me meas asure for r group record linkage Charini Nanayakkara, Peter Christen , and Thilina Ranbaduge peter.christen@anu.edu.au Research School of Computer

Syntax analysis Definition keywords: (method select: (aBlock) [locals temp] (set temp ((self

Temp mporal Ma Mana nagement gement of of R RFID Da Data Peiya Liu and Fusheng Wang

CS 225 Data Structures No Novem ember er 16 Gr Graph aph Im Implementations and Tr

REFLOW SOLDERING SYSTEMS TAP30-458EM Tamura Reflow Advantage Adjustable air blow speed Heater Temp

y = x; } int a = 2, b = 6; swap(a,b); void swap(int x, int y) { int temp = y; y = x; x =

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

Community Alcohol/Drug Assessment Program APH Board Presentation - May 24, 2017 Jan Metheany,

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

N2 REFLOW SOLDERING SYSTEMS TNP25-538EM - N 2 Tamura Reflow Advantage Adjustable air blow speed

CS 225 Data Structures No Novem ember er 15 Gr Graph aph Trav aversal als G G Carl

CS 225 Data Structures No Novem ember er 11 Gr Graph aph Impl plementat ation G G

Proposed Standard of GRAP on Living and Non-living Resources Background DP 10 Accounting for

application Powered by Gian Luca Farina Perseu | www.21-style.com Mu s eoTo rin o | Grap h DB

Py Pyro: A Spa patial-Tempo mporal Big-Data Storage System m Shen Li Shaohan Hu Raghu Ganti

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Temp Faculty Hiring &amp; Processing Workshop Presented by: The Office of Faculty Advancement

Weg2Vec: Event Embedding for Temporal Networks Mrton Karsai Temporal Networks (a) (b) (c)

Tableau-based decision method for testing satisfiability of the linear temporal logic LTL

Formal Verifjcation Lecture 1: Introduction to Model Checling and Temporal Logic Jacques

Tree-shaped one-pass tableau systems for Linear Temporal Logic satisfiability checking Nicola

Spatio-Temporal Databases Alvin Thai Amruta Sawant Samriddhi Singla Background

TDA and Persistent Homology: a new method for analysing temporal graphs Marco Piangerelli -

The role o of g ground-based ed a aer eroso sol net networks i s in ev n evaluating

Implementing Small Area Fair Market Rents (SAFMRs) for the HCV Program In-Person Training March

Temp Faculty Hiring & Processing Workshop Presented by: The Office of Faculty Advancement