The Best of Both Worlds: Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen, Gabriel Campero Durand, Roman Zoun, David Broneske, Yang Li, Gunter Saake xiao.chen@ovgu.de Otto-von-Guericke-University of Magdeburg BTW’19, Rostock, March 7th, 2019 Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 1
Entity Resolution (ER) Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work ▪ Real world vs. Digital world Real-world Entities: Digital-world Records: Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 2 2/19
Entity Resolution (ER) Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work ▪ Real world vs. Digital world ▪ Definition: Identifying records that refer to the same entity Real-world Entities: Digital-world Records: Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 3 2/19
Entity Resolution (ER) Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work ▪ Real world vs. Digital world ▪ Definition: Identifying records that refer to the same entity Given-name Surname city Postcode Age Phone-number Sex Hospital starab Kuaririo brisbane 1402 25 03 2867 8172 f Citizen’s sarah Guarino brisbane 1402 26 03 2897 8172 m office Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 4 2/19
Entity Resolution (ER) Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work ▪ Real world vs. Digital world ▪ Definition: Identifying records that refer to the same entity Name Description Manufacturer Price world book the world book encyclopedia 2006 is a truly student-friendly cd topics Amazon encyclopedia entertainment 19.99 reference resource. it's been … 2006 overview with over 87 years of experience and a global world book Google reputation for unsurpassed excellence world book 2006 is firmly - 17.9 2006 established as the premier reference source for ... Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 5 2/19
Entity Resolution (ER) Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work ▪ Real world vs. Digital world ▪ Definition: Identifying records that refer to the same entity ID Titel Author Venue Year conf/sigmod/ PTool: A Light Weight David Hanley, Robert L. SIGMOD Conference 1995 DBLP GrossmanHQ Persistent Object Grossman, Xiao Qin 95 Manager 223901 PTool: a light weight R. L. Grossman, D. International Conference 1995 ACM persistent object Hanley, X. Qin on Management of Data manager Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 6 2/19
Basic Steps of Pair-Wise ER Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work Input data Pair-Wise comparison Classification Clerical review Non- Potential Results: Matches matches matches Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 7 3/19
Basic Steps of Pair-Wise ER Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work Input data A B C D E Similarity scores ( , ); ( , ); A B A C ( , ); ( , ); A D A E Pair-Wise comparison ( , ); ( , ); B C B D ( , ); ( , ); B E C D ( , ); ( , ); C E D E Classification Clerical review Non- Potential Results: Matches matches matches Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 8 3/19
Basic Steps of Pair-Wise ER Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work Input data A B C D E Similarity scores ( , ); ( , ); A B A C ( , ); ( , ); A D A E Pair-Wise comparison ( , ); ( , ); B C B D ( , ); ( , ); B E C D ( , ); ( , ); C E D E Classification Match/Non-match? Clerical (( ), score ) review A B (( ), score ) C E (( ), score ) C D Non- Potential Results: Matches matches matches (( ), score ) D E … … Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 9 3/19
Three Groups of Attributes Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work Persons: Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m DBLP-ACM bibliography data: Titel Author Venue Year PTool: A Light Weight David Hanley, Robert L. SIGMOD Conference 1995 Persistent Object Manager Grossman, Xiao Qin PTool: a light weight R. L. Grossman, D. Hanley, International Conference 1995 persistent object manager X. Qin on Management of Data Amazon-Google product data: Name Description Manufacturer Price world book the world book encyclopedia 2006 is a truly topics encyclopedia 19.99 student-friendly cd reference resource. it's been … entertainment 2006 overview with over 87 years of experience and a global world book reputation for unsurpassed excellence world book 2006 - 17.9 2006 is firmly established as the premier reference source for students parents teachers and librarians... Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 10 4/19
Three Groups of Attributes Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work Persons: ▪ Numerical attributes (NA): Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m DBLP-ACM bibliography data: Titel Author Venue Year PTool: A Light Weight David Hanley, Robert L. SIGMOD Conference 1995 Persistent Object Manager Grossman, Xiao Qin PTool: a light weight R. L. Grossman, D. Hanley, International Conference 1995 persistent object manager X. Qin on Management of Data Amazon-Google product data: Name Description Manufacturer Price world book the world book encyclopedia 2006 is a truly topics encyclopedia 19.99 student-friendly cd reference resource. it's been … entertainment 2006 overview with over 87 years of experience and a global world book reputation for unsurpassed excellence world book 2006 17.9 2006 is firmly established as the premier reference source for students parents teachers and librarians... Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 11 4/19
Three Groups of Attributes Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work Persons: ▪ Numerical attributes (NA): Given-name Surname city Postcode Age Phone-number Sex Don’t include numerical strings o starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m DBLP-ACM bibliography data: Titel Author Venue Year PTool: A Light Weight David Hanley, Robert L. SIGMOD Conference 1995 Persistent Object Manager Grossman, Xiao Qin PTool: a light weight R. L. Grossman, D. Hanley, International Conference 1995 persistent object manager X. Qin on Management of Data Amazon-Google product data: Name Description Manufacturer Price world book the world book encyclopedia 2006 is a truly topics encyclopedia 19.99 student-friendly cd reference resource. it's been … entertainment 2006 overview with over 87 years of experience and a global world book reputation for unsurpassed excellence world book 2006 17.9 2006 is firmly established as the premier reference source for students parents teachers and librarians... Xiao Chen Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution 12 4/19
Recommend
More recommend