Schema Matching in a Large Scale Schema Matching in a Large Scale Personal Schema Based Querying Personal Schema Based Querying Marko Smiljani � , Maurice van Keulen, Willem Jonker Dutch Dutch-Belgian Database Day Belgian Database Day - December 3, 2004 December 3, 2004 - Antwerp, Belgium Antwerp, Belgium
in this talk in this talk • motivation motivation • personal schema based querying • understanding understanding • formalizing the schema matching problem • solving solving • clustering in schema matching • validating validating • semantic validation without semantics
motivation motivation
mediated schema mediated schema data //account[number=1234]/owner data data mediator
personal schema personal schema data //account[number=1234]/owner PSQ data data PSQ – Personal Schema Based Query Answering System
architecture architecture schemas schema loader schema repository ���������������� ��������������� �������������� ��� ��� ������� ��������������� select ���� �������� ��������������� ��������������� ������� ���������� ���������������� data
Dé éj jà à Vu Vu D ���������������� �������������� ������� ��������������� ����������
goals and issues goals and issues goals • efficiency of schema matching (time-to-last, time-to-first) • effectiveness of schema matching (precision/recall) issues • trees vs. graphs • the objective function
understanding understanding
schema matching schema matching hints
formalism formalism constraint optimization problem constraint optimization problem well known framework, well known framework, offering a range of approaches for efficient problem solving offering a range of approaches for efficient problem solving
formalism formalism correctness ranking
finding a solution finding a solution
the idea of clustering the idea of clustering distance based clustering
why clustering? why clustering? • clusters can be ranked • search space is reduced
clustering approaches (and issues) clustering approaches (and issues) • clustering method has to be scalable k-medoid • how to initialize • pre-computation of distance hand made linear-time clustering • make it intelligent, yet keep it close to linear-time
validation validation
validation paradox validation paradox s s e e a a r r c c h h s s p p a a c c e e P = T / A A H T R = T / H semantic validation • semantic validation • does not like large search spaces! does not like large search spaces! vs. . vs. clustering is only useful in large search spaces! • clustering is only useful in large search spaces! •
estimating the precision and recall estimating the precision and recall • size based • order based
size based quality estimation size based quality estimation g n i r e t s u l c o n B P = T / A A H T R = T / H g n i r e t s u l c s e y H R 12 = B / A T B
size based quality estimation size based quality estimation NO CLUSTERING NO CLUSTERING CLUST. BEST CASE CLUST. BEST CASE B H B/A = 93% CLUST. WORST CASE CLUST. WORST CASE
order based quality estimation order based quality estimation � ✁ ✄ ✝ ✞ ✠ ✂ ✟ ✳ g g g g ☎ ✡ n n n n i i i i r r r r e e ✆ e ☛ e t t t t s s s s u u u u l l l l c c c c s o s o e n e n ✎ y y ✏ ✧ ✑ ★ ✒ ✩ ✓ ✪ ✱ ✔ ✫ ✕ ✬ ✖ ✭ ✗ ✮ ✘ ✯ ✙ ✰ ✌ ✍ ✚ ✛ ✜ ✢ ✲ ✲ ✣ ✤ ✥ ✦ ☞
order based quality estimation order based quality estimation NO CLUSTERING NO CLUSTERING CLUST. ALG 1 CLUST. ALG 1 CLUST. ALG 2 CLUST. ALG 2
what comes next what comes next • add intelligence to clustering • impact of other hints on clustering • using graphs
En dat was het dan! En dat was het dan! Vragen? Vragen?
Recommend
More recommend