VLDB 2007 Efficient Keyword Search over Virtual XML Views Feng Shao, Lin Guo, Chavdar Botev, Anand Bhaskar, Muthiah Chettiar, Fan Yang Cornell University Jayavel Shanmugasundaram Yahoo! Research
Applications - Personal Portal “Beetles Record” Sports News “Chevy Malibu” Auto Finances “NCAA “DOW Rankings” Index” Overlap/Duplicate Views
Applications – Information Integration “Vista”, “Vista” “budget” “budget” personalized Projects/Feedback Projects/Feedback views, by XML View XML View privilege Email with comments on project.doc (in XML projects (in XML format) � format) � <project> <comment> <feedback> <comment> <title>…</title> … … … </comment> </feedback> … </comment> <project>
Keyword Search over XML View � Materialized XML Views? • Similar to keyword search over XML documents � Many well-studied algorithms � Materialize views when loading documents • Not applicable in emerging applications! � Overlap/Duplicate/Update overhead � View definitions not known a-priori � � Keyword Search over Virtual XML Views
Related Work � Scoring and Indexing in IR community • DBXplorer [Agrawal02], Banks [Bhalotia02], ObjectRank [Balmin04], XRank [Guo02], Discover [Hristidis 02] • Work with materialized documents � Integrating keyword search and structural queries • GTP [Chen 03], TermJoin [Khalifa 03] • Access base data to evaluate the view � Projecting XML documents [Marian 03] • Access base data; not leveraging indexes
Outline � Motivation � Problem Definition � High-level Overview � PDT Generation Algorithm � Experimental Results � Conclusion
Problem Definition � Ranked Keyword Search over Virtual XML Views • Input: a set of keywords Q = {k1, k2, …, kn}, an XML view definition V over an XML database D • Output: k view elements with highest scores • TF-IDF scores � TF(k, e): # occurences of the keyword k in an element e � IDF(k): the inverse of # of elements containing k � Score(e, Q) = Σ i TF(k i ,e) * IDF(k i ) � Score(e, Q) is further normalized by the length of the view elements
Running Example Virtual View “XML” & “Search” for $book in fn:doc(books.xml)/books//book where $book/year > 1995 return <book> $book/title for $review in fn:doc(reviews.xml)/reviews//review where $review/isbn = $book/isbn return <review> $review/content </review> </book> reviews.xml books.xml books reviews book book review review review book rating content title isbn isbn year publisher 5 111-11 111-11 This book Design 1997 Princeton describes … -1111 -1111 Patterns
Running Example “XML” & “Search” Materialized View book book title title review review review review Design XML content content content content Patterns Primer This book Excellent! … search Decent book describes … and query … on XML… reviews.xml books.xml books reviews book book review review book review publisher rating content title isbn isbn year 5 111-11 111-11 This book Design 1997 Princeton describes … -1111 -1111 Patterns
Outline � Motivation � Problem Definition � High-level Overview � PDT Generation Algorithm � Experimental Results � Conclusion
Our Approach “XML” Traditional Approach View Results “Search” “XML” View Results “Search” Materialization PDT Ranked Results Generator Scoring Pruned Pruned Results Document Keyword Keyword Processor Trees Processor (PDTs) � book Evaluator Evaluator Materialized Pruned View View indexes indexes reviews > 300s reviews books books 5s
Our Approach “XML” books View Results “Search” book book title isbn year publisher Materialization PDT 111-11 XML Ranked Results Princeton Generator 1997 -1111 Primer Scoring PDT (Pruned PDTs Pruned Results books Document Keyword Tree) � book Processor title year isbn Evaluator Pruned View 111-11 Id=“1.2.1” 1997 -1111 kwd1=“xml indexes ”tf=“1” length = “10” reviews books Orders of magnitude smaller!
Our Approach -- Challenges “XML” View Results 1. Joining books & reviews “Search” requires isbn (data value) � -- how to get data values Materialization PDT without accessing the Ranked Results Generator base data? Scoring PDTs Pruned Results 2. Scoring view elements requires aggregate Keyword Processor statistical data (e.g., tf from book and review)? Evaluator Pruned View -- How to collect them without materializing the indexes view elements? reviews books
������������������ "����������# $����%&'(�$)��������%�' �������� � ����� B+-Tree ��� ������ ����� ������ ���� � � � ����� *�����*����*���� �������������� �����(�� �� ���� ���������� ���� ����! ����� *�����*����*���� � � � � �� �� ����� � ���� ����� � � � ��� � ��������� *�����*����*������ �-���� �� �!(���,�! ����� ��������� *+� ��2�����������3�$4��%!'($5���2%&' B+ tree index �� �! � ��,�! � � -��� ./01 � ���� � (ID, TF) �
Outline � Motivation � Problem Definition � High-level Overview � PDT Generation Algorithm � Experimental Results � Conclusion
XML View � Query Pattern Tree (QPT) � Similar to GTP, proposed for $book in fn:doc(books.xml)/books//book where $book/year > 1995 by Chen 2003 for normal return query evaluation <book> $book/title for $review infn:doc(reviews.xml)/reviews//review • Captures the structural where $review/isbn = $book/isbn return parts required by queries <review> $review/content </review> • Mandatory/Optional edges </book> � New features • Node annotations books � V: value required to evaluate the view book � C: content used in the view mandatory optional isbn year>1995 title v c
PDT Intuition �������������� • Restrictions enforced by QPT ����� Predicate Restriction Descendant Restriction ���� Ancestor Restriction ���� ����� ���� � � �������� books book book author year title isbn author year title isbn 1994 Database 111-11 publisher 1997 XML Primer 121-32- publisher Concepts 1112 id:1.2.1 8663 kwd1=“xml”tf=1 length = 10
PDT Generation “XML” View Results “Search” Materialization 1. Get ID lists for PDT Ranked Results paths in the QPT Generator Scoring 2. Merge IDs in the PDTs Pruned Results lists to create the PDT Keyword Processor Evaluator Pruned View indexes reviews books
Step 1: Get List of IDs QPT B+-Tree books PathID Value IDList … … … book /books/book/isbn “111-11-111” 1.1.1 /books/book/isbn “121-23-1321” 1.2.1 … … … /books/book/author/fn “Jane” 1.2.3, 1.7.3 isbn title year>1995 v c Key idea: for each node without mandatory child edges, obtain the corresponding list of ids books//book/isbn: (1.1.1:”111-11-111”),(1.2.1,”121-23-1321”) � books//book/title: 1.1.4, 1.2.3, 1.9.3 books//book/year: (1.2.6, 1.5.1:”1996”), (1.6.1:”1997”) �
Step 2: Merging IDs -- Challenges books QPT � Makes a single pass over relevant id lists book • Flat indices � nested structure • Enforce ancestor/descendant restrictions title isbn year>1995 1 books 1.1 book book 1.2 author isbn title year author isbn 1.1.4 1.2.7 title year 1.1.5 1.1.1 1.1.2 1.1.3 1.2.8 publisher 1.2.1 1.2.3 1.2.6 publisher
PDT Generator – Merging IDs ������������������������������� �� ����!������"������"� "# �#"�!������� �����������������������$����"�#������#��%� ����������������������"�&�������������&!������&���!���'(���%� PDT IDs PDT Candidate Tree Idea: a loop that merges ids in the lists, and creates the CT nodes in dewey id order At each step, we check the min id in the CT if satisfies all restrictions � PDT if satisfies descendant restriction and not ancestor � PDT Cache if not satisfies descendant restriction and does not have child node in the CT � Discard
Recommend
More recommend