Outline � Background � Summary of our Contribution Kikori-KS: An Effective and � Kikori-KS Efficient Keyword Search System for Digital Libraries in XML � User Interfaces � Implementation of Keyword Search on Relational Databases � Ranking Model Toshiyuki Shimizu (Kyoto University) � Experiments Norimasa Terada (Nagoya University) � Conclusions, Future Works Masatoshi Yoshikawa (Kyoto University) ICADL 2006 29 th November 2 Background (2/2) Background (1/2) � Large number of documents in digital � For the keyword “database” libraries are now structured in XML : Element Node � Growing demand for XML Information : Text Value article Retrieval (XML-IR) Systems � We can identify meaningful document fragments transaction title body by encoding documents in XML section section � ex) Sections, subsections and paragraphs XML Index in scholarly articles database title p title p � Browsing only document fragments relevant to a certain topic Introduction XML Labeling …XML database � Keyword search on XML documents Query processing… � Simple, intuitively understandable, yet useful form of queries, especially for unskilled end-users � We do not need to understand XML query languages and XML schema 3 4 Outline Summary of our Contribution � Background � We have developed Kikori-KS � A prototype system for XML-IR � Summary of our Contribution � Under Kikori Project � Kikori-KS � Accepts Keyword Set as a query � User Interfaces � Implementation of Keyword Search on Relational � User-friendly interface Databases � Ranking Model � FetchHighlight interface � Experiments � Storage schema on RDB � Conclusions, Future Works � The database schema is carefully designed � Acceptable search time 5 6 1
Outline Overview of Kikori-KS � Background � Summary of our Contribution SQL Translation � Kikori-KS RDB Set of Keywords Module � User Interfaces � Implementation of Keyword Search on Relational Ranked relevant elements Databases Storage Module � Ranking Model End User � Experiments User Interface Search Results <?xml version=“1.0”?> � Conclusions, Future Works Module <document> ~ ~ </document> XML Documents 7 8 User Interfaces Outline � Search results of XML-IR are document � Background fragments, which may be nested � Summary of our Contribution � INEX 2005 project * defined three strategies � Kikori-KS for element retrieval � User Interfaces E 1 E 1 D 1 D 1 � Implementation of Keyword Search on Relational E 2 E 2 E 11 E 11 Databases E 3 E 3 E 12 E 111 E 13 E 112 : : � Ranking Model : : E 12 : D 2 � Experiments : E 21 D 2 ( E i does not E 22 E 21 � Conclusions, Future Works overlap with E j ) : : Thorough Focussed FetchBrowse FetchHighlight � Three strategies of INEX are not necessarily intended to be used in designing user interfaces 9 10 * http://inex.is.informatik.uni-duisburg.de/2005/ Retrieval Strategy of INEX (1/3) Retrieval Strategy of INEX (2/3) � Thorough � Focussed E 1-3 (0.6) E 1-3 (0.6) E 1-8 (0.5) E 1-8 (0.5) � The system retrieves only focussed � Relevant elements are retrieved in E 1-10 (0.4) E 2-10 (0.4) elements (i.e. non-overlapping descending order of their scores E 2-10 (0.4) elements) : � Ranked in relevance order : element score element score D 1 D 2 D 1 D 2 0.2 0.1 0.2 0.1 E 1-1 E 2-1 E 1-1 E 2-1 0 0.6 0.1 0 0 0.2 0 0.6 0.1 0 0 0.2 E 1-2 E 1-3 E 1-4 E 2-2 E 2-3 E 2-4 E 1-2 E 1-3 E 1-4 E 2-2 E 2-3 E 2-4 0 0.5 0 0.3 0 0.5 0 0.3 E 1-5 E 1-8 E 2-5 E 2-8 E 1-5 E 1-8 E 2-5 E 2-8 0 0 0.3 0.4 0 0 0.3 0.4 0 0 0 0.4 0 0 0 0.4 E 1-6 E 1-7 E 1-9 E 1-10 E 2-6 E 2-7 E 2-9 E 2-10 E 1-6 E 1-7 E 1-9 E 1-10 E 2-6 E 2-7 E 2-9 E 2-10 11 12 2
Retrieval Strategy of INEX (3/3) FetchHighlight � FetchBrowse � Fetching Phase � Displaying search result elements aggregated � The system first identifies relevant by XML documents is effective D 1 (0.2) documents and ranks them in E 1-3 (0.6) � FetchBrowse is of that style relevance order (0.5) E 1-8 � Browsing Phase � Displaying search result elements in their (0.4) E 1-10 � Within a fetched document, the : document order is useful D 2 (0.1) system identifies relevant elements and ranks them in relevance order E 2-10 (0.4) D 1 : document score E 11 � XML documents are first sorted in their D 1 0.2 D 2 0.1 E 111 relevance order element score 0.2 0.1 E 1-1 E 2-1 E 112 � Relevant elements within the XML 0 0.6 0.1 0 0 0.2 E 12 document are displayed in document : E 1-2 E 1-3 E 1-4 E 2-2 E 2-3 E 2-4 D 2 order E 21 0 0.5 0 0.3 � Elements are indented in accordance : E 1-5 E 1-8 E 2-5 E 2-8 with their depth in the XML tree FetchHighlight 0 0 0.3 0.4 0 0 0 0.4 E 1-6 E 1-7 E 1-9 E 1-10 E 2-6 E 2-7 E 2-9 E 2-10 13 14 FetchHighlight Interface Browsing Document Fragment Outline elements are displayed Elements with high score are Aggregated by Document order displayed by using a larger font document * Selected document fragment is Highlighted * Search words are Highlighted 15 16 The Feature of FetchHighlight Interface Outline � Focussed elements are easily identified � Background � Users can also recognize the parts in the � Summary of our Contribution documents with many high relevant elements � Kikori-KS clustered � User Interfaces � Implementation of Keyword Search on Relational � Outline elements Databases � Displayed even if the score is 0 � Ranking Model � The elements with particular structural information � Experiments � ex) such as sections and subsections � Conclusions, Future Works � Useful for browsing 17 18 3
Storing XML documents into RDB Conceptual Database Design � A huge number of document fragments have Path Element to be handled efficiently pathID pathexp docID elemID pathID start end label 1 /article 1 1 1 1 236 XML Index � ex) There are 16,080,830 document fragments 2 /article/transaction 1 2 2 10 44 database (elements) against 16,819 documents in the 1 3 3 45 68 XML Index 3 /article/title INEX 1.9 collection used in our experiments : : : : : : : : � Storage schema based on XRel Term 1 � Independent of the logical structure of XML term docID elemID tfipf article elemID database 1 1 0.3 documents. 2 3 4 transaction title body database 1 2 0.1 � Conceptual Database Design 5 8 : : : : section section XML Index � Element (docID, elemID, pathID, start, end, label) XML 1 1 0.3 9 10 6 7 database XML 1 3 0.4 � Path (pathID, pathexp) title p title p : : : : � Term (term, docID, elemID, tfipf) Introduction XML Labeling Query processing * label: short text representing the element * tfipf : term weight in the element 19 20 We explain XML database Schema Refinement (1/4) Schema Refinement (2/4) � Materialized view � Materialized view � Join Element table, Path table, and Term table � Join Element table, Path table, and Term table � Partitioning the Term table with each term Term Element Path � Term_xyz (docID, elemID, tfipf, start, end, label, pathexp) term docID elemID tfipf start end label pathexp database 1 1 0.3 1 236 XML Index /article database 1 2 0.1 10 44 database /article/transaction � Selecting outline elements and constructing : : : : : : : : an Outline table in advance XML 1 1 0.3 1 236 XML Index /article XML 1 3 0.4 45 68 XML Index /article/title � The system designer have to predefines outline : : : : : : : : elements � Outline (docID, elemID, start, end, label, pathexp) 21 22 Schema Refinement (3/4) Schema Refinement (4/4) � Partitioning the table by terms � Selecting outline elements 1 and constructing an article � Term_xyz (docID, elemID, tfipf, start, end, label, pathexp) elemID 2 3 4 Outline table in advance transaction title body � The system designer 5 8 predefine outline elements section section XML Index 10 9 6 7 Term_database database title p title p docID elemID tfipf start end label pathexp 1 1 0.3 1 236 XML Index /article Introduction XML Labeling 1 2 0.1 10 44 database /article/transaction Query processing : : : : : : : We explain XML database Term_XML � Outline (docID, elemID, start, end, label, pathexp) docID elemID tfipf start end label pathexp Outline 1 1 0.3 1 236 XML Index /article 1 3 0.4 45 68 XML Index /article/title docID elemID start end label pathexp : : : : : : : 1 5 75 143 Introduction /article/body/section 1 8 144 219 XML Labeling /article/body/section 23 24 4
Recommend
More recommend