Outline Outline Template- -Based Information Mining Based Information Mining Template � The Web Information Mining Problem from HTML Documents from HTML Documents � A Model of Electronic Documents � Document Templates � Template-based Information Extraction Jane Yung-jen Hsu & Wen-tau Yih Computer Science and Information Engineering � A Case Study: FAQ Miner National Taiwan University � Conclusion AAAI-97 Web Information Mining Web Information Mining The Myth about Keywords Keywords The Myth about � Search for relevant documents � Relevant information can be found using � Search engines keyword-based methods . e.g. � Web guides � Search for relevant documents � White & yellow pages � Filter undesirable information � Extract target information from documents � Extract useful information � Document analysis � Information extraction in resource discovery � Are keywords sufficient to satisfy most of � Smart web shopping our informational needs? AAAI-97 AAAI-97 Problem with Keywords: Example Problem with Keywords: Example AAAI-97 AAAI-97
Sample FAQ Documents Sample FAQ Documents Semi Semi- -Structured Document Hypothesis Structured Document Hypothesis � A semi-structured document, e.g. an HTML document with tags, provides sufficient structural hints to enable effective extraction of semantically meaningful information. � Machine readable � ↑ machine usable AAAI-97 AAAI-97 Basic Elements of A Document Content Basic Elements of A Document Content MEMORANDUM MEMORANDUM TO: JOHN SMITH, � Content GRADUATE OFFICE FROM: MARK SAM SUBJ: STUDENT APPEALS TO: JOHN SMITH, GRADUATE OFFICE � the actual data in a document MEETING DATE: 8 APR, 1997 There FROM: MARK SAM will be a meeting of the Committee SUBJ: STUDENT APPEALS MEETING � Format on Student Appeals on Wednesday, DATE: 8 APR, 1997 June 10, 1997 at 10:00 a.m. to 1:00 � the visual presentation of a document p.m. in Room 504 Cullimore. Please There will be a meeting of the Committee on Student make every effort to attend. If you Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 � Structure cannot attend, please contact Mary p.m. in Room 504 Cullimore. Armour, ext. 1234. Please make every effort to attend. If you cannot attend, � the logical elements and their relationships please contact Mary Armour, ext. 1234. AAAI-97 AAAI-97 Format Format Structure Structure BLAHBLAHBL MEMORANDUM � Memorandum MEMORANDUM � Title BLA HBLA HBLAHB LBLAHBLA HBLAHB TO: JOHN SMITH, GRADUATE OFFICE � Header Block BLAHA HBLA HBL FROM: MARK SAM TO: JOHN SMITH, GRADUATE OFFICE FROM: MARK SAM BLAHB LBLAHBL ABLAHBL ABLAHBLA � Receiver field SUBJ: STUDENT APPEALS MEETING SUBJ: STUDENT APPEALS MEETING BLAHB L AHBL AHBL DATE: � Sender field 8 APR, 1997 DATE: 8 APR, 1997 � Subject field � Date field There will be a meeting of the Committee on Blahb lahb la h blahbla hb lah Blahblahb la Hblahb There will be a meeting of the Committee on Student Appeals on Wednesday, June 10, 1997 at � Memo Body 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. Lahbla hb Blahblahbl Ahbl ahb lahb la hblah blah bl Student Appeals on Wednesday, June 10, 1997 at ahbl ahbl ah Blah bla Hblahblahb 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. Please make every effort to attend. If you cannot � Paragraph 1 attend, please contact Mary Armour, ext. 1234. Blahbl blah blahb lahbla hb lahblahb Bl ahb lahbla Please make every effort to attend. If you cannot � Paragraph 2 hblahbl ahblah blahbla Hbla Hblahbl ahbl hblah attend, please contact Mary Armour, ext. 1234. AAAI-97 AAAI-97
A Model of Electronic Documents A Model of Electronic Documents Properties of Document Structure Properties of Document Structure � S: a set of structural components Title � Title, Header Block, Memo Body receiver � Context continuity � C: the sequence of content symbols � Partial order Sender � MEMORANDUM TO: JOHN SMITH, GRADUATE OFFICE .... between levels Memo header � F: format properties of elements in C Subject � Total order within � B lah blah b lah the same level Date � A partial ordering over S � Order-preserving Paragraph 1 � A mapping between C and S body Paragraph 2 AAAI-97 AAAI-97 Template- -based Information Extraction based Information Extraction Template Memo in SGML Memo in SGML <memorandum> <title> MEMORANDUM </title> Title MEMORANDUM <header> MEMORANDUM <rec> TO: JOHN SMITH, GRADUATE OFFICE </rec> <send> FROM: MARK SAM </send> Receiver TO: JO HN SM ITH , GR AD UATE O FFIC E <subj> SUBJ: STUDENT APPEALS MEETING </sub> �� <date> DATE: 8 APR, 1997 </date> TO: JOHN SMITH, GRADUATE OFFICE Sender FRO M : M AR K SAM Header </header> �� FROM: MARK SAM Subject SU B J: STU D ENT APPEALS M EETING <body> �� <paragraph> SUBJ: STUDENT APPEALS MEETING D A TE: 8 APR, 1997 Date There will be a meeting of the Committee on Student �� DATE: 8 APR, 1997 Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 � p.m. in Room 504 Cullimore. There w ill be a m eeting of the C om m ittee on Student </paragraph> There will be a meeting of the Committee on Student Paragraph 1 Appeals on Wednesday, June 10, 1997 at 10:00 a.m . to <paragraph> Appeals on Wednesday, June 10, 1997 at 10:00 a.m. 1:00p.m . in R oom 504 C ullim ore. Body Please make every effort to attend. If you cannot to 1:00 p.m. in Room 504 Cullimore. attend, please contact Mary Armour, ext. 1234. Paragraph 2 Please m ake every effort to attend. If you cannot </paragraph> Please make every effort to attend. If you cannot attend, please contact M ary Arm our, ext. 1234. </body> attend, please contact Mary Armour, ext. 1234. </memorandum> AAAI-97 AAAI-97 Memo in HTML (2/2) Memo in HTML (2/2) Memo in HTML (1/2) Memo in HTML (1/2) <BODY> <BODY> M EM ORANDUM <H1> MEMORANDUM </H1> MEMORANDUM <P> <FONT SIZE=6> <B> MEMORANDUM </B></FONT> </P> <HR> <UL> <HR> � � TO: JOHN SMITH, GRADUATE OFFICE <LI> TO: JOHN SMITH, GRADUATE OFFICE </ LI > <UL> �� TO: JOHN SM ITH, GRADUATE OFFICE <LI> FROM: MARK SAM </ LI > � � FROM: MARK SAM <LI> TO: JOHN SMITH, GRADUATE OFFICE </LI > �� FROM <LI> SUBJ: STUDENT APPEALS MEETING </ LI > <LI> FROM: MARK SAM </LI > : M ARK SAM � � SUBJ: STUDENT APPEALS MEETING <LI > DATE: 8 APR, 1997 </ LI > <LI> SUBJ: STUDENT APPEALS MEETING </LI > �� SUBJ: STUDENT APPEALS M EETING </UL> <LI> DATE: 8 APR, 1997 </LI > � � DATE: 8 APR, 1997 <HR> � </UL> �� DATE: 8 APR, 1997 <P> � <HR> There will be a meeting of the Committee on There will be a meeting of the Committee on Student There will be a meeting of the Committee on Student Student Appeals on Wednesday, June 10, 1997 Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 There will be a meeting of the Com mittee on Student p.m. in Room 504 Cullimore. at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. p.m. in Room 504 Cullimore. Appeals on W ednesday, June 10, 1997 at 10:00 a.m. </P> <BR> <BR> <P> Please make every effort to attend. If you cannot Please make every effort to attend. If you cannot to 1:00 p.m . in Room 504 Cullim ore. Please make every effort to attend. If you cannot attend, please contact Mary Armour, ext. 1234. attend , please contact Mary Armour, ext. 1234. attend, please contact Mary Armour, ext. 1234. Please m ake every effort to attend. If you cannot </BODY> </P> attend, please contact M ary Armour, ext. 1234. </BODY> AAAI-97 AAAI-97
The FAQ Agent The FAQ Agent Template: FAQ Documents Template: FAQ Documents Standard_TFAQ FAQ FAQ ¡ Documents Information Title FAQ Worm FAQ Miner <TITLE> TERM_faq_title </TITLE> ¡ toc index_indicator TERM_TOC_indicator index_body (ordered_list <OL> list_item* </OL> | unordered_list <UL> list_item* </UL>) ¡ Answers q_a_pairs User Input question_answer_paragraph* FAQ FAQ Answer Knowledge list_item Finder <LI> Hyperlink_Anchor TERM_question </A> </LI> Base AAAI-97 AAAI-97 FAQ Miner Architecture FAQ Miner Architecture Sample FAQ documents Sample FAQ documents FAQ Document Template Template Matching Fail Template KB Modification Success Learning Modules Extract Target Information AAAI-97 AAAI-97 Concluding Remarks Concluding Remarks Experimental Results Experimental Results � Document structure facilitates information Template # of documents Success Ratio extraction. � HTML documents are tree-structured. Standard_TFAQ 62 56.4% � HTML tags provide hints for structural elements. No_TOC_Indicator 10 9.1% � Effective information mining is possible. Near Pass 13 11.8% � What¡s next? Difficult 25 22.7% � Tree-structured document templates � Semantic parsing AAAI-97 AAAI-97
Recommend
More recommend