Sea rch� Analysis� and Integration of W eb Do cuments� A Case Study with FLORID Rainer Himmer� oder P aul�Th� Kandzia Bertram Lud � ascher W olfgang Ma y Geo rg Lausen Institut f ur � Info rmatik� Universit � at F reiburg� Germany Overview � Intro duction�Motivation � FLORID W eb mo del � Integration� CIA W ORLD F A CTBOOK and W ORLD ONLINE � �Semistructured Data� � Conclusions
MOTIV A TION � Goal� A unifo rm framew o rk fo r � Querying the W eb� � exp ress decla ratively ho w to query�navigate on the W eb � extract data from W eb pages fo r p opulating a database � W eb�data w a rehousing � � Management of Semistructured Data� � structure is irregula r� pa rtial� unkno wn� implicit in the data � example� HTML pages � querying�navigation using general path exp ressions � discover structure � Info rmation Integration� � heterogeneous sources with di�erent structure � wrapp ers� mediato rs
QUERYING THE WEB WITH F�LOGIC�FLORID � DOOD P a radigm� � deduction � fo r data�driven explo ration of the W eb and high level querying � object�o rientation � fo r �exible mo deling of semistructured data �optional metho ds instead of NULLs� � extension of F�logic fo r querying and restructuring the W eb� W eb�FLORID � decla rative rule�based p rogramming st yle� unifo rm language fo r wrapp ers � mediato rs � meta features� schema b ro wsing�reasoning� va riables at class�metho d p ositions � restructuring of info rmation � navigation b y �general� path exp ressions � � unifo rm access to lo cal db � W eb data integration of heterogenous info rmation
F�LOGIC IN A NUTSHELL � Basic Constructs� � ISA�relation � � � � Object�Class � SUBCLASS�relation � � � SubClass��Class Class � � � SIGNA TURE� single�valued Method��P�types� �� R�type Class � � � ��� and multi�valued Method��P�types� ��� R�types Object � � � D A T A� single�valued Method��Params� �� R Object � f R��R� g � � ��� and multi�valued Method��Params� ��� M���P�� � � �� M���P�� � � � P A TH EXPRESSION Obj� Spec� Spec� Object Creation via P ath Exp ressions in the Head� X�father�man X�person� � X�mother�woman X�person� � �� �person�M�C� M�father� C�man� M�mother� C�woman
WEB MODEL � hrefs��label� � � �url�� �url�� �HTML��HEAD������HEA D� �HTML��HEAD������HEAD� ��� ��� �A HREF��url�� �label ��A� �A HREF��������� ��A� ��� ��� ��HTML� ��HTML� � �z � � �z � wd� wd� Link Structure� Signature � webdoc � � hrefs��string� ��� url Example � wd��webdoc � � hrefs���label�� ��� �url�� F urther A ttributes� webdoc � � � self �� url� address �� string� modif �� string� ��� � error ��� string
F�LOGIC VIEW ON THE WEB � F�LOGIC�DB url webdoc hrefs u get � � � � address url��string � � get �� webdoc Rule�Based Explo ration � U�get � � � generate OID U�url� ��� � � � ��� add to U�get�webdoc webdoc � U�get � � � ��� �ll in slots address �� ���� hrefs������ ��� ��� U�explored U�url�get � � � � U�unexplored U�url� not U�explored� �
SEMANTICS � Extension of F�logic b y � P ath Exp ressions �FLU�VLDB���� � HB closure axioms extended Herb rand universe U � Herb rand base � W eb Interface � set of reserved names � get � url � ���� R hrefs � explo re � U RL � P � HB � U RL � �� � maps URLs to sets of new facts R � W eb Access Axiom � fo r � HB � H j � � � j � fo r all � explo re � u � H u �url u �get H new new �if is de�ned fo r a URL u � then all explo red data is in H� get � minimal Herb rand W eb Mo del � Integration with Bottom�up Evaluation � � W � H � �� � � H � � f explo re � u � j � � H � g T H T u � url � u � get T � � � P P P � � decla rative semantics � if explo re �� then W eb�FLORID � FLORID
EXAMPLE� INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE � CIA W ORLD F A CTBOOK �CIA� � geography � p eople� government� economy � ��� no cities �apa rt from country capitals� � info rmation� link structure� fo rmatted text � very structured and regula r � complete W ORLD ONLINE �W OL� � administrative divisions� main cities � info rmation� link structure� tables � not very regula r � � incomplete � W OL autho r� �All visito rs must realize that this site �i�e� collecting the data and putting it up here� is a logical development of one of my hobbies� y ou therefo re cannot exp ect all data to b e of academic standa rd� What y ou see is what y ou get� although I try to b e as tho rough as p ossible��
EXAMPLE� INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE �
INTEGRA TION METHODOLOGY� T ypical Steps and Rules � ������������������������� ACCESSING RELEVANT PAGES� ������������������������� C�url��cia���U� �� C�continent�file��cia���FN�� strcat�cia�src�FN�U�� U�url�get �� C�continent�url��cia���U�� ������������������������ EXTRACTING ��RAW DATA��� ������������������������ pattern�capital���Capital� ���n ����� ��� pattern�total�area���total area����n���sq km����� C�Method �� X� �� pattern�Method� RegEx�� pmatch�C�country�url��cia��ge t� RegEx� ����� X�� �������������������������� ���� �� RESTRUCTURING AND DATA CLEANING� �������������������������� ���� �� C�real�country �� C�country�capital��CA�� not substr��none�� CA�� �������������������������� ���� ����� ��� INTEGRATION OF SOURCES� OBJECT FUSION� �������������������������� ���� ����� ��� C� � C� �� C��country�continent��CT���m ain� citi es�na me�� wol� ��N� � C��country�continent��CT�cap ital ��N� name��cia������ not C��C��
QUERYING THE INTEGRA TED D A T A �� QUERY� �Name the capitals �from CIA� with their p opulation �from W OL�� �� ��country�name��cia� �� Country� capital �� City�� ��city�name��wol� �� City� population �� P�� P������������ City��Vienna� Country��Austria� P������������ City��Prague� Country��Czech Republic� P������������ City��Paris� Country��France� P������������ City��Berlin� Country��Germany� P������������ City��Budapest� Country��Hungary� P������������ City��Madrid� Country��Spain� P���������� City��Stockholm� Country��Sweden� P���������� City��Bern� Country��Switzerland � P������������ City��London� Country��United Kingdom� � output�s� printed
Recommend
More recommend