T echniques and Rule P atterns fo r Decla ratively Querying W eb Data with FLORID Bertram Lud � ascher Rainer Himmer� oder W olfgang Ma y Institut f � ur Info rmatik� Universit � at F reiburg� Germany Overview � Intro duction � FLORID W eb mo del � Integration of W eb Access with DOOD pa radigm � Data Integration� A Case Study � Navigation � Conclusions
INTRODUCTION � Goal� A unifo rm framew o rk�system fo r � Querying the W eb� � exp ress decla ratively ho w to query�navigate on the W eb � extract data from W eb pages fo r p opulating a database � W eb�data w a rehousing � � Management of Semistructured Data� � structure is irregula r� pa rtial� unkno wn� implicit in the data � example� HTML pages � querying�navigation using general path exp ressions �b oth in the w eb �via links� and in the database� � discover structure � Info rmation Integration� � heterogeneous sources with di�erent structure � wrapp ers� mediato rs
QUERYING THE WEB WITH F�LOGIC�FLORID � DOOD P a radigm� � deduction � data�driven explo ration of the W eb and high level querying � object�o rientation � �exible mo deling of semistructured data �optional metho ds instead of NULLs� � extension of F�logic fo r querying and restructuring the W eb� W eb�FLORID � decla rative rule�based p rogramming st yle� unifo rm language fo r wrapp ers � mediato rs � meta features� schema b ro wsing�reasoning� va riables at class�metho d p ositions � restructuring of info rmation � navigation b y �general� path exp ressions � � unifo rm access to lo cal db � W eb data integration of heterogenous info rmation
F�LOGIC IN A NUTSHELL � Basic Constructs� � ISA�relation � � � � Object�Class � SUBCLASS�relation � � � SubClass��Class Class � � � SIGNA TURE� single�valued Method��P�types� �� R�type Class � � � ��� and multi�valued Method��P�types� ��� R�types Object � � � D A T A� single�valued Method��Params� �� R Object � f R��R� g � � ��� and multi�valued Method��Params� ��� M���P�� � � �� M���P�� � � � P A TH EXPRESSION Obj� Spec� Spec� Object Creation via P ath Exp ressions in the Head� X�father�man X�person� � X�mother�woman X�person� � �� �person�M�C� M�father� C�man� M�mother� C�woman
WEB MODEL � The W eb � Graph� consisting of no des �urls� containing w eb do cuments � and links hrefs��label� � � �url�� �url�� �HTML��HEAD������HEA D� �HTML��HEAD������HEAD� ��� ��� �A HREF��url�� �label ��A� �A HREF��������� ��A� ��� ��� ��HTML� ��HTML� � �z � � �z � wd� wd� Link Structure� Signature � webdoc � � hrefs��string� ��� url Example � wd��webdoc � � hrefs���label�� ��� �url�� F urther A ttributes� webdoc � � � self �� url� address �� string� modif �� string� ��� � error ��� string Additional� user�p rogrammed evaluation of the w eb do cuments�
INTEGRA TION OF THE WEB MODEL IN THE DEDUCTIVE SYSTEM � F�LOGIC�DB url webdoc hrefs u get � � � � address url��string � � get �� webdoc Rule�Based Explo ration � U�get � � � generate OID U�url� ��� � � � ��� add to U�get�webdoc webdoc � U�get � � � ��� �ll in slots address �� ���� hrefs������ ��� ��� U�explored U�url�get � � � � U�url � � � NewU�url hrefs�� � ��� NewU �
SEMANTICS � � P ath Exp ressions �FLU�VLDB���� � HB closure axioms extended Herb rand universe � Herb rand base U � W eb Interface � set of reserved names � get � url � ���� R hrefs U RL � P � HB � U RL � � explo re � � � � maps URLs to sets of new facts R � W eb Access Axiom � fo r � HB � H j � � � � j � fo r all facts � explo re � u � H u url u �get H � � new new �if is de�ned fo r a URL u � then all explo red data is in � get H � minimal Herb rand W eb Mo del � Integration with Bottom�up Evaluation � � W � H � �� � � H � � explo re � u � T H T � � P P u � u � � H � url � get � T � P � decla rative semantics � if explo re �� � then W eb�FLORID � FLORID
EXAMPLE� INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE � CIA W ORLD F A CTBOOK �CIA� � geography � p eople� government� economy � ��� no cities �apa rt from country capitals� � info rmation� link structure� fo rmatted text � �at �text� structure� quite regula r� only �BR� �tags used fo r structuring �B�� �I�� W ORLD ONLINE �W OL� � administrative divisions� main cities � info rmation� link structure� tables � structured �tables�� but not regula r �di�erent table la y out� columns�
EXAMPLE� INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE �
INTEGRA TION METHODOLOGY� T ypical Steps and Rules � CIA F actb o ok� Matching via Regula r Exp ressions� accessing relevant pages� C�url��cia���U� �� C�continent�file��cia���FN�� strcat�cia�src�FN�U�� U�url�get �� C�continent�url��cia���U�� cid�C��country�url��cia� �� U� name��cia��� Label� continent �� CT � �� CT�continent�url��cia��get� href s��L abel � ��� U�� U�url�get �� ��country�url��cia���U�� extracting �ra w data�� pattern�capital�name���Cap ital ����n ���� ���� pattern�total�area���total area����n���� sq km���� C�Method �� X� �� pattern�Method� RegEx�� pmatch�C�country�url��cia��ge t� RegEx� ����� X�� restructuring and data cleaning� C�real�country �� C�country�capital�name��CA�� not substr��none�� CA�� � P atterns and rules fo r commalists �ethic groups� languages�
INTEGRA TION METHODOLOGY� T ypical Steps and Rules �� W OL P ages� P a rsing �nsgmls�P a rser integrated into FLORID � and Evaluating� Accessing � pa rsing relevant pages� U�url�parse �� C�country�url��wol���U�� �� Generates parsetree of the document �� Tab��U�url�parse�table� element�Tab�Row�Col��conte nts� �Cont �typ e��T ype� �� Tab��U�parse�table�� Tab�table�����tbody��Row���� �tr� �Col ���X �Type ���� ��Co nt�� �� Identifying Main�Cities�T able and column attributes C�main�city�tab �� T�header�row��HZ�pop�year��P S���Y �cit y�co l��� CS�po p�co l��� PS�� �� C�country�url��wol���U�� T��U�parse�table�� element�T������contents��Co nt�� substr�Cont��main cities��� element�T�HZ�CS��contents�� Heade r��t ype� �th� � substr��city��Header��� element�T�HZ�PS��contents�� Heade r��t ype� �th� � substr��pop��Header��� pmatch�Header�������������� ����� ���� ���Y �� Evaluation of main�cities�table� C�main�cities ��� cty�C�CN��city�country��C�nam estr� �N�p opul atio n��Y� ��P� � �� C�country�main�city�tab �� T�city�col���CS�pop�col���PS�p op�y ear�� PS�� �Y�� �� element�T�DZ�CS��contents�� CN�ty pe�� td�� element�T�DZ�PS��contents�� P�typ e��t d��
Recommend
More recommend