Explo loration a n and nd M Mini ning ng o of Web R - PowerPoint PPT Presentation

Explo loration a n and nd M Mini ning ng o of Web R Repositories WSDM DM 2 2014 T Tutorial l Nan Zhang, George Washington University Gautam Das, University of Texas at Arlington

Outline  Introduction: Web Search and Data Mining  Resource Discovery and Interface Understanding  Technical Challenges for Data Mining  Exploration Beyond Top-k  Sampling  Data Analytics  Final Remarks 2

Surface Web & Deep Web  Surface Web o Inter-linked web pages, ~167 tera bytes [1] o Searchable through search engines  Deep Web o Dynamic contents, unlinked pages, private web, contextual web, etc o ~91,850 tera bytes [1] , much larger than the surface web [2] o Mostly out of reach by search engines [1] SIMS, UC Berkeley, How much information? 2003 [2] Bright Planet, Deep Web FAQs, 2010, http://www.brightplanet.com/the-deep-web/ 3

Surface Web Search top-k query answer keyword query (Surface-) Web retrieval/ Search Engine ranking system indexing Document Corpus index 4

Deep Web Search Structured query top-k query answer Deep Web Query Database Search processing/ ranking system indexing Back-end Database index 5

Mining Web Repositories Classification: Real or Fake? Document Clustering Disease info Treatment info 6

How to Mine Web Repositories? for surface web pages  Surface Web Approach vs. Deep Web Approach ª (Relatively) unrestricted access to ª One data source to consider each web page ⎯ Severely restricted access interface ⎯ Many data sources to consider 7

How to Mine Web Repositories? for deep web repositories Hidden Repository Web Owner User 8

Our Focus Deep Web Approach How t to e efficient ntly mi ly mine ne d data repositories a and nd s search e h eng ngine ne corpora i in t n the he d deep w web? 9

Deep Web Repository: Example I Enterprise Search Engine’s Corpus Keyword search Top-k Unstructured data Asthma 10

Exploration: Example I Metasearch engine Discovers deep web repositories of a given topic • Integrate query answers from multiple repositories • For result re-organization, evaluate the quality of each • repository through data analytics and mining e.g., how large is the repository? • e.g., clustering of documents • Treatment Disease info info 11

Example II Yahoo! Auto, other online e-commerce websites Structured data Form-like search Top-1500 12

Exploration: Example II Third-party analytics & mining of an individual repository • Price distribution • Price anomaly detection • Classification: fake or real? Third-party mining of multiple repositories • Repository comparison • Consumer behavior analysis Main Tasks • Resource discovery • Data integration • Single-/Cross- site mining 13

Example III Graph browsing Local view Semi-structured data Picture from Jay Goldman, Facebook Cookbook, O’Reiley Media, 2008. 14

Exploration: Example III For commercial advertisers: • Market penetration of a social network • “buzz words” tracking For private detectors: • Find pages related to an individual For individual page owners: • Understand the (relative) popularity or followers of ones own page • Understand how new posts affect the popularity • Understand how to promote the page Main Tasks: resource discovery and data integration less of a challenge, analytics and mining of very large amounts of data becomes the main challenge. 15

Summary of Main Tasks/Obstacles  Find where the data are Covered by many recent tutorials o Resource discovery: find URLs of deep web [Dong and Srivastava VLDB 13, ICDE 13, repositories Weikum and Theobald ICDE 13, PODS o Required by: Metasearch engine, shopping website 10, Chiticariu et al SIGMOD 10, Dong and Nauman VLDB 09, Franklin, Halevy comparison, consumer behavior modeling, etc. and Maier VLDB 08]  Understand the web interface Demoed by research prototypes o Required by almost all applications. and product systems  Mine the underlying data W EB T ABLES o Through crawling, sampling, and/or analytics T EXT R UNNER o Required by: Metasearch engine, keep it real fake, price prediction, universal mobile interface, shopping website comparison, consumer behavior modeling, market penetration analysis, social page evaluation and optimization, etc. 16

Outline of This Tutorial  Brief Overview of: o Resource discovery o Interface understanding o i.e., where to, and how to issue a search query to a deep web repository?  Our focus: Mining through crawling, sampling, analytics Which individual search and/or browsing requests should a third-party explorer issue to the the web interface of a given deep web repository, in order to enable efficient data mining? 17

Outline  Introduction  Resource Discovery and Interface Understanding  Technical Challenges for Data Exploration  Crawling  Sampling  Data Analytics  Final Remarks 18

Resource Discovery  Objective: discover resources of “interest” Task 1: is an URL of interest? o • Criteria A: is a deep web repository • Criteria B: belongs to a given topic Task 2: Find all interesting URLs o  Task 1, Criteria A Transactional page search [LKV+06] o • Pattern identification – e.g., “Enter keywords”, form identification Figure from [DCL+00] • Synonym expansion – e.g., “Search” + “Go” + “Find it”  Task 1, Criteria B: Learn by example o [DCL+00] M. Diligenti, F. M. Coetzee, S.  Task 2 Lawrence, C. L. Giles, and M. Gori, "Focused Topic distillation based on a search engine o crawling using context graphs", VLDB, 2000. • e.g., “used car search”, “car * search” • Alone not suffice for resource discovery [Cha99] [LKV+06] Y. Li, R. Krishnamurthy, S. Focused/Topical “Crawling” Vaithyanathan, and H. V. Jagadish, "Getting o • Priority queue ordered by importance score Work Done on the Web: Supporting Transactional Queries", SIGIR, 2006. • Leveraging locality • Often irrelevant pages could lead to relevant ones [Cha99] S. Chakrabarti, "Recent results in Reinforcement learning, etc. • automatic Web resource discovery", ACM Computing Surveys, vol. 31, 1999. 19

Interface Understanding Modeling Web Interface Generally easy for keyword search interface, but can  be extremely challenging for others (e.g., form-like search, graph-browsing) What to understand?  Structure of a web interface o Modeling language  Flat model e.g., [KBG+01] o Hierarchical model e.g., [ZHC04, DKY+09] o Input information  HTML Tags e.g., [KBG+01] o Visual layout of an interface e.g., [DKY+09] o Departure city Chunk 1 Where? Table 1 Chunk 1 Arrival city Chunk 1 Table 2 Departure AA.com Chunk 1 date … When … Return date Table k Service Chunk 1 Class Chunk 1 [KBG+01] O. Kaljuvee, O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, "Efficient Web Form Entry on PDAs", WWW 2001. [ZHC04] Z. Zhang, B. He, and K. C.-C. Chang, "Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax", SIGMOD 2004 [DKY+09] E. C. Dragut, T. Kabisch, C. Yu, and U. Leser, "A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration", VLDB, 2009. 20

Interface Understanding Schema Matching  What to understand? o Attributes corresponding to input/output controls on an interface  Modeling language o Map schema of an interface to a mediated schema (with well understood attribute semantics)  Key Input Information o Data/attribute correlation [SDH08, CHW+08] o Human feedback [CVD+09] o Auxiliary sources [CMH08] [CHW+08] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, "WebTables: exploring the power of tables on the web", VLDB, 2008. [SDH08] A. D. Sarma, X. Dong, and A. Halevy, "Bootstrapping Pay-As-You-Go Data Integration Systems", SIGMOD, 2008. [CVD+09] X. Chai, B.-Q. Vuong, A. Doan, and J. F. Naughton, "Efficiently Incorporating User Feedback into Information Extraction and Integration Programs", SIGMOD, 2009. [CMH08] M. J. Cafarella, J. Madhavan, and A. Halevy, "Web-Scale Extraction of Structured Data", SIGMOD Record, vol. 37, 2008. 21

Related Tutorials  [DS13] Xin Luna Dong and Divesh Srivastava. Big data integration. Tutorial in ICDE'13, VLDB'13.  [SW13] Fabian M. Suchanek and Gerhard Weikum, Knowledge Harvesting from Text and Web Sources, Tutorial in ICDE ‘13.  [WT10] G. Weikum and M. Theobald, "From Information to Knowledge: Harvesting Entities and Relationships from Web Sources", PODS, 2010.  [CLR+10] L. Chiticariu, Y. Li, S. Raghavan, and F. Reiss, "Enterprise Information Extraction: Recent Developments and Open Challenges", SIGMOD, 2010.  [DN09] X. Dong and F. Nauman, "Data fusion - Resolving Data Conflicts for Integration", VLDB, 2009.  [FHM08] M. Franklin, A. Halevy, and D. Maier, "A First Tutorial on Dataspaces", VLDB, 2008.  [GM08] L. Getoor and R. Miller, "Data and Metadata Alignment: Concepts and Techniques", ICDE, 2008. 22

Outline  Introduction  Resource Discovery and Interface Understanding  Technical Challenges for Data Mining  Crawling  Sampling  Data Analytics  Final Remarks 23

Explo loration a n and nd M Mini ning ng o of Web R - PowerPoint PPT Presentation

Explo loration a n and nd M Mini ning ng o of Web R Repositories WSDM DM 2 2014 T Tutorial l Nan Zhang, George Washington University Gautam Das, University of Texas at Arlington Outline Introduction: Web Search and Data Mining

Explo loring Nondestructive Explo loration Methods Question: How can we investigate and

Rockho khopper Explo loration P n PLC 30 September 2010 AGM Presentation

Explo Explo loratio loratio ion and Develo ion and Develo lopment lopment in in East Afric

MINI OPENDRIVE 1 MINI MINI OPENDRIVE EXP OPENDRIVE EXP Experience, eXpertise, Performance The

Cleani ning C ng Cont ontract Cleani ning C ng Cont ontract Cleani ning C ng Cont

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Mini-Sentinel Common Data Model Lesley Curtis on behalf of the Mini-Sentinel Data Core May 8,

Appli plications cations of f CFD and d Desi sign gn Exp xplorat loration ion in the

Navigating the Common Cor e with L with L e ar e ar ning Pr ning Pr ogr ogr e ssions e

Ban angkok gkok 15 May ay 2017 17 Sa Safety fety Pe Performa rformance nce Industry

The program is correct at 31th August 2018 and is subject to change. Monday 10 September 2018

MINI-LINK 6352 Technical Presentation Content E-band Ericsson Radio System MINI-LINK

Neighbourhoods Scrutiny Mini Holland presentation June 2014 Waltham Forest Mini Holland Waltham

MINI DUBLIN GROUP * * * Subject : Meeting of the Iran Mini Dublin Group on the Presentation of

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

BaySTDetect: Detecting unusual temporal patterns in small area disease rates using Bayesian

POLAR GP Welcome March 2017 3/30/2017 Insert Footer Here 1 Explanatory Note The following

Risk Assessment and Genomics Risk Assessment and Genomics Science and Policy: EPAs

ICBO 2010 Patricia S. Lemer, M. Ed., NCC devdelay@mindspring.com Statistics Is There an Autism

roup Focus on Health ingleton hire ealthy nvironment A community-based group looking to

Engaging in Qualitative Research Methods: Opportunities for Prevention and Health Promotion

Balancing Wellness and Academic Success in College Settings Anisha Patel December 4, 2019 The

Patient Engagement Advisory Panel January 13, 2014 Washington, DC Patient Engagement Advisory

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us