course content principles of knowledge
play

Course Content Principles of Knowledge Introduction to Data - PDF document

Course Content Principles of Knowledge Introduction to Data Mining Discovery in Databases Data warehousing and OLAP Data cleaning Fall 1999 Data mining operations Chapter 9: Web Mining Data summarization


  1. Course Content Principles of Knowledge • Introduction to Data Mining Discovery in Databases • Data warehousing and OLAP • Data cleaning Fall 1999 • Data mining operations Chapter 9: Web Mining • Data summarization • Association analysis Dr. Osmar R. Zaïane • Classification and prediction • Clustering • Web Mining • Similarity Search University of Alberta • Other topics if time permits  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 1  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 2 Web Mining Outline Chapter 9 Objectives • What are the incentives of web mining? Understand the different knowledge discovery issues in data mining from the World Wide • What is the taxonomy of web mining? Web. • What is web content mining? • What is web structure mining? Distinguish between resource discovery and • What is web usage mining? Knowledge discovery from the Internet. • What is a Virtual Web View? • Is there a query and discovery language for VWV?  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 3  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 4 WWW: Incentives WWW: Facts • No standards, unstructured and heterogeneous • Enormous wealth of information on web Internet growth • Growing and changing very rapidly 40000000 35000000 • The web is a huge collection of: 30000000 25000000 Hosts 20000000 – One new WWW server every 2 hours 15000000 10000000 – Documents of all sorts 5000000 0 Sep-69 Sep-72 Sep-75 Sep-78 Sep-81 Sep-84 Sep-87 Sep-90 Sep-93 Sep-96 Sep-99 – 5 million documents in 1995 – Hyper-link information – 320 million documents in 1998 – Access and usage information • Mine interesting nuggets of information leads to wealth The Asilomar Report urges • Indices get stale very quickly the database research of information and knowledge community to contribute in deploying new technologies • Challenge: Unstructured, huge, dynamic. Need for better resource for resource and discovery and information retrieval from the World-Wide Web. knowledge extraction .  Dr. Osmar R. Zaïane, 1999  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 5 Principles of Knowledge Discovery in Databases University of Alberta 6 1

  2. WWW and Web Mining Web Mining Outline • Web: A huge, widely-distributed, highly heterogeneous, semi- structured, interconnected, evolving, hypertext/hypermedia • What are the incentives of web mining? information repository. • What is the taxonomy of web mining? • Problems: • What is web content mining? – the “ abundance ” problem: • 99 % of info of no interest to 99% of people • What is web structure mining? – limited coverage of the Web: • What is web usage mining? • hidden Web sources, majority of data in DBMS. • What is a Virtual Web View? – limited query interface based on keyword-oriented search • Is there a query and discovery language for VWV? – limited customization to individual users  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 7  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 8 Web Mining Taxonomy Web Mining Taxonomy Web Mining Web Mining Web Content Mining Web Structure Web Content Web Structure Web Usage Web Usage Mining Web Page Content Mining Mining Mining Mining Mining Web Page Summarization WebLog ( Lakshmanan et.al. 1996 ), WebOQL( Mendelzon et.al. 1998 ) …: Web Page Search Result General Access Customized Search Result General Access Customized Content Mining Mining Pattern Tracking Usage Tracking Web Structuring query languages; Mining Pattern Tracking Usage Tracking Can identify information within given web pages •Ahoy! ( Etzioni et.al. 1997 ):Uses heuristics to distinguish personal home pages from other web pages •ShopBot ( Etzioni et.al. 1997 ): Looks for product prices within web pages  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 9  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 10 Web Mining Taxonomy Web Mining Taxonomy Web Mining Web Mining Web Content Web Content Web Usage Web Structure Mining Web Usage Mining Web Structure Mining Mining Mining Mining Using Links Web Page •PageRank ( Brin et al., 1998 ) Content Mining Search Result Mining •CLEVER ( Chakrabarti et al., 1998 ) Use interconnections between web pages to give General Access Customized Search Result General Access Search Engine Result Pattern Tracking Usage Tracking weight to pages. Mining Pattern Tracking Summarization •Clustering Search Result ( Leouski Using Generalization Web Page Customized and Croft, 1996, Zamir and Etzioni, Content Mining •MLDB ( 1994 ), VWV ( 1998 ) Usage Tracking 1997 ): Uses a multi-level database representation of the Categorizes documents using Web. Counters (popularity) and link lists are used phrases in titles and snippets for capturing structure.  Dr. Osmar R. Zaïane, 1999  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 11 Principles of Knowledge Discovery in Databases University of Alberta 12 2

  3. Web Mining Taxonomy Web Mining Taxonomy Web Mining Web Mining Web Content Web Structure Web Usage Web Content Web Structure Web Usage Mining Mining Mining Mining Mining Mining Web Page Customized Web Page General Access Customized Usage Tracking General Access Pattern Tracking Content Mining Usage Tracking Content Mining Pattern Tracking •Adaptive Sites ( Perkowitz and Etzioni, 1997 ) •Web Log Mining ( Zaïane, Xin and Han, 1998 ) Search Result Search Result Analyzes access patterns of each user at a time. Uses KDD techniques to understand general Mining Mining Web site restructures itself automatically by access patterns and trends. learning from user access patterns. Can shed light on better structure and grouping of resource providers.  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 13  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 14 Mine What Web Search Engine Finds Web Mining Outline • Current Web search engines: convenient source for mining – keyword-based, return too many answers, low quality • What are the incentives of web mining? answers, still missing a lot, not customized, etc. • What is the taxonomy of web mining? • Data mining will help: • What is web content mining? – coverage: “Enlarge and then shrink,” using synonyms and • What is web structure mining? conceptual hierarchies • What is web usage mining? – better search primitives: user preferences/hints – linkage analysis: authoritative pages and clusters • What is a Virtual Web View? – Web-based languages: XML + WebSQL + WebML • Is there a query and discovery language for VWV? – customization: home page + Weblog + user profiles  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 15  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 16 Warehousing a Meta-Web: Construction of Multi-Layer An MLDB Approach Meta-Web • Meta-Web: A structure which summarizes the contents, structure, • XML: facilitates structured and meta-information extraction linkage, and access of the Web and which evolves with the Web • Hidden Web: DB schema “extraction” + other meta info • Layer 0 : the Web itself • Layer 1 : the lowest layer of the Meta-Web • Automatic classification of Web documents: – an entry: a Web page summary, including class, time, URL, – based on Yahoo!, etc. as training set + keyword-based contents, keywords, popularity, weight, links, etc. correlation/classification analysis (IR/AI assistance ) • Layer 2 and up: summary/classification/clustering in various ways • Automatic ranking of important Web pages and distributed for various applications – authoritative site recognition and clustering Web pages • Meta-Web can be warehoused and incrementally updated • Querying and mining can be performed on or assisted by meta- • Generalization-based multi-layer meta-Web construction Web (a multi-layer digital library catalogue, yellow page). – With the assistance of clustering and classification analysis  Dr. Osmar R. Zaïane, 1999  Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 17 Principles of Knowledge Discovery in Databases University of Alberta 18 3

Recommend


More recommend