Web Content Mining Dr. Ahmed Rafea Outline Introduction The - PowerPoint PPT Presentation

Web Content Mining Dr. Ahmed Rafea

Outline • Introduction • The Web: Opportunities & Challenges • Techniques • Applications

Introduction • The Web is perhaps the single largest data source in the world. • Web mining aims to extract and mine useful knowledge from the Web. • A multidisciplinary field:data mining, machine learning, natural language processing, statistics, databases, information retrieval, multimedia, etc. • Due to the heterogeneity and lack of structure of Web data, mining is a challenging task.

The Web: Opportunities & Challenges(1) • Web offers an unprecedented opportunity and challenge to data mining – The amount of information on the Web is huge, – The coverage of Web information is very wide and diverse. – Information/data of almost all types exist on the Web, Much of the Web information is semi-structured – Much of the Web information is linked. – Much of the Web information is redundant. – The Web is noisy.. – The Web consists of surface Web and deep Web. – The Web is also about services. – The Web is dynamic. – Above all, the Web is a virtual society

Techniques • Classification of Multimedia Content and Websites • Focused Crawling • Clustering Web Objects • Wrapper Induction • Automatic Data Extraction • NLP technique for sentiment classification • Sentiment classification using ML methods • NLP for Customer Reviews Analysis

Classification of Multimedia Content and Websites • In order to retrieve relevant knowledge a system has to analyze web content first. • Classification of web objects offers an automatic way to decide the relevance of web objects. • Since websites are usually represented by multiple pages, classifying website on top of web pages classification demands new algorithms

Focused Crawling • A focused web crawler takes a set of well-selected web pages exemplifying the user interest. • The focused crawler starts from the given pages and recursively explores the linked web pages. • While the crawlers perform a breadth-first search of the whole web, a focused crawler explores only a small portion of the web using a best-first search guided by the user interest. • Crawling for retrieving multimedia content in the web, instead of plain HTML documents.

Clustering Web Objects • Focused Crawling retrieves large numbers of relevant data. • In order to offer fast and more specific access to the query results, clustering is an established method to group the retrieved information to achieve better understanding. • If the query results are websites or combined objects like images and their text descriptions, algorithm are needed to handle these combined data types to find meaningful clustering

Wrapper Induction • A wrapper is a piece of software that enables a semi structured Web source to be queried as if it were a database • Given a set of manually labeled pages, a machine learning method is applied to learn extraction rules or patterns.

Automatic Data Extraction • Given a set of positive pages, generate extraction patterns. • Given only a single page with multiple data records, generate extraction patterns.

NLP techniques for sentiment classification • The approach: Three steps – Step 1: •Part-of-speech tagging •Extracting two consecutive words (two-word phrases) from reviews if their tags conform to some given patterns, – Step 2: •Estimate the semantic orientation (SO) of the extracted phrases – Step 3: •Compute the average SO of all phrases •Classify the review as recommended if average SO is positive, not recommended otherwise.

Sentiment classification using ML methods • Three classification techniques were tried: – Naïve Bayes – Maximum entropy – Support vector machine

NLP for Customer Reviews Analysis • Mining product features – Part-of-Speech tagging – features are nouns and nouns phrases • Identify Orientation of an Opinion Sentence – Use dominant orientation of opinion words (e.g., adjectives) as sentence orientation.

Applications • Automatic Maintenance of Topic Specific Directory Services • Data extraction • Sentiment classification, analysis • Summarization of consumer reviews • Information integration and schema matching • Knowledge synthesis • Template detection and page segmentation

Web Content Mining Dr. Ahmed Rafea Outline Introduction The - PowerPoint PPT Presentation

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities & Challenges Techniques Applications Introduction The Web is perhaps the single largest data source in the world. Web mining aims to

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Content Audit Web Support Team Terminology Web content: is the textual, visual or aural

Semantic Web Mining Bettina Berendt Humboldt-Universitt zu Berlin Institut fr

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

A C r a w l i n g A p p l i c a t i o n w i t h R Wh a t a b o u t

Inference in OSNs via Lightweight Partial Crawls Jithin K. Sreedharan Inria, France Konstantin

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

* A new open source language * A concurrent garbage collected language * Builds large programs

Machine Learning: A Promising Direction for Web Tracking Countermeasures Jason Bau, Jonathan

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us