Web Content Mining Dr. Ahmed Rafea
Outline • Introduction • The Web: Opportunities & Challenges • Techniques • Applications
Introduction • The Web is perhaps the single largest data source in the world. • Web mining aims to extract and mine useful knowledge from the Web. • A multidisciplinary field:data mining, machine learning, natural language processing, statistics, databases, information retrieval, multimedia, etc. • Due to the heterogeneity and lack of structure of Web data, mining is a challenging task.
The Web: Opportunities & Challenges(1) • Web offers an unprecedented opportunity and challenge to data mining – The amount of information on the Web is huge, – The coverage of Web information is very wide and diverse. – Information/data of almost all types exist on the Web, Much of the Web information is semi-structured – Much of the Web information is linked. – Much of the Web information is redundant. – The Web is noisy.. – The Web consists of surface Web and deep Web. – The Web is also about services. – The Web is dynamic. – Above all, the Web is a virtual society
Techniques • Classification of Multimedia Content and Websites • Focused Crawling • Clustering Web Objects • Wrapper Induction • Automatic Data Extraction • NLP technique for sentiment classification • Sentiment classification using ML methods • NLP for Customer Reviews Analysis
Classification of Multimedia Content and Websites • In order to retrieve relevant knowledge a system has to analyze web content first. • Classification of web objects offers an automatic way to decide the relevance of web objects. • Since websites are usually represented by multiple pages, classifying website on top of web pages classification demands new algorithms
Focused Crawling • A focused web crawler takes a set of well-selected web pages exemplifying the user interest. • The focused crawler starts from the given pages and recursively explores the linked web pages. • While the crawlers perform a breadth-first search of the whole web, a focused crawler explores only a small portion of the web using a best-first search guided by the user interest. • Crawling for retrieving multimedia content in the web, instead of plain HTML documents.
Clustering Web Objects • Focused Crawling retrieves large numbers of relevant data. • In order to offer fast and more specific access to the query results, clustering is an established method to group the retrieved information to achieve better understanding. • If the query results are websites or combined objects like images and their text descriptions, algorithm are needed to handle these combined data types to find meaningful clustering
Wrapper Induction • A wrapper is a piece of software that enables a semi structured Web source to be queried as if it were a database • Given a set of manually labeled pages, a machine learning method is applied to learn extraction rules or patterns.
Automatic Data Extraction • Given a set of positive pages, generate extraction patterns. • Given only a single page with multiple data records, generate extraction patterns.
NLP techniques for sentiment classification • The approach: Three steps – Step 1: •Part-of-speech tagging •Extracting two consecutive words (two-word phrases) from reviews if their tags conform to some given patterns, – Step 2: •Estimate the semantic orientation (SO) of the extracted phrases – Step 3: •Compute the average SO of all phrases •Classify the review as recommended if average SO is positive, not recommended otherwise.
Sentiment classification using ML methods • Three classification techniques were tried: – Naïve Bayes – Maximum entropy – Support vector machine
NLP for Customer Reviews Analysis • Mining product features – Part-of-Speech tagging – features are nouns and nouns phrases • Identify Orientation of an Opinion Sentence – Use dominant orientation of opinion words (e.g., adjectives) as sentence orientation.
Applications • Automatic Maintenance of Topic Specific Directory Services • Data extraction • Sentiment classification, analysis • Summarization of consumer reviews • Information integration and schema matching • Knowledge synthesis • Template detection and page segmentation
Recommend
More recommend