Ontology-based approach for unsupervised and adaptive focused - PowerPoint PPT Presentation

Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe CRUZ, Aurélie Bertaux thomas.hassan@u-bourgogne.fr Le2i FRE2005, CNRS, Arts et Métiers, Univ. Bourgogne Franche-Comté Dijon, France

Outline • Context § Industrial context § Problem statement • Proposed solution § Background § Architecture • Evaluation § Scaling § Performance • Conclusion and future work

Industrial context Competitive intelligence 3

Industrial context Content feed tools Content analysis

Problem statement Content feed tools Content analysis Bottlenecks : - Cross-referencing articles to assess veracity - Manual classification of articles - Discrepancy between data and knowledge base High time cost for experts, possible loss of information

Problem statement • How to specialize feed tools with domain-specific knowledge ? • How to optimize content gathering to find most relevant items fast ? • How to expand information sources horizon ? 6

Background : focused crawler Relevant Irrelevant Seed item Inlink 8

Background : focused crawler + semantics Web Crawler Ontology Efficient content gathering Relevant content analysis 9

Limitations 1) Dynamic data VS static ontology : Discrepancy between ontology-based classifier and actual web data 2) Crawler should improve from experience : Both content and graph mining should be useed to enhance crawling performance Objectives : adapt both crawling experience and content analysis over time to accelerate crawling and improve relevance 10

Architecture : baseline implementation Based on Nutch, hadoop-based distributed crawler • Crawl web sources periodically • High throughput, fault tolerance • Integrate usefull modules Diagram from : https://nutch.wordpress.com/ 11

Architecture : classification module Classification model construction based on probability distribution of features HMC with Tree HMC with DAG L L term 1 term 2 term 3 term 4 term 5 term 6 term 7 label 1 0 0 5 0 5 25 25 label 2 0 75 0 0 0 75 5 label 3 0 0 75 0 25 0 0 label 4 5 25 25 0 5 93 25 label 5 95 0 0 0 60 0 5 label 6 0 60 0 95 0 0 90 label 7 5 98 5 60 25 0 79 Item Item Multi-label Hierarchical Classification

Architecture : classification module Objective : content-based classification of items HMC with Tree HMC with DAG L L Item Item Multi-label Hierarchical Classification Each document represented as a vector of terms it contains (Lucene) Outputs a vector of labels (relevant concepts of the ontology) for each item

Architecture : priority module Use the context-graph approach to estimate relevance of unseen links. Computes similarity with fetched items based on the distance to relevant items Relevant Irrelevant Inlink Graph layers Diligenti, et al., 2000. Focused Crawling Using Context Graphs. In VLDB (pp. 527-534). 14

Architecture : classification module Integration with the crawler 15

Architecture : maintenance module Objective : maintain a cooccurrence matrix of features term 1 term 2 term 3 term 4 term 5 term 6 term 7 label 1 0 0 5 0 5 25 25 label 2 0 75 0 0 0 75 5 label 3 0 0 75 0 25 0 0 label 4 5 25 25 0 5 93 25 label 5 95 0 0 0 60 0 5 label 6 0 60 0 95 0 0 90 label 7 5 98 5 60 25 0 79 16

Architecture : maintenance module 17

Scaling Distributed architecture to deal with scaling 19

Scaling Distributed architecture to deal with scaling 20

Quality Evaluation Comparison with standard Best-N-First using only cosine similarity 21

Conclusion • An approach for unsupervised ontology-based focused crawling § Performs cross-referencing of web items § Ontology-based classification model for accurate item classification § Adaptation and evolution of the model using web content and web graph mining • Future work § Evaluation of the architecture in industrial context § Leverage scalability issues of the maintenance process. § Active learning integration in the maintenance process (expert feedback) 23

Ontology-based approach for unsupervised and adaptive focused crawling Thank you ! Thomas HASSAN, Christophe CRUZ, Aurélie Bertaux thomas.hassan@u-bourgogne.fr Le2i FRE2005, CNRS, Arts et Métiers, Univ. Bourgogne Franche-Comté Dijon, France

Ontology-based approach for unsupervised and adaptive focused - PowerPoint PPT Presentation

Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe CRUZ, Aurlie Bertaux thomas.hassan@u-bourgogne.fr Le2i FRE2005, CNRS, Arts et Mtiers, Univ. Bourgogne Franche-Comt Dijon, France Outline

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

ODPReco - A Tool to Recommend Ontology Design Patterns Maleeha Arif Yasvi, Raghava Mutharaju

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot

Ontology Development 101: A Guide to Creating Your First Ontology Natalya F. Noy and Deborah L.

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Ontology Engineering Lecture 7: Top-down (and middle-out) Ontology Development II Maria Keet

Some (more) Burning Issues for Ontology Initiatives Background: Current Ontology Work in Bremen

Systematic Annotation Mark Voorhies 4/5/2011 The Gene Ontology Three directed acyclic graphs

Ontology Languages for the Semantic Web Ontology Languages Wide variety of languages for

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

Crawling the Web for Sebastian Nagel Apache Big Data Europe 2016 snagel@apache.org

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Crawling T. Yang, UCSB 293S Some of slides from Crofter/Metzler/Strohmans textbook Where are

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

a framework for historical analysis and real-4me monitoring of BGP data Chiara Orsini, Alistair

WEBCOP: LOCATING NEIGHBORHOODS OF MALWARE ON THE WEB Reid Andersen Jay Stokes

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us