An unsupervised classification process for large datasets based on web reasoning Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr Laboratoire LE2I – UMR CNRS 6306 – Université de Bourgogne
Outline Context Global problem The Semantic HMC Specific Problem Proposed Solution Implementation Setup Results Conclusion and future work
Global Problem Value extraction from Big Data sources 3
Global Problem 4
Proposition: « Semantic HMC » 6
Proposition: « Semantic HMC » 7
Proposition : « Semantic HMC » Unsupervised ontology learning Rule-based Classification (Web Reasoner) 8
Outline Context Global problem The Semantic HMC Specific Problem Proposed Solution Implementation Setup Results Conclusion and future work
Specific Problem Rule-based reasonning to perform Classification Unsupervised ontology learning Rule-based Classification 10
Specific Problem • Resolution: Learn classifications rules from large volumes of unstructured text Distributed method that exploits the coocurrence matrix • Realization: classify large volumes of new data items Classification using a Web Reasonner 11
Proposed solution: rule learning (Resolution) Learning Alpha and Beta sets 𝑄 𝐷 (i|j) term 1 term 2 term 3 term 4 term 5 term 6 term 7 label 1 0 0 5 0 5 25 25 label 2 0 75 0 0 0 75 5 label 3 0 0 75 0 25 0 0 label 4 5 25 25 0 5 93 25 label 5 95 0 0 0 60 0 5 label 6 0 60 0 95 0 0 90 label 7 5 98 5 60 25 0 79 𝑄 𝐷 (𝑢𝑓𝑠𝑛 𝑗 |𝑢𝑓𝑠𝑛 𝑘 ) = 𝑑𝑔𝑛 (𝑢𝑓𝑠𝑛 𝑗 , 𝑢𝑓𝑠𝑛 𝑘 ) Coocurrence: 𝑑𝑔𝑛(𝑢𝑓𝑠𝑛 𝑘 , 𝑢𝑓𝑠𝑛 𝑘 ) 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 > 𝛽 Alpha set: 𝜕 𝛽 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 ≤ 𝛽 Beta set: 𝜕 𝛾 12
Proposed solution: rule learning (Resolution) Learning Alpha and Beta sets 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 > 𝛽 Alpha set: 𝜕 𝛽 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 ≤ 𝛽 Beta set: 𝜕 𝛾 13
Proposed solution: rule learning (Resolution) Example: % term 1 term 2 term 3 term 4 term 5 term 6 term 7 label 1 0 0 5 0 5 25 25 label 2 0 75 0 0 0 75 5 label 3 0 0 75 0 25 0 0 label 4 5 25 25 0 5 93 25 label 5 95 0 0 0 60 0 5 label 6 0 60 0 95 0 0 90 label 7 5 98 5 60 25 0 79 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 > 𝛽 , 𝛽 = 91 𝜕 𝛽 91 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 ≤ 𝛽 , 𝛾 = 70 𝜕 𝛾 14
Proposed solution: classification with web reasoner Classification at query-time using backward-chaining 15
Core Ontology DL concepts Description 𝐽𝑢𝑓𝑛 ⊑ ∃ℎ𝑏𝑡𝑈𝑓𝑠𝑛. 𝑈𝑓𝑠𝑛 Items to classify (e.g. doc) has terms 𝑈𝑓𝑠𝑛 ⊑ ⊺ Terms (e.g. word) extracted from items 𝑀𝑏𝑐𝑓𝑚 ⊑ 𝑈𝑓𝑠𝑛 Labels are terms used to classify items 𝑀𝑏𝑐𝑓𝑚 ⊑ ∀𝑐𝑠𝑝𝑏𝑒𝑓𝑠. 𝑀𝑏𝑐𝑓𝑚 Broader relation between labels 𝑀𝑏𝑐𝑓𝑚 ⊑ ∀𝑜𝑏𝑠𝑠𝑝𝑥𝑓𝑠. 𝑀𝑏𝑐𝑓𝑚 Narrower relation between labels 𝑐𝑠𝑝𝑏𝑒𝑓𝑠 ≡ 𝑜𝑏𝑠𝑠𝑝𝑥𝑓𝑠 − Broader and narrower are inverse 𝐽𝑢𝑓𝑛 ⊓ 𝑈𝑓𝑠𝑛 = ∅ Items and Terms are disjoint 𝐽𝑢𝑓𝑛 ⊑ ∀𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒. 𝑀𝑏𝑐𝑓𝑚 Relation that links items with labels 16
Outline Context Global problem The Semantic HM C Specific Problem Proposed Solution Implementation Setup Results Conclusion and future work
Implementation: rule creation Distributed process using mapreduce: OWL API used to generate SWRL rules from the output 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 𝑢𝑓𝑠𝑛 𝑗 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑘 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) 18
Implementation: rule creation Generated rules Exemple Alpha rules 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 1 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 1 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 2 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 2 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) Beta rules 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 1 , Term 𝑢 2 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎasTerm ? 𝑗𝑢, 𝑢 1 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 2 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 1 , Term 𝑢 3 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 1 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 3 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒 ? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 2 , Term 𝑢 3 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 2 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 3 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) 19
Implementation: Classification at query-time Stardog used as a scalable triple-store (compatible with backward- chaining inference as well as SWRL rules inference) Rule selection process developped in Java interacting with Stardog to optimize query performance Resolution Realization 20
Implementation: test environment 21
Implementation: parameter setup Parameter Step Value Alpha Threshold 90 Beta Threshold 80 Resolution Term ranking (n) 5 p 0.25 Term Threshold ( 𝛅 ) Realization 2 22
Results Number of classifications: 𝐽𝑢𝑓𝑛 ⊑ ∀𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒. 𝑀𝑏𝑐𝑓𝑚 23
Results Number of learned rules (Alpha + Beta) 24
Results Number of learned rules (Alpha + Beta) 𝛽 = 91 𝛾 = 80 90 25
Outline Context Global problem The Semantic HM C Specific Problem Proposed Solution Implementation Setup Results Conclusion and future work
Conclusion • A new unsupervised process to automatically classify items A highly scalable rule learning method based on statistical and lexical approaches A novel method to classify items using a web reasoner • The process prototype was successfully implemented in a scalable and distributed platform to process Big Data • Preliminary results show that the items are classified automatically by the reasonner 27
Ongoing and Future Work • Quality Evaluation of the process: comparison with state-of- the art in classification • Predictive performance evaluation based on cross-validation with large dataset • Optimization of the process by exhaustive analysis of parameters ’ impact • Application to classification of news articles on the web 28
An unsupervised classification process for large datasets using web reasoning Thank you ! Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr Laboratoire LE2I – UMR CNRS 6306 – Université de Bourgogne
Recommend
More recommend