an unsupervised classification process for large datasets
play

An unsupervised classification process for large datasets based on - PowerPoint PPT Presentation

An unsupervised classification process for large datasets based on web reasoning Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr Laboratoire LE2I UMR CNRS 6306 Universit de


  1. An unsupervised classification process for large datasets based on web reasoning Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr Laboratoire LE2I – UMR CNRS 6306 – Université de Bourgogne

  2. Outline Context  Global problem  The Semantic HMC Specific Problem  Proposed Solution Implementation  Setup  Results Conclusion and future work

  3. Global Problem Value extraction from Big Data sources 3

  4. Global Problem 4

  5. Proposition: « Semantic HMC » 6

  6. Proposition: « Semantic HMC » 7

  7. Proposition : « Semantic HMC » Unsupervised ontology learning Rule-based Classification (Web Reasoner) 8

  8. Outline Context  Global problem  The Semantic HMC Specific Problem  Proposed Solution Implementation  Setup  Results Conclusion and future work

  9. Specific Problem Rule-based reasonning to perform Classification Unsupervised ontology learning Rule-based Classification 10

  10. Specific Problem • Resolution: Learn classifications rules from large volumes of unstructured text Distributed method that exploits the coocurrence matrix • Realization: classify large volumes of new data items Classification using a Web Reasonner 11

  11. Proposed solution: rule learning (Resolution) Learning Alpha and Beta sets 𝑄 𝐷 (i|j) term 1 term 2 term 3 term 4 term 5 term 6 term 7 label 1 0 0 5 0 5 25 25 label 2 0 75 0 0 0 75 5 label 3 0 0 75 0 25 0 0 label 4 5 25 25 0 5 93 25 label 5 95 0 0 0 60 0 5 label 6 0 60 0 95 0 0 90 label 7 5 98 5 60 25 0 79 𝑄 𝐷 (𝑢𝑓𝑠𝑛 𝑗 |𝑢𝑓𝑠𝑛 𝑘 ) = 𝑑𝑔𝑛 (𝑢𝑓𝑠𝑛 𝑗 , 𝑢𝑓𝑠𝑛 𝑘 ) Coocurrence: 𝑑𝑔𝑛(𝑢𝑓𝑠𝑛 𝑘 , 𝑢𝑓𝑠𝑛 𝑘 ) 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 > 𝛽 Alpha set: 𝜕 𝛽 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 ≤ 𝛽 Beta set: 𝜕 𝛾 12

  12. Proposed solution: rule learning (Resolution) Learning Alpha and Beta sets 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 > 𝛽 Alpha set: 𝜕 𝛽 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 ≤ 𝛽 Beta set: 𝜕 𝛾 13

  13. Proposed solution: rule learning (Resolution) Example: % term 1 term 2 term 3 term 4 term 5 term 6 term 7 label 1 0 0 5 0 5 25 25 label 2 0 75 0 0 0 75 5 label 3 0 0 75 0 25 0 0 label 4 5 25 25 0 5 93 25 label 5 95 0 0 0 60 0 5 label 6 0 60 0 95 0 0 90 label 7 5 98 5 60 25 0 79 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 > 𝛽 , 𝛽 = 91 𝜕 𝛽 91 𝑢 𝑗 = 𝑢 𝑘 |∀𝑢 𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄 𝐷 𝑢 𝑗 |𝑢 𝑘 ≤ 𝛽 , 𝛾 = 70 𝜕 𝛾 14

  14. Proposed solution: classification with web reasoner Classification at query-time using backward-chaining 15

  15. Core Ontology DL concepts Description 𝐽𝑢𝑓𝑛 ⊑ ∃ℎ𝑏𝑡𝑈𝑓𝑠𝑛. 𝑈𝑓𝑠𝑛 Items to classify (e.g. doc) has terms 𝑈𝑓𝑠𝑛 ⊑ ⊺ Terms (e.g. word) extracted from items 𝑀𝑏𝑐𝑓𝑚 ⊑ 𝑈𝑓𝑠𝑛 Labels are terms used to classify items 𝑀𝑏𝑐𝑓𝑚 ⊑ ∀𝑐𝑠𝑝𝑏𝑒𝑓𝑠. 𝑀𝑏𝑐𝑓𝑚 Broader relation between labels 𝑀𝑏𝑐𝑓𝑚 ⊑ ∀𝑜𝑏𝑠𝑠𝑝𝑥𝑓𝑠. 𝑀𝑏𝑐𝑓𝑚 Narrower relation between labels 𝑐𝑠𝑝𝑏𝑒𝑓𝑠 ≡ 𝑜𝑏𝑠𝑠𝑝𝑥𝑓𝑠 − Broader and narrower are inverse 𝐽𝑢𝑓𝑛 ⊓ 𝑈𝑓𝑠𝑛 = ∅ Items and Terms are disjoint 𝐽𝑢𝑓𝑛 ⊑ ∀𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒. 𝑀𝑏𝑐𝑓𝑚 Relation that links items with labels 16

  16. Outline Context  Global problem  The Semantic HM C Specific Problem  Proposed Solution Implementation  Setup  Results Conclusion and future work

  17. Implementation: rule creation Distributed process using mapreduce: OWL API used to generate SWRL rules from the output 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 𝑢𝑓𝑠𝑛 𝑗 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑘 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) 18

  18. Implementation: rule creation Generated rules Exemple Alpha rules 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 1 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 1 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 2 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 2 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) Beta rules 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 1 , Term 𝑢 2 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎasTerm ? 𝑗𝑢, 𝑢 1 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 2 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 1 , Term 𝑢 3 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 1 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 3 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒 ? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t 2 , Term 𝑢 3 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛 𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 2 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢 3 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛 𝑗 ) 19

  19. Implementation: Classification at query-time Stardog used as a scalable triple-store (compatible with backward- chaining inference as well as SWRL rules inference) Rule selection process developped in Java interacting with Stardog to optimize query performance Resolution Realization 20

  20. Implementation: test environment 21

  21. Implementation: parameter setup Parameter Step Value Alpha Threshold 90 Beta Threshold 80 Resolution Term ranking (n) 5 p 0.25 Term Threshold ( 𝛅 ) Realization 2 22

  22. Results Number of classifications: 𝐽𝑢𝑓𝑛 ⊑ ∀𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒. 𝑀𝑏𝑐𝑓𝑚 23

  23. Results Number of learned rules (Alpha + Beta) 24

  24. Results Number of learned rules (Alpha + Beta) 𝛽 = 91 𝛾 = 80 90 25

  25. Outline Context  Global problem  The Semantic HM C Specific Problem  Proposed Solution Implementation  Setup  Results Conclusion and future work

  26. Conclusion • A new unsupervised process to automatically classify items  A highly scalable rule learning method based on statistical and lexical approaches  A novel method to classify items using a web reasoner • The process prototype was successfully implemented in a scalable and distributed platform to process Big Data • Preliminary results show that the items are classified automatically by the reasonner 27

  27. Ongoing and Future Work • Quality Evaluation of the process: comparison with state-of- the art in classification • Predictive performance evaluation based on cross-validation with large dataset • Optimization of the process by exhaustive analysis of parameters ’ impact • Application to classification of news articles on the web 28

  28. An unsupervised classification process for large datasets using web reasoning Thank you ! Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr Laboratoire LE2I – UMR CNRS 6306 – Université de Bourgogne

Recommend


More recommend