comparison of categorical properties offered by multiple
play

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC - PowerPoint PPT Presentation

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using a Web Crawler in Python with Scrapy Bachelor Thesis - Final Presentation Louis Mbuyu Aufgabensteller: Prof. Dr. Franois Bry Betreuer: Prof. Dr. Franois Bry,


  1. COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using a Web Crawler in Python with Scrapy Bachelor Thesis - Final Presentation Louis Mbuyu Aufgabensteller: Prof. Dr. François Bry Betreuer: Prof. Dr. François Bry, Yingding Wang 12.04.18 � 1

  2. AGENDA 1. Introduction / Goal 2. Defining MOOC model 3. Web Scraper / Results 4. Gold standard selection 5. Text categorisation approach 6. Gold standard evaluation 7. Evaluation of all platforms 8. Conclusion & Future work � 2

  3. 1. Introduction / Goal � 3

  4. Motivation • Irom - I ntelligent R ecommender O f M OOCs • MOOC - M assive O pen O nline C ourse The goal of Irom • To improve the learning and studying at the university. • To develop an intelligent MOOCs search engine Goal of thesis • Define unified set of categories across all MOOC platforms. � 4

  5. Motivation � 5

  6. Modified Goal � 6

  7. Tasks 1. Define a MOOC model 2. Build a Web Scraper and extract data 3. Select a platform as the “Gold standard” 4. Text categorisation approach (TF-IDF & cos sim.) 5. Evaluate tf-idf and cosine similarity approach 6. Categorise courses from other platforms 7. Evaluate the results � 7

  8. 2. Defining MOOC model � 8

  9. � 9

  10. � 10

  11. Motivation - MOOC platforms � 11

  12. Unified MOOC Model (Table) Coursera Udacity Edx FutureLearn Open2Study Udemy ✓ ✓ ✓ ✓ ✓ ✓ Url ✓ ✓ ✓ ✓ ✓ ✓ Title ✓ ✓ ✓ ✓ ✓ ✓ Summary ✓ ✓ ✓ ✓ ✓ ✓ Description ✓ ✓ ✓ Subcategory ✘ ✘ ✘ ✓ ✓ ✓ ✓ ✓ ✓ Category WhyTakeThis ✓ ✓ ✓ ✓ ✓ ✓ Course ✓ ✓ ✓ ✓ ✓ ✓ Provider ✓ ✓ ✓ ✓ ✘ ✘ Level ✓ ✓ ✓ ✓ ✓ ✓ ImageUrl ✓ ✓ ✓ ✓ ✓ ✓ Price ✓ ✓ ✓ ✓ ✓ ✓ Duration ✓ ✓ ✓ ✓ ✓ ✘ RatingValue ✓ ✓ ✓ ✓ ✓ ✘ RatingAmount ✓ ✓ ✓ ✓ ✘ ✘ StartDate ✓ ✓ ✘ ✘ ✘ ✘ EndDate � 12

  13. 3. Web Scraper / Results � 13

  14. Scraped Data (Table) Coursera Udacity Edx FutureLearn Open2Study Udemy All Total 3.032 232 1.098 193 49 40.003 44.607 number of courses � 14

  15. 4. Gold standard selection � 15

  16. Gold standard criteria 1. Number of categories 2. Number of courses 3. Diversity 4. Represent University Subjects � 16

  17. Gold standard elimination process Coursera Udacity Edx FutureLearn Open2Study Udemy No. of ✓ ✓ ✓ ✓ ✘ ✘ categories No. of ✓ ✓ ✓ ✘ ✘ ✘ courses ✓ ✓ ✓ ✓ ✓ ✘ Diversity University ✓ ✓ ✓ ✓ ✘ ✘ rep. � 17

  18. Gold standard structure � 18

  19. Gold standard structure � 19

  20. Gold standard structure � 20

  21. 5. Text categorisation approach � 21

  22. Text categorisation approach (Step 1) Query Database : ‘Platform’ = ‘Coursera’ AND GROUB BY ‘Subcategory’ MongoDB � 22

  23. Text categorisation approach (Step 2) subcategories Array of courses (JSON object) Finance … course 1 course 2 course n … … Marketing course 1 course 2 course n … … Algorithms course 1 course 2 course n … … Subcategory m course 1 course 2 course n � 23

  24. Text categorisation approach (Step 3) Iterate through courses and extract and combine the ‘title’, ‘Summary’, ‘Description’ subcategories Array of courses (String) … Finance course 1 course 2 course n … … Marketing course 1 course 2 course n … … Subcategory m course 1 course 2 course n � 24

  25. Text categorisation approach (Step 4) Join all arrays/list of strings into one string subcategories Combined array of courses (String) “Intro into Finance. This course …” Finance … “Marketing 101. Learn fundamentals …” Marketing … … Subcategory m course 1 course 2 course n � 25

  26. Text categorisation approach (Step 5) Preprocess Data : Remove all stop words and punctuations, All words to lowercase, All words are stemmed subcategories Preprocessed combined array of courses “intro financ cours …” Finance … “market lear fundamental …” Marketing … … Subcategory m course 1 course 2 course n � 26

  27. Text categorisation approach (Step 6) Course (Query) from another platform, that needs to be categorised { “title”: String, “courseUrl”: String, “imageUrl”: String, “description”: String, “duration”: Int, “category”: String, … } Extract and combine the ‘title’, ‘Summary’, ‘Description’ Preprocess Data : Remove all stop words and punctuations, All words to lowercase, All words are stemmed � 27

  28. Text categorisation approach (Step 6) Calculate TF-IDF and Cosine similarity for all subcategories. Course is categorised to the subcategory with the highest value. Course(s) Subcategories (String) { Finance “title”: String, TF-IDF and cosine “courseUrl”: String, “imageUrl”: String, “description”: String, similarity “duration”: Int, “category”: String, … … } Marketing … Subcategory m � 28

  29. TF-IDF & Cosine Similarity TF-IDF - Term frequency inverse document frequency Term Frequency - How frequent a term appears in a given document Inverse document frequency - diminishes the weight of terms that appear very frequently in the corpus and increases the weight of terms that appear rarely. Cosine similarity - a measure of similarity between two vectors, that measures the cosine of the angle between them. � 29

  30. 6. Approach evaluation � 30

  31. Approach evaluation Coursera courses TF-IDF and cosine New similarity approach Category { “title”: String, “courseUrl”: String, “imageUrl”: String, “description”: String, “duration”: Int, “category”: String, … } � 31

  32. Approach evaluation Accuracy - Accuracy is a ratio of total correctly categorised courses to the total number of courses 2625 0.87 Gold standard accuracy = ≈ 3032 � 32

  33. 7. Evaluation of all platforms � 33

  34. Evaluation of all platforms (Udacity) Intro. to Android Computer TF-IDF and cosine Science similarity approach Category: Android Category Good or bad outcome? � 34

  35. Evaluation of all platforms (Udacity) Gold standard (Coursera) categories { { Udacity categories � 35

  36. Evaluation of all platforms (Udacity) *The heat-map shows the percentages of courses categorised to that particular category, with darker colours indicating greater percentage. � 36

  37. Grading schema � 37

  38. Evaluation of all platforms (Udacity) Udacity evaluation table � 38

  39. Evaluation of all platforms (Udacity) Udacity courses distribution (Pie Chart) � 39

  40. Evaluation of all platforms (Udacity) Udacity courses distribution (Table) � 40

  41. Evaluation of all platforms (Edx) � 41

  42. 8. Conclusion & Future work � 42

  43. Conclusion 1. ca 45.000 courses scraped and indexed for IROM. 2. Coursera’s categories as the gold standard was a great outcome. 3. Tf-idf and cosine similarity measure was also a positive outcome. � 43

  44. Future work 1. Measure the quality of data scraped 2. Better approach - machine learning (neural networks, etc) 3. Evaluating text categorisation � 44

  45. Thank you. � 45

Recommend


More recommend