COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC - PowerPoint PPT Presentation

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using a Web Crawler in Python with Scrapy Bachelor Thesis - Final Presentation Louis Mbuyu Aufgabensteller: Prof. Dr. François Bry Betreuer: Prof. Dr. François Bry, Yingding Wang 12.04.18 � 1

AGENDA 1. Introduction / Goal 2. Defining MOOC model 3. Web Scraper / Results 4. Gold standard selection 5. Text categorisation approach 6. Gold standard evaluation 7. Evaluation of all platforms 8. Conclusion & Future work � 2

1. Introduction / Goal � 3

Motivation • Irom - I ntelligent R ecommender O f M OOCs • MOOC - M assive O pen O nline C ourse The goal of Irom • To improve the learning and studying at the university. • To develop an intelligent MOOCs search engine Goal of thesis • Define unified set of categories across all MOOC platforms. � 4

Motivation � 5

Modified Goal � 6

Tasks 1. Define a MOOC model 2. Build a Web Scraper and extract data 3. Select a platform as the “Gold standard” 4. Text categorisation approach (TF-IDF & cos sim.) 5. Evaluate tf-idf and cosine similarity approach 6. Categorise courses from other platforms 7. Evaluate the results � 7

2. Defining MOOC model � 8

� 10

Motivation - MOOC platforms � 11

Unified MOOC Model (Table) Coursera Udacity Edx FutureLearn Open2Study Udemy ✓ ✓ ✓ ✓ ✓ ✓ Url ✓ ✓ ✓ ✓ ✓ ✓ Title ✓ ✓ ✓ ✓ ✓ ✓ Summary ✓ ✓ ✓ ✓ ✓ ✓ Description ✓ ✓ ✓ Subcategory ✘ ✘ ✘ ✓ ✓ ✓ ✓ ✓ ✓ Category WhyTakeThis ✓ ✓ ✓ ✓ ✓ ✓ Course ✓ ✓ ✓ ✓ ✓ ✓ Provider ✓ ✓ ✓ ✓ ✘ ✘ Level ✓ ✓ ✓ ✓ ✓ ✓ ImageUrl ✓ ✓ ✓ ✓ ✓ ✓ Price ✓ ✓ ✓ ✓ ✓ ✓ Duration ✓ ✓ ✓ ✓ ✓ ✘ RatingValue ✓ ✓ ✓ ✓ ✓ ✘ RatingAmount ✓ ✓ ✓ ✓ ✘ ✘ StartDate ✓ ✓ ✘ ✘ ✘ ✘ EndDate � 12

3. Web Scraper / Results � 13

Scraped Data (Table) Coursera Udacity Edx FutureLearn Open2Study Udemy All Total 3.032 232 1.098 193 49 40.003 44.607 number of courses � 14

4. Gold standard selection � 15

Gold standard criteria 1. Number of categories 2. Number of courses 3. Diversity 4. Represent University Subjects � 16

Gold standard elimination process Coursera Udacity Edx FutureLearn Open2Study Udemy No. of ✓ ✓ ✓ ✓ ✘ ✘ categories No. of ✓ ✓ ✓ ✘ ✘ ✘ courses ✓ ✓ ✓ ✓ ✓ ✘ Diversity University ✓ ✓ ✓ ✓ ✘ ✘ rep. � 17

Gold standard structure � 18

5. Text categorisation approach � 21

Text categorisation approach (Step 1) Query Database : ‘Platform’ = ‘Coursera’ AND GROUB BY ‘Subcategory’ MongoDB � 22

Text categorisation approach (Step 2) subcategories Array of courses (JSON object) Finance … course 1 course 2 course n … … Marketing course 1 course 2 course n … … Algorithms course 1 course 2 course n … … Subcategory m course 1 course 2 course n � 23

Text categorisation approach (Step 3) Iterate through courses and extract and combine the ‘title’, ‘Summary’, ‘Description’ subcategories Array of courses (String) … Finance course 1 course 2 course n … … Marketing course 1 course 2 course n … … Subcategory m course 1 course 2 course n � 24

Text categorisation approach (Step 4) Join all arrays/list of strings into one string subcategories Combined array of courses (String) “Intro into Finance. This course …” Finance … “Marketing 101. Learn fundamentals …” Marketing … … Subcategory m course 1 course 2 course n � 25

Text categorisation approach (Step 5) Preprocess Data : Remove all stop words and punctuations, All words to lowercase, All words are stemmed subcategories Preprocessed combined array of courses “intro financ cours …” Finance … “market lear fundamental …” Marketing … … Subcategory m course 1 course 2 course n � 26

Text categorisation approach (Step 6) Course (Query) from another platform, that needs to be categorised { “title”: String, “courseUrl”: String, “imageUrl”: String, “description”: String, “duration”: Int, “category”: String, … } Extract and combine the ‘title’, ‘Summary’, ‘Description’ Preprocess Data : Remove all stop words and punctuations, All words to lowercase, All words are stemmed � 27

Text categorisation approach (Step 6) Calculate TF-IDF and Cosine similarity for all subcategories. Course is categorised to the subcategory with the highest value. Course(s) Subcategories (String) { Finance “title”: String, TF-IDF and cosine “courseUrl”: String, “imageUrl”: String, “description”: String, similarity “duration”: Int, “category”: String, … … } Marketing … Subcategory m � 28

TF-IDF & Cosine Similarity TF-IDF - Term frequency inverse document frequency Term Frequency - How frequent a term appears in a given document Inverse document frequency - diminishes the weight of terms that appear very frequently in the corpus and increases the weight of terms that appear rarely. Cosine similarity - a measure of similarity between two vectors, that measures the cosine of the angle between them. � 29

6. Approach evaluation � 30

Approach evaluation Coursera courses TF-IDF and cosine New similarity approach Category { “title”: String, “courseUrl”: String, “imageUrl”: String, “description”: String, “duration”: Int, “category”: String, … } � 31

Approach evaluation Accuracy - Accuracy is a ratio of total correctly categorised courses to the total number of courses 2625 0.87 Gold standard accuracy = ≈ 3032 � 32

7. Evaluation of all platforms � 33

Evaluation of all platforms (Udacity) Intro. to Android Computer TF-IDF and cosine Science similarity approach Category: Android Category Good or bad outcome? � 34

Evaluation of all platforms (Udacity) Gold standard (Coursera) categories { { Udacity categories � 35

Evaluation of all platforms (Udacity) *The heat-map shows the percentages of courses categorised to that particular category, with darker colours indicating greater percentage. � 36

Grading schema � 37

Evaluation of all platforms (Udacity) Udacity evaluation table � 38

Evaluation of all platforms (Udacity) Udacity courses distribution (Pie Chart) � 39

Evaluation of all platforms (Udacity) Udacity courses distribution (Table) � 40

Evaluation of all platforms (Edx) � 41

8. Conclusion & Future work � 42

Conclusion 1. ca 45.000 courses scraped and indexed for IROM. 2. Coursera’s categories as the gold standard was a great outcome. 3. Tf-idf and cosine similarity measure was also a positive outcome. � 43

Future work 1. Measure the quality of data scraped 2. Better approach - machine learning (neural networks, etc) 3. Evaluating text categorisation � 44

Thank you. � 45

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC - PowerPoint PPT Presentation

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using a Web Crawler in Python with Scrapy Bachelor Thesis - Final Presentation Louis Mbuyu Aufgabensteller: Prof. Dr. Franois Bry Betreuer: Prof. Dr. Franois Bry,

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using automated Web

Categorical Professional Development In-Service August 6, 2019 Welcome Back Categorical Team

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 /

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Case study introduction Emily Robinson Data Scientist DataCamp Categorical Data in the

Categorical quantum mechanics Chris Heunen 1 / 76 Categorical Quantum Mechanics? Study of

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Categorical models of probability with symmetries Sam Staton, Oxford Categorical models

Reordering factors Emily Robinson Data Scientist DataCamp Categorical Data in the Tidyverse

STAT 113 Describing Categorical Data I Colin Reimer Dawson Oberlin College September 11, 2020

CS 147: Computer Systems Performance Analysis Multiple and Categorical Regression 1 / 36

Chapter 11 Categorical Data Analysis Categorical Data and the Multinomial Distribution

Web Course Web Course Physical Properties of Glass Physical Properties of Glass 1. Properties

Web Course Web Course Physical Properties of Glass Physical Properties of Glass 1. Properties

Mount Vernon School District Categorical Programs 2020-2021 Categorical Programs Supplemental

An Introduction to Category Theory basics Products, and Categorical Logic coproducts, and

Development R&D Review Automated Grouping Model Extraction from BIM Data Unified Fire-Egress

A Semantic Similarity Measure for Formal Ontologies Mark Hall Final presentation for the master

A Study of Hybrid Similarity Measures for Semantic Relation Extraction Alexander Panchenko and

Classification of normal and pathological brain networks based on similarity of graph partitions

Your Network Management Partner Castle Rock Com puting Silicon Valley based Network

RTG A Scalable SNMP Statistics Architecture USENIX LISA 2002 Robert Beverly November 7, 2002

GE Wind Energy Overview of Windfarm Communications Using DNP3 NREL November 17, 2003 Next Gen

Freedom to communicate ERMES story 1 ERMES started its activity in 1990 by developing and

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC - PowerPoint PPT Presentation

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using a Web Crawler in Python with Scrapy Bachelor Thesis - Final Presentation Louis Mbuyu Aufgabensteller: Prof. Dr. Franois Bry Betreuer: Prof. Dr. Franois Bry,

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using automated Web

Categorical Professional Development In-Service August 6, 2019 Welcome Back Categorical Team

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 /

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Case study introduction Emily Robinson Data Scientist DataCamp Categorical Data in the

Categorical quantum mechanics Chris Heunen 1 / 76 Categorical Quantum Mechanics? Study of

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Categorical models of probability with symmetries Sam Staton, Oxford Categorical models

Reordering factors Emily Robinson Data Scientist DataCamp Categorical Data in the Tidyverse

STAT 113 Describing Categorical Data I Colin Reimer Dawson Oberlin College September 11, 2020

CS 147: Computer Systems Performance Analysis Multiple and Categorical Regression 1 / 36

Chapter 11 Categorical Data Analysis Categorical Data and the Multinomial Distribution

Web Course Web Course Physical Properties of Glass Physical Properties of Glass 1. Properties

Web Course Web Course Physical Properties of Glass Physical Properties of Glass 1. Properties

Mount Vernon School District Categorical Programs 2020-2021 Categorical Programs Supplemental

An Introduction to Category Theory basics Products, and Categorical Logic coproducts, and

Development R&amp;D Review Automated Grouping Model Extraction from BIM Data Unified Fire-Egress

A Semantic Similarity Measure for Formal Ontologies Mark Hall Final presentation for the master

A Study of Hybrid Similarity Measures for Semantic Relation Extraction Alexander Panchenko and

Classification of normal and pathological brain networks based on similarity of graph partitions

Your Network Management Partner Castle Rock Com puting Silicon Valley based Network

RTG A Scalable SNMP Statistics Architecture USENIX LISA 2002 Robert Beverly November 7, 2002

GE Wind Energy Overview of Windfarm Communications Using DNP3 NREL November 17, 2003 Next Gen

Freedom to communicate ERMES story 1 ERMES started its activity in 1990 by developing and

Development R&D Review Automated Grouping Model Extraction from BIM Data Unified Fire-Egress