Exploratory Data Analysis Demo (Use Case: MOOC dropout prediction) - PowerPoint PPT Presentation

Exploratory Data Analysis Demo (Use Case: MOOC dropout prediction) Feb 09, 2019 Naveen Kumar Kaveti, Data Scientist Soumya Sulegai, Talent Acquisition Mgr Sravya Garapati, Machine Learning Engineer Priyanka A Giri, CW Talent Acquisition Viswa Datha Polavarapu, Machine Learning Engineer

Agenda Introduction to Intuit Prerequisites Problem Statement Data Understanding Feature Engineering EDA (Exploratory Data Analysis) Model Building Demo Time Challenge Time Intuit Confidential and Proprietary 2

Intuit Confidential and Proprietary 3

Our Mission Intuit Confidential and Proprietary 4

Our journey so far Intuit Confidential and Proprietary 5

Products that power prosperity Our technology has helped us innovate four of our major products that are simplifying work of millions, worth millions. Intuit Confidential and Proprietary 6

Prerequisites What is distribution? What are the properties of distribution? Mean Variance Skewness Kurtosis Intuit Confidential and Proprietary 8

Prerequisites Correlations: Pearson’s Correlation Coefficient - Measure of the linear correlation between two variables X and Y Spearman’s Rank Correlation Coefficient - Measures the monotonic relationship between two variables Mutual Information - Measures the amount of information flow between two variables Intuit Confidential and Proprietary 9

Problem Statement Dropped 79% Completed 21% MOOC: Massive Open Online Courses Intuit Confidential and Proprietary 11

Problem Statement The Challenge: The competition participants need to predict whether a user will drop a course within next 10 days based on his or her prior activities . If a user C leaves no records for course C in the log during the next 10 days, we define it as dropout from course C. But Why? Students' high dropout rate on MOOC platforms has been heavily criticized, and predicting their likelihood of dropout would be useful for maintaining and encouraging students' learning activities . Reference: http://moocdata.cn/challenges/kdd-cup-2015 Intuit Confidential and Proprietary 12

Data Understanding - Course Level Information Course Duration Module Information ❏ ❏ Course ID Course ID ❏ Module ID ❏ From ❏ Category ❏ ❏ To Children ❏ Start Description: Description: Each line in this file describes a module in a course with its category, children objects and Each line contains the timespan of each release time. course (both train and test data). Intuit Confidential and Proprietary 14

Data Understanding - Enrollment Level Information Enrollment History Student Database Truth ❏ Enrollment ID ❏ ❏ Enrollment ID Time ❏ Enrollment ID ❏ ❏ User name Surce ❏ Dropout ❏ ❏ Course ID Event ❏ Object Description: Description: Description: Each line is a course enrollment record with an Each line is an action taken by a user within an Each line contains information enrollment id, a username U and a course id C, enrollment. about the ground truth of indicating that U enrolled in course C. enrollments in the training set. Intuit Confidential and Proprietary 15

Data Understanding Student Database Course Duration ❏ Left Join Enrollment ID ❏ Course ID ❏ User name ❏ From ❏ Course ID ❏ To Key: Course ID Feature Truth Student-Course Level Left Join ❏ ❏ Enrollment ID Enrollment ID ❏ Feature Engineering ❏ Features Dropout Key: Enrollment ID Enrollment History Module Information ❏ Enrollment ID ❏ Course ID ❏ Left Join ❏ Time Module ID ❏ ❏ Surce Category ❏ ❏ Children Event ❏ ❏ Start Object Left Key: Object Right Key: Module ID Final ❏ Enrollment ID ❏ Dropout ❏ Features Intuit Confidential and Proprietary 16

Feature Engineering User Level Features Course Level Features Enrollment Level Features ❏ ❏ ❏ Number of courses enrolled Number of users enrolled Average delay between chapter complete times ❏ ❏ Lifetime of the user Dropout percentage ❏ Event (Problem, Video and ❏ Average delay between Discussion) counts chapter start times ❏ Event (Problem, Video and Discussion) duration Intuit Confidential and Proprietary 18

EDA (Exploratory Data Analysis) Make a Hypothesis Test a Hypothesis Intuit Confidential and Proprietary 20

Testing of Hypothesis (Two Sample t-test) Step1: Null Hypothesis (Make an hypothesis about population): Mean of two samples are equal (μ 1 = μ 2 ) Alternative Hypothesis (Negate Null Hypothesis): Mean of two samples are not equal (μ 1 ≠ μ 2 ) Step 2: Test the hypothesis about population using available data Step 3: Compute p-value based on t-statistic -t +t Step 4: Compare p-value with the assumed level of significance (say, 0.05) and reject the null hypothesis if p-value is less than 0.05 and fail to reject the null hypothesis if p-value is greater than 0.05 Intuit Confidential and Proprietary 21

EDA (Exploratory Data Analysis) Hypothesis: Does lifetime of user impacts the user’s willingness to complete the course? Intuit Confidential and Proprietary 22

EDA (Exploratory Data Analysis) Hypothesis: Does number of courses enrolled by the user impact the user’s willingness to complete the course? Intuit Confidential and Proprietary 23

EDA (Exploratory Data Analysis) Hypothesis: Does event (problem/video/discussion) counts impact the user’s willingness to complete the course? t = -43.033; p-value = < 2.2e-16 t = -31.896; p-value = < 2.2e-16 t = -14.87; p-value = < 2.2e-16 Mean of x = 3.46; Mean of y = 18.78 Mean of x = 4.93; Mean of y = 33 Mean of x = 2.07; Mean of y = 18.14 Conclusion: The difference in means is Conclusion: The difference in means is Conclusion: The difference in means is not equals to 0 not equals to 0 not equals to 0 Intuit Confidential and Proprietary 24

Bagging Vs Boosting Bagging (Parallel) Boosting (Sequential) Reference: GIS-based mineral prospectivity mapping using machine learning methods: A case study from Tongling ore district, eastern China Intuit Confidential and Proprietary 26

Gradient Boost Machine Reference: https://dimensionless.in/gradient-boosting/ Intuit Confidential and Proprietary 27

Metrics to Validate Classification Model Confusion Matrix: Accuracy: Precision: Recall: TN + TP TP TP TN + TP + FP + FN TP + FP TP + FN F1 Score: 2*P*R Reference: Packtpub.com P + R Accuracy: Proportion of correct classifications Precision: Quantifies the number of correct positive predictions made . It’s a good metric to validate if the cost of false positives is very high. Recall: Quantifies the number of correct positive predictions made out of all positive predictions that could have been made. It’s a good metric to validate if the cost of false negatives is very high. F1 Score: Balances between precision and recall Intuit Confidential and Proprietary 28

AUC-ROC and AUC-PR AUC-ROC AUC-PR Recall/TPR: FPR: TP FP TP + FN FP + TN Reference: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/ Intuit Confidential and Proprietary 29

Model Building Train Metrics Test Metrics Trained Model: Gradient Boost Machine (GBM) Number of enrollments in test: 24,013 Number of enrollments in train: 72,395 Confusion Matrix for F1-optimal threshold Confusion Matrix for F1-optimal threshold 2,411 692 7,968 7,061 2,491 18,419 86.7% 1,923 55,443 87.6% AUC-ROC: 0.85 AUC-PR: 0.94 AUC-ROC: 0.87 AUC-PR: 0.95 Max F1: 0.92 Threshold: 0.47 Intuit Confidential and Proprietary 30

References 1. KDD Cup 2015 Challenge 2. Code Try this out: Will Bill Solve it? Intuit Confidential and Proprietary 31

Exploratory Data Analysis Demo (Use Case: MOOC dropout prediction) - PowerPoint PPT Presentation

Exploratory Data Analysis Demo (Use Case: MOOC dropout prediction) Feb 09, 2019 Naveen Kumar Kaveti, Data Scientist Soumya Sulegai, Talent Acquisition Mgr Sravya Garapati, Machine Learning Engineer Priyanka A Giri, CW Talent Acquisition Viswa

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Introduction to Data Science: x (1) x 1 x 2 x ( n ) x i n 1 1 Size: size

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory

CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis Lecture 5: Exploratory Data

Exploratory Data Analysis Exploratory Data Analysis for Ecological Modelling and for Ecological

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

VISUALIZATION Jeff Goldsmith, PhD Department of Biostatistics 1 Exploratory data analysis

Exploratory Data Analysis Maneesh Agrawala CS 448B: Visualization Fall 2018 1 A2: Exploratory

Exploratory Monitoring at Bing AUTOMATED SYNTHETIC EXPLORATORY MONITORING OF DYNAMIC WEB SITES

Exploratory Data Analysis Nam Wook Kim Mini-Courses January @ GSAS 2018 Goal Learn the

Project: Exploratory Data Analysis Tony Yao-Jen Kuo Project Overview Project source Assignment

Exploratory Data Analysis Ma Maneesh Agrawala CS 448B: Visualization Winter 2020 1 A2:

Exploratory Data Analysis Ma Maneesh Agrawala CS 448B: Visualization Fall 2020 1 A2:

The United Nations Voting Dataset Exploratory Data Analysis: Case Study UN Voting Dataset Roll

Middle Level Exploratory Classes Standards Based Grading McLean County Unit 5 Exploratory

Agenda Agenda 1. ProjectOverview 1 Project Overview 2. DrillingProgram 3 3.

ACT AND QUI TAM ENFORCEMENT The Impact of FCA Enforcement on Federal Procurement Law and Other

The Changing Electric Utility Industry The Litigation Risks Ahead April 28, 2015 Todays

FOR A FOR A GRAFFITI FREE GRAFFITI FREE RIVERSIDE RIVERSIDE TAKE BACK THE WALL How would you

Statistical significance in CP violation Mattias Blennow emb@kth.se KTH Theoretical Physics

IAC 2018 Interactive Presentations FAQ You can find answers to most questions in this FAQ. You can

Educator Licensure Changes Per Executive Order 2020-31 and Emergency Rules Institutions of

The LEA Special Education Point of Contact Monthly Webinar will begin momentarily. A copy of

HCBS Webinar Questions and Answers Regarding Home and Community-Based Services 1. Question: Does