Classifying the Terms of Service Capstone Presentation | Sam Beardsworth
Goal Build a model to make Terms of Service easier to read How? • Identify the content • Extract the meaning • Highlight important terms
Approach No shortage of data: it's literally on every website But how to make sense of it? Answer: Use a pre-classified dataset (courtesy of ToS;DR)
ToS;DR • started in June 2012 • aims to review and score Terms of Service policies of major web services • users can look up terms through website / browser extension • public, transparent, community-driven • volunteer project
Data Gathering API : broken but was able to obtain the same info via public repos Additional challenges - ToS;DR has had 3 incarnations - API only has good data for incarnation #2 - Scrape all 3 and merge by ID Some manual cleaning needed
Dataset 1688 observations (extracts) mean length: 65 words max length: 1410 words! 107340 words total / 6469 unique 17 columns - discarded 9 as purely administrative
Dataset ID Status Service Source quote Topic Case Point 1720 pending facebook Cookie Policy 'We use Tracking Personal data bad cookies to used for help us show advertising ads...' 1311 approved nokia T&C 'Except as set Content Service retains bad forth in the deleted Privacy content Policy...' 2261 approved whatsapp NA 'When you Right to leave Data deleted good delete your after account WhatsApp closure account...' unique: 179 22 143 4
Dataset: Filling the gaps
EDA
Lemmatization
Topic Exploration • 22 topics, inbalanced • dropped <25 observations • remember to balance during classification
Modelling 19 topics Baseline accuracy: 0.117 70-30 train-test split, stratified by topic Basic, untuned logistic regression Test accuracy: 0.615
Improving the score • TF-IDF to reduce feature importance of common words • imblearn's RandomOverSampler to reduce class imbalance in the training set • GridSearchCV for optimal Logistic Regression hyperparameters Improved test accuracy: 0.641
Beyond Logistic Regression the sklearn 'try everything' approach... ...optimised with GridSearch
Model Comparison
Alternative Models word2vec - 3.5 GB dictionary pre-trained on news articles - applied to pre-lemmatized tokens (corpus) - performed differently but more poorly Accuracy score: 0.613 Principle Component Analysis / SVD - explanatory value relatively low - 19% across PC1-2, 37% across PC1-10
Alternative Models Latent Dirichlet Allocation (LDA) "a technique to extract the hidden topics from large volumes of text... The challenge is how to extract good quality of topics that are clear, segregated and meaningful " Some themes: - Consistently identified 'virtual currency' as a topic - Change and modification - Damage and waiver
LDA Heatmap comparing unsupervised sorting into 19 topics, versus human- classified topics
Where from here?
Quiz You agree to provide Grammarly with accurate and complete registration information and to promptly notify Grammarly in the event of any changes to any such information. Anonymity & Tracking Personal Data ??? ???
Quiz You agree to provide Grammarly with accurate and complete registration information and to promptly notify Grammarly in the event of any changes to any such information. Anonymity & Tracking Personal Data Human Model
Quiz Nothing here should be considered legal advice. We express our opinion with no guarantee and we do not endorse any service in any way. Please refer to a qualified attorney for legal advice. Governance Guarantee ??? ???
Quiz Nothing here should be considered legal advice. We express our opinion with no guarantee and we do not endorse any service in any way. Please refer to a qualified attorney for legal advice. Governance Guarantee Model Human
Quiz For revisions to this Privacy Policy that may be materially less restrictive on our use or disclosure of personal information you have provided to us, we will make reasonable efforts to notify you and obtain your consent before implementing revisions with respect to such information. Personal Data Changes to Terms ??? ???
Quiz For revisions to this Privacy Policy that may be materially less restrictive on our use or disclosure of personal information you have provided to us, we will make reasonable efforts to notify you and obtain your consent before implementing revisions with respect to such information. Personal Data Changes to Terms Model Human
Practical Application Unfavourable Terms or: classifying into good and bad
Extract Review
Model Performance Same approach as before Best performer: - K-Nearest Neighbours: 0.71 What if we focus solely on unfavourable terms?
Predicting Unfavourable Terms • Do people really care about good or neutral statements? • Real value is in being able to highlight potential unfavourable terms Reclassify: - Good + Neutral = Neutral - Bad = Warning
Binary Classification Improved performance Best performers: - K-Nearest Neighbours: 0.75 - LinearSVC: 0.76 Additional benefit: ability to tune the model to correctly predict more warning statements at expense of more 'false' warnings.
Evaluation / Next Steps There are three areas for next steps: 1. Building a proof of concept for an end-user classification tool 2. Improve the model 3. Subject matter expertise
Questions?
Recommend
More recommend