Classifying the Terms of Service Capstone Presentation | Sam - PowerPoint PPT Presentation

Classifying the Terms of Service Capstone Presentation | Sam Beardsworth

Goal Build a model to make Terms of Service easier to read How? • Identify the content • Extract the meaning • Highlight important terms

Approach No shortage of data: it's literally on every website But how to make sense of it? Answer: Use a pre-classified dataset (courtesy of ToS;DR)

ToS;DR • started in June 2012 • aims to review and score Terms of Service policies of major web services • users can look up terms through website / browser extension • public, transparent, community-driven • volunteer project

Data Gathering API : broken but was able to obtain the same info via public repos Additional challenges - ToS;DR has had 3 incarnations - API only has good data for incarnation #2 - Scrape all 3 and merge by ID Some manual cleaning needed

Dataset 1688 observations (extracts) mean length: 65 words max length: 1410 words! 107340 words total / 6469 unique 17 columns - discarded 9 as purely administrative

Dataset ID Status Service Source quote Topic Case Point 1720 pending facebook Cookie Policy 'We use Tracking Personal data bad cookies to used for help us show advertising ads...' 1311 approved nokia T&C 'Except as set Content Service retains bad forth in the deleted Privacy content Policy...' 2261 approved whatsapp NA 'When you Right to leave Data deleted good delete your after account WhatsApp closure account...' unique: 179 22 143 4

Dataset: Filling the gaps

Lemmatization

Topic Exploration • 22 topics, inbalanced • dropped <25 observations • remember to balance during classification

Modelling 19 topics Baseline accuracy: 0.117 70-30 train-test split, stratified by topic Basic, untuned logistic regression Test accuracy: 0.615

Improving the score • TF-IDF to reduce feature importance of common words • imblearn's RandomOverSampler to reduce class imbalance in the training set • GridSearchCV for optimal Logistic Regression hyperparameters Improved test accuracy: 0.641

Beyond Logistic Regression the sklearn 'try everything' approach... ...optimised with GridSearch

Model Comparison

Alternative Models word2vec - 3.5 GB dictionary pre-trained on news articles - applied to pre-lemmatized tokens (corpus) - performed differently but more poorly Accuracy score: 0.613 Principle Component Analysis / SVD - explanatory value relatively low - 19% across PC1-2, 37% across PC1-10

Alternative Models Latent Dirichlet Allocation (LDA) "a technique to extract the hidden topics from large volumes of text... The challenge is how to extract good quality of topics that are clear, segregated and meaningful " Some themes: - Consistently identified 'virtual currency' as a topic - Change and modification - Damage and waiver

LDA Heatmap comparing unsupervised sorting into 19 topics, versus human- classified topics

Where from here?

Quiz You agree to provide Grammarly with accurate and complete registration information and to promptly notify Grammarly in the event of any changes to any such information. Anonymity & Tracking Personal Data ??? ???

Quiz You agree to provide Grammarly with accurate and complete registration information and to promptly notify Grammarly in the event of any changes to any such information. Anonymity & Tracking Personal Data Human Model

Quiz Nothing here should be considered legal advice. We express our opinion with no guarantee and we do not endorse any service in any way. Please refer to a qualified attorney for legal advice. Governance Guarantee ??? ???

Quiz Nothing here should be considered legal advice. We express our opinion with no guarantee and we do not endorse any service in any way. Please refer to a qualified attorney for legal advice. Governance Guarantee Model Human

Quiz For revisions to this Privacy Policy that may be materially less restrictive on our use or disclosure of personal information you have provided to us, we will make reasonable efforts to notify you and obtain your consent before implementing revisions with respect to such information. Personal Data Changes to Terms ??? ???

Quiz For revisions to this Privacy Policy that may be materially less restrictive on our use or disclosure of personal information you have provided to us, we will make reasonable efforts to notify you and obtain your consent before implementing revisions with respect to such information. Personal Data Changes to Terms Model Human

Practical Application Unfavourable Terms or: classifying into good and bad

Extract Review

Model Performance Same approach as before Best performer: - K-Nearest Neighbours: 0.71 What if we focus solely on unfavourable terms?

Predicting Unfavourable Terms • Do people really care about good or neutral statements? • Real value is in being able to highlight potential unfavourable terms Reclassify: - Good + Neutral = Neutral - Bad = Warning

Binary Classification Improved performance Best performers: - K-Nearest Neighbours: 0.75 - LinearSVC: 0.76 Additional benefit: ability to tune the model to correctly predict more warning statements at expense of more 'false' warnings.

Evaluation / Next Steps There are three areas for next steps: 1. Building a proof of concept for an end-user classification tool 2. Improve the model 3. Subject matter expertise

Questions?

Classifying the Terms of Service Capstone Presentation | Sam - PowerPoint PPT Presentation

Classifying the Terms of Service Capstone Presentation | Sam Beardsworth Goal Build a model to make Terms of Service easier to read How? Identify the content Extract the meaning Highlight important terms Approach No shortage of

V0D 2016 Classifying Studies V0D V0D 2016 Classifying Studies 1 2016 Classifying Studies

Casey Rosenthal @caseyrosenthal Part One. SERVICE A SERVICE B SERVICE C SERVICE D SERVICE E

Classifying Homogeneous Structures Cherlin Introduction The finite case Gregory Cherlin

III. Staying in a Hotel - Terms of Accommodation Contract Headings: Terms Standard

Classifying Internet One-way Traffic Eduard Glatz, Xenofontas Dimitropoulos ETH Zurich May 15,

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Language and Computers Unsupervised Learning Features & Classifying Documents Evidence

Classifying Strictly Weakly Integral Modular Categories of Dimension 16p Elena Amparo College of

Cubical sets as a classifying topos Bas Spitters Carnegie Mellon University, Pittsburgh Aarhus

Classifying Unification Problems in Preliminaries Algebraic Unifiers Distributive Lattices and

The (big) infinitesimal topos as a classifying topos Matthias Hutzler Universit at Augsburg

Classifying Laser Range Data Images Supervisor: Elin Anna Topp Fredrik Paulsson Shan

On the classifying space of an Artin monoid Giovanni Paolini Scuola Normale Superiore, Pisa

Classifying local four gluon S-matrices Subham Dutta Chowdhury November 20, 2020 YITP Strings

A Feasibility Study on Using Classifying Terms in Alloy Robert Claris & Martin Gogolla

Words vs. Terms Words vs. Terms Information Retrieval cares about terms You search

DESIGN PORTFOLIOS PRESENTATION AND MARKETING FOR INTERIOR DESIGNERS 3RD EDITION FREE Author:

What Every Employer Needs to Know About Marketplace Opportunities December 2-4, 2013 John M.

COVID-19 Response: Service Unit Presentation for Troop Leaders Fall 2020 Page 1 Re-opening the

2018 Annual General Meeting 25 October 2018 For personal use only 2018 Annual General Meeting

Becoming A Data Champion the law. John Enser A refresher: Data has rules. 1. You need

ALTA 2013 National Update Washington Land Title Association Education Seminar Everett,

CITY OF SAINT PAUL Public Works Division of Street Design and Construction Welcome

Northland Career Center Parent Night Tuesday, August 22, 2017 Welcome! NCCs Strategic Plan

Classifying the Terms of Service Capstone Presentation | Sam - PowerPoint PPT Presentation

Classifying the Terms of Service Capstone Presentation | Sam Beardsworth Goal Build a model to make Terms of Service easier to read How? Identify the content Extract the meaning Highlight important terms Approach No shortage of

V0D 2016 Classifying Studies V0D V0D 2016 Classifying Studies 1 2016 Classifying Studies

Casey Rosenthal @caseyrosenthal Part One. SERVICE A SERVICE B SERVICE C SERVICE D SERVICE E

Classifying Homogeneous Structures Cherlin Introduction The finite case Gregory Cherlin

III. Staying in a Hotel - Terms of Accommodation Contract Headings: Terms Standard

Classifying Internet One-way Traffic Eduard Glatz, Xenofontas Dimitropoulos ETH Zurich May 15,

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Language and Computers Unsupervised Learning Features &amp; Classifying Documents Evidence

Classifying Strictly Weakly Integral Modular Categories of Dimension 16p Elena Amparo College of

Cubical sets as a classifying topos Bas Spitters Carnegie Mellon University, Pittsburgh Aarhus

Classifying Unification Problems in Preliminaries Algebraic Unifiers Distributive Lattices and

The (big) infinitesimal topos as a classifying topos Matthias Hutzler Universit at Augsburg

Classifying Laser Range Data Images Supervisor: Elin Anna Topp Fredrik Paulsson Shan

On the classifying space of an Artin monoid Giovanni Paolini Scuola Normale Superiore, Pisa

Classifying local four gluon S-matrices Subham Dutta Chowdhury November 20, 2020 YITP Strings

A Feasibility Study on Using Classifying Terms in Alloy Robert Claris &amp; Martin Gogolla

Words vs. Terms Words vs. Terms Information Retrieval cares about terms You search

DESIGN PORTFOLIOS PRESENTATION AND MARKETING FOR INTERIOR DESIGNERS 3RD EDITION FREE Author:

What Every Employer Needs to Know About Marketplace Opportunities December 2-4, 2013 John M.

COVID-19 Response: Service Unit Presentation for Troop Leaders Fall 2020 Page 1 Re-opening the

2018 Annual General Meeting 25 October 2018 For personal use only 2018 Annual General Meeting

Becoming A Data Champion the law. John Enser A refresher: Data has rules. 1. You need

ALTA 2013 National Update Washington Land Title Association Education Seminar Everett,

CITY OF SAINT PAUL Public Works Division of Street Design and Construction Welcome

Northland Career Center Parent Night Tuesday, August 22, 2017 Welcome! NCCs Strategic Plan

Language and Computers Unsupervised Learning Features & Classifying Documents Evidence

A Feasibility Study on Using Classifying Terms in Alloy Robert Claris & Martin Gogolla