An Analysis of Domain Classification Services Pelayo Vallina, - PowerPoint PPT Presentation

Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez

web directories classification engines (manually edited) (automated) 2 [Yan04, Qi09, Bru20]

Why does the quality of these services matter? › End users : incorrect categories affect reliability over/underblocking in content filtering › Academia : domain sample or results depend on them 2019 top conferences: 24 papers lack of trust → resort to manual classification 3 [Res04, Ric02, Sch18, LeP19, Ahm20, Zeb20]

Services are opaque on how they operate Validation? Training set? Comprehensiveness? 4

Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 5

Inputs Outputs Purpose Updates Access 7

Inputs Outputs Purpose Updates Access Aggregator 8

Inputs Outputs Purpose Updates Access Content filtering Threat assessment Marketing Discovery 10

Inputs Outputs Purpose Updates Access (Mostly) automated Manual 11

Label gathering 4.4M Aggregate 4.4M using Top 1M Tranco domains ↓ 10k Sept 1-30 4.4M direct (rate-limited) 2019 domains direct 279k cats (904k domains) 15

Service choice affects which domains are labeled › Coverage Updates ranges from <1% to 94% is better for automated classification services 100 100 4.4M 10k 80 80 60 60 40 40 20 20 0 0 b t e o a S t c e d r r t e a o b S t d e r c n e e c n x e e e s r N e r x s r e N r e e e d i k n c e t a d i W f n e c a o W f D o r n r D i n A l n u i A e l n u i M i e A p M A p . d n a d i c . n s e g c r s e g r r e m M e D e M b f h i D e b f i a d o c t d e c t p e e p s r e x r y r n r r d n o D W O o o b c W O e S o e e t F F e i F l t F i M r A r B i W T B T 16 r T

Service choice affects which domains are labeled › Coverage Updates ranges from <1% to 94% is better for automated classification services Updates › Popular domains have better coverage › Subdomain coverage ranges from <1% to 99% Inputs › Inconsistent when directly sourced Access or through VirusTotal 17

Service choice affects the taxonomy granularity › Security/content filtering: fewer categories As low as 12 Easier setup Purpose › Marketing: more categories Up to 7.5k Fine-grained targeting 18

Service choice affects label interpretation › Inconsistencies between documented Access and observed labels › Multiple labels are uncommon Outputs › Subdomains inherit labels from parent Inputs › 3 out of 9 services updated labels Updates Mostly for maliciousness 19

Service choice affects label distribution › Disagreement Purpose on distribution of labels over domains Updates As measured through mutual information › Uneven distribution of labels over domains Purpose As measured through label frequency 20

Dynamics of human labeling may trigger biases Participation concentrated › at beginning of project outdated labels? › with few users lack of peer review? › on unlabeled domains stale labels? 24

Disagreement in human labeling may trigger biases › Label assignment is not completely objective 25

Disagreement in human labeling may trigger biases › Label assignment is not completely objective › Empirically : Clusters of correlated labels 26

Disagreement in human labeling may trigger biases › Label assignment is not completely objective › Empirically : Clusters of correlated labels › Experimentally : 35.5% disagreement among authors, 71% matches community label 27

We analyze services on specialized use cases › Intended usage → requirements → data source selection › Service selection → characteristics → coverage/accuracy › Estimate suitability for three case studies Obtain a manually curated list as “ground truth” Analyze coverage across domains Analyze appropriateness of labels 28

Behavior differs widely for specialized use cases › Advertising and tracking Curated list : EasyList/EasyPrivacy Finding: few services label the domains at all, let alone as tracker › Adult content Curated list: [Val19] and gambling regulators Finding: 5 services label correctly, 3 others hardly label any › CDNs/hosting providers Curated list: signatures from WebPageTest Finding: confusion between service function and content 29

Recommendations › We avoid recommending a specific service “Best” service depends on use case and requirements We cannot measure semantic agreement nor correctness › Our recommendations address best practices 32 [Seb16, Lee13, Wei19]

Recommendations › Coverage and accuracy may be insufficient Very service - and use case -dependent Consider impact of errors › Purpose and updates may introduce biases Consult documentation for taxonomy and label sources ... but verify (and report) manually, as inconsistencies exist › Taxonomies differ in size, scope and semantics Sound aggregation is not obvious 33

Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez

An Analysis of Domain Classification Services Pelayo Vallina, - PowerPoint PPT Presentation

Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, lvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez web directories

CCECE 2003 Signal Classification through Multifractal Analysis and Complex Domain Neural

Traffic analysis and modelling 1 Service classification Services may be classified according

Requirements Analysis Overview What is requirement ? Classification of requirements

for Domain Adaptation in Chest X-ray Classification Matthias Lenga, Heinrich Schulz, Axel

Detection, Classification, and Analysis of Inter-Domain Traffic with Spoofed Source IP Addresses

Classification Problems From Regression to Classification x } Suppose we have two classes of

Domain analysis Domain analysis Goal: build an object-oriented model of the real- world

Astroinformatics in the Time Domain: Classification of Light Curves and Transients Prof. S.

Identifying Transferable Information Across Domains for Cross-domain Sentiment Classification

Web Ontology Segmentation Analysis, Classification and Use Julian Seidenberg Alan Rector

Exploratory Neural Relation Classification for Domain Knowledge Acquisition Yan Fan , Chengyu

Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta

Learning From Data Lecture 8 Linear Classification and Regression Linear Classification Linear

Week 8: Classification & Model Building Classification for Binary Outcomes, Variable

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

Text Classification and Sentiment Analysis Alejandro Moreo AFIRM 16th January 2019 Alejandro

Rendezvous-based Traffic Rendezvous-based Traffic Classification, Measurement, Classification,

Classification Classification TNM classification Survival time Survival time Tumour size,

Content-based Classification of Fraudulent Webshops Mick Cox & Sjors Haanen RP30 July 5 th

Data Classification Linear Classifier II Latent Differential Analysis Mean Classification

Discrete-time Systems in the Time Domain Domain Chapter 4 Chapter 4 Sections 4.1 - 4.7 Dr.

Classification method in single particle analysis Cluster Analysis Pawel A. Penczek

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

Image Analysis System Example: Image Classification System pre feature feature segmentation

An Analysis of Domain Classification Services Pelayo Vallina, - PowerPoint PPT Presentation

Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, lvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez web directories

CCECE 2003 Signal Classification through Multifractal Analysis and Complex Domain Neural

Traffic analysis and modelling 1 Service classification Services may be classified according

Requirements Analysis Overview What is requirement ? Classification of requirements

for Domain Adaptation in Chest X-ray Classification Matthias Lenga, Heinrich Schulz, Axel

Detection, Classification, and Analysis of Inter-Domain Traffic with Spoofed Source IP Addresses

Classification Problems From Regression to Classification x } Suppose we have two classes of

Domain analysis Domain analysis Goal: build an object-oriented model of the real- world

Astroinformatics in the Time Domain: Classification of Light Curves and Transients Prof. S.

Identifying Transferable Information Across Domains for Cross-domain Sentiment Classification

Web Ontology Segmentation Analysis, Classification and Use Julian Seidenberg Alan Rector

Exploratory Neural Relation Classification for Domain Knowledge Acquisition Yan Fan , Chengyu

Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta

Learning From Data Lecture 8 Linear Classification and Regression Linear Classification Linear

Week 8: Classification &amp; Model Building Classification for Binary Outcomes, Variable

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

Text Classification and Sentiment Analysis Alejandro Moreo AFIRM 16th January 2019 Alejandro

Rendezvous-based Traffic Rendezvous-based Traffic Classification, Measurement, Classification,

Classification Classification TNM classification Survival time Survival time Tumour size,

Content-based Classification of Fraudulent Webshops Mick Cox &amp; Sjors Haanen RP30 July 5 th

Data Classification Linear Classifier II Latent Differential Analysis Mean Classification

Discrete-time Systems in the Time Domain Domain Chapter 4 Chapter 4 Sections 4.1 - 4.7 Dr.

Classification method in single particle analysis Cluster Analysis Pawel A. Penczek

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

Image Analysis System Example: Image Classification System pre feature feature segmentation

Week 8: Classification & Model Building Classification for Binary Outcomes, Variable

Content-based Classification of Fraudulent Webshops Mick Cox & Sjors Haanen RP30 July 5 th