Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez
web directories classification engines (manually edited) (automated) 2 [Yan04, Qi09, Bru20]
Why does the quality of these services matter? › End users : incorrect categories affect reliability over/underblocking in content filtering › Academia : domain sample or results depend on them 2019 top conferences: 24 papers lack of trust → resort to manual classification 3 [Res04, Ric02, Sch18, LeP19, Ahm20, Zeb20]
Services are opaque on how they operate Validation? Training set? Comprehensiveness? 4
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 5
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 6
Inputs Outputs Purpose Updates Access 7
Inputs Outputs Purpose Updates Access Aggregator 8
Inputs Outputs Purpose Updates Access 9
Inputs Outputs Purpose Updates Access Content filtering Threat assessment Marketing Discovery 10
Inputs Outputs Purpose Updates Access (Mostly) automated Manual 11
Inputs Outputs Purpose Updates Access 12
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 13
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 14
Label gathering 4.4M Aggregate 4.4M using Top 1M Tranco domains ↓ 10k Sept 1-30 4.4M direct (rate-limited) 2019 domains direct 279k cats (904k domains) 15
Service choice affects which domains are labeled › Coverage Updates ranges from <1% to 94% is better for automated classification services 100 100 4.4M 10k 80 80 60 60 40 40 20 20 0 0 b t e o a S t c e d r r t e a o b S t d e r c n e e c n x e e e s r N e r x s r e N r e e e d i k n c e t a d i W f n e c a o W f D o r n r D i n A l n u i A e l n u i M i e A p M A p . d n a d i c . n s e g c r s e g r r e m M e D e M b f h i D e b f i a d o c t d e c t p e e p s r e x r y r n r r d n o D W O o o b c W O e S o e e t F F e i F l t F i M r A r B i W T B T 16 r T
Service choice affects which domains are labeled › Coverage Updates ranges from <1% to 94% is better for automated classification services Updates › Popular domains have better coverage › Subdomain coverage ranges from <1% to 99% Inputs › Inconsistent when directly sourced Access or through VirusTotal 17
Service choice affects the taxonomy granularity › Security/content filtering: fewer categories As low as 12 Easier setup Purpose › Marketing: more categories Up to 7.5k Fine-grained targeting 18
Service choice affects label interpretation › Inconsistencies between documented Access and observed labels › Multiple labels are uncommon Outputs › Subdomains inherit labels from parent Inputs › 3 out of 9 services updated labels Updates Mostly for maliciousness 19
Service choice affects label distribution › Disagreement Purpose on distribution of labels over domains Updates As measured through mutual information › Uneven distribution of labels over domains Purpose As measured through label frequency 20
21
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 22
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 23
Dynamics of human labeling may trigger biases Participation concentrated › at beginning of project outdated labels? › with few users lack of peer review? › on unlabeled domains stale labels? 24
Disagreement in human labeling may trigger biases › Label assignment is not completely objective 25
Disagreement in human labeling may trigger biases › Label assignment is not completely objective › Empirically : Clusters of correlated labels 26
Disagreement in human labeling may trigger biases › Label assignment is not completely objective › Empirically : Clusters of correlated labels › Experimentally : 35.5% disagreement among authors, 71% matches community label 27
We analyze services on specialized use cases › Intended usage → requirements → data source selection › Service selection → characteristics → coverage/accuracy › Estimate suitability for three case studies Obtain a manually curated list as “ground truth” Analyze coverage across domains Analyze appropriateness of labels 28
Behavior differs widely for specialized use cases › Advertising and tracking Curated list : EasyList/EasyPrivacy Finding: few services label the domains at all, let alone as tracker › Adult content Curated list: [Val19] and gambling regulators Finding: 5 services label correctly, 3 others hardly label any › CDNs/hosting providers Curated list: signatures from WebPageTest Finding: confusion between service function and content 29
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 30
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 31
Recommendations › We avoid recommending a specific service “Best” service depends on use case and requirements We cannot measure semantic agreement nor correctness › Our recommendations address best practices 32 [Seb16, Lee13, Wei19]
Recommendations › Coverage and accuracy may be insufficient Very service - and use case -dependent Consider impact of errors › Purpose and updates may introduce biases Consult documentation for taxonomy and label sources ... but verify (and report) manually, as inconsistencies exist › Taxonomies differ in size, scope and semantics Sound aggregation is not obvious 33
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 34
Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 35
Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez
Recommend
More recommend