an analysis of domain classification services
play

An Analysis of Domain Classification Services Pelayo Vallina, - PowerPoint PPT Presentation

Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, lvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez web directories


  1. Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez

  2. web directories classification engines (manually edited) (automated) 2 [Yan04, Qi09, Bru20]

  3. Why does the quality of these services matter? › End users : incorrect categories affect reliability over/underblocking in content filtering › Academia : domain sample or results depend on them 2019 top conferences: 24 papers lack of trust → resort to manual classification 3 [Res04, Ric02, Sch18, LeP19, Ahm20, Zeb20]

  4. Services are opaque on how they operate Validation? Training set? Comprehensiveness? 4

  5. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 5

  6. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 6

  7. Inputs Outputs Purpose Updates Access 7

  8. Inputs Outputs Purpose Updates Access Aggregator 8

  9. Inputs Outputs Purpose Updates Access 9

  10. Inputs Outputs Purpose Updates Access Content filtering Threat assessment Marketing Discovery 10

  11. Inputs Outputs Purpose Updates Access (Mostly) automated Manual 11

  12. Inputs Outputs Purpose Updates Access 12

  13. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 13

  14. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 14

  15. Label gathering 4.4M Aggregate 4.4M using Top 1M Tranco domains ↓ 10k Sept 1-30 4.4M direct (rate-limited) 2019 domains direct 279k cats (904k domains) 15

  16. Service choice affects which domains are labeled › Coverage Updates ranges from <1% to 94% is better for automated classification services 100 100 4.4M 10k 80 80 60 60 40 40 20 20 0 0 b t e o a S t c e d r r t e a o b S t d e r c n e e c n x e e e s r N e r x s r e N r e e e d i k n c e t a d i W f n e c a o W f D o r n r D i n A l n u i A e l n u i M i e A p M A p . d n a d i c . n s e g c r s e g r r e m M e D e M b f h i D e b f i a d o c t d e c t p e e p s r e x r y r n r r d n o D W O o o b c W O e S o e e t F F e i F l t F i M r A r B i W T B T 16 r T

  17. Service choice affects which domains are labeled › Coverage Updates ranges from <1% to 94% is better for automated classification services Updates › Popular domains have better coverage › Subdomain coverage ranges from <1% to 99% Inputs › Inconsistent when directly sourced Access or through VirusTotal 17

  18. Service choice affects the taxonomy granularity › Security/content filtering: fewer categories As low as 12 Easier setup Purpose › Marketing: more categories Up to 7.5k Fine-grained targeting 18

  19. Service choice affects label interpretation › Inconsistencies between documented Access and observed labels › Multiple labels are uncommon Outputs › Subdomains inherit labels from parent Inputs › 3 out of 9 services updated labels Updates Mostly for maliciousness 19

  20. Service choice affects label distribution › Disagreement Purpose on distribution of labels over domains Updates As measured through mutual information › Uneven distribution of labels over domains Purpose As measured through label frequency 20

  21. 21

  22. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 22

  23. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 23

  24. Dynamics of human labeling may trigger biases Participation concentrated › at beginning of project outdated labels? › with few users lack of peer review? › on unlabeled domains stale labels? 24

  25. Disagreement in human labeling may trigger biases › Label assignment is not completely objective 25

  26. Disagreement in human labeling may trigger biases › Label assignment is not completely objective › Empirically : Clusters of correlated labels 26

  27. Disagreement in human labeling may trigger biases › Label assignment is not completely objective › Empirically : Clusters of correlated labels › Experimentally : 35.5% disagreement among authors, 71% matches community label 27

  28. We analyze services on specialized use cases › Intended usage → requirements → data source selection › Service selection → characteristics → coverage/accuracy › Estimate suitability for three case studies Obtain a manually curated list as “ground truth” Analyze coverage across domains Analyze appropriateness of labels 28

  29. Behavior differs widely for specialized use cases › Advertising and tracking Curated list : EasyList/EasyPrivacy Finding: few services label the domains at all, let alone as tracker › Adult content Curated list: [Val19] and gambling regulators Finding: 5 services label correctly, 3 others hardly label any › CDNs/hosting providers Curated list: signatures from WebPageTest Finding: confusion between service function and content 29

  30. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 30

  31. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 31

  32. Recommendations › We avoid recommending a specific service “Best” service depends on use case and requirements We cannot measure semantic agreement nor correctness › Our recommendations address best practices 32 [Seb16, Lee13, Wei19]

  33. Recommendations › Coverage and accuracy may be insufficient Very service - and use case -dependent Consider impact of errors › Purpose and updates may introduce biases Consult documentation for taxonomy and label sources ... but verify (and report) manually, as inconsistencies exist › Taxonomies differ in size, scope and semantics Sound aggregation is not obvious 33

  34. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 34

  35. Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 35

  36. Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez

Recommend


More recommend