Mining bipartite graphs to improve semantic pedophile activity detection (short paper) R. Fournier , M. Danisch L2TI / Institut Galilée Université Paris-Nord, Sorbonne Paris-Cité LIP6 CNRS et Université Pierre et Marie Curie May 28th, 2014
Context Paedophile activity in P2P systems Children victimization Danger for innocent users Policy making issues Recent research Identification of large file providers Collection of large sets of queries Design and validation of a detection tool [ IPM 2012 ] Extend this effort R. Fournier , M. Danisch Mining bipartite graphs… May 28th, 2014 2 / 8
Datasets Queries submitted to eDonkey search engine 2007 10 weeks, 100 millions queries, 24 million IP addresses t timestamp u user information (IP address, connection port) Duly anonymised R. Fournier , M. Danisch Mining bipartite graphs… May 28th, 2014 3 / 8 Set of queries : q i = ( t , u , k 1 , k 2 ,..., k n ) ( k 1 , k 2 ,..., k n ) sequence of keywords
Pedophile detection tool 4 semantic categories of paedophile queries May 28th, 2014 Mining bipartite graphs… R. Fournier , M. Danisch False negatives rate: 24.5% False negatives ( ``pjk 12yo'' ) False positives ( ``sexy daddy destinys child'' ) 4 / 8 query matches matches agesuffix matches matches child familyparents and with age<17 and explicit ? and sex ? familychild and sex ? ( sex or child )? tag as paedophile Focus on reduced false positives rate ( < 1 . 4 % )
Our approach Goals May 28th, 2014 Mining bipartite graphs… R. Fournier , M. Danisch 1 5 / 8 Reduce the number of queries to process manually Validate existing classification Bipartite graphs and communities s C ( r ) = ∑ u ∈ V ( r ) | C ∩ R ( u ) \{ r }| 1 q1 u1 ∑ u ∈ V ( r ) | R ( u ) \{ r }| 0 q2 u2 s 1 ( q 2 ) = 0 . 5 1 q3 u3 | C ∩ R ( u ) \{ r } 0 q4 s ′ | V ( r ) | ∑ u4 C ( r ) = | R ( u ) \{ r } u ∈ V ( r ) 0 q5
Results 4,518 queries (out of 12,858) with score 1 not detected May 28th, 2014 Mining bipartite graphs… R. Fournier , M. Danisch further analysis required to avoid increased FP rate new keywords and combinations obtained 6 / 8 1.0 0.8 SCORE 0.6 0.4 0.2 0.0 0 1 2 3 4 5 6 7 8 10 10 10 10 10 10 10 10 10 1.0 P(TRUE | RANKED X) 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 5 6 7 8 10 10 10 10 10 10 10 10 10 RANK OF THE REQUEST ACCORDING TO ITS SCORE
Results categories 2,3 and 4 fewly connected with category 1 May 28th, 2014 Mining bipartite graphs… R. Fournier , M. Danisch 7 / 8 1.0 0.8 SCORE 0.6 0.4 0.2 0.0 0 1 2 3 4 5 6 7 8 10 10 10 10 10 10 10 10 10 1.0 P(1 | RANKED X) 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 5 6 7 8 10 10 10 10 10 10 10 10 10 0.12 P(234 | RANKED X) 0.10 0.08 0.06 0.04 0.02 0.00 0 1 2 3 4 5 6 7 8 10 10 10 10 10 10 10 10 10 RANK OF THE REQUEST ACCORDING TO ITS SCORE
Conclusion Measure of topological similarity between queries Limitation of the number of errors to process manually Semantic and topological categories seem linked Future work Explore other scoring functions Explore local community completion methods Update the original filter by refining its lists of keywords introduce new categories subdivide existing categories R. Fournier , M. Danisch Mining bipartite graphs… May 28th, 2014 8 / 8
Thank you for your attention. Questions? raphael.fournier@lip6.fr
Recommend
More recommend