Your words betray you! The role of language in cyber crime inves9ga9ons Awais Rashid
Digital World Online World Physical World
dual use
P2P Study • 1.6% of searches and 2.4% responses on Gnutella network alone (Study by Hughes et al. 2006) • Hundreds or thousands of searches per second – Approx. 600,000 searches per day on Gnutella alone • Specialist vocabulary: 53% of searches used such keywords and 88% of responses. – Vocabulary changes over Gme.
Top 100 Frequent Searches SEARCH FREQUENCY Popularity Topic
Core of Distributors
Chat and Social Networking
Digital Personas
Do you Know Who you are Talking to? ? ? 18.3%
Experience from Isis: ProtecGng Children in Online Social Networks (EPSRC/ESRC) iCOP: IdenGfying and Catching Originators in P2P Networks (EC Safer Internet Programme)
DetecGng DecepGve Digital Personas
StylisGc Language “Fingerprint” Individual_1 Individual_2 New text New text Individual_3 Individual_4 New text New text
Age and Gender Analysis Stylis;c Classifier Features Female Reference Data Sets Word level Male Distance Measure SyntacGc level SemanGc level
No DecepGon – Age (Precision) 100 95 90 77.35% 72.24% 85 80 75 70 65 Precision (%) 60 55 50 45 40 35 30 25 Level 1 20 Level 2 15 Level 3 10 Level 4 5 Level 5 0 0 10 20 30 40 50 60 70 80 90 100 Threshold (%)
No DecepGon – Age (Recall) 100 Level 1 95 Level 2 90 Level 3 85 Level 4 80 Level 5 75 70 65 60 Recall (%) 55 50 45 40 35 30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 80 90 100 Threshold (%)
No DecepGon - Gender 100 Recall 95 Precision 90 66.86% 71.07% 85 80 75 70 Recall / Precision (%) 65 60 55 50 45 40 35 30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 80 90 100 Threshold (%)
DecepGon DetecGon ? ? 18.3%
DecepGon DetecGon ? ? 84.29%
DecepGon DetecGon ? ? 93.18%
Is it being used? • Being used by law enforcement following trials and commercialisaGon via a spin-out company (RelaGve Insight) • UK case study for Internet Governance Forum in 2009, 2010 • Featured in internaGonal TV and print news media • Part of evidence to UK Select Commigee on Child ProtecGon and EU Policy frameworks. • Chosen as one of the 100 Big Ideas for the future by UniversiGes UK and Research Councils UK (2011) • Mobile App built on the digital persona analysis demonstrated to the Prime Minister at WeProtect, Dec. 2014 • An Impact Case Study for REF2014
DetecGng Specialist TacGcs, e.g., Vocabulary
DetecGng new/unknown CSA media in P2P Networks § Using query analysis to automaGcally triage and idenGfy potenGal candidates for new CSA media § New text analysis techniques to automaGcally flag potenGal CSA media based on their filename • (Semi-)automaGc video and image analysis techniques to assess CSA content 30
Filename ClassificaGon Key challenges § Compiling a CSA dataset § Filenames = short text samples § Presence of non-standard forms & “specialised” vocabulary 31
Filename ClassificaGon (2) Dataset § Manual collecGon through LE à 268 CSA filenames § Legal pornography sites à 10K non-CSA filenames § simulate real-life data distribuGon in P2P 32
Filename ClassificaGon (3) Feature Selec;on § Seman9c features • Known CSA keywords • Explicit language use • References to children, young age • Family relaGons Original filename ptl0lita12yo.jpeg Seman9c Feats. [paedo_keyword] [child_ref] 33
Filename ClassificaGon (4) Feature Selec;on § Character n -grams • slices of 2, 3 and 4 consecuGve characters ptl0lita12yo.jpeg Original filename Char. 2-grams pt tl l0 0l li it ta a1 12 2y yo Char. 3-grams ptl tl0 l0l 0li lit ita ta1 a12 12y 2yo Char. 4-grams ptl0 tl0l l0li 0lit lita ita1 ta12 a12y 12yo 34
Filename ClassificaGon (5) Experimental Setup § Support Vector Machines (LibShortText) § 5-fold cross-validaGon § EvaluaGon: • Overall system accuracy • Precision, Recall and F-score per class label 35
Filename ClassificaGon (6) Results Scores SVM classifier (%) Precision Recall F-score Seman;c CSA 5.7 21.3 9.0 feats. Non-CSA 97.7 90.6 94.0 Char. n-grams CSA 89.8 62.3 73.6 Non-CSA 99.0 99.8 99.4 Combined CSA 89.9 66.1 76.1 Non-CSA 99.1 99.8 99.5 36
The iCOP Toolkit
Is it being used? • Training days for European Law Enforcement personnel – ParGcipants from 8 European countries and Interpol – Hands-on sessions on live P2P data • Live demonstraGon at Interpol at end of project • Being uGlised by several law enforcement agencies in Europe
Further InformaGon Isis A. Rashid, A. Baron, P. Rayson, C. May-Chahal, P. Greenwood, J. Walkerdine (2013). “Who Am I? Analysing Digital Personas in Cyber Crime Inves;ga;ons” , IEEE Computer, 46(4). C. May-Chahal, C. Mason, A. Rashid, P. Greenwood, J. Walkerdine, P. Rayson (2014). “Safeguarding Cyborg Childhoods: Incorpora;ng the On/Offline Behaviour of Children into Everyday Social Work Prac;ces” , BriGsh Journal of Social Work.
Further InformaGon iCOP C. Peersman, C. Schulze, A. Rashid, M. Brennan, C. Fischer (2014). “iCOP: Automa;cally Iden;fying New Child Abuse Media in P2P Networks” , IEEE Symposium on Security and Privacy Workshops 2014: 124-131 C. Peersman, C. Schulze, A. Rashid, M. Brennan, C. Fischer (2016). “iCOP: live forensics to reveal previously unknown criminal media on P2P networks”, Digital InvesGgaGon, 18, pp. 50-64.
Further InformaGon General M. Edwards, A. Rashid, P. Rayson (2015). “A Systema;c Survey of Online Data Mining Technology Intended for Law Enforcement” , ACM CompuGng Surveys, 48(1). A. Rashid, J. Weckert, R. Lucas: (2009). “SoZware Engineering Ethics in a Digital World” IEEE Computer 42(6): 34-41. A. Rashid, K. Moore, C. May-Chahal, R. Chitchyan (2015). “Managing emergent ethical concerns for soZware engineering in society” , Proc. ICSE 2015, Soqware Engineering in Society, pp. 523-526. IEEE
Recommend
More recommend