Forum post classification to support forensic investigations of - PowerPoint PPT Presentation

Forum post classification to support forensic investigations of illegal trade on the Dark Web System & Network Engineering (MSc) Research Project 2 Supervisors: Diana Rusu Martijn Spitters diana.rusu@os3.nl Stefan Verbruggen

Motivation ● Illegal business thriving under the DeepWeb ● Processing large amounts of data needs (semi) automatization ● Keyword matching is not sufficient for classification tasks ● Current techniques require large training sets 2

Research Question In the context of grouping DarkWeb marketplaces forum posts into relevant categories useful for forensic investigators Can we boost the classification process using semantic word representations, in order to reduce the required amount of training samples? Subquestions: 1. What methods can be inferred to exploit the word representations for classifying sparse, short forum posts on discussion forums, using few examples? 2. What is the accuracy of the proposed methods and how can it be improved? 3

Word2Vec ● It represents a word as a vector (word representations) ● Given a word it will predict n similar words around it ● Given some words it will give the appropriate word in that context (animated image [1]) ● Creates a “semantic space” from large amount of data ● Based on ○ skip-gram ○ CBOW (continuous bag-of-words) 4

Experimental Data Dataset provided by TNO, aggregated from different forums that accompany DeepWeb marketplaces such as Agora or Evolution: Data Raw Posts Tokenized Posts (after preprocessing) Posts 1954508 1447029 Words 138310824 42835813 5

Taxonomy Class 6

Approach Start Point Intermediate Point 7

Experiments - Setup Intermediate Point End Point 8

Experiments All of the following results provided were accomplished by having a single-class assigned for each post 9

Experiments - Example 1 Human label - "hard_drugs" Post 97 1072694 fakename wrote : i dont like street deals so i buy only here and another markets but need a fair deal.I gave you a vendor , whose prices are decent for an online market . And there are a shittonne of vendors online selling the Nijntje pills ... themostseekrit contact details upon request But I see nothing , no eyes ... no eyes on me . ------------------------------------- ***********Highest Rank(bottom-up)************** TOP 36: greetings - 0.22749844193458557, …………………………………………………………... TOP5: trading_scamming - 0.8590390682220459, TOP7: vendors - 0.8627676367759705, TOP6: trading_shipping - 0.8668627142906189 TOP5: financial_carding - 0.8688409924507141, TOP4: hard_drugs - 0.8711443543434143 , TOP3: other - 0.8717963695526123 TOP 2: trading_feedback - 0.8815533518791199, TOP 1 :trading_recommendation - 0.8951979279518127 -- The example above uses Cosine Similarity when testing with 100 Test Set Sample-- 10

Results Accuracy : percentage of test instances for which the correct label was ranked as #1 in cosine similarity or SVM learning method Y-axis: Accuracy in % 11

Results Excluding “other” class label: Y-axis: Accuracy in % 12

Results When expanding the training set(applied in case of Cosine Similarity): Accuracy (in %) Methods Test Set 1 - 60 Test Set 2 - 100 Random Posts Random Posts Cosine Similarity 14.0350877193 20.8791208791 13

Results Y-axis - Accuracy in % X-axis - TOP classes Plot 1: The accuracy of the Cosine Similarity between the AverageVector Class and the Vector Test class increases significantly if searching in TOP_4 the “human” labeled class 14

Evaluation Y-axis - Accuracy in % X-axis - TOP classes Plot 2 : The accuracy of the Cosine Similarity between the same samples, in where it can be seen an accuracy of TOP 4 at ~50%, while in the case of extending the initial training set ~40% 15

Conclusions ❏ Cosine Similarity, using word representations, provides ~20 % accuracy from the first run (TOP1) based on the experiments conducted (single-class label for each post), while SVM shows a better result with ~39% accuracy ❏ Cosine Similarity improves significantly its accuracy if searching in TOP4 values assigned by the classifier, the “human” labeled class. In this case will achieve ~50% accuracy. SVM needs to be tested for the TOPn classes(report) ❏ In practice, based on the results, if improving a small training set with the correct multi-class labeling for each post it is feasible to use word representations as futures for a classifier, in order to get a quick thematic insight over the discussion forums which reside under the Dark Web 16

Future Work ● Training Set has to be reviewed by at least 2 persons ● Expand the Taxonomy class ● Integrate this classifier into the DarkWebMonitor portal (darkwebmonitor.eu) 17

Questions? 18

References 1. http://multithreaded.stitchfix. com/blog/2015/03/11/word-is-worth-a- thousand-vectors/ 19

Forum post classification to support forensic investigations of - PowerPoint PPT Presentation

Forum post classification to support forensic investigations of illegal trade on the Dark Web System & Network Engineering (MSc) Research Project 2 Supervisors: Diana Rusu Martijn Spitters diana.rusu@os3.nl Stefan Verbruggen Motivation

Forensic Science Center Forensic Science Center -10 Budget 10 Budget FY 09- FY 09 Forensic

Forensic Challenge V2.0 UNAM-CERT RedIRIS Topics * Forensic Challenge V1.0 * Forensic

Specialized Topics in Ethical Forensic Practice, Part 3: Bias in Forensic Evaluations November 18,

Forensic Mental Health Care in the Texas State Hospital System Matthew Faubion, M.D. Forensic

THE NEW FORENSIC PATIENT Learning Objectives Review the epidemiology of forensic populations

Regional Forensic Trainings 2013 Pathways to Conditional Release: An Overview of the Forensic

Drugs in Oral Fluid AS4760 Olaf H. Drummer December 9, 2013 DEPARTMENT OF FORENSIC MEDICINE

CS CSI: I: DUND DUNDEE EE Th The e Fo Fore rensic nsic To Tool olkit kit Meet the

Challenges in Crime Scene Investigation Technical challenges in forensic STR profiling

Forensic Ballistics In Court Interpretation And Presentation Of Firearms Evidence Forensic

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Current Forensic DNA Typing o Forensic cases -- matching suspect with evidence Involves generation

GOJ Audit Commission Conference 2016 PRESENTS : FORENSIC Forensic Audits-Help for Todays

Presentation of Forensic Science Evidence Dr. Ran B. Singh, Forensic Science Laboratory, Lucknow

10. Forensic Issues I A MERICAN P SYCHOLOGICAL A SSOCIATION Forensic Issues For people with SMI,

Forensic Voice Comparison and Forensic Acoustics 1 Value and Interpretation of Biometric

Class of 2019 CSF Academic Letter Filing Periods: October 1 st 31 st March 1 st 31 st

JOHNSON COUNSELING STAFF Courtney Tarbox (Lead Counselor) A-BL Becky Hudkins BN-EC Patti

CHEROKEE HIGH SCHOOL SENIOR PARENT COLLEGE APPLICATION NIGHT S E P T E M B E R 1 8 , 2 0 1 4

Practical Solutions for Format- Preserving Encryption Mor Weiss Joint work with Boris Rozenberg

Second quarter 2016 Results ING posts 2Q16 underlying net profit of EUR 1,417 million Ralph

and New Hire Experience Driving Corporate Strategy through Process and Engagement PRESENTED BY

EU-China Dialogue on Migration and Mobility Support Project

University of Maribor and the European Initiatives of Rankings Lu ka Lorber and Marko Marhl

Sambuz

Useful Links

Newsletter

Mail Us