opinion spam analysis and detection
play

Opinion Spam Analysis and Detection Leaked Confidential Information - PowerPoint PPT Presentation

Opinion Spam Analysis and Detection Leaked Confidential Information as Ground Truth Yu-Ren Chen, Hsin Hsi Chen National Taiwan University What Is Opinion Spam? spreading commercially advantageous opinions as regular users on the Internet


  1. Opinion Spam Analysis and Detection Leaked Confidential Information as Ground Truth Yu-Ren Chen, Hsin Hsi Chen National Taiwan University

  2. What Is Opinion Spam? ● spreading commercially advantageous opinions as regular users on the Internet positive opinions about own products/services ○ negative opinions about competitors ○ ● also known as ‘ review spam ’ or ‘shilling’ ● undoubtedly unethical ● in most cases illegal ● believed to be widely used in practice potentially lucrative profits → strong incentives ○

  3. ‘Spam’ in General ● definition of ‘spam’ ○ ( Wikipedia ) the use of electronic messaging systems to send unsolicited bulk messages (spam), especially advertising , indiscriminately ○ ( Oxford ) irrelevant or unsolicited messages sent over the Internet, typically to large numbers of users, for the purposes of advertising , phishing, spreading malware, etc.

  4. Various Kinds of Spam ● email spam the most well-known kind of spam ○ spam is defined as ‘email spam’ in Merriam-Webster ○ ● search engine spam manipulation of search engine indices and rankings ○ content spam: keyword stuffing, meta-tag stuffing, scraper ○ sites, article spinning, machine translation link spam: link farms, hidden links, expired domains, ○ comment spam, referrer log spam ● social network spam bulk message, malicious links, fake friends, etc. ○ ● many more...

  5. Opinion Spam v.s. Other Spam ● very carefully written to avoid getting caught ○ deemed as fraud → unacceptable ○ backfiring may cause serious damage to the reputation of a brand (or a store, restaurant, etc.) ● initiated by a big brand in our case study ( Samsung ) ○ high stakes ○ opinion spammers have to be really careful

  6. Difficulty in Obtaining Dataset ● manual annotation is pretty much useless low inter-annotator agreement score (Ott et al. 2011) ○ approximated ground truth ○ duplicate / near-duplicate reviews ■ ● (Jindal et al. 2008) crowdsourced fake reviews ■ ● (Ott et al. 2011) ● utilizing confidential internal records in our study ○ real record of opinion spam in real world ○ ‘true’ ground truth

  7. Our Case Study - 三星寫手門事件 ● conducted by ‘ 鵬泰顧問有限公司 ’, a subsidiary company of Samsung ● hired writers and designated employees were instructed to make disingenuous reviews on web forums ● revealed by a hacker known as ‘0xb’ ○ making confidential documents in 鵬泰 publicly available on TaiwanSamsungLeaks.org ● Samsung was fined $10 million TWD by the Fair Trade Commission in Taiwan

  8. About ‘Mobile01’ ● the main battlefield of this marketing campaign ● where the leaked documents was first made public ● one of the most popular websites in Taiwan #10 in Alexa traffic rank in Taiwan ○ ● mainly featuring discussion about consumer electronics mobile phones, tablets, etc. ○ ● primarily a web forum site rather than product review site such as Amazon ○ ● written in Traditional Chinese for the most part rather than in English ○

  9. Structure of Mobile01 Forum

  10. More about Web Forums ● thread ( topic, 討論串 ) ○ collection of posts (from oldest to latest) ○ started by specifying the title and the first post ● post ( 文章 ) ○ first post (original post, thread starter, 一樓 ) ○ reply ( 回覆 ) ■ all posts except the first posts ■ can be used for bumping ( 手動置頂,頂 ) ● hierarchical structure ○ forum ⇢ board ⇢ thread ⇢ post

  11. Dataset Collection ● leaked spreadsheets HHP-2011.xlsx and HHP-2012.xlsx ○ containing URLs to the spam posts ○ source of the ground truth in our study ○ ● Mobile01 contents of the posts ○ including spam and non-spam ■ other various kinds of information on Mobile01 ○ all posts from 2011 to 2012 on SAMSUNG (Android) board ○ where the ‘ spam density ’ is the highest ■ all profiles of the posters of such posts ○ three SQLite tables: POSTS , PROFILES, THREADS , ○

  12. a snippet of the leaked spreadsheet ‘HHP-2012.xlsx’

  13. an example post on Mobile01

  14. a snippet of the table ‘POSTS’ scraped from Mobile01

  15. an example profile on Mobile01

  16. a snippet of the table ‘PROFILES’ scraped from Mobile01

  17. a snippet of the table ‘THREADS’ scraped from Mobile01

  18. amount of data collected

  19. Looking into the Dataset ● main observations ○ subtlety in spam posts ○ low spam post ratio of some spammers ○ types of spammer accounts reputable accounts ■ throwaway accounts ■ ○ first posts v.s. replies in threads ○ pattern in submission time of spam posts ○ activeness of threads ○ collusive activities of spammers

  20. examples of subtly written spam posts

  21. examples of subtly written spam posts

  22. examples of subtly written spam posts

  23. Low Spam Post Ratio of Some Spammers ● our definition of spammer poster who had submitted any spam post ○ ● only 33% of the posts from spammers are spam in this dataset

  24. Different Types of Spammer Accounts ● reputable accounts hired reputable writers ○ low spam ratio ○ ● throwaway accounts registered in batch ○ high spam ratio ○ low number of threads ○ initiated ● others hired non-reputable posters? ○ borrowed accounts? ○

  25. pattern in registration time

  26. First Posts v.s. Replies ● first post initiating the thread ○ richer in content ○ higher spam ratio ○ ● reply (2nd, 3rd posts in thread) usually quite concise ○

  27. Submission Time of Spam Posts ● hypothesis : spam posts are more often made during work time compared to normal posts because spamming is a job rather than a leisure activity ○

  28. Activeness of Threads ● threads started by spam first posts are expected to be more active written to draw attention and exposure ○ ● measuring of ‘activeness’ of a thread number of posts in the thread ○ number of clicks on the thread ○

  29. Collusion between Spammers ● different spam accounts submit spam posts to the same thread ○ fabricating the majority opinion ○ bumping the same spam thread ● 67% of the spam posts are in the threads containing multiple spam posts ● could be different actual human posters ● or just one human login with different accounts ○ still can be seen as collusion between the accounts

  30. Detection ● evaluation metric ● data splitting ● machine learning ● spam detection for first posts ● spam detection for replies ● spammer detection

  31. Evaluation Metric ● spam / spammer is in minority ( <5% ) ○ accuracy ✘ ● high precision / recall on the spam / spammer class is preferable ○ F-measure ✔

  32. Data Splitting ● posts (spam detection) ○ made in 2011 → training set ○ made between Jan 2012 and May 2012 → test set ● user accounts (spammer detection) ○ who had submitted a post in 2011 → training set ○ who had submitted a post between Jan 2012 and May 2012 → test set ○ who had submitted a post in both → training set ■ probabilistic prediction on posts will be used in spammer detection

  33. Test Set* for Posts ● concern: capturing the writing habit of spammers? ○ favorite words , preferred writing style and etc. ■ might be purely personal preference ■ not essential to opinion spamming ● solution: removing posts by ‘cross-posters’ from test set for posts ○ cross-posters: users who have made a post in both of training set and test set ○ result set: test set*

  34. number of instances in each split

  35. Machine Learning ● Scikit-Learn (Pedregosa et al. 2011) machine learning in Python ○ ● SVM with RBF kernel outperforms SVM with linear kernel, logistic regression, ○ AdaBoost, random forests , etc Python wrapper for LibSVM ○ scaling features to zero mean and unit variance ○ (Hsu et al. 2003) ■ two primary hyperparameter (c, γ) to tune ○ 5-fold cross-validation on the training set ■ grid search on (c, γ) with F-measure as the metric to ■ optimize

  36. Spam Detection for First Posts ● specifically for first posts in threads ● expected to be have nicer results than for replies ○ higher spam ratio ○ richer content ● only show the performance on test set* in the following slides for conciseness

  37. Random Baseline ● predict whether a first post is spam according to the result of a fair coin flip ● precision ≈ ratio of spam ● recall ≈ 50% features\metrics precision recall F-measure random 2.52% 55.71% 4.82%

  38. (Dimension-reduced) Bag-of-Words ● Chinese word segmentation with Jieba ● words with < 5 occurrences are removed ● words appeared in over 30% of the posts are removed stop words like ○ ● Randomized PCA (Halko et al. 2011) to reduce the dimension of bag-of-words efficient on large matrices ○ mitigating overfitting ○ speed up the training process ○

  39. Number of Dimension to Reduce to ● determined by the result of the 5-fold cross validation ● F-measure is the highest when bag-of-words is reduced to 150 dimensions

  40. Bag-of-Words Performance ● tremendous performance boost ○ F-measure improved by 46% ● how? features\metrics precision recall F-measure random 2.52% 55.71% 4.82% bag-of-words 50.00% 51.43% 50.70%

Recommend


More recommend