Opinion Spam Analysis and Detection Leaked Confidential Information as Ground Truth Yu-Ren Chen, Hsin Hsi Chen National Taiwan University
What Is Opinion Spam? ● spreading commercially advantageous opinions as regular users on the Internet positive opinions about own products/services ○ negative opinions about competitors ○ ● also known as ‘ review spam ’ or ‘shilling’ ● undoubtedly unethical ● in most cases illegal ● believed to be widely used in practice potentially lucrative profits → strong incentives ○
‘Spam’ in General ● definition of ‘spam’ ○ ( Wikipedia ) the use of electronic messaging systems to send unsolicited bulk messages (spam), especially advertising , indiscriminately ○ ( Oxford ) irrelevant or unsolicited messages sent over the Internet, typically to large numbers of users, for the purposes of advertising , phishing, spreading malware, etc.
Various Kinds of Spam ● email spam the most well-known kind of spam ○ spam is defined as ‘email spam’ in Merriam-Webster ○ ● search engine spam manipulation of search engine indices and rankings ○ content spam: keyword stuffing, meta-tag stuffing, scraper ○ sites, article spinning, machine translation link spam: link farms, hidden links, expired domains, ○ comment spam, referrer log spam ● social network spam bulk message, malicious links, fake friends, etc. ○ ● many more...
Opinion Spam v.s. Other Spam ● very carefully written to avoid getting caught ○ deemed as fraud → unacceptable ○ backfiring may cause serious damage to the reputation of a brand (or a store, restaurant, etc.) ● initiated by a big brand in our case study ( Samsung ) ○ high stakes ○ opinion spammers have to be really careful
Difficulty in Obtaining Dataset ● manual annotation is pretty much useless low inter-annotator agreement score (Ott et al. 2011) ○ approximated ground truth ○ duplicate / near-duplicate reviews ■ ● (Jindal et al. 2008) crowdsourced fake reviews ■ ● (Ott et al. 2011) ● utilizing confidential internal records in our study ○ real record of opinion spam in real world ○ ‘true’ ground truth
Our Case Study - 三星寫手門事件 ● conducted by ‘ 鵬泰顧問有限公司 ’, a subsidiary company of Samsung ● hired writers and designated employees were instructed to make disingenuous reviews on web forums ● revealed by a hacker known as ‘0xb’ ○ making confidential documents in 鵬泰 publicly available on TaiwanSamsungLeaks.org ● Samsung was fined $10 million TWD by the Fair Trade Commission in Taiwan
About ‘Mobile01’ ● the main battlefield of this marketing campaign ● where the leaked documents was first made public ● one of the most popular websites in Taiwan #10 in Alexa traffic rank in Taiwan ○ ● mainly featuring discussion about consumer electronics mobile phones, tablets, etc. ○ ● primarily a web forum site rather than product review site such as Amazon ○ ● written in Traditional Chinese for the most part rather than in English ○
Structure of Mobile01 Forum
More about Web Forums ● thread ( topic, 討論串 ) ○ collection of posts (from oldest to latest) ○ started by specifying the title and the first post ● post ( 文章 ) ○ first post (original post, thread starter, 一樓 ) ○ reply ( 回覆 ) ■ all posts except the first posts ■ can be used for bumping ( 手動置頂,頂 ) ● hierarchical structure ○ forum ⇢ board ⇢ thread ⇢ post
Dataset Collection ● leaked spreadsheets HHP-2011.xlsx and HHP-2012.xlsx ○ containing URLs to the spam posts ○ source of the ground truth in our study ○ ● Mobile01 contents of the posts ○ including spam and non-spam ■ other various kinds of information on Mobile01 ○ all posts from 2011 to 2012 on SAMSUNG (Android) board ○ where the ‘ spam density ’ is the highest ■ all profiles of the posters of such posts ○ three SQLite tables: POSTS , PROFILES, THREADS , ○
a snippet of the leaked spreadsheet ‘HHP-2012.xlsx’
an example post on Mobile01
a snippet of the table ‘POSTS’ scraped from Mobile01
an example profile on Mobile01
a snippet of the table ‘PROFILES’ scraped from Mobile01
a snippet of the table ‘THREADS’ scraped from Mobile01
amount of data collected
Looking into the Dataset ● main observations ○ subtlety in spam posts ○ low spam post ratio of some spammers ○ types of spammer accounts reputable accounts ■ throwaway accounts ■ ○ first posts v.s. replies in threads ○ pattern in submission time of spam posts ○ activeness of threads ○ collusive activities of spammers
examples of subtly written spam posts
examples of subtly written spam posts
examples of subtly written spam posts
Low Spam Post Ratio of Some Spammers ● our definition of spammer poster who had submitted any spam post ○ ● only 33% of the posts from spammers are spam in this dataset
Different Types of Spammer Accounts ● reputable accounts hired reputable writers ○ low spam ratio ○ ● throwaway accounts registered in batch ○ high spam ratio ○ low number of threads ○ initiated ● others hired non-reputable posters? ○ borrowed accounts? ○
pattern in registration time
First Posts v.s. Replies ● first post initiating the thread ○ richer in content ○ higher spam ratio ○ ● reply (2nd, 3rd posts in thread) usually quite concise ○
Submission Time of Spam Posts ● hypothesis : spam posts are more often made during work time compared to normal posts because spamming is a job rather than a leisure activity ○
Activeness of Threads ● threads started by spam first posts are expected to be more active written to draw attention and exposure ○ ● measuring of ‘activeness’ of a thread number of posts in the thread ○ number of clicks on the thread ○
Collusion between Spammers ● different spam accounts submit spam posts to the same thread ○ fabricating the majority opinion ○ bumping the same spam thread ● 67% of the spam posts are in the threads containing multiple spam posts ● could be different actual human posters ● or just one human login with different accounts ○ still can be seen as collusion between the accounts
Detection ● evaluation metric ● data splitting ● machine learning ● spam detection for first posts ● spam detection for replies ● spammer detection
Evaluation Metric ● spam / spammer is in minority ( <5% ) ○ accuracy ✘ ● high precision / recall on the spam / spammer class is preferable ○ F-measure ✔
Data Splitting ● posts (spam detection) ○ made in 2011 → training set ○ made between Jan 2012 and May 2012 → test set ● user accounts (spammer detection) ○ who had submitted a post in 2011 → training set ○ who had submitted a post between Jan 2012 and May 2012 → test set ○ who had submitted a post in both → training set ■ probabilistic prediction on posts will be used in spammer detection
Test Set* for Posts ● concern: capturing the writing habit of spammers? ○ favorite words , preferred writing style and etc. ■ might be purely personal preference ■ not essential to opinion spamming ● solution: removing posts by ‘cross-posters’ from test set for posts ○ cross-posters: users who have made a post in both of training set and test set ○ result set: test set*
number of instances in each split
Machine Learning ● Scikit-Learn (Pedregosa et al. 2011) machine learning in Python ○ ● SVM with RBF kernel outperforms SVM with linear kernel, logistic regression, ○ AdaBoost, random forests , etc Python wrapper for LibSVM ○ scaling features to zero mean and unit variance ○ (Hsu et al. 2003) ■ two primary hyperparameter (c, γ) to tune ○ 5-fold cross-validation on the training set ■ grid search on (c, γ) with F-measure as the metric to ■ optimize
Spam Detection for First Posts ● specifically for first posts in threads ● expected to be have nicer results than for replies ○ higher spam ratio ○ richer content ● only show the performance on test set* in the following slides for conciseness
Random Baseline ● predict whether a first post is spam according to the result of a fair coin flip ● precision ≈ ratio of spam ● recall ≈ 50% features\metrics precision recall F-measure random 2.52% 55.71% 4.82%
(Dimension-reduced) Bag-of-Words ● Chinese word segmentation with Jieba ● words with < 5 occurrences are removed ● words appeared in over 30% of the posts are removed stop words like ○ ● Randomized PCA (Halko et al. 2011) to reduce the dimension of bag-of-words efficient on large matrices ○ mitigating overfitting ○ speed up the training process ○
Number of Dimension to Reduce to ● determined by the result of the 5-fold cross validation ● F-measure is the highest when bag-of-words is reduced to 150 dimensions
Bag-of-Words Performance ● tremendous performance boost ○ F-measure improved by 46% ● how? features\metrics precision recall F-measure random 2.52% 55.71% 4.82% bag-of-words 50.00% 51.43% 50.70%
Recommend
More recommend