An Empirical Study on Selective Sampling in Active Learning for Splog Detection Taichi Katayama 1 Takehito Utsuro 1 Yuuki Sato 2 Takayuki Yoshinaka 3 Yasuhide Kawada 4 Tomohiro Fukuhara 5 1 University of Tsukuba, 2 Konami Corporation, 3 Tokyo Denki University, 4 Navix Co., Ltd., 5 University of Tokyo, 1 AIRWeb2009, April 21nd, 2009 @Madrid, Spain. WWW2009
Background • Opinion Mining from Blogs • Splogs are Serious Noise in Opinion Mining – e.g., larger scale statistics (2008 Mar.) • 40% of Japanese Blog Articles in BuzzPulse, nifty are Splogs, 2007 Oct. � 2008 Feb. • Automatic Detection is highly Expected. 2
keyword stuffed blog 3
Blog snippet Rumor of retrieved with “FC Tokyo” “FC Tokyo” (a football team in Japan) “FC Tokyo” 4
Blog snippet retrieved with “LOUIS VUITTON Key case” pop-up advertisement automatically 5 inserted by the blog host system
$50 Software Package for Massive Splog Creation Featuring • SEO • Affiliate Program satellite satellite in link in link main site satellite satellite satellite satellite 6
Background • Opinion Mining from Blogs • Splogs are Serious Noise in Opinion Mining – e.g., larger scale statistics (2008 Mar.) • 40% of Japanese Blog Articles in BuzzPulse, nifty are Splogs, 2007 Oct. � 2008 Feb. • Automatic Detection is highly Expected. 7
Previous studies on splog detection • [P.Kolari 2007] – Words – URLs – Anchor texts – Links – HTML meta tags • [Y.-R.Lin 2007] – Temporal self similarities of • Posting time • Posting contents • Affiliated links • [G.Mishne 2005] – Language models among the blog post , the comment ,and pages linked by the comments 8
Evaluation with two data sets “Does splog change over time?” 1. Years 2007-2008 (720 sites) 2. Years 2008-2009 (720 sites) 9
Recall/Precision curves with confidence measure Splog detection Authentic blog detection ��� ��� Train 07-08(720 sites) Train 07-08 (360 �� sites) +08-09 (360 �� sites) �� P r e c i s i o n ( % ) �� P r e c i s i o n ( % ) �� �� �� �� Train 07-08 (360 �� �� sites) +08-09 (360 Train 07-08(720 sites) sites) �� �� � �� �� �� �� �� �� �� �� �� � �� �� �� �� �� �� �� �� �� ��� Recall(%) Recall(%) Test 08-09 (40 sites) 10
Purpose of This Research (1) • Needs for continuously updating splog/authentic blog data sets year by year • How to reduce human supervision? • May active learning framework work? 11
Purpose of This Research (2) • Optimal Strategies for Selective Sampling in Active Learning • Guided by Certain Confidence Measure random samples, samples with the � samples balanced least confidence with a confidence measure 12
Outline 1. Definition of splog sites 2. Splog detection by Machine learning – SVM – Confidence Measure – Features 3. Active learning 4. Evaluation 5. Future works 13
Definition of splog sites • If one of the followings holds for the given blog sites, then it is mostly splog – originally written text is not included – originally written text is included but many • “links top affiliated sites” or • ”advertisement articles” or • “articles with adult content” are included (judged individually by considering the contents of each blog) • Otherwise, the given blog sites is an authentic blog 14
Splog Detection by SVMs • a tool – TinySVM • the kernel function: – 2nd order � linear • confidence measure – the distance from the separating hyperplane to each test instance 15
A Confidence Measure � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Lower Bound � � � � (splog) � � :splog Lower Bound � :authentic blog Separating (authentic blog) 16 hyperplane
Features for splog detection 1. Total frequency of URLs not linked from splogs 2. Co-occurrence between Noun Phrases and Splogs � 2 ( splog , noun phrase w ) • Sum of 3. Noun Phrases in Anchor Texts and linked URLs • Total frequency of anchor text noun phrases • in splogs • out-linked to splog URLs and Blacklist URLs • Total frequency of anchor text noun phrases • in splogs • out-linked to authentic blog URLs Whitelist URLs 17
Feature1: URLs are not linked from splog Authentic Authentic splog splog splog blog blog included only in splogs url included only in url authentic blogs url url url url url more than one Whitelist: Blacklist: More than one inward links defined as inward links defined as from authentic 18 these URLs from splogs these URLs blogs
Value of the Whitelist URLs feature � � � � total frequency total � � � � � � � � of u in the whole frequency � � � � � � log training instances of u in � � � � u � � � � of authentic blog the test � � � � homepages instance � � � � u : Whitelist URLs 19
Features for splog detection 1. Total frequency of URLs not linked from splogs 2. Co-occurrence between Noun Phrases and Splogs � 2 ( splog , noun phrase w ) • Sum of 3. Noun Phrases in Anchor Texts and linked URLs • Total frequency of anchor text noun phrases • in splogs • out-linked to splog URLs and Blacklist URLs • Total frequency of anchor text noun phrases • in splogs • out-linked to authentic blog URLs Whitelist URLs 20
Feature2: Noun Phrases Authentic splog blog Training set �� w �� �� w �� �� w �� �� w �� �� � w �� �� � w �� �� � w �� freq( splog, w)=a freq( splog , � w)=b ��� � w ��� ��� w ��� ��� w ��� freq( authentic blog ,w)=c freq( authentic blog , � w)=d w : a noun phrase 21
Value of the splog noun phrase feature � 2 ( ad bc ) � � 2 ( splog , w ) � � � � ( a b )( a c )( b d )( c d ) � � total frequency of w � � � � � 2 log ( splog , w ) � � � in the test instance � w 22
Features for splog detection 1. Total frequency of URLs not linked from splogs 2. Co-occurrence between Noun Phrases and Splogs � 2 ( splog , noun phrase w ) • Sum of 3. Noun Phrases in Anchor Texts and linked URLs • Total frequency of anchor text noun phrases • in splogs • out-linked to splog URLs and Blacklist URLs • Total frequency of anchor text noun phrases • in splogs • out-linked to authentic blog URLs Whitelist URLs 23
Feature3: Noun Phrases in Anchor Texts and linked URLs w : a noun phrase in Anchor text a Splog site s <a href= ��� > ��� w ��� </a> <a href= ��� > ��� w ��� </a> <a href= ��� > ��� w ��� </a> <a href= ��� > ��� w ��� </a> <a href= ��� > ��� w ��� </a> <a href= ��� > ��� w ��� </a> <a href= ��� > ��� w ��� </a> <a href= ��� > ��� w ��� </a> <a href= ��� > ��� w ��� </a> http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� http:// ���� Authentic blog Whitelist URLs URLs Splog URLs Other URLs Blacklist URLs AncfW(w,s)=freq of w 24 AncfB(w,s)=freq of w
Noun Phrases in Anchor Texts and linked URLs: two features the value of a feature named anchor text noun phrase out-linked to Blacklist URLs for a test instance blog homepage � � � � � � � log AncfB ( w , s ) AncfB ( w , t ) � � w s the value of a feature named anchor text noun phrase out-linked to Whitelist URLs for a test instance blog homepage � � � � � � � log AncfW ( w , s ) AncfW ( w , t ) � � � � w training splog � � homepages w : noun phrase s : a training splog homepage t : a test instance blog homepage 25
Framework of Active learning 250 cycles up to 1010 training instances labeled 4 sites unlabeled 4 sites selective Training Training Human sampling Set an SVM supervision In active (initial size classifier learning of 10) (4 splog and 6 authentic Pool of unlabeled Blog) instances (initial size of 3504) (1296 splog and 2208 authentic blog) 26
Statistics of Splog/Authentic Blogs Data Set Data Sets # of splogs # of total authentic blogs Years 1445 2459 3904 2008-2009 27
Recommend
More recommend