Using Blog Properties to Improve Retrieval Gilad Mishne ISLA, University of Amsterdam gilad@science.uva.nl Abstract posts are more likely to contain meaningful opinions and are preferred by users. In this last category of quality we also This paper describes three simple heuristics which improve include detection of spam in blogs, defining a spam blog post opinion retrieval effectiveness by using blog-specific proper- as a low-quality one. ties. Blog timestamps are used to increase the retrieval scores We addressed each of these three aspects independently of of blog posts published near the time of a significant event the rest, using a wide range of techniques: some of those related to a query; an inexpensive approach to comment were blog-specific, and some general methods used in various amount estimation is used to identify the level of opinion retrieval settings. Each technique resulted in a separate rele- expressed in a post; and query-specific weights are used to vance score for each blog post: standard information retrieval change the importance of spam filtering for different types approaches resulted in a ranking of posts by their topical rel- of queries. Overall, these methods, combined with non-blog- evance to a query; sentiment analysis was used to rank all specific retrieval approaches, result in substantial improve- posts by the amount of sentiment contained in them; spam ments over state-of-the-art. filtering was used to rank all posts by their estimated spam Keywords level; and so on. The final ranking of a blog post was ob- tained by combining the partial scores assigned to it by the Blog retrieval, opinion retrieval, TREC different approaches using a linear combination. Overall, this method proved as one of the top performers at TREC; more 1. Introduction information about it is found in [7]. Of the different methods we used, in this paper we describe The annual Text Retrieval Conference (TREC) is organized around a set of separate tracks, each investigating a particular three, one from each of the high-level aspects we investigated; all three use properties which are specific to the blogspace, retrieval domain, and each including one or more tasks in this domain. In 2006, TREC featured, for the first time, a and all three are based on a straightforward, inexpensive ap- proach. We show that each of these techniques improve over track dedicated to blog retrieval: the TREC Blog Track. In particular, the track included an opinion retrieval task, where a baseline, and that, combined with other techniques we use, they improve also over state-of-the-art. participants were requested to locate blog posts expressing an opinion about a topic in a large collection of posts. The polarity of the sentiment in a post was not required to be 2. Improving Retrieval using Blog Properties identified: rather, any post answering the question “What do We now describe in more details the three approaches; evalu- people think about [the entity in the query]” was considered ation of each follows in the next Section. The first approach relevant. Queries included mostly person names, products, we discuss uses the timelined nature of blogs to identify pe- and brand names, taken from a query log of a blog search riods of increased possible relevance. The second relates the engine. More details about the opinion retrieval task, the amount of comments in a blog posts and the likelihood of an data used for it, the queries, and the assessments carried out opinion being present in the post. The last of the methods we are found in [10]. describe uses query-dependent spam filtering to reduce noise Our approach to the opinion retrieval task identified three in the collection. aspects involved in locating opinionated blog posts: topical relevance , opinion expression , and post quality . The first, top- 2.1 Temporal Relevance Feedback ical relevance, is the degree to which a post deals with the The blogspace is a dynamic medium, quickly responding to given topic; this is similar to relevance as defined for ad-hoc ongoing events; as a result, a substantial number of blog retrieval tasks, such as many of the traditional TREC tasks. search queries are related to specific events, in many cases The second aspect, opinion expression, involves identifying news-oriented ones [8]. The distribution of dates in relevant whether a post contains an opinion: the degree to which it documents for these queries is not uniform, but concentrated contains subjective information about a topic. Finally, the around a short period during which the event took place. For post quality is an estimation of the (query-independent) qual- example, Figure 1 shows the distribution of dates in relevant ity of a blog post, under the assumption that higher-quality documents for the query “state of the union,” which seeks opinions about the presidential state of the union address, delivered on the evening of January 31st, 2006: clearly, rel- evant documents are found mostly in the few days following ICWSM’2007 Boulder, Colorado, USA
Recommend
More recommend