Using Blog Properties to Improve Retrieval Gilad Mishne ISLA, - PDF document

Using Blog Properties to Improve Retrieval Gilad Mishne ISLA, University of Amsterdam gilad@science.uva.nl Abstract posts are more likely to contain meaningful opinions and are preferred by users. In this last category of quality we also This paper describes three simple heuristics which improve include detection of spam in blogs, defining a spam blog post opinion retrieval effectiveness by using blog-specific proper- as a low-quality one. ties. Blog timestamps are used to increase the retrieval scores We addressed each of these three aspects independently of of blog posts published near the time of a significant event the rest, using a wide range of techniques: some of those related to a query; an inexpensive approach to comment were blog-specific, and some general methods used in various amount estimation is used to identify the level of opinion retrieval settings. Each technique resulted in a separate rele- expressed in a post; and query-specific weights are used to vance score for each blog post: standard information retrieval change the importance of spam filtering for different types approaches resulted in a ranking of posts by their topical rel- of queries. Overall, these methods, combined with non-blog- evance to a query; sentiment analysis was used to rank all specific retrieval approaches, result in substantial improve- posts by the amount of sentiment contained in them; spam ments over state-of-the-art. filtering was used to rank all posts by their estimated spam Keywords level; and so on. The final ranking of a blog post was ob- tained by combining the partial scores assigned to it by the Blog retrieval, opinion retrieval, TREC different approaches using a linear combination. Overall, this method proved as one of the top performers at TREC; more 1. Introduction information about it is found in [7]. Of the different methods we used, in this paper we describe The annual Text Retrieval Conference (TREC) is organized around a set of separate tracks, each investigating a particular three, one from each of the high-level aspects we investigated; all three use properties which are specific to the blogspace, retrieval domain, and each including one or more tasks in this domain. In 2006, TREC featured, for the first time, a and all three are based on a straightforward, inexpensive approach. We show that each of these techniques improve over track dedicated to blog retrieval: the TREC Blog Track. In particular, the track included an opinion retrieval task, where a baseline, and that, combined with other techniques we use, they improve also over state-of-the-art. participants were requested to locate blog posts expressing an opinion about a topic in a large collection of posts. The polarity of the sentiment in a post was not required to be 2. Improving Retrieval using Blog Properties identified: rather, any post answering the question “What do We now describe in more details the three approaches; evalu- people think about [the entity in the query]” was considered ation of each follows in the next Section. The first approach relevant. Queries included mostly person names, products, we discuss uses the timelined nature of blogs to identify pe- and brand names, taken from a query log of a blog search riods of increased possible relevance. The second relates the engine. More details about the opinion retrieval task, the amount of comments in a blog posts and the likelihood of an data used for it, the queries, and the assessments carried out opinion being present in the post. The last of the methods we are found in [10]. describe uses query-dependent spam filtering to reduce noise Our approach to the opinion retrieval task identified three in the collection. aspects involved in locating opinionated blog posts: topical relevance , opinion expression , and post quality . The first, top- 2.1 Temporal Relevance Feedback ical relevance, is the degree to which a post deals with the The blogspace is a dynamic medium, quickly responding to given topic; this is similar to relevance as defined for ad-hoc ongoing events; as a result, a substantial number of blog retrieval tasks, such as many of the traditional TREC tasks. search queries are related to specific events, in many cases The second aspect, opinion expression, involves identifying news-oriented ones [8]. The distribution of dates in relevant whether a post contains an opinion: the degree to which it documents for these queries is not uniform, but concentrated contains subjective information about a topic. Finally, the around a short period during which the event took place. For post quality is an estimation of the (query-independent) qual- example, Figure 1 shows the distribution of dates in relevant ity of a blog post, under the assumption that higher-quality documents for the query “state of the union,” which seeks opinions about the presidential state of the union address, delivered on the evening of January 31st, 2006: clearly, relevant documents are found mostly in the few days following ICWSM’2007 Boulder, Colorado, USA

Using Blog Properties to Improve Retrieval Gilad Mishne ISLA, - PDF document

Using Blog Properties to Improve Retrieval Gilad Mishne ISLA, University of Amsterdam gilad@science.uva.nl Abstract posts are more likely to contain meaningful opinions and are preferred by users. In this last category of quality we also This

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve

Web Course Web Course Physical Properties of Glass Physical Properties of Glass 1. Properties

Web Course Web Course Physical Properties of Glass Physical Properties of Glass 1. Properties

P6 Stephen Powell, thoughts mostly about learning A specialist in inquiry-based, work-focussed,

or Diamonds in the rough Richard M. Davis, ULCC Maureen Pennock, BL or Set a blog to

The important part played by blogging By Jamie - Explore With Ed @Explorewithed Introduction...

Presentation Materials The Alternative Reference Rates Committee August 1, 2017 1 Data rel

WordPress Introduction John Boucha, M.S. STEM IT Professional Great Plains STEM Education

Keeping Current on Educational Resources Camille Andrews Mann Library, Cornell University March

Scribo: A Livejournal Client for the Maemo 5 Platform Diana Zaiceva, Artem Mezhenin, Aleksandr

Social Media Seminar for Development Educators Part 1: Social Media Basics How are these

Using Blog Properties to Improve Retrieval Gilad Mishne ISLA, - PDF document

Using Blog Properties to Improve Retrieval Gilad Mishne ISLA, University of Amsterdam gilad@science.uva.nl Abstract posts are more likely to contain meaningful opinions and are preferred by users. In this last category of quality we also This

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve

Web Course Web Course Physical Properties of Glass Physical Properties of Glass 1. Properties

Web Course Web Course Physical Properties of Glass Physical Properties of Glass 1. Properties

P6 Stephen Powell, thoughts mostly about learning A specialist in inquiry-based, work-focussed,

or Diamonds in the rough Richard M. Davis, ULCC Maureen Pennock, BL or Set a blog to

The important part played by blogging By Jamie - Explore With Ed @Explorewithed Introduction...

Presentation Materials The Alternative Reference Rates Committee August 1, 2017 1 Data rel

WordPress Introduction John Boucha, M.S. STEM IT Professional Great Plains STEM Education

Keeping Current on Educational Resources Camille Andrews Mann Library, Cornell University March

Scribo: A Livejournal Client for the Maemo 5 Platform Diana Zaiceva, Artem Mezhenin, Aleksandr

Social Media Seminar for Development Educators Part 1: Social Media Basics How are these

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models