CS 1655 / Spring 2013 Secure Data Management and Web Applications � 04 – Information Retrieval Alexandros Labrinidis University of Pittsburgh � Alexandros Labrinidis, CS 1655 / Spring 2013 1 University of Pittsburgh What is Information Retrieval? � • Information organized into documents � – Large number of documents � – Data in documents is unstructured � • Quest: � – Locate documents that match user ʼ s needs � – How: � • Keywords � • Sample documents � • Like finding a needle in a haystack � – Or worse: a hay-colored needle! � Alexandros Labrinidis, CS 1655 / Spring 2013 2 University of Pittsburgh 1
IR vs Databases � • Database systems : � – Structured data � – Complex data models � – Updates/transactions/concurrency control � • Information retrieval : � – Unstructured data � – Collection of documents � – Approximate searching/ranking of results � Alexandros Labrinidis, CS 1655 / Spring 2013 3 University of Pittsburgh How to retrieve information? � • One way: � – Get keywords from user � – Scan entire collection of documents � – Return documents that match � – Problems? � • Will not scale to large document collections � – E.g., the Web � • Will not rank results � – E.g., too many matches for “Labrinidis” � Alexandros Labrinidis, CS 1655 / Spring 2013 4 University of Pittsburgh 2
Relevance Ranking using terms � • Given a term t how relevant is a document d to the term? � • Approach #1: � – Use the number of occurrences of t in d � – n(d, t) � • Approach #2: � – Normalize number of occurrences of t in d by the total number of terms in d � r ( d , t ) = log(1 + n ( d , t ) n ( d ) ) Alexandros Labrinidis, CS 1655 / Spring 2013 5 University of Pittsburgh Handling Multiple Query Terms � • Simple way: � – Compute independent relevance measures � – Add them up � • Better way: � – Use inverse document frequency for each term � • Number of documents that contain term t � – Relevance of document d to set of terms Q : � r ( d , t ) ∑ r ( d , Q ) = n ( t ) t ∈ Q Alexandros Labrinidis, CS 1655 / Spring 2013 6 University of Pittsburgh 3
Not all words created equal � • Google query: � – the oranges from florida � • http://www.google.com � • the, from are very common and omitted from search � – These are called stop words � Alexandros Labrinidis, CS 1655 / Spring 2013 7 University of Pittsburgh Other factors affecting relevance � • Word proximity � – If two query terms are closer in a document this should rank higher than when they are not � – Example? � • Using hyperlinks � – E.g., site popularity, hubs, authorities (more on this next time) � Alexandros Labrinidis, CS 1655 / Spring 2013 8 University of Pittsburgh 4
Scaling to large collections � • Effective index structure is crucial � • Documents containing a specific term are located using an inverted index � – Each keyword maps to a list of documents that contain it. � • How to support or/and semantics? � – OR : compute union of sets � – AND : compute intersection of sets � Alexandros Labrinidis, CS 1655 / Spring 2013 9 University of Pittsburgh How to measure effectiveness � • Approximate, incomplete results are usual � – Especially if using an index � • How to measure quality of these results? � • False negative : � – A relevant document was not returned � • False positive : � – An irrelevant document was returned � Alexandros Labrinidis, CS 1655 / Spring 2013 10 University of Pittsburgh 5
Effectiveness metrics � • Precision � – What percentage of retrieved documents are relevant to query � • Recall � – What percentage of the documents that are relevant to the query has been retrieved � Alexandros Labrinidis, CS 1655 / Spring 2013 11 University of Pittsburgh How to improve effectiveness � • Better ranking � • Better indexing � • Classification of documents � – Instead of “global” pool, focus on smaller set of documents that are related � • Feedback from users � Alexandros Labrinidis, CS 1655 / Spring 2013 12 University of Pittsburgh 6
Focused Crawling � • google.com � – Search for “database management systems” � • google.com � – Search for “database management systems” � – +site: pitt.edu � • scholar.google.com � – Search for “database management systems” � Alexandros Labrinidis, CS 1655 / Spring 2013 13 University of Pittsburgh 7
Recommend
More recommend