text retrieval algorithms
play

Text Retrieval Algorithms Jimmy Lin Jimmy Lin University of - PowerPoint PPT Presentation

Data-Intensive Information Processing Applications Session #4 Text Retrieval Algorithms Jimmy Lin Jimmy Lin University of Maryland Tuesday, February 23, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share


  1. Data-Intensive Information Processing Applications ― Session #4 Text Retrieval Algorithms Jimmy Lin Jimmy Lin University of Maryland Tuesday, February 23, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

  2. Source: Wikipedia (Japanese rock garden)

  3. Today’s Agenda � Introduction to information retrieval � Basics of indexing and retrieval as cs o de g a d et e a � Inverted indexing in MapReduce � Retrieval at scale � Retrieval at scale

  4. First, nomenclature… � Information retrieval (IR) � Focus on textual information (= text/document retrieval) � Other possibilities include image, video, music, … � What do we search? � Generically, “collections” � Less-frequently used, “corpora” � What do we find? � What do we find? � Generically, “documents” � Even though we may be referring to web pages, PDFs, PowerPoint slides, paragraphs, etc.

  5. Information Retrieval Cycle Source Resource Selection Query Query Formulation Search Results Selection Documents System discovery Vocabulary discovery Concept discovery Document discovery Document discovery Examination Examination Information source reselection Delivery

  6. The Central Problem in Search Author Searcher Concepts Concepts Concepts Concepts Query Terms Document Terms “tragic love story” “fateful star-crossed romance” Do these represent the same concepts?

  7. Abstract IR Architecture Query Documents online offline Representation Representation Function Function Function Function Query Representation Document Representation Comparison Index Function Hits

  8. How do w e represent text? � Remember: computers don’t “understand” anything! � “Bag of words” ag o o ds � Treat all the words in a document as index terms � Assign a “weight” to each term based on “importance” (or in simplest case presence/absence of word) (or, in simplest case, presence/absence of word) � Disregard order, structure, meaning, etc. of the words � Simple, yet effective! � Assumptions � Term occurrence is independent � Document relevance is independent � “Words” are well-defined

  9. What’s a w ord? 天主教教宗若望保祿二世因感冒再度住進醫院。 ﻒﻴﺠﻳر كرﺎﻣ لﺎﻗو - ﻢﺳﺎﺑ ﻖﻃﺎﻨﻟا 這是他今年第二度因同樣的病因住院。 ﺔﻴﻠﻴﺋاﺮﺳﻹا ﺔﻴﺟرﺎﺨﻟا - ﻞﺒﻗ نورﺎﺷ نإ ةرﺎﻳﺰﺑ ﻰﻟوﻷا ةﺮﻤﻠﻟ مﻮﻘﻴﺳو ةﻮﻋﺪﻟا ﺮﻘﻤﻟا ﺔﻠﻳﻮﻃ ةﺮﺘﻔﻟ ﺖﻧﺎآ ﻲﺘﻟا ،ﺲﻧﻮﺗ مﺎﻋ نﺎﻨﺒﻟ ﻦﻣ ﺎﻬﺟوﺮﺧ ﺪﻌﺑ ﺔﻴﻨﻴﻄﺴﻠﻔﻟا ﺮﻳﺮﺤﺘﻟا ﺔﻤﻈﻨﻤﻟ ﻲﻤﺳﺮﻟا 1982. Выступая в Мещанском суде Москвы экс - глава ЮКОСа заявил не совершал ничего противозаконного , в чем обвиняет его генпрокуратура России . भारत सरकार ने आिथरॎरॎक सवेरॎेरॎक्स क्सण मेःेः िवत्थ त्थीय वषरॎरॎ 2005-06 मेःेः सात फ़ीसदी िवकास दर हािसल करने का आकलन िकया है और कर सुधार पर ज़ोर िदया है 日米連合で台頭中国に対処 … アーミテージ前副長官提言 조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안 에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부 언론의 에 대해 군대라도 동원해 막고싶은 심정 이라고 말했다는 일부 언론의 보도를 부인했다 .

  10. Sample Document McDonald's slims down spuds “Bag of Words” Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. 14 × McDonalds 14 M D ld NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as 12 × fat it moves to make all its fried menu items healthier. 11 × fries 11 × fries But does that mean the popular shoestring fries But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are 8 × new getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. 7 7 × french french But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, 6 × company, said, nutrition but at least one nutrition expert says playing with the formula could mean a different taste. 5 × food, oil, percent, reduce, , , p , , Shares of Oak Brook, Ill.-based McDonald's (MCD d (MCD: down $0.54 to $23.22, Research, $0 54 t $23 22 R h taste, Tuesday Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down … $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. …

  11. Counting Words… Documents Documents case folding, tokenization, stopword removal, stemming Bag of Words syntax, semantics, word knowledge, etc. Inverted Index

  12. Boolean Retrieval � Users express queries as a Boolean expression � AND, OR, NOT � Can be arbitrarily nested � Retrieval is based on the notion of sets � Any given query divides the collection into two sets: retrieved, not-retrieved � Pure Boolean systems do not define an ordering of the results

  13. Inverted Index: Boolean Retrieval Doc 1 Doc 2 Doc 3 Doc 4 one fish, two fish red fish, blue fish cat in the hat green eggs and ham 1 2 3 4 blue 1 blue 2 cat 1 cat 3 egg 1 egg 4 fish fish 1 1 1 1 fish fish 1 1 2 2 green 1 green 4 ham 1 ham 4 hat 1 hat 3 one 1 one 1 red red 1 1 red red 2 2 two 1 two 1

  14. Boolean Retrieval � To execute a Boolean query: � Build query syntax tree OR OR ( blue AND fish ) OR ham ham AND � For each clause, look up postings For each clause, look up postings blue blue fish fish blue 2 fish 1 2 � Traverse postings and apply Boolean operator � Efficiency analysis y y � Postings traversal is linear (assuming sorted postings) � Start with shortest posting first

  15. Strengths and Weaknesses � Strengths � Precise, if you know the right strategies � Precise, if you have an idea of what you’re looking for � Implementations are fast and efficient � Weaknesses � Weaknesses � Users must learn Boolean logic � Boolean logic insufficient to capture the richness of language � No control over size of result set: either too many hits or none � When do you stop reading? All documents in the result set are considered “equally good” co s de ed equa y good � What about partial matches? Documents that “don’t quite match” the query may be useful also

  16. Ranked Retrieval � Order documents by how likely they are to be relevant to the information need � Estimate relevance( q , d i ) � Sort documents by relevance � Display sorted results Display sorted results � User model � Present hits one screen at a time, best results first � At any point, users can decide to stop looking � How do we estimate relevance? � Assume document is relevant if it has a lot of query terms � Replace relevance( q , d i ) with sim( q , d i ) � Compute similarity of vector representations p y p

  17. Vector Space Model t 3 d 2 d 3 d 1 θ θ φ t 1 d 5 t 2 d 4 Assumption: Documents that are “close together” in vector space “talk about” the same things Therefore retrieve documents based on how close the Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

  18. Similarity Metric � Use “angle” between the vectors: r r ⋅ d d j d d cos( θ = j k k r r ) d d j k r r ∑ ∑ n n ⋅ w w d d = = = i , j i , k j k r r i 1 sim ( d , d ) j k ∑ ∑ n n d d 2 2 w w j k = i , j = i , k i 1 i 1 � Or, more generally, inner products: r r ∑ = n = ⋅ = sim ( d , d ) d d w w j k j k i , j i , k i 1

  19. Term Weighting � Term weights consist of two components � Local: how important is the term in this document? � Global: how important is the term in the collection? � Here’s the intuition: � Terms that appear often in a document should get high weights � Terms that appear in many documents should get low weights � How do we capture this mathematically? � How do we capture this mathematically? � Term frequency (local) � Inverse document frequency (global)

  20. TF.IDF Term Weighting N = ⋅ w tf log i , j i , j n i w , weight assigned to term i in document j i j tf number of occurrence of term i in document j i , j number of documents in entire collection N n n number of documents with term i i i

Recommend


More recommend