retrieval models outline
play

Retrieval Models: Outline CS490W: Web I nformation Search & - PDF document

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models CS-490W Exact-match retrieval method Web Information Search & Management Unranked Boolean retrieval method Ranked Boolean retrieval


  1. Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models CS-490W � Exact-match retrieval method Web Information Search & Management � Unranked Boolean retrieval method � Ranked Boolean retrieval method Information Retrieval: Retrieval Models � Best-match retrieval method � Vector space retrieval method Luo Si � Latent semantic indexing Department of Computer Science Purdue University Retrieval Models Retrieval Models: Unranked Boolean Unranked Boolean: Exact match method Information � Selection Model Need � Retrieve a document iff it matches the precise query Representation Representation � Often return unranked documents (or with chronological order) � Operators Query Retrieval Model Indexed Objects � Logical Operators: AND OR, NOT � Approximately operators: #1(white house) (i.e., within one word Retrieved Objects distance, phrase) #sen(Iraq weapon) (i.e., within a sentence) � String matching operators: Wildcard (e.g., ind* for india and indonesia) Evaluation/Feedback � Field operators: title(information and retrieval)… Overview of Retrieval Models Retrieval Models: Unranked Boolean Retrieval Models Unranked Boolean: Exact match method � Boolean � A query example � Vector space (#2(distributed information retrieval) OR (#1 (federated � Basic vector space SMART, Lucene search)) AND author(#1(Jamie Callan) AND NOT (Steve)) � Extended Boolean � Probabilistic models � Statistical language models Lemur � Two Possion model Okapi � Bayesian inference networks Inquery � Citation/Link analysis models � Page rank Google � Hub & authorities Clever

  2. Retrieval Models: Unranked Boolean Retrieval Models: Ranked Boolean WestLaw system: Commercial Legal/Health/Finance Ranked Boolean: Calculate doc score Information Retrieval System � Term evidence: Evidence from term i occurred in doc j: (tf ij ) and (tf ij *idf i ) � Logical operators � Proximity operators: Phrase, word proximity, same � AND weight: minimum of argument weights sentence/paragraph � OR weight: maximum of argument weights � String matching operator: wildcard (e.g., ind*) Min=0.2 Max=0.6 � Field operator: title(#1(“legal retrieval”)) date(2000) AND OR � Citations: Cite (Salton) Term 0.2 0.6 0.4 0.2 0.6 0.4 evidence Query: (Thailand AND stock AND market) Retrieval Models: Unranked Boolean Retrieval Models: Ranked Boolean Advantages: Advantages: � Work well if user knows exactly what to retrieve � All advantages from unranked Boolean algorithm � Works well when query is precise; predictive; efficient � Predicable; easy to explain � Results in a ranked list (not a full list); easier to browse and � Very efficient find the most relevant ones than Boolean Disadvantages: � Rank criterion is flexible: e.g., different variants of term evidence � It is difficult to design the query; high recall and low precision for loose query; low recall and high precision for strict query Disadvantages: � Results are unordered; hard to find useful ones � Still an exact match (document selection) model: inverse � Users may be too optimistic for strict queries. A few very correlation for recall and precision of strict and loose queries relevant but a lot more are missing � Predictability makes user overestimate retrieval quality Retrieval Models: Ranked Boolean Retrieval Models: Vector Space Model Ranked Boolean: Exact match Vector space model � Similar as unranked Boolean but documents are ordered by � Any text object can be represented by a term vector some criterion � Documents, queries, passages, sentences Retrieve docs from Wall Street Journal Collection � A query can be seen as a short document Query: (Thailand AND stock AND market) � Similarity is determined by distance in the vector space Which word is more important? Reflect importance of � Example: cosine of the angle between two vectors document by its words Many “stock” and “market”, but fewer � The SMART system “Thailand”. Fewer may be more indicative � Developed at Cornell University: 1960-1999 Term Frequency (TF): Number of occurrence in query/doc; larger � Still quite popular number means more important Total number of docs � The Lucene system Inversed Document Frequency (IDF): Number of docs � Open source information retrieval library; (Based on Java) Larger means more important contain a term � Work with Hadoop (Map/Reduce) in large scale app (e.g., Amazon There are many variants of TF, IDF: e.g., consider document length Book)

  3. Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Vector space model vs. Boolean model Give two vectors of query and document � = � Boolean models q ( q q , ,..., q ) � query as � 1 2 n � � � Query: a Boolean expression that a document must satisfy q = � document as d ( d , d ,..., d ) j j 1 j 2 jn � Retrieval: Deductive inference � � � � calculate the similarity θ ( , ) � Vector space model q d j � � Cosine similarity: Angle between vectors � Query: viewed as a short document in a vector space d � � � � � � j = θ � Retrieval: Find similar vectors/objects sim q d ( , ) cos( ( , q d )) j j � � � θ cos( ( , q d )) j � � � + + + + + + i q d q d ... q d q d q d ... q d q d = j = = � � � 1 j ,1 2 � j ,2 � � j j n , 1 j ,1 2 j ,2 j j n , + + + + q d q d 2 2 2 2 q ... q d ... d 1 n j 1 jn Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Vector representation Vector representation Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Vector representation Vector Coefficients Java � The coefficients (vector elements) represent term D 3 evidence/ term importance D 1 � It is derived from several elements Query � Document term weight: Evidence of the term in the document/query D 2 � Collection term weight: Importance of term from observation of collection � Length normalization: Reduce document length bias Sun � Naming convention for coefficients: = First triple represents query term; q d . DCL DCL . k j k , second for document term Starbucks

  4. Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Advantages: Common vector weight components: � Best match method; it does not need a precise query � lnc.ltc: widely used term weight � Generated ranked lists; easy to explore the results � “l”: log(tf)+1 � Simplicity: easy to implement � “n”: no weight/normalization � “t”: log(N/df) � Effectiveness: often works well � “c”: cosine normalization � Flexibility: can utilize different types of term weighting methods ⎡ ⎤ ( )( ) ∑ N + + ⎢ log( tf ( k ) 1 log( tf ( k ) 1 log ⎥ + + � Used in a wide range of IR tasks: retrieval, classification, .. ⎣ q j ⎦ q d q d q d df ( k ) 1 j 1 2 j 2 n jn = k [ ] 2 summarization, content-based filtering… q d ( ) ⎡ ( ) ⎤ j ∑ ∑ N + 2 + ⎢ ⎥ log( tf ( k ) 1 log( tf ( k ) 1 log q j ⎣ df ( k ) ⎦ k k Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Disadvantages: Common vector weight components: � Hard to choose the dimension of the vector (“basic concept”); � dnn.dtb: handle varied document lengths terms may not be the best choice � “d”: 1+ln(1+ln(tf)) � Assume independent relationship among terms � “t”: log((N/df) � Heuristic for choosing vector operations � “b”: 1/(0.8+0.2*docleng/avg_doclen) � Choose of term weights � Choose of similarity function � Assume a query and a document can be treated in the same way Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Disadvantages: � Standard vector space � Represent query/documents in a vector space � Hard to choose the dimension of the vector (“basic concept”); terms may not be the best choice � Each dimension corresponds to a term in the vocabulary � Use a combination of components to represent the term evidence in � Assume independent relationship among terms both query and document � Heuristic for choosing vector operations � Use similarity function to estimate the relationship between � Choose of term weights query/documents (e.g., cosine similarity) � Choose of similarity function � Assume a query and a document can be treated in the same way

Recommend


More recommend