Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended Boolean, Generalized Vector Space Models Berlin Chen 2004 Reference: 1. Modern Information Retrieval . Chapter 2
Taxonomy of Classic IR Models Set Theoretic Fuzzy Extended Boolean Classic Models Boolean Algebraic Vector U Generalized Vector Probabilistic Retrieval: s Latent Semantic Adhoc e Indexing (LSI) Filtering Neural Networks r Structured Models Probabilistic T Non-Overlapping Lists a Inference Network Proximal Nodes s Belief Network k Browsing Hidden Markov Model Browsing Probabilistic LSI Language Model Flat Structure Guided probability-based Hypertext IR 2004 – Berlin Chen 2
Outline • Alternative Set Theoretic Models – Fuzzy Set Model (Fuzzy Information Retrieval) – Extended Boolean Model • Alternative Algebraic Models – Generalized Vector Space Model IR 2004 – Berlin Chen 3
Fuzzy Set Model • Premises – Docs and queries are represented through sets of keywords, therefore the matching between them is vague • Keywords cannot completely describe the user’s information need and the doc’s main theme aboutness Retrieval Model w s , w p , w q,…. w i , w j , w k,…. 陳總統、北二高、、 陳水扁、北部第二高速公路、、 – For each query term (keyword) • Define a fuzzy set and that each doc has a degree of membership (0~1) in the set IR 2004 – Berlin Chen 4
Fuzzy Set Model (cont.) • Fuzzy Set Theory – Framework for representing classes (sets) whose boundaries are not well defined – Key idea is to introduce the notion of a degree of membership associated with the elements of a set – This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership • 0 → no membership • 1 → full membership – Thus, membership is now a gradual instead of abrupt • Not as conventional Boolean logic Here we will define a fuzzy set for each query (or index) term, thus each doc has a degree of membership in this set. IR 2004 – Berlin Chen 5
Fuzzy Set Model (cont.) U A B • Definition u – A fuzzy subset A of a universal of discourse U is characterized by a membership function µ A : U → [0,1] • Which associates with each element u of U a number µ A ( u ) in the interval [0,1] – Let A and B be two fuzzy subsets of U . Also, let A be the complement of A . Then, µ = − µ ( u ) 1 ( u ) • Complement A A µ = µ µ ( u ) max( ( u ), ( u )) • Union ∪ A B A B µ = µ µ • Intersection ( u ) min( ( u ), ( u )) ∩ A B A B IR 2004 – Berlin Chen 6
Fuzzy Set Model (cont.) • Fuzzy information retrieval Defining term relationship – Fuzzy sets are modeled based on a thesaurus – This thesaurus can be constructed by a term-term correlation matrix (or called keyword connection matrix) r • c : a term-term correlation matrix c , • : a normalized correlation factor for terms k i and k l i l n n = : no of docs that contain k i i , l c i i , l + − n n n n : no of docs that contain both k i and k l i , l i l i , l ranged from 0 to 1 docs, paragraphs, sentences, .. • We now have the notion of proximity among index terms – The relationship is symmetric ! ( ) ( ) µ = = = µ k c c k k l i , l l , i k i i l IR 2004 – Berlin Chen 7
Fuzzy Set Model (cont.) • The union and intersection operations are modified here U + + ab a b a b ( ) ( ) = + − + − A 1 ab 1 a b a 1 b A 2 = + − + − ab b ab a ab = − − − + 1 ( 1 a b ab ) u = − − − 1 ( 1 a )( 1 b ) – Union : algebraic sum (instead of max ) µ = µ ( u ) ( u ) µ = µ µ + µ µ + µ µ ( u ) ( u ) ( u ) ( u ) ( u ) ( u ) ( u ) ∪ ∪ ∪ A A A A L 1 2 n j ∪ A A A A A A A A j 1 2 1 2 2 1 1 2 ( ) ( ) 2 n ∏ = ∏ 1 - 1 -µ (u) = 1 - 1 -µ (u) A j A = a negative algebraic product j 1 j = j 1 – Intersection : algebraic product (instead of min ) n ∏ µ = ( u ) µ (u) µ = µ µ ( u ) ( u ) ( u ) ∩ ∩ A A A A L 1 2 n j ∩ A A A A = j 1 1 2 1 2 IR 2004 – Berlin Chen 8
Fuzzy Set Model (cont.) – The degree of membership between a doc d j and an index term k i algebraic sum (a doc is a union of index terms) k k b ( ) a ( ) ( ) c , µ = µ = µ c , d k k i b i a k ∪ k j d i k i i 1 − 1 − c , i j l c , ∈ i a k d i b l j ( ) ( ) ( ) ∏ ∏ = − − µ = − − 1 1 k 1 1 c k i i , l l ∈ ∈ k d k d l j l j • Computes an algebraic sum over all terms in the doc d j – Implemented as the complement of a negative algebraic product – A doc d j belongs to the fuzzy set associated to the term k i if its own terms are related to k i • If there is at least one index term k l of d j which is strongly related to the index k i ( ) then µ ki,dj ∼ 1 c ~ 1 i , l – k i is a good fuzzy index for doc d j – And vice versa IR 2004 – Berlin Chen 9
Fuzzy Set Model (cont.) • Example: – Query q = k a ∧ ( k b ∨ ¬ k c ) disjunctive normal form q dnf =( k a ∧ k b ∧ k c ) ∨ ( k a ∧ k b ∧ ¬ k c ) ∨ ( k a ∧ ¬ k b ∧ ¬ k c ) = cc 1 +cc 2 +cc 3 conjunctive component D a D b cc 2 cc 3 – D a is the fuzzy set of docs cc 1 associated to the term k a – Degree of membership ? D c IR 2004 – Berlin Chen 10
Fuzzy Set Model (cont.) D a D b cc 2 • Degree of membership cc 3 cc 1 algebraic sum µ = µ ∪ ∪ q , d cc cc cc , d j 1 2 3 j 3 negative algebraic product ∏ for a doc in d = − − µ D c 1 ( 1 ) j the fuzzy answer cc , d D set i j cc 3 q ) ( ) = cc 2 ( )( i 1 cc 1 = − − µ − µ − µ 1 1 1 1 ∩ ∩ ∩ ∩ a b c , d a b c , d ∩ ∩ a b c , d j j j algebraic product = − − µ µ µ 1 ( 1 ) a , d b , d c , d j j j × − µ µ − µ × − µ − µ − µ ( 1 ( 1 )) ( 1 ( 1 )( 1 )) a , d b , d c , d a , d b , d c , d j j j j j j IR 2004 – Berlin Chen 11
Fuzzy Set Model (cont.) • Advantages – The correlations among index terms are considered – Degree of relevance between queries and docs can be achieved • Disadvantages – Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory – Experiments with standard test collections are not available IR 2004 – Berlin Chen 12
Extended Boolean Model Salton et al., 1983 • Motive – Extend the Boolean model with the functionality of partial matching and term weighting 陳水扁 及 呂秀蓮 • E.g.: in Boolean model, for the qery q = k x ∧ k y , a doc contains either k x or k y is as irrelevant as another doc which contains neither of them • How about the disjunctive query q = k x ∨ k y 陳水扁 或 呂秀蓮 – Combine Boolean query formulations with characteristics of the vector model • Term weighting a ranking can be obtained • Algebraic distances for similarity measures IR 2004 – Berlin Chen 13
Extended Boolean Model (cont.) • Term weighting – The weight for the term k x in a doc d j is idf ranged from 0 to 1 = × w tf x x , j x , j max idf Normalized idf i i normalized frequency w , • is normalized to lay between 0 and 1 x j • Assume two index terms k x and k y were used w , x – Let denote the weight of term k x on doc d j x j w , y – Let denote the weight of term k y on doc d j y j r ( ) ( ) = d j x , y – The doc vector is represented as = d w , , w j x j y , j – Queries and docs can be plotted in a two-dimensional map IR 2004 – Berlin Chen 14
Extended Boolean Model (cont.) • If the query is q = k x ∧ k y (conjunctive query) -The docs near the point (1,1) are preferred -The similarity measure is defined as ( ) ( ) − + − 2 2 1 x 1 y ( ) 2-norm model = − sim q , d 1 and (Euclidean distance) 2 k y (1,1) 1 1 − 1 / 2 AND r ( ) = y w = d j+1 d w , , w y , j j x j y , j d j 0 = x w k x (0,0) 1 − 1 − 1 1 / / 2 2 x , j IR 2004 – Berlin Chen 15
Extended Boolean Model (cont.) • If the query is q = k x ∨ k y (disjunctive query) -The docs far from the point (0,0) are preferred -The similarity measure is defined as + x 2 y 2 ( ) = sim q , d 2-norm model or 2 (Euclidean distance) k y (1,1) Or 1 / 2 1 d j+1 d j y = w y,j k x 1 / 2 (0,0) 0 x = w x,j IR 2004 – Berlin Chen 16
Extended Boolean Model (cont.) ( ) sim q or , d • The similarity measures and ( ) also lay between 0 and 1 sim q and , d IR 2004 – Berlin Chen 17
Recommend
More recommend