information filtering
play

Information Filtering Information Systems M Prof. Paolo Ciaccia - PDF document

Information Filtering Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/


  1. Information Filtering Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ ���������������������������������������������� The Information Filtering (IF) problem: � Deliver to users only the information that is relevant to them, filtering � out all irrelevant new data items (news, papers, advertisments, …) Although IF and IR share the common goal to provide users with relevant � information, there are important differences: IR IF Selecting relevant items Filtering out the many Goal (docs) for each query irrelevant data items Type of use Ad-hoc use Repetitive use Type of users One-time users Long-term users Representation of Queries User profiles information needs Index Items User profiles ��������������������� ��������������������� � 1

  2. ������������������ IF techniques find applications in a variety of scenarios, including: � Automatic delivery of news/alerts � Online display advertising � Publish/subscribe systems � … � Recommender systems are a specific type of IF systems that will be � discussed later on ��������������������� ��������������������� � ������������ Due to its similarity with IR, it is not surprise that the most common � approaches to IF are based on the Boolean and the Vector Space models However, a more detailed and structured description of the user profile is � now needed, in order to improve the effectiveness of matching In the sequel we will sketch the details of a recent approach based on the � Boolean model; examples of use of the VSM will be given in the context of recommender systems ��������������������� ��������������������� � 2

  3. �������������������������������������������� Reference: [WBS+09] � Scenario: A (profiled) user visiting a web site (also called an “assignment”) � Many advertisement campaigns managed by the site � Both specified using Boolean expressions (BE’s) over a multi)attribute � space Alternatively (pub/sub system): An incoming item � Many stored user profiles � One “assignment” to be efficiently matched against many stored BE’s index BE Assignment Matched BE’s ��������������������� ��������������������� � �������� Two types of Boolean predicates: ∈ and ∉ � E.g.: state ∈ {CA,NY}, state ∉ {NY} � Ranges of values are converted into ∈ and ∉ predicates � age < 30 converted into age ∈ {0,1,2} (0 = [0,9], 1 = [10,19], …) � A BE is either in DNF or in CNF normal form, e.g.: � (state ∈ {CA,NY} & age ∈ {1,2}) | (state ∉ {NY} & gender ∈ {F}) & = AND; | = OR � In the following we only discuss the DNF case � An assignment S is a set (conjunction) of attribute and value pairs � E.g.: S: state = CA & gender = F � An attribute-value pair is also called a key � E.g. (state,CA) is a key � ��������������������� ��������������������� 3

  4. ����������� A BE E is satisfied by an assignment S if S makes E true � S: state = CA & gender = F � E1: state ∈ {CA,NY} satisfied � E2: state ∈ {CA,NY} & gender ∈ {M} not satisfied � Since an assignment needs not to specify a value for all the attributes, the � semantics of matching needs to be refined (state ∈ {NY} & gender ∈ {F}) is satisfied by gender = F? NO � (state ∉ {NY} & gender ∈ {F}) is satisfied by gender = F? MAYBE… � Two alternative interpretations for ∉ predicates: � Strong) ∉ predicate: violated if no value is specified for the attribute � Weak) ∉ predicate: satisfied if no value is specified for the attribute � The default are weak- ∉ predicates; � The strong- ∉ semantics can be enforced by writing, e.g.: state ∉ {NY,NULL}, � which requires a value for state to be present in the assignment ��������������������� ��������������������� ! ��������������"�#�������������� The basic idea is to build an inverted index on BE’s that, for each key, stores � the BE’s containing it The basic case is when BE’s are simple conjunctions of ∈ predicates � E1: A ∈ {1} Inverted Index E2: A ∈ {1} & B ∈ {2} & C ∈ {3,4} Key Posting list (A,1) E1, E2 (B,2) E2 S: A = 1 & B = 2 (C,3) E2 (C,4) E2 The problem is that neither intersection nor union of posting lists work here: - Intersection: E2 - Union: E1 and E2 ��������������������� ��������������������� $ 4

  5. ��������������"�����%���&'������(����� Entries are partitioned based on the number of conjuncts K in each BE � The partition of the inverted index storing information of BE’s with K � conjuncts is called the “K-index” BE’s (conjunctions) Inverted Index K Key Posting list ID BE K C1 age ∈ {3} & state ∈ {NY} 2 0 (state,CA) (C6, ∉ ) C2 age ∈ {3} & gender ∈ {F} 2 (state,NY) (C6, ∉ ) C3 age ∈ {3} & gender ∈ {M} & state ∉ {CA} 2 Z (C6, ∈ ) C4 2 1 (age,3) (C5, ∈ ) state ∈ {CA} & gender ∈ {M} C5 1 (age,4) (C5, ∈ ) age ∈ {3,4} C6 state ∉ {CA,NY} 0 2 (state,NY) (C1, ∈ ) (C1, ∈ ), (C2, ∈ ), (age,3) (C3, ∈ ) The “Z key” is used to handle the case � (gender,F) (C2, ∈ ) K = 0 (notice that ∉ predicates do not (state,CA) (C3, ∉ ) ,(C4, ∈ ) concur to determine the value of K) (gender,M) (C3, ∈ ), (C4, ∈ ) ��������������������� ��������������������� ) *���%+��&'����������������("�#���������� Given an assignment S with t keys, two basic conditions are used to check if � a conjunction C matches S: 1. For a K)index with K ≤ t, a conjunction C matches S only if there are K posting lists such that: � Each list refers to a key (A,v) in S, and (C, ∈ ) is in the posting list 2. For no (A,v) key in S there is a posting list in which (C, ∉ ) appears Example: � C1: (age ∈ {3} & gender ∈ {M}) matches � S: age ∈ {3} & gender ∈ {M} & state ∈ {CA} C2: (age ∈ {3} & gender ∈ {M} & state ∉ {CA}) � does not match S, since the posting list of the key (state,CA) includes the entry (C2, ∉ ) The Conjunction algorithm iterates through the K)indexes by checking that � above conditions are satisfied Further, it does not consider at all K)indexes with K > t � ��������������������� ��������������������� ,- 5

  6. *���%+��&'����������������("�������� Inverted Index S: age =3 & state = CA & gender = M K Key Posting list First, all the relevant posting lists are � obtained (one K-index at a time) 0 (state,CA) (C6, ∉ ) Z (C6, ∈ ) For K=2 it is recognized that neither � 1 (age,3) (C5, ∈ ) C1 nor C2 can be satisfied by S 2 (age,3) (C1, ∈ ), (C2, ∈ ), (C3, ∈ ) Although C3 satisfies condition 1, � (state,CA) (C3, ∉ ) ,(C4, ∈ ) it violates cond. 2 (gender,M) (C3, ∈ ), (C4, ∈ ) C4 satisfies both conditions � BE’s (conjunctions) The same holds for C5 (K=1) � ID BE K C6 violates condition 2 � C1 age ∈ {3} & state ∈ {NY} 2 C2 age ∈ {3} & gender ∈ {F} 2 Result: {C4,C5} C3 age ∈ {3} & gender ∈ {M} & state ∉ {CA} 2 C4 state ∈ {CA} & gender ∈ {M} 2 C5 age ∈ {3,4} 1 C6 state ∉ {CA,NY} 0 ��������������������� ��������������������� ,, *���./������ To process BE’s in DNF it is sufficient to observe that a BE E is satisfied by an � assignment S iff at least one of its conjunctions of predicates is satisfied by S Example: � (state ∈ {CA} & gender ∈ {M}) | (state ∈ {NY} & gender ∈ {F}) is satisfied by S: age =3 & state = CA & gender = M ��������������������� ��������������������� ,� 6

Recommend


More recommend