Fausto Giunchiglia, Uladzimir Kharkevich , Ilya Zaihrayeu Concept Search : Semantics Enabled Syntactic Search June 2nd, 2008, Tenerife, Spain 1
Outline � Information Retrieval (IR) � Syntactic IR � Problems of Syntactic IR � Semantic Continuum � Concept Search ( C-Search ) � C-Search via Inverted Indices � Preliminary Evaluation � Conclusion and Future work 2
I nformation Retrieval (I R) IR can be represented as a mapping function: � I R: Q → D Q - natural language queries which specify user information needs � D - a set of documents in the document collection, which meet these � needs, (optionally) ordered according to the degree of relevance. Ex. document collection: � Ex. queries: � 3
I nformation Retrieval System I R_System = < Model, Data_Structure, Term, Match> Model – IR models used for document and query representations, � for computing query answers and relevance ranking. Bag of words model (representation) � Boolean Model, Vector Space Model, Probabilistic Model (retrieval) � Data_Structure – data structures used for indexing and retrieval. � Inverted Index � Signature File � Term – an atomic element in document and query representations. � a word or multi-words phrase � Match – matching technique used for term matching. � a syntactic matching of words or phrases: � � search for equivalent words � search for words with common prefixes � search for words within a certain edit distance with a given word 4
Syntactic I R (Ex. I nv. I ndex) Q3 : 5
Problems of Syntactic I R (I) Ambiguity of Natural Language � Polysemy : one word ↔ multiple meanings � e.g., baby is a young mammal or a human child Synonymy : different words ↔ same meaning � e.g., mark and print – a visible indication made on a surface (II) Complex Concepts � Syntactic IR does not take into account complex concepts formed by � Natural Language Phrases (e.g., Noun Phrases). � E.g., Computer table → A laptop computer is on a coffee table (III) Related Concepts � Syntactic IR does not take into account related concepts : � � E.g., carnivores (flesh-eating mammals) is more general than dog OR cat 6
Syntactic I R We can think of Syntactic IR as a point in a space of IR approaches � NL Word String Similarity (0, 0, 0) Pure Syntax 7
(1) Ambiguity : Natural Language → Formal Language (FL) NL2FL 1 NL Word String Similarity (0, 0, 0) Pure Syntax E.g., baby → C(baby) : a human child � print → C(print) : a visible indication made on a surface 8
(2) Complex Concepts : Words → Multi-word Phrases W2P 1 (Free Text) … +Verb Phrase +Noun Phrase (FL) NL2FL 1 NL Word String Similarity (0, 0, 0) Pure Syntax E.g., Computer table → C (computer table) � A laptop computer is on a coffee table → { C (laptop computer), C (coffee table)} 9
(3) Related Concepts : String similarity → Knowledge W2P 1 (Free Text) … +Verb Phrase +Noun Phrase (FL) NL2FL 1 NL Word KNOW 1 … String +Statistical +Lexical (Complete Similarity Knowledge knowledge Ontological (0, 0, 0) Knowledge) Pure Syntax E.g., “ carnivores ” ≠ “ dog ” → C(carnivores) ⊒ C(dog) � 10
Semantic Continuum Full Semantics (1, 1, 1) W2P 1 (Free Text) … +Verb Phrase � C-Search +Noun Phrase (FL) NL2FL 1 NL Word KNOW 1 … String +Statistical +Lexical (Complete Similarity Knowledge knowledge Ontological (0, 0, 0) Knowledge) Pure Syntax 11
C-Search in Semantic Continuum NL2FL-axis - Lack of background knowledge : � It is not always possible to find a concept which corresponds to a � given word (e.g., a concept does not exist in the lexical database). In this case, word itself is used as the identifier for a concept. W2P-axis - Descriptive phrases � (Complex) concepts are extracted from descriptive phrases � � descriptive phrase ::= noun phrase { OR noun phrase} � E.g., C(A little dog OR a huge cat) = (little-2 ⊓ dog-1) ⊔ (huge-1 ⊓ cat- 3) KNOW-axis - lexical knowledge � We use synonyms, hyponyms, hypernyms � Semantic Matching → search for related complex concepts. � 12
C-Search in Semantic Continuum Full Semantics (1, 1, 1) W2P 1 (Free Text) … C-Search +Verb Phrase +Descriptive Phrase +Noun Phrase (FL) NL2FL 1 NL&FL NL Word KNOW 1 … String +Statistical +Lexical (Complete Similarity Knowledge knowledge Ontological (0, 0, 0) Knowledge) Pure Syntax 13
C-Search via I nverted I ndices Moving from Syntactic I R to C-Search does not require � the introduction of new data structures or retrieval models The current implementation of C-Search : � Model – Bag of concepts (representation), � Boolean Model (retrieval), Vector Space Model (ranking) Data_Structure – Inverted Index � Term – an atomic or a complex concept � Match – semantic matching of concepts � 14
C-Search (Ex. I nv. I ndex) 15
Concept Matching Goal: To find a set of document concepts matching query concept � = ⊆ ( ) { | } q d d q C C C C C ms 1 st approach - directly via S-Match � Sequentially iterate through all document concepts � Compare document concept with query concept (using S-Match ) � Collect those concepts for which S-Match return more specific ( ⊑ ) � I t can be slow! (because number of document concepts > 10E6) � 2 nd approach - via I nverted I ndices (brief overview) � A-I ndex � → Index atomic concepts by more general atomic concept ⊓ -I ndex � → Index conjunctive clauses by its components (i.e., atomic concepts ) ⊔ -I ndex � → Index DNF formulas by its components (i.e., conjunctive clauses ) 16
Concept I ndices (An example) Let us consider the following concept: � C1 = (little-2 ⊓ dog-1) ⊔ (huge-1 ⊓ cat-3) Fragments of concept indices for document concept C1: � Concept ∩ -index Concept ∪ -index Concept A-index C 2 (little ∩ dog) C 1 ,… A 1 (little) C 2 , … A 5 (canine) A 2 ,… C 3 (huge ∩ cat) C 1 ,… A 2 (dog) C 2 , … A 6 (feline) A 4 ,… … … … … … … C 3 , … A 3 (huge) C 3 , … A 4 (cat) … … 17
Concept Retrieval (An example) 0. Query concept: Cq = canine ⊔ feline � 1. For each atomic concept → more specific atomic concepts � Search A-I ndex � E.g., canine → { dog, wolf, …} and feline → { cat, lion, …} � 2. For each atomic concept → more specific conjunctive clauses � Search ⊓ -I ndex � E.g., dog → { C2= little ⊓ dog, …} and cat → { C3= huge ⊓ cat, …} � (Note that: canine → { C2= little ⊓ dog, …} and feline → { C3= huge ⊓ cat, …} ) � 3. For each disjunctive clause → more specific conjunctive clauses � Union of conjunctive clauses � E.g., canine ⊔ feline → { C2= little ⊓ dog, C3= huge ⊓ cat, …} � 4. For each disjunctive clause → more specific DNF formulas � Search ⊔ -I ndex � E.g., canine ⊔ feline → { C1= (little ⊓ dog) ⊔ (huge ⊓ cat), …} � 5. … � 18
Evaluation: Settings Data_set_1 : Home sub-tree of DMoz web directory � Document set : documents classified to nodes (29506) � Query set : concatenation of node's and its parent's labels (890) � Relevance judgment : node-document links � Data_set_2 : Only difference with Data_set_1 is: � Document set : concatenation of titles and descriptions of docs in DMoz. � WordNet is used as Lexical DB � GATE is used as NLP Tool � Lucene is used as I nverted I ndex � 19
Evaluation results Data_set_1 � Data_set_2 � 20
Conclusion and Future work Conclusion � In C-Search , syntactic IR is extended with a semantics layer � C-Search performs as good as syntactic search while allowing for � an improvement when semantics is available In principle, C-Search supports a continuum from purely syntactic IR to � fully semantic IR in which indexing and retrieval can be performed at any point of the continuum depending on how much semantics is available Future work � Development of more accurate concept extraction algorithm � Development of document relevance metrics based on both syntactic and � semantic similarities of query and document descriptions Allow semantic scope (such as equivalence, more/less general, disjoint) � Comparing the performance of the proposed solution with the state-of-the- � art syntactic IR systems using a syntactic IR benchmark 21
Thank You! 22
Recommend
More recommend