Of Search and Semantics Patrick Pantel NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008 - 2 -
Vannaver Vannaver Bush proposes to Bush proposes to build a body of knowledge build a body of knowledge for all mankind: for all mankind: Memex Memex Bush: "technical difficulties Bush: "technical difficulties of all sorts have been of all sorts have been ignored." ignored." Semantics captured by an Semantics captured by an associative trail associative trail , , personal personal comments comments and and side trails side trails Memex, in the form of a desk, would instantly bring files and material on any subject to the operator’s fingertips. - 3 - Gerald Salton Gerald Salton , father of modern , father of modern search: introduces concepts such as search: introduces concepts such as vector space model vector space model , Inverse , Inverse Document Frequency Document Frequency (IDF), q q y y (IDF), Term Term Frequency Frequency (TF), and (TF), and relevancy relevancy feedback feedback 1990: the first search 1990: the first search engine engine Archie Archie , from McGill, , from McGill, matches keyword queries matches keyword queries against a database of Web against a database of Web filenames filenames - 4 -
Yahoo! Yahoo! is founded around a is founded around a taxonomical organization of taxonomical organization of the Web… manually the Web… manually - 5 - Key innovator : Key innovator Semantic models for ranking Semantic models for ranking Key innovator Key innovator : paid search results paid search results Advanced search features + Advanced search features + first SE to allow natural first SE to allow natural language queries language queries Semantics and Query Logs Semantics and Query Logs : Technologies developed/applied during the Technologies developed/applied during the DARPA Tipster program and TREC make it DARPA Tipster program and TREC make it commercial: spelling correction, query commercial: spelling correction, query reformulations, also try, … reformulations, also try, … - 6 -
Future of search lies in a Future of search lies in a deep deep understanding and understanding and matching of information matching of information g request request behind user queries behind user queries Natural language questions Natural language questions answered by editors answered by editors - 7 - Semantic repositories Semantic repositories and user and user- -annotated annotated content grow rapidly content grow rapidly Semantic Semantic search engines search engines search engines search engines emerge emerge - 8 -
- 9 - - 10 -
- 11 - Search Assist Search Assist Technology Technology - 12 -
- 13 - Smart Snippets Tapas Kanungo et al. - 14 -
Aggregate star Aggregate star rating rating Example review as Example review as summary summary Current prices at Current prices at Current prices at Current prices at merchant sites merchant sites Images, maps, Images, maps, specs, … specs, … - 15 - O Opportunities : i i • Marriage of information extraction, content analysis, and query intent modeling • User experience design • Key Technologies • IE : Entity detection and salience; attribute detection salience; attribute detection • CA : Text classification, aboutness, information fusion • QIM : Entity detection, intent/task understanding - 16 -
Task Modeling Ana-Maria Popescu - 17 - Task Modeling • What is the user trying to do? – Find product reviews of the Nikon p D300? – Buy a Nikon D300? – Find support for her camera? • UED : enriching the search experience – How can we make use of this knowledge? • K Key Technologies T h l i – Entity detection – Document classification – Intent modeling / detection - 18 -
Research Problem Research Problem : Intent Synonymy Intent Synonymy Is Is battery life battery life a synonym of a synonym of image quality image quality ? Review Intent How can we How can we automatically discover automatically discover automatically discover automatically discover intent synonyms? intent synonyms? Transactional Intent - 19 - Intent Synonymy Key Assumption: one intent per session A searcher’s intent remains the same within a search session I ntent Synonyms Methodology very similar to constructing dictionary of distributionally similar words h d l l d f d b ll l d I ntent Discovery and Chaining via Clustering Review Price Support best thin where can I get cheap common problems rate black Friday sale fault codes compare kodak vs. canon christmas sales installing new small portable small portable cheap cheap easy use easy use high zoom buy now pay later operating instructions rate best dell discount won’t heat battery life great deals best schematics user comments overstock.com keeps shutting - 20 -
Road Ahead Key Applications Enabling Technologies Entity Detection Entity Detection Attribute Attribute Mining Mining Entailment Entailment Distributional Similarity Distributional Similarity Intent Synonymy Intent Synonymy Anaphora Resolution Anaphora Resolution Content Analysis Content Analysis Text Text Classification Classification - 21 - - 22 -
Dynamic Similarity Modeling Eric Crestan and Vishnu Vyas - 23 - SeeLEx: Seed List Expansion • SeeLEx, developed at Yahoo!, is a weakly supervised system for mining entity lists based on distributional similarity Honda Jaguar Peugeot Mercedes Ford - 24 -
SeeLEx: Seed List Expansion lion cheetah Opel Nissan Nissan Honda Mazda Fiat Lexus Toyota Jaguar Peugeot Clinton Porsche Volkswagen Hyundai Hyundai caracal Mercedes Ford Carter Eisenhower Suburu Renault puma Saab - 25 - SeeLEx: Seed List Expansion lion cheetah Opel Nissan Nissan Honda Mazda Fiat Lexus Toyota Jaguar What is a caracal? Peugeot endangered Porsche carnivore Volkswagen fast fast Hyundai Hyundai hunted caracal Mercedes Ford … Suburu Renault puma Saab - 26 -
Questions Asked and Conclusions What is the effect of corpus size on expansion accuracy? Significant boost in performance What is the effect of Does seed quality matter? corpus quality on Significant Great variance in performance expansion accuracy? boost in depending on set seed composition performance How many seeds are on average optimal for expansion accuracy and are more seeds better than less ? Somewhat surprising: 5-20 seeds in general is sufficient; more seeds gains little but don’t hurt - 27 - Gold Standard Sets Archbishops of Canterbury Cognitive Scientists Male Tennis Players Astronomers Composers Maryland Counties Australian Airlines Countries New Zealand Songwriters Australian A-League Electronic Companies NHL Hockey Teams Football Teams Elements North American Mountain Australian Cities English Cities Ranges Australian Prime Ministers English Poets Presidents of Argentina Best Actress Academy English Premier Football Rivers in England Award Winners Clubs Roman Emperors Biology Disciplines First Ladies Russian Authors Bottled Water Brands Formula One Drivers Spanish Provinces Boxing Weight Classes French Artists Stars California Counties Greek Gods Superheroes Canadian Stadiums Canadian Stadiums Greek I slands Greek I slands Texas Counties Texas Counties Canadian Universities I rish Theatres U.S. Army Generals Charitable Foundations I talian Regions U.S. Federal Departments Classical Pianists Japanese Martial Arts U.S. Internet Companies Cocktails Japanese Prefectures Average of 208 instances 50 sets extracted from Wikipedia (2007/12) Minimum of 11 Maximum of 1116 Total of 10,377 instances - 28 -
Corpora Table: Corpora used to build our expansion models. U NIQUE U N IQUE T OKEN S C ORPORA S EN TEN CES W ORDS (M ILLIONS ) (M ILLIONS ) (M ILLION S ) k100 5,201 217,940 542 k020 † 1040 43,588 108 † k004 208 8,717 22 Wikipedia 30 721 34 † Estima tistics. ted from k100 st a - 29 - The Effect of: Corpus Size and Corpus Quality - 30 -
1 System and Corpora Analysis (Precision vs. Recall) 0.9 Takeaway: Corpus Size Matters 0.8 • k100 yields 13% higher R -precision than 1/5 th its size full.k100 0.7 full.k020 f ll k020 • k100 yields 53% higher R -precision than 1/25 th its size full.k004 0.6 full.wikipedia Recall Takeaway: Corpus Quality Matters 0.5 • Wikipedia yields similar performance to a web corpus 60 times its size 0.4 0.3 Opportunity: Model a more natural SeeLEx usage SeeLEx usage 0.2 • What is the performance of the system on typical sets searched by editors? 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision - 31 - 1 k100: List Effect Precision vs. Recall 0.9 0.8 Opportunity: Bucket sets to find predictable behaviors 0.7 • Our gold standard sets vary significantly in their O ld t d d t i ifi tl i th i expansion performance 0.6 • Look into predictability of open vs. closed class sets, Recall large vs. small class sets, and types of sets such as 0.5 locations , people , … 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision - 32 -
Recommend
More recommend