Web, Semantic, and Social Information Retrieval Gerhard Weikum weikum@mpi-inf.mpg.de http://www.mpi-inf.mpg.de/~weikum/ EDBT 2007 Summer School, Bolzano, Italy, September 3, 2007
Adding Semantics to IR (or Adding Ranking to DB) IR Systems Keyword Search on Unstructured Search Engines Relational Graphs search (BANKS, Discover, DBexplorer, …) + Digital Libraries (keywords) + Web 2.0 + Enterprise Search DB Systems Querying entities & Structured + Text relations from IE search + Relax. & Approx. (Libra, ExDB, NAGA, … ) (SQL,XQuery) + Ranking Structured data (records) Unstructured data (documents) Trend: quadrants getting blurred towards DB&IR technology integration 2/62 Gerhard Weikum, EDBT 2007 Summer School
Overview • Part 1: Web IR • State of the Art • Scalability Challenge • Quality Challenge • Personalization • Research Opportunities • Part 2: Semantic & Social IR • Ontologies in XML IR • Entity Search and Ranking • Graph IR • Web 2.0 Search and Mining • Research Opportunities 3/62 Gerhard Weikum, EDBT 2007 Summer School
XML IR on Heterogeneous Data Union of heterogeneous sources without global schema Similarity-aware XPath: Similarity-aware XPath: Which professors // ~ Professor [//* = ” ~ SB“] // ~ Professor [//* = ” ~ SB“] from Saarbruecken (SB) are teaching IR and have [ // ~ Course [//* = ” ~ IR“] ] [ // ~ Course [//* = ” ~ IR“] ] research projects on XML? [ // ~ Research [//* = ” ~ XML“] ] [ // ~ Research [//* = ” ~ XML“] ] Lecturer Professor Name: Activities Ralf Name: Address: Address Schenkel Gerhard Max-Planck ... Weikum Institute for City : SB Research Informatics, Teaching Country : Germany ... Germany Seminar Scientific Other Course Project … Title: Contents: Name: Title: IR Sponsor: Intelligent Ranked Syllabus INEX task EU retrieval … Search of coordinator Description: ... (Initiative for the Heterogeneous Literature: … Information Book Article Evaluation of XML …) XML Data retrieval ... ... ... Funding : EU 4/62 Gerhard Weikum, EDBT 2007 Summer School
XML IR on Heterogeneous Data Union of heterogeneous sources without global schema Similarity-aware XPath: Which professors // ~ Professor [//* = ” ~ Saarbruecken“] from Saarbruecken (SB) are teaching IR and have [ // ~ Course [//* = ” ~ IR“] ] research projects on XML? [ // ~ Research [//* = ” ~ XML“] ] alchemist Lecturer Professor primadonna magician director Scoring and ranking: artist Name: Activities wizard investigator Ralf Name: Address: • XML BM25 for content cond. Address Schenkel Gerhard Max-Planck ... intellectual Weikum • ontological similarity for Institute for RELATED (0.48) City : SB Research Informatics, relaxed tag condition Teaching Country : Germany professor researcher ... Germany • score aggregation with Seminar Scientific Other HYPONYM (0.749) Course Project probabilistic independence scientist query expansion model: … Title: Contents: Name: scholar lecturer Title: IR disjunction of tags • extended TA for query exec. Sponsor: Intelligent Ranked Syllabus INEX task EU retrieval … mentor Search of coordinator Description: ... academic, teacher (Initiative for the Heterogeneous statistical edge weighting by Literature: … Information academician, Book Article Evaluation of XML …) XML Data Dice coeff.: 2 #(x,y) / (#x + #y) on Web faculty member retrieval ... ... ... Funding : EU 5/62 Gerhard Weikum, EDBT 2007 Summer School
Query Expansion with Incremental Merging [M. Theobald et al.: SIGIR 2005] alchemist magician primadonna director artist relaxable query q: ~ professor research wizard investigator exp(i)={w | sim(i,w) � � } with expansions intellectual Related (0.48) based on ontology relatedness modulating professor researcher monotonic score aggregation Hyponym (0.749) � i � q exp(i) scientist TA scans of index lists for scholar lecturer mentor academic, Better: dynamic query expansion with teacher academician, incremental merging of additional index lists faculty member B+ tree index on terms ontology / meta-index lecturer: research professor scholar: 0.6 professor 0.7 57: 0.6 12: 0.9 37: 0.9 92: 0.9 lecturer: 0.7 44: 0.4 44: 0.4 14: 0.8 44: 0.8 67: 0.9 scholar: 0.6 52: 0.4 28: 0.6 22: 0.7 52: 0.9 academic: 0.53 33: 0.3 17: 0.55 23: 0.6 44: 0.8 75: 0.3 61: 0.5 51: 0.6 55: 0.8 scientist: 0.5 ... 44: 0.5 52: 0.6 ... ... ... ... efficient, robust, self-tuning 6/62 Gerhard Weikum, EDBT 2007 Summer School
Query Expansion Example From TREC 2004 Robust Track Benchmark: Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20], ...}} 135530 sorted accesses in 11.073s. Results: 1. Interpol Chief on Fight Against Narcotics 2. Economic Counterintelligence Tasks Viewed 3. Dresden Conference Views Growth of Organized Crime in Europe 4. Report on Drug, Weapons Seizures in Southwest Border Region 5. SWITZERLAND CALLED SOFT ON CRIME ... 7/62 Gerhard Weikum, EDBT 2007 Summer School
Overview • Part 1: Web IR • State of the Art • Scalability Challenge • Quality Challenge • Personalization • Research Opportunities • Part 2: Semantic & Social IR � Ontologies in XML IR • Entity Search and Ranking • Graph IR • Web 2.0 Search and Mining • Research Opportunities 8/62 Gerhard Weikum, EDBT 2007 Summer School
Don‘t Let Me Be Misunderstood Keyword query: Max Planck Keyword query: Greek art Paris or or Semantic Search Concept query: Concept query: Person = „Max Planck“ „Greek art“ & Location = „Paris“ 9/62 Gerhard Weikum, EDBT 2007 Summer School
Entity Search: Example Google What is lacking? • data is not knowledge � extraction and organization • keywords cannot express advanced user intentions � concepts, entities, properties, relations 10/62 Gerhard Weikum, EDBT 2007 Summer School
Entity Search: Example NAGA Query: $x isa politician $x isa scientist Results: Benjamin Franklin Paul Wolfowitz Angela Merkel … 11/62 Gerhard Weikum, EDBT 2007 Summer School
Entity Search: Example DBLife http://dblife.cs.wisc.edu 12/62 Gerhard Weikum, EDBT 2007 Summer School
Entity Search Instead of „interpreting“ text with background knowledge, extract facts and search entities, attributes, and relations Motivation and Applications: • Web search for vertical domains (products, traveling, entertainment, scholarly publications, intelligence agencies, etc.) • preparation for natural-language QA • step towards better Deep-Web search, digital libraries, e-science Example systems: • Libra (MSR), EntityRank (UIUC), ExDB (UW Seattle), NAGA (MPII), … • probably all commercial search engines have some support for entities Typical system architecture: record record keyword / focused extraction linkage & record entity crawling & (named entity, aggregation search ranking Deep-Web attributes) (entity (faceted crawling matching) GUI) 13/62 Gerhard Weikum, EDBT 2007 Summer School
Information Extraction (IE): Text to Records Person BirthDate BirthPlace ... Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person ScientificResult Max Planck Quantum Theory extracted facts often Constant Value Dimension have confidence < 1 Planck‘s constant 6.226 � 10 23 Js � DB with uncertainty (probabilistic DB) Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Person Organization Max Planck KWG / MPG combine NLP, pattern matching, lexicons, statistical learning 14/62 Gerhard Weikum, EDBT 2007 Summer School
IE Technology: Rules, Patterns, Learning For heterogeneous sources and for natural-language text: • NLP techniques (parser, PoS tagging) for tokenization • identify patterns (regular expressions) as features • train statistical learners for segmentation and labeling (HMM, CRF, SVM, etc.), augmented with lexicons • use learned model to automatically tag newly seen input Training data: <location> The WWW conference takes place in Banff in Canada. <organization> Today‘s keynote speaker is Dr. Berners-Lee from W3C. <person> The panel in Edinburgh, chaired by Ron Brachman from Yahoo!, … <event> … <lecture> NP NP NN IN DT NP VB IN DT ADJ NN PP NP IN CD Ian Foster, father of the Grid, talks at the GES conference in Germany on 05/02/07. <person> <event> <location> <date> 15/62 Gerhard Weikum, EDBT 2007 Summer School
Recommend
More recommend