over taxonomies
play

over Taxonomies Yodsawalai Chodpathumwan University of Illinois at - PowerPoint PPT Presentation

Cost-Effective Conceptual Design over Taxonomies Yodsawalai Chodpathumwan University of Illinois at Urbana-Champaign Ali Vakilian Massachusetts Institute of Technology Arash Termehchy, Amir Nayyeri Oregon State University Users have to query


  1. Cost-Effective Conceptual Design over Taxonomies Yodsawalai Chodpathumwan University of Illinois at Urbana-Champaign Ali Vakilian Massachusetts Institute of Technology Arash Termehchy, Amir Nayyeri Oregon State University

  2. Users have to query over unstructured dataset. Medical articles, HTML pages, … Wikipedia article excerpts <article id=2> <article id=1> keyword query John Adams has been a John Adams is a former member of Ohio composer whose “John Adams, politician” House of Representative music inspired by nature … from 2007 to 2014 … ranked list <article id=3> John Adams is a public high school located on the east side of Cleveland, Ohio, … <article id=2> <article id=3> Precision = #returned relevant answers #returned answers <article id=1> precision = 1/3 Only Article id 1 is poor ranking quality! about a politician.

  3. Annotating a dataset helps answering the query. We can annotate the dataset using concepts from a taxonomy. thing Taxonomy : DBpedia taxonomy * Tree-shaped graph * Vertex = concept place agent * Edge = subclass relation person populated place organization school athlete artist politician legislature state city Wikipedia article excerpts <article id=1> John Adams has been a <article id=2> politician former member of John Adams is a composer artist Ohio House of Representative whose music inspired by legislature from 2007 to 2014 … nature … <article id=3> John Adams is a public high school located on the east school side of Cleveland, Ohio, … city state

  4. Users can submit structured queries over annotated dataset. politician artist Wikipedia article excerpts Structured keyword query <article id=2> <article id=1> John Adams is a John Adams has been a Politician(“John Adams”) composer whose former member of music inspired by Ohio House of Representative nature … from 2007 to 2014 … ranked list <article id=3> John Adams is a public high school located <article id=1> on the east side of Cleveland, Ohio, … state city school legislature precision = 1/1 = 1 Perfect!

  5. Concept annotation is costly Instances of concepts are annotated by a program called concept annotator. It is costly to develop, execute, and maintain a concept annotator. • Hand-tuned program rules – need experts, time-consuming • Machine learning technique – lots of relevant features, thousands of rules • Executing concepts annotator may take several days and require lots of computational resources • Datasets evolve over time – rewrite and re-execute concept annotators

  6. It is not possible to always annotate all concepts. Ideally , we would like to annotate instances of all concepts in a given taxonomy from a dataset to answer all queries effectively. Reality , we can only annotate instances of some concepts. thing DBpedia taxonomy place agent person populated place organization school athlete artist politician legislature state city Wikipedia article excerpts <article id=1> John Adams has been a former <article id=2> person member of John Adams is a composer person Ohio House of Representative whose music inspired by from 2007 to 2014 … nature … <article id=3> John Adams is a public high school located on the east side of organization Cleveland, Ohio, …

  7. Annotating dataset with only a subset of concepts from a taxonomy still helps. person person Wikipedia article excerpts Structured keyword query <article id=2> <article id=1> John Adams is a John Adams has been a Politician(“John Adams”) composer whose former member of music inspired by Ohio House of Representative nature … from 2007 to 2014 … ranked list <article id=3> John Adams is a public high school located <article id=2> on the east side of Cleveland, Ohio, … <article id=1> organization … precision = 1/2 > 1/3 person organization … … politician Precision over unannotated dataset

  8. Many taxonomies contain large number of concepts. • Medical Subject Headings (MeSH ), Plant Ontology, … • An organization has limited amount of resources • Annotate a dataset using only a subset of concepts from a given taxonomy: a conceptual design for the data

  9. Which conceptual design to pick? Find a cost-effective subset of concepts from an input taxonomy that maximizes the effectiveness of answering queries. Precision@k dataset thing place agent person organization populated place politician athlete artist legislature city school state I can only annotate I want largest average a few concepts precision over these over this dataset. queries. Query

  10. Problem of Cost-Effective Conceptual Design (CECD) Given a dataset, a query workload, a taxonomy, a fixed budget We would like to select a conceptual design 𝑇 such that Cost function • 𝐷∈𝑇 𝑥(𝐷) ≤ 𝐶 Fixed budget • 𝑇 provides the largest precision@k of answering queries more than other designs that satisfy the budget constraint. Let’s quantify the amount of improvement for precision@k: the Queriability of a design

  11. Partitions of a conceptual design Given a design 𝑇 over a taxonomy 𝑌 , the partition of a concept 𝑑 ∈ 𝑇 or 𝒒𝒃𝒔𝒖(𝒅) is a subset of leaf nodes in 𝑌 such that, for every concept 𝑒 ∈ 𝑞𝑏𝑠𝑢(𝑑) , the lowest ancestor of 𝑒 in 𝑻 is 𝑑 or 𝑒 = 𝑑 . thing agent place 𝑇 = {agent, person} person organization populated place politician athlete artist legislature city school state 𝑔𝑠𝑓𝑓 𝑇 = {state, city} 𝑞𝑏𝑠𝑢 agent = {legislature, school} 𝑞𝑏𝑠𝑢 person = {politician, athlete, artist} Each leaf concept in 𝑌 belongs to at most one partition of a design 𝑇 . A set of leaf concepts that do not belong to any partition of 𝑇 is called 𝑔𝑠𝑓𝑓(𝑇) .

  12. Conceptual design 𝑻 helps answering queries whose concepts are in partition s of 𝑻 . 𝐷 = “ politician ”, 𝑇 = {person ,…} 𝑣 𝑑 : popularity of concept 𝑑 school(…) in query workload … politician(…) politician(…) agent artist(…) Portion of queries about query workload “ politician ” is 𝑣 politician … organization person Fraction of “ politician ” documents 𝑒 politician artist politician … school … … amongst “ person ” is 𝑒 person dataset 𝑣( politician )𝑒 politician Improvement is 𝑒 person politician artist Total improvement from partition of “ person ” is 𝑣(politician)𝑒 politician + 𝑣(artist)𝑒 artist 𝑣(𝑑)𝑒(𝑑) person + ⋯ = 𝑒 person 𝑒 person 𝑒(person) 𝑑∈𝑞𝑏𝑠𝑢( person ) 𝑒 𝑑 : frequency of documents of concept 𝑑 𝒗 𝒅 𝒆 𝒅 Total improvement from design 𝑻 is 𝑸∈𝑻 𝒅∈𝒒𝒃𝒔𝒖(𝑸) 𝒆(𝑸)

  13. The contribution of a design for queries whose concepts are not in any partition of the design. Generally, the concepts with more instances in the dataset are more likely to appear in the top answers. Thus, it is more likely they contain some relevant answers for the query. answers organization organization organization person person person dataset relevant answers The total improvement by concepts in 𝑔𝑠𝑓𝑓(𝑇) is 𝒗 𝒅 𝒆 𝒅 𝒅∈𝒈𝒔𝒇𝒇 𝑻 Portion of instances in the Portion of queries dataset that belong to 𝒅 whose concepts are 𝒅

  14. Formal definition of Cost-Effective Conceptual Design Problem Given a taxonomy 𝑌 , a dataset 𝐸 , query workload 𝑅 and a budget 𝐶 , find a conceptual design 𝑇 over 𝑌 such that 𝑥 𝑑 ≤ 𝐶 𝑑∈𝑇 and 𝑇 maximizes the queriablity 𝑣 𝑑 𝑒 𝑑 𝑞𝑠(𝑄) 𝑅𝑉 𝑇 = + 𝑣 𝑑 𝑒(𝑑) 𝑒(𝑄) 𝑄∈𝑇 𝑑∈𝑞𝑏𝑠𝑢(𝑄) 𝑑∈𝑔𝑠𝑓𝑓(𝑇)

  15. We have proposed an approximation algorithm called “Level - wise Algorithm” (LW) Find a design whose concepts are all from a same level of the input taxonomy … 𝑅𝑉({Infections,…}) … Infections 𝑅𝑉({Eye−Infections,...}) Skin-Infections Bone-Infections Eye-Infections … 𝑅𝑉({Trachoma,...}) Trachoma Ecthyma Erysipelas Periostitis Spondylitis Hordeolum … 𝑻 𝒎𝒇𝒘𝒇𝒎 ← a design with 𝐧𝐛𝐲{𝑹𝑽, 𝑹𝑽, 𝑹𝑽, … } Find the design with maximum 𝑻 𝒎𝒇𝒃𝒈 ← leaf concept with largest popularity ( 𝒗 ) queriability for each level using APM algorithm [Termehchy , SIGMOD’14 ] APM returns a design with Return a design with 𝐧𝐛𝐲 𝑹𝑽 𝑻 𝒎𝒇𝒘𝒇𝒎 , 𝑹𝑽 𝑻 𝒎𝒇𝒃𝒈 largest queriability over a set of concepts.

  16. Level-wise algorithm has a bounded approximation ratio over a special case of the CECD problem • Sometimes it is easier to use and manage a conceptual design whose concepts are not subclass/superclass of each other. • We call this design a disjoint design . • May restrict the solution in the CECD problem to disjoint designs. • We call this problem a disjoint CECD problem . Theorem The Level-wise algorithm is a 𝑃 log |𝐷| -approximation for the disjoint CECD problem.

  17. Experiment Settings • 8 extracted tree taxonomies from YAGO ontology, T1-T8 • Number of concepts between 10 – 400 with height of 2 – 9 • 8 Datasets of articles from English Wikipedia Collection • Bing (bing.com) query log whose relevant answers are Wikipedia article. • Effectiveness metric: precision at 3 ( 𝑞@3 ) • Two cost models: uniform cost and random cost

Recommend


More recommend