Cost-Effective Conceptual Design over Taxonomies Yodsawalai Chodpathumwan University of Illinois at Urbana-Champaign Ali Vakilian Massachusetts Institute of Technology Arash Termehchy, Amir Nayyeri Oregon State University
Most information over the web is unstructured. Medical articles, HTML pages, … Users have to usually query over unstructured data. Wikipedia articles query <article id=1> <article id=2> “John Adams, politician” John Adams has been a John Adams is a former member of Ohio composer whose House of Representative music inspired by from 2007 to 2014 … nature … ranked list <article id=3> John Adams is a public high school located <article id=2> on the east side of Cleveland, Ohio, … <article id=3> Precision@ 𝑙 = #returned relevant answers in top 𝑙 answers #returned answers in top 𝑙 answers <article id=1> precision@3 = 1/3 Only Article id 1 is poor ranking quality! about a politician. 2
Annotating a dataset improves the effectiveness of answering queries. Taxonomy : thing * DAG DBpedia taxonomy * Vertex = concept * Edge = subclass relation place agent Will consider tree taxonomy person populated place organization school city state athlete artist politician legislature Wikipedia articles <article id=1> John Adams has been a <article id=2> politician former member of John Adams is a composer artist Ohio House of Representative whose music inspired by legislature from 2007 to 2014 … nature … <article id=3> John Adams is a public high school located on the east school side of Cleveland, Ohio, … city state 3
Users can submit queries with concepts over annotated dataset. politician artist Annotated Wikipedia articles query <article id=2> <article id=1> John Adams is a John Adams has been a Politician(“John Adams”) composer whose former member of music inspired by Ohio House of Representative nature … from 2007 to 2014 … ranked list <article id=3> John Adams is a public high school located <article id=1> on the east side of Cleveland, Ohio, … state city school legislature precision@3 = 1/1 = 1 Perfect! 4
Concept annotation is costly. Instances of concepts are annotated by a program called concept annotator. Researchers estimate that annotating each article in MEDLINE/PubMED dataset using concepts in MeSH taxonomy costs about $9.4 [K.Liu, 2015] . It is costly to develop, execute, and maintain a concept annotator. • Development: • Hand-tuned programming rules – need experts, thousands of rules • Machine learning technique – find and extract lots of relevant features • Execution: may take several days and require lots of computational resources • Maintenance: datasets evolve over time – rewrite and re-execute concept annotators 5
It is not usually possible to annotate all concepts. Ideally , we would like to annotate instances of all concepts in a given taxonomy from a dataset to answer all queries effectively. With limited budget , we can only annotate instances of some concepts because concept annotation is costly. thing DBpedia taxonomy place agent person populated place organization school athlete artist politician legislature state city Wikipedia articles <article id=1> John Adams has been a former <article id=2> person politician member of John Adams is a composer person artist Ohio House of Representative whose music inspired by legislature from 2007 to 2014 … nature … <article id=3> John Adams is a public high school located on the east side of school organization Cleveland, Ohio, … state 6 city
Annotating datasets with only a subset of concepts from a taxonomy still improves the effectiveness of answering queries. … person organization athlete politician artist school legislature person person Annotated Wikipedia articles query Politician(“John Adams”) <article id=2> <article id=1> John Adams is a John Adams has been a composer whose former member of ranked list music inspired by Ohio House of Representative from 2007 to 2014 … nature … <article id=2> <article id=3> John Adams is a public high school located <article id=1> on the east side of Cleveland, Ohio, … organization precision@3 = 1/2 > 1/3 7 Precision over unannotated dataset
A subset of concepts in a taxonomy used to annotate a dataset is called a conceptual design for the data. politician artist Annotated Wikipedia articles 𝑻 𝟐 = {politician, artist, school, <article id=2> <article id=1> John Adams is a John Adams has been a city, state, legislature} composer whose former member of music inspired by Ohio House of Representative nature … from 2007 to 2014 … <article id=3> John Adams is a public high school located on the east side of Cleveland, Ohio, … school state person city Annotated Wikipedia articles legislature person <article id=2> <article id=1> John Adams is a John Adams has been a composer whose former member of music inspired by Ohio House of Representative nature … from 2007 to 2014 … 𝑻 𝟑 = {person, organization} <article id=3> John Adams is a public high school located on the east side of Cleveland, Ohio, … organization 8
Which conceptual design to pick? Given a dataset, a taxonomy, a sample of query workload and a budget, find a subset of concepts from an input taxonomy that maximizes the effectiveness of answering queries. Precision@k thing dataset place agent person organization populated place politician athlete artist city legislature school state p@3 = 0.1 p@3 = 0.2 Sample {person, agent}, {state, city}, Query {person,organization} , … budget p@3 = 0.5 I want largest I will pick {person,organization} average precision because it is the most effective over these queries. and under my budget! 9
Problem of Cost-Effective Conceptual Design (CECD) Given a dataset, a sample of query workload, a taxonomy, a available budget We would like to select a conceptual design 𝑇 such that Cost function • σ 𝐷∈𝑇 𝑥(𝐷) ≤ 𝐶 Budget • 𝑇 provides the largest improvement in the average precision@k of answering queries amongst all designs that satisfy the budget constraint. Let’s quantify the amount of improvement in precision@k: the queriability of a design 𝑇 or 𝑅𝑉(𝑇) 10
Partitions of a conceptual design Annotating a concept in a taxonomy also improves quality of answering queries with the concepts that are subclass or descendant of them. thing 𝑇 3 = {agent, person} place agent person organization populated place politician athlete artist legislature city school state 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 agent = {legislature, school} 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 person = {politician, athlete, artist} 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜(𝑇) is the set of partitions of each concept in the conceptual design 𝑇 . 11
A conceptual design may not help all the queries. thing 𝑇 3 = {agent, person} place agent person organization populated place politician athlete artist legislature city 𝑔𝑠𝑓𝑓 𝑇 = {state, city} school state 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 agent = {legislature, school} 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 person = {politician, athlete, artist} A set of leaf concepts that do not belong to any partition of 𝑇 is called 𝑔𝑠𝑓𝑓(𝑇) . 12
Conceptual design 𝑻 improves the effectiveness of answering queries whose concepts are in partition s of 𝑻 . … agent Query : Politician(“John Adams”) … organization person 𝑇 = {person, organization} politician … school artist … … politician ∈ 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 (person) Dataset annotated by 𝑇 𝑒 𝑑 : fraction of documents of concept 𝑑 in a dataset politician Likelihood of returning relevant answers with concept “ politician ” is person 𝑒 politician 𝑒 person organization Improvement over unannotated dataset 13
Conceptual design 𝑻 improves the effectiveness of answering queries whose concepts are in partition s of 𝑻 . … 𝑇 = {person, organization} agent school(…) … organization person politician(…) politician(…) Portion of queries about “ politician ” is 𝑣 politician politician … school artist … … artist(…) query workload Dataset annotated by 𝑇 Overall improvement for concept “politician” is 𝑣(politician)𝑒 politician 𝑒 person politician Total improvement from partition of “ person ” is person 𝑣(𝑑)𝑒(𝑑) 𝑒(person) organization 𝑑∈𝑞𝑏𝑠𝑢( person ) Total improvement from design 𝑇 is 𝒗 𝒅 𝒆 𝒅 𝒆(𝑸) 𝑸∈𝒒𝒃𝒔𝒖𝒋𝒖𝒋𝒑𝒐(𝑻) 𝒅∈𝒒𝒃𝒔𝒖𝒋𝒖𝒋𝒑𝒐(𝑸) 14
Recommend
More recommend