Doctoral Thesis: Doctoral Thesis: Learning Semantic Definitions Learning Semantic Definitions for Information Sources on the Internet for Information Sources on the Internet Mark James Carman Mark James Carman Advisors: Advisors: Prof. Paolo Traverso Traverso Prof. Paolo Prof. Craig A. Knoblock Prof. Craig A. Knoblock
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Abundance of Information Sources Abundance of Information Sources a s e e B o g l G o l s o t e H Weather Realtime Conditions Stock Quote t e l H o l s e a D Exchange Rates Package i t y o c e l Deals a v T r Earthquake Currency e s a r i r f A Data Rates Orbitz Stock Travel Deals Cheap Tsunami Quotes Flights Warnings! t o o g h Y a h F l i s a r C e d U s d s e d s i f i e u s s i f i a s s t a t C l a C l S e ! a l r S Last Minute f o g s s t i n L i Flights s a r w C e N l e ! Weather S a o r f Forecasts 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 2 24 April 2007 Thesis Defense 2
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Bringing the Data Together Bringing the Data Together a s e e B o g l G o l s o t e H t e l H o l s e a D Exchange Rates Package i t y o c e l Deals a v T r e s a r i r f A Orbitz Travel Deals Cheap Flights t g h F l i s u t a t S Last Minute Flights Weather Forecasts 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 3 24 April 2007 Thesis Defense 3
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Bringing the Data Together Bringing the Data Together s e B a g l e o o G s i t y o t e l o c Exchange H v e l a T r e s f a r A i r Rates e l o t H Package s e a l D Deals Cheap Flights Orbitz Travel Deals Last Minute t g h F l i Flights s t u t a S Weather Forecasts 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 4 24 April 2007 Thesis Defense 4
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Mediators resolve Heterogeneity Mediators resolve Heterogeneity s e B a g l e o o G s i t y o t e l o c Exchange H v e l a T r Mediator e s f a r A i r Rates e l o t H Package s e a l D Deals Cheap Flights Orbitz Travel Deals Last Minute t g h F l i Flights s t u t a S Weather Forecasts 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 5 24 April 2007 Thesis Defense 5
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Require Source Definitions Mediators Require Source Definitions Mediators � New service = > no source definition! New service = > no source definition! � � Can we discover a definition automatically? Can we discover a definition automatically? � Orbitz Flight ) ” P X M “ ” , X A Search L “ ( e r a F t s e Mediator w Reformulated Query o l Query Reformulated Query United SELECT MIN(price) Airlines calcPrice(“LAX”,“MXP”,”economy”) FROM flight R e WHERE depart=“LAX” f o r m u l a Qantas t AND arrive=“MXP” e d Q u e Specials r y Source Definitions: a a l i A l i t - Orbitz Flight Search - United Airlines - Qantas Specials 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 6 24 April 2007 Thesis Defense 6
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Inducing Source Definitions by Example Inducing Source Definitions by Example source1($zip, lat, long) :- centroid(zip, lat, long). Known Known Known Source 1 Source 2 Source 3 source2($lat1, $long1, $lat2, $long2, dist) :- greatCircleDist(lat1, long1, lat2, long2, dist). source3($dist1, dist2) :- convertKm2Mi(dist1, dist2). � Step 1: classify input & Step 1: classify input & � w N e e 4 u r c S o output semantic types output semantic types s a h zipcode distance m e l b o r p s i h t e ! d m source4( $startZip, $endZip, separation) e u v s l o s A s n e e b 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 7 24 April 2007 Thesis Defense 7
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Inducing Source Definitions - - Step 2 Step 2 Inducing Source Definitions source1($zip, lat, long) :- centroid(zip, lat, long). Known Known Known Source 1 Source 2 Source 3 source2($lat1, $long1, $lat2, $long2, dist) :- greatCircleDist(lat1, long1, lat2, long2, dist). source3($dist1, dist2) :- convertKm2Mi(dist1, dist2). source4($zip1, $zip2, dist):- source1(zip1, lat1, long1), � Step 1: classify input & Step 1: classify input & � e w N source1(zip2, lat2, long2), 4 c e o u r S output semantic types output semantic types source2(lat1, long1, lat2, long2, dist2), � Step 2: generate Step 2: generate � source3(dist2, dist). plausible definitions plausible definitions source4($zip1, $zip2, dist):- source4( $zip1, $zip2, dist) centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), greatCircleDist(lat1, long1, lat2, long2, dist2), convertKm2Mi(dist1, dist2). 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 8 24 April 2007 Thesis Defense 8
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Inducing Source Definitions – – Step 3 Step 3 Inducing Source Definitions source4($zip1, $zip2, dist):- � Step 1: classify input & Step 1: classify input & � source1(zip1, lat1, long1), output semantic types output semantic types source1(zip2, lat2, long2), � Step 2: generate Step 2: generate � source2(lat1, long1, lat2, long2, dist2), plausible definitions plausible definitions source3(dist2, dist). source4($zip1, $zip2, dist):- � Step 3: invoke service Step 3: invoke service � centroid(zip1, lat1, long1), & compare output & compare output centroid(zip2, lat2, long2), match greatCircleDist(lat1, long1, lat2, long2, dist2), convertKm2Mi(dist1, dist2). $zip1 $zip2 dist dist $zip1 $zip2 dist dist (predicted) (predicted) (actual) (actual) 80210 90266 842.37 843.65 60601 15201 410.31 410.83 10005 35555 899.50 899.21 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 9 24 April 2007 Thesis Defense 9
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Overlapping Data Requirement Overlapping Data Requirement � Assumption: overlap between new & known sources Assumption: overlap between new & known sources � � Nonetheless, the technique is widely applicable: Nonetheless, the technique is widely applicable: � r g b e m o o B l Yahoo c y � Redundancy Redundancy e n u r r C � Exchange e s a t R Rates d e w i l d o r W � Scope or Completeness Scope or Completeness � US Hotel l s e a D t e l H o Rates e l s o t H � Binding Constraints Binding Constraints 5 * � Hotels By t e S t a y B Zipcode c e a n s t D i � Composed Functionality Composed Functionality � n Great Circle e e w Centroid B e t s d e of Zipcode o Distance p c Z i e g l o o G � Access Time Access Time � e l o t H Government c h a r S e Hotel List 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 10 24 April 2007 Thesis Defense 10
Motivation Approach Search Scoring Extensions Experiments Related Work Conclusions Searching for Definitions Searching for Definitions Expressive Language � Search space of Search space of conjunctive queries: conjunctive queries: � Sufficient for modeling most online sources target(X) : target(X ) :- - source1(X source1(X 1 1 ), source2(X ), source2(X 2 2 ), ), … … � For scalability don For scalability don’ ’t allow negation or union t allow negation or union � � Perform Top Perform Top- -Down Best Down Best- -First Search First Search � 1. First sample the New Source Invoke target with set of random inputs; Add empty clause to queue ; while ( queue not empty) v := best definition from queue ; forall ( v’ in Expand( v ) ) if ( Eval( v’ ) > Eval( v ) ) 2. Then perform best-first insert v’ into queue ; search through space of candidate definitions 24 April 2007 Thesis Defense - - Mark James Carman Mark James Carman 11 24 April 2007 Thesis Defense 11
Recommend
More recommend