na vig a tion re tr ie va l with site anc hor t e xt
play

Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki - PowerPoint PPT Presentation

Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki KAWAI, Ke nji T AT E ISHI a nd T oshikazu F UKUSHIMA NE C Inte r ne t Syste ms Re se ar c h L abs. 1 Introduc tion Na vig a tion Re trie va l T a sk in NT CIR- 4


  1. Na vig a tion Re tr ie va l with Site Anc hor T e xt Hide ki KAWAI, Ke nji T AT E ISHI a nd T oshikazu F UKUSHIMA NE C Inte r ne t Syste ms Re se ar c h L abs. 1

  2. Introduc tion � Na vig a tion Re trie va l T a sk in NT CIR- 4 WE B (task B) � Se a rc hing for one or more "re pre se nta tive We b pa g e s." � R e le vancy and R e pr e se ntative ne ss of doc ume nt a re both importa nt. � Motiva tion � Ve rify the e ffic ie nc y of re fe re ntia l informa tion Re tr ie va l syste m whic h inde xe s only site a nc hor te xt Re trie va l syste m whic h inde xe s only site a nc hor te xt T wo a dva nta g e s : ・ T he inde x size is ve r y small. ・ A use r c an r e tr ie ve unc r awle d doc ume nts as we ll as c r awle d doc ume nts. 2

  3. Site Anc hor T e xt � Anc hor te xt of links from e xte rna l We b site � Anc hor ( d , a )+Anc hor ( e , a )+Anc hor ( f , a ) www.a.c om www.b.c om Summa rizing c onte nt a nd Summa rizing c onte nt a nd d e f popula rity of the We b site popula rity of the We b site a We c a n c a lc ula te r e le vancy We c a n c a lc ula te r e le vancy and r e pr e se ntative ne ss . and r e pr e se ntative ne ss . b c www.c .c om Note : We de fine d "e xte r na l We b site s" simply as site s whose domain name is diffe r e nt fr om the tar ge t page . 3

  4. Re trie va l Me thod Ste p1 : Par se the que r y and se ar c h for page s Ste p2 : De te r mine sc or e of e ac h page Ste p3 : Sor t page s by Sc or e � Sc ore of pa g e p = × Score( p ) Rep( p ) Rel( p , q ) e se nta tive ne ss of pa g e p Re pr Re pre se nta tive ne ss of pa g e p de rive d from link struc ture de rive d from link struc ture Re le va nc y of pa g e p and que r y q Re le va nc y of pa g e p and que r y q ba se d on two kinds of me a sure s, r e fe r e nce consiste ncy ba se d on two kinds of me a sure s, r e fe r e nce consiste ncy a nd spe cificity of wor d combination a nd spe cificity of wor d combination 4

  5. Re pre se nta tive ne ss of pa g e p � De rive d from link struc ture = × Rep( p ) C T www.a.c om www.b.c om C : Citation fr e que nc y fr om e xte r nal We b site s d e f T : L ike lihood of top page de te r mine d by following he ur istic s: (H 1 ) Doe s the URL of the page c onsist of only domain name ? a (H 2 ) Doe s the file name of the URL c ontain suc h a str ing as "inde x" or "de fault"? (H 3 ) Doe s the URL e nd with a slash "/ " ? b c = × δ + × δ + × δ + T w w w w 1 1 2 2 3 3 4 www.c .c om  1 if H is true δ = i  i  0 if H is false http:/ / www.c .c om/ abc / inde x.html i = ( w , w , w , w ) ( 1000 , 100 , 10 , 1 ) 1 2 3 4 = × = 5 Rep( a ) 3 101 303 e.g.

  6. Re le va nc y of pa g e p a nd que ry q Ma in c onc e pt : E ffe c tive use of limite d informa tion to de te rmine the re le va nc y � Re fe re nc e c onsiste nc y � How c onsiste ntly is the pa g e re fe rre d by e xte r nal We b site s? � (How sha rply doe s the site foc us on a topic ?) � Spe c ific ity of word c ombina tion � How spe c ific ally ar e page s ide ntifie d by g ive n wor d c ombina tion? 6

  7. Re fe re nc e c onsiste nc y � Whic h is re le va nc e for que ry "i- pod" ? blog blog Clie iPod MBA iPod Ma tsui iPod iPod Apple x L a Vie y iPod NE C iPod   2 ∑ f   = × t Rel( p , q ) kw   t N   sa ∈ t q f t : F re que nc y of word t in the site a nc hor te xt for pa g e p N sa : Amount of site a nc hor te xt for pa g e p − ( n i ) = q kw t : We ig ht of the word in que ry q kw 2 i < Rel( x , " " ) Rel( y , " " ) iPod iPod In this c ase ... 7

  8. Spe c ific ity of word c ombina tion � How spe c ific a lly a re pa g e s ide ntifie d by g ive n word c ombina tion? N t 1 = Rel( p , q ) log τ ∈ D( p , q ) τ ∈ D( p , q ) : Numbe r of page s that c ontain t 2 oup inc lude d in both page p and t 3 ke ywor d gr y q que r < < < D( t , t , t ) D( t , t ) D( t , t ) D( t , t ) if a nd 1 2 3 1 2 1 3 2 3 ∈ ∈ ∈ ∈ i D( t , t , t ), j D( t , t ), k D( t , t ), l D( t , t ) the n, 1 2 3 1 2 1 3 2 3 ( ) ( ) ( ) ( ) . > > > Rel i , q Rel j , q Rel k , q Rel l , q Note : T r aditional T F - IDF sc he ma te nds to be biase d towar d wor ds with highly spe c ific ity ( t 2 and t 3 ), so Rel( l , q ) > Rel( j , q ) or Rel( k , q ) in this c ase . 8

  9. E va lua tion � Doc ume nt c olle c tion :100GB NW100G- 01 � T ota l size of site a nc hor te xt : 94MB � E va lua tion sc a le s : WRR (a nd DCG) � "re le va nt", "pa rtia lly r e le vant", "ir r e le vant" � Compa re d with following 4 syste ms: ID Inde x Re le va nc y c a lc ula tion OKA F ull te xt of c ra wle d pa g e s OKAPI ANC F ull te xt of c ra wle d pa g e s Hig h we ig ht to a nc hor te xt SAR Site a nc hor te xt only Re fe r e nc e c onsiste nc y SAS Site a nc hor te xt only Spe c ific ity of wor d c ombina tion 9

  10. Re sult a nd disc ussion (1/ 4) � Site a nc hor te xt re trie va l (SAR a nd SAS) ha s g re a t a dva nta g e s ove r simple full te xt re trie va l (OKA). 0.6 wrr.1-0 wrr.1-1 0.5 0.4 Site anc hor te xt WRR re trie va l (SAR a nd 0.3 SAS) outpe r for me d 0.2 the simple full te xt r e tr ie val (OKA) 0.1 0 OKA-TT ANC-TT SAR-TT SAS-TT OKA-DS ANC-DS SAR-DS SAS-DS ※ T 10 T : <T IT L E > / DS : <DE SC> for T opic Pa rt

  11. Re sult a nd disc ussion (2/ 4) � Some importa nt informa tion in a nc hor te xt c a n be lost whe n site a nc hor te xt wa s e xtra c te d. � e .g . http:/ / a bc .jp/ ~usr1/ a nd http:/ / a bc .jp/ ~usr2/ a re de a lt with a s the sa me site . 0.6 wrr.1-0 Anc hor we ig hte d wrr.1-1 0.5 full te xt r e tr ie va l (ANC) wa s be tte r 0.4 tha n site a nc hor WRR 0.3 te xt re trie va l (SAR a nd SAS) 0.2 0.1 0 OKA-TT ANC-TT SAR-TT SAS-TT OKA-DS ANC-DS SAR-DS SAS-DS ※ T 11 T : <T IT L E > / DS : <DE SC> for T opic Pa rt

  12. Re sult a nd disc ussion (3/ 4) � De spite a ve ry sma ll inde x, SAR a nd SAS we re c ompa ra ble with ANC (up to 88% on WRR) � E spe c ia lly a c c ura c y ra tio te nds to be hig he r in da ta se rie s tha t g ive a sc ore only for the "re le va nt" pa g e s. � Site a nc hor te xt c a n pinpoint hig hly re le va nt doc ume nts. SAR/ ANC SAC/ ANC dc g .3- 0 0.84 0.81 dc g .3- 2 0.75 0.71 dc g .3- 3 0.72 0.68 wrr.1- 0 0.88 0.76 wrr.1- 1 0.84 0.71 ※ T opic Par t is <T IT L E > 12

  13. Re sult a nd disc ussion (4/ 4) � Some unc ra wle d pa g e s a re "re le va nt" a nd re le va nc y for the unc ra wle d pa g e s c a n be de te rmine d ba se d on r e fe r e nc e infor mation. 0 . 7 w r r . 1 - 0 T he g a p of WRR va lue w r r . 1 - 1 0 . 6 inc r e a se d be twe e n 0 . 5 R SAR a nd ANC (or SAS 0 . 4 R a nd ANC) c ra wle d 0 . 3 W doc ume nts 0 . 2 0 . 1 0 O A S S @ @ @ @ N A A O A S S K R S C N A A A K - - R S - - C A T T T T - - - - T T T T T T T T T T T T WRR for c r awle d doc ume nts only 13

  14. Conc lusion a nd F uture work � Site a nc hor te xt re trie va l syste m ... � Ha s ve ry sma ll inde x size (one - thousa nds of orig ina l doc ume nt se t) � Outpe rforms simple full- te xt re trie va l. � Is c ompa ra ble with a nc hor te xt we ig hte d full- te xt re trie va l (up to 88% a c c ura c y). � T e nds to pinpoint hig hly re le va nt pa g e s. � Ca n re trie ve unc ra wle d pa g e s a s we ll a s c ra wle d pa g e s ba se d on only r e fe r e ntia l infor ma tion. � In future work ... � Inte g ra te site a nc hor te xt re trie va l a nd tra ditiona l r e tr ie val syste m � Addre ss the proble m of We b site bounda rie s 14

Recommend


More recommend