SSTUT at NTCIR-4 Web task Yinghui Xu Kyoji Umemura - PowerPoint PPT Presentation

SSTUT at NTCIR-4 Web task Yinghui Xu Kyoji Umemura Software System Lab. (Umemura Lab) Information and Computer Science Dept. Toyohashi University of Technology June 3, 2004 1

Web Searching Using term entropy on Virtual Document and Query Independent Importance Is the page itself adequate for Web IR ? � No. Page ‡ Document. � Page = textual page content + virtual document (VD). Does the term in query convey the same importance? � Usually not. Weighting query term may be helpful. What does linkage information of Web pages tell us? � Link analysis has been a good searching function for ranking web resources. 2

Our interests Feasible augmentation of general relevance ranking scheme through weighting query terms for Web IR. Effectiveness of information of VD on boosting the precision of general page content searching. Functionality of link analysis 3

Our Approach Weight query term based on term entropy in virtual document collection space and then introduced into general OKAPI model. Combining the relevance ranking score obtained through performing searching on both page content and page’s virtual document. Proposing a literal matching aided link analysis model. 4

Sample Show of VD URL_X Linguistics-Related Archives, Databases, Information Sources Linguistics, Natural Language, Computational Linguistics Meta-index URL_Y Meta index of linguistics resources . This index has well-chosen links and brief annotations , Meta-index, linguistics Resources Natural Language www-nlp.stanford.edu/links/linguistics.html archives, Databases information Computational sources Linguistics Meta-index Index links annotations …Page contents… virtual document for page on www-nlp.stanford.edu/links/linguistics.html A diagram showing definition virtual document in our approach. 5

Definition of VD Comprised of the expanded anchor text from pages that point to him and some important words on the page itself. ( , ) : AnchorText i j set of terms appears in and . around anchor of the link from i to j ( ) : BodyText j ⎧ " " . set of terms appearing in the title tag ⎪ ⎨ . set of terms appearing in the meta tag ⎪ ⎩ " 1, 2" . set of terms appearing i n the H H tag ( ) : . VD j set of terms in virtual document j ⎛ ⎞ ( ) ( ) ( ) ∪ = ∪ , ⎜ ⎟ VD j AnchorText i j BodyText j ⎝ ⎠ 6 i

Assumption on VD Characteristic of VD: � Objective impression on page from others; � Subjective presentations of page author ’ s motivation. We assume: � VD is the representative information resources for Web pages. � VD is a good approximation of the type of summarization presented by users to search system in most queries. 7

Functionality of VD Allowing set up different weighting scheme and performing separate relevance ranking calculation. Predicting the query term importance. Providing the representative summarization of Web pages for deciding the transition probability in our proposed link analysis model. 8

Ranker – relevance ranking BASE - OKAPI ’ s BM25 ( ) + log 0.5 / N df tf ( ) ∑ = × 2 , SIM Q d ( ) ( ) + + + 1.5* dl log 1.0 log N 0.5 tf ∈ w Q 2 2 _ ave dl QTIBRF � Query term importance based ranking function ( ) + log 0.5 / N df tf ( ) ∑ ( ) = × × 2 , SIM Q d VDTW w ( ) ( ) + + + 1.5* dl log 1.0 log N 0.5 tf ∈ w Q 2 2 _ ave dl SMRF – score merging ranking function ( ) ( ) ( ) ( ) ( ) = + λ , , FinalScore p SIM Q VD p SIM Q AD p i i i λ = 0.114 9

Query term w eighting in QTIRBF Query term are weighted by its entropy on virtual document collection space. { } ( ) ( ) = ∈ , # | V D T F w j w w V D j N ∑ ( ) ( ) ( ) = , , , P w j V D T F w j V D T F w k = 1 k N = − ∑ ( ) ( ) ( ) , log , VDET w P w j P w j N = 1 j ( ) ( ) = − 1 VDTW w VDET w 10

LinkAnalyzer - Literal Matching aided link analysis What we hold: � Inbounds links from pages with similar theme to our own have larger influence on PageRank than links from unrelated pages Our approach: � Combine the evidence from both content and link structure into the link analysis method � Modify the underlying Markov process by giving different weights to different outgoing links from a page. 11

Assumption User would like to choose the relevant target that they picture in their mind. Searching is a process to approach a desired outcome of user gradually. Accordingly, user’s mind are somewhat consistent in searching path. 12

Diagram of LMALA TranOdds(P � q k ) � prob(VD(q k )|P) VD(q 1 ) � Measure how likely the VD q 1 VD(p) of the activate target page can be generated by the page being viewed VD(q 2 ) ⎛ ⎞ w ∑ � ⎜ ⎟ prob ⎝ ⎠ q 2 p ( ) ( ) ( ) ∈ ∩ w V D q V D p k P � indicate the dependent degree of the two q 3 connected VD. Measure VD(q 3 ) user ‘s mind consistency 13

Computation Model Based on calculated values that indicate transition likelihood for all possible connections on a page, we assign the transition probability to them and regard them as the link weight in the Markov chain. λ = 0.85 ( ) ( ) ( ) ( ) ∑ = − λ + λ → 1 1/ PR j N PR i prob i j γ = 0.7 ∈ i B j ( ) ⎧ → TranOdds i j γ × = ⎪ , ( ( , )) 1 Liter link i k ( ) ∑ → ⎪ TranOdds i k ( ) → = ⎨ prob i j ( ) ∈ k F i ⎪ ( ) ( ) ( ) ( ) ( ) − γ × − ⎪ 1 1 # , F i LiterLink i otherwise ⎩ The condition represent whether the link between i and k has relevant literal information or not. 14

Rank adjuster Model 1. (RA1) ( ) ( ) log * LMALA P N ( ) ( ) λ = = + λ × i 0.1 FScore P SMRF P ( ) i i log 1.8 Model 2. (RA2) ( ) ( ) τ + τ P P ( ) = − λ × λ = 1 i 2 i FScore SMRF P 0.08 ( ) ( ) τ − τ + i 1 P P 1 i 2 i : R return document sets for a given query τ : document in R sort by SMRF score 1 τ : document in R sort by LMALA score 2 ( ) τ τ : i rank of i in k k 15

Architecture Rank 6 Adjuster Rank Final Adjuster Score 5 LinkAnalyzer Query Relevance LinkAnalyzer Independent Score score 4 Ranker Query Indexer VD.dvec VD.invf AD.invf AD.dvec Indexer 3 INV_Indexer DVEC_Indexer Ranker Document 2 Generator VD AD URL:DNAME Dictionary Map Table VD AD Generator Generator URL2DName Mapper Chasen Dic_Builder 1 Dom Parser DOM Parser & Chasen LinkList Doclist File EUC Web Page 16 Repository

Experiment results - BASE vs. QTIBRF Virtual document (VD) Actual document (AD) Topic Ave. P P@10 P@20 Ave.P P@10 P@20 tt 0.0621 0.2738 0.2206 0.2052 0.4550 0.3931 BASE tt 0.0705 0.2850 0.2431 0.2127 0.4487 0.3850 QTIBRF desc 0.0579 0.2550 0.2038 0.1839 0.4300 0.3713 BASE desc 0.0641 0.2825 0.2306 0.1987 0.4225 0.3625 QTIBRF � QTIRBF got improvements of Ave. P on both VD and AD searching. � QTIRBF is more adaptable for improving VD based searching 17

SMRF vs. QTIBRF Rank Fun. Ave.P P@10 P@20 VD only QTIBRF 0.0705 0.2850 0.2431 AD only QTIBRF 0.2127 0.4437 0.3750 VD+AD SMRF 0.2208 0.4767 0.4184 18

SMRF vs. QTIBRF Rank Fun. Ave.P P@10 P@20 VD only QTIBRF 0.0705 0.2850 0.2431 AD only QTIBRF 0.2127 0.4437 0.3750 VD+AD SMRF 0.2208 0.4767 0.4184 19

SMRF vs. RA1 and RA2 SMRF RA1 RA2 Ave. P 0.1203 0.1212 0.1204 Recall 0.0 0.7036 0.7116 0.7226 0.1 0.4157 0.4246 0.4143 0.2 0.2576 0.2577 0.2557 0.3 0.1751 0.1759 0.1740 Prec. @5 0.4629 0.4457 0.4629 @10 0.4000 0.3943 0.4057 @20 0.3529 0.3514 0.3543 @30 0.3314 0.3286 03343 20

Rank comparison of relevant file 21

Rank comparison of relevant file 22

Conclusion Weighting query term through entropy on VD space improves searching results. It indicates that the system which makes used of Web structure, such as anchor, title, will perform better than the content-only system without considering them. No clear improvements obtained by combining query independent score using our proposed link analysis model, but indicate the potential ability on improving searching results. 23

Thank you 24

SSTUT at NTCIR-4 Web task Yinghui Xu Kyoji Umemura - PowerPoint PPT Presentation

SSTUT at NTCIR-4 Web task Yinghui Xu Kyoji Umemura Software System Lab. (Umemura Lab) Information and Computer Science Dept. Toyohashi University of Technology June 3, 2004 1 Web Searching Using term entropy on Virtual Document and

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

KSU Teams QA System for World History Exams at the NTCIR-13 QA Lab-3 Task Tasuku Kimura, Ryo

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

RMIT at the NTCIR-13 We Want Web Task Luke Gallagher with Joel Mackenzie, Rodger Benham,

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS

CUTKB at NTCIR-14 QALab-PoliInfo Task Toshiki Tomihira and Yohei Seki University of Tsukuba,

Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Text Conversation Task

SG01 at the NTCIR-13 STC-2 task Haizhou Zhao , Yi Du, Hangyu Li, Qiao Qian, Hao Zhou, Minlie

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

IASL System for NTCIR-6 Korean-Chinese CLIR Yu-Chun Wang Cheng-Wei Lee Richard Tzong-Han Tsai

NiCT/ATR in NTCIR-7 CCLQA Track Youzheng WU, Wenliang CHEN, Hideki KASHIOKA NiCT/ATR, Japan

Overview of the 7 th NTCIR f Workshop N Noriko Kando k K d National Institute of

Overview of NTCIR-14 Makoto P. Kato Yiqun Liu University of Tsukuba Tsinghua University

Use of the automated quality evaluation system for the comparison of health care web pages T.

I Office (ISO) implemented significant changes requiring it, among other things, to add a third

MPRESS MPRESS and the role of and the role of MetaData in Math in Math- -Net Net MetaData y

Question Answering Alexander Solovyev Bauman Moscow Sate Technical University a-soloviev@mail.ru

Practical Rails2 ihower@handlino.com about me (a.k.a ihower)

PROGRESSIVE WEB APP (INTRODUCTION) Lai Weng Han (Johnson) https://pwa-web.wenghan.me

How to Secure your Based in Essex & London Founded Primary Image WordPress Website in

Developer Portals Kristof Van Tomme @kvantomme co-Founder/CEO kristof@pronovix.com Ghent area,

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

SSTUT at NTCIR-4 Web task Yinghui Xu Kyoji Umemura - PowerPoint PPT Presentation

SSTUT at NTCIR-4 Web task Yinghui Xu Kyoji Umemura Software System Lab. (Umemura Lab) Information and Computer Science Dept. Toyohashi University of Technology June 3, 2004 1 Web Searching Using term entropy on Virtual Document and

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

KSU Teams QA System for World History Exams at the NTCIR-13 QA Lab-3 Task Tasuku Kimura, Ryo

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

RMIT at the NTCIR-13 We Want Web Task Luke Gallagher with Joel Mackenzie, Rodger Benham,

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS

CUTKB at NTCIR-14 QALab-PoliInfo Task Toshiki Tomihira and Yohei Seki University of Tsukuba,

Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Text Conversation Task

SG01 at the NTCIR-13 STC-2 task Haizhou Zhao , Yi Du, Hangyu Li, Qiao Qian, Hao Zhou, Minlie

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

IASL System for NTCIR-6 Korean-Chinese CLIR Yu-Chun Wang Cheng-Wei Lee Richard Tzong-Han Tsai

NiCT/ATR in NTCIR-7 CCLQA Track Youzheng WU, Wenliang CHEN, Hideki KASHIOKA NiCT/ATR, Japan

Overview of the 7 th NTCIR f Workshop N Noriko Kando k K d National Institute of

Overview of NTCIR-14 Makoto P. Kato Yiqun Liu University of Tsukuba Tsinghua University

Use of the automated quality evaluation system for the comparison of health care web pages T.

I Office (ISO) implemented significant changes requiring it, among other things, to add a third

MPRESS MPRESS and the role of and the role of MetaData in Math in Math- -Net Net MetaData y

Question Answering Alexander Solovyev Bauman Moscow Sate Technical University a-soloviev@mail.ru

Practical Rails2 ihower@handlino.com about me (a.k.a ihower)

PROGRESSIVE WEB APP (INTRODUCTION) Lai Weng Han (Johnson) https://pwa-web.wenghan.me

How to Secure your Based in Essex &amp; London Founded Primary Image WordPress Website in

Developer Portals Kristof Van Tomme @kvantomme co-Founder/CEO kristof@pronovix.com Ghent area,

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

How to Secure your Based in Essex & London Founded Primary Image WordPress Website in