users information
play

Users Information Abstract Sources With the ever incr e - PDF document

Index Structures for Information Filtering Under the V ector Space Mo del T ak W. Y an and Hector Garcia-Molina Departmen t of Computer Science Stanford Univ ersit y Stanford, CA 94305 Users Information


  1. Index Structures for Information Filtering Under the V ector Space Mo del � T ak W. Y an and Hector Garcia-Molina Departmen t of Computer Science Stanford Univ ersit y Stanford, CA 94305 Users Information Abstract Sources With the ever incr e asing volumes of ele ctr onic informa- tion gener ation, users of information systems ar e facing an information overlo ad. It is desir able to supp ort infor- Information Filtering mation �ltering as a c omplement to tr aditional r etrieval Server(s) me chanism. The numb er of users, and thus pr o�les (r ep- r esenting users' long-term inter ests), hand le d by an infor- mation �ltering system is p otential ly huge, and the system has to pr o c ess a c onstant str e am of inc oming information Figure 1: Information Filtering Serv er(s) in a timely fashion. The e�ciency of the �ltering pr o c ess is thus an imp ortant issue. In this p ap er, we study what data structur es and algo- systems can pro vide an information �ltering mec hanism, rithms c an b e use d to e�ciently p erform lar ge-sc ale infor- through whic h a user subscrib es pr o�les , or queries that mation �ltering under the ve ctor sp ac e mo del, a r etrieval are con tin uously ev aluated, to represen t his long-term in- mo del establishe d as b eing e�e ctive. We apply the ide a of terests, and then passiv ely receiv es information �ltered b y the standar d inverte d index to index user pr o�les. We de- the system according to the pro�les. vise an alternative to the standar d inverte d index, in which we, inste ad of indexing every term in a pr o�le, sele ct only Researc h in information �ltering has receiv ed a lot of the signi�c ant ones to index. We evaluate their p erfor- atten tion lately . Ho w ev er, previous w ork has fo cused on manc e and show that the indexing metho ds r e quir e or ders the e�ectiv eness (precision and recall) of the �ltering, and of magnitude fewer I/Os to pr o c ess a do cument than when little has b een done to address the e�ciency (p erformance) no index is use d. We also show that the pr op ose d alterna- asp ect of the problem. W e b eliev e that information �ltering tive p erforms b etter in terms of I/O and CPU pr o c essing is going to b e used on a large scale and hence the e�ciency time in many c ases. issue m ust b e addressed. In this pap er, w e presen t data structure and algorithms to supp ort information �ltering. Wide area information retriev al is no w a realit y; large- scale w orld-wide information �ltering is also foreseeable. 1 In tro duction Consider a p opulation of users and a n um b er of informa- tion sources in a net w ork ed information �ltering en viron- Information is increasingl y a v ailable in electronic form. men t. The �ltering can b e done either at the information The n um b er and size of full text do cumen t databases are sources, at the user sites, or at an in termediate information rapidly increasing. Users of suc h database systems are fac- �ltering server (Figure 1). Relying solely on user �ltering ing an information o v erload; it is b ecoming di�cult for is exp ensiv e since net w ork bandwidth is w asted to transmit users to rely solely on traditional retrosp ectiv e searc h and irrelev an t information and a lot of w asteful lo cal pro cess- retriev al mec hanisms to k eep themselv es apprised of new ing is done. Relying on �ltering at the sources themselv es do cumen ts that are relev an t to their in terest. As a com- is also exp ensiv e since users need to replicate their pro�les plemen t to con v en tional searc h mec hanism, information at al l p ossible sources. The information �ltering serv er is a go o d compromise. It collects information from a set of � This researc h w as sp onsored b y the Adv anced Researc h Pro jects Agency (ARP A) of the Departmen t of Defense under sources and routes it to in terested users. Of course, there Gran t No.MD A972-92-J-1029 with the Corp oration for National can b e m ultiple information �ltering serv ers on the net- Researc h Initiativ es (CNRI). The views and conclusions con- w ork, eac h servicing a di�eren t set (ma yb e o v erlapping) of tained in this do cumen t are those of the authors and should not users and information sources. b e in terpreted as necessarily represen ting the o�cial p olicies or In this pap er, w e fo cus on one information �ltering endorsemen t , either expressed or implied, of ARP A, the U.S. serv er and consider what data structure and algorithms Go v ernmen t , or CNRI.

Recommend


More recommend