CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald - PowerPoint PPT Presentation

CS290N Summary 2015 Tao Yang

Text books • [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website . • [MRS] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, Introduction to Information Retrieval , Cambridge University Press. 2008. HTML edition of the book here. • Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval (second edition), Addison-Wesley, 2011. Book website . • Charles L. A. Clarke, Stefan Buettcher, Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, MIT Press Book website .

Search Result Reply Pages Advertisements Main results Suggestions recommendation

A Crawler Architecture Olston/Najork. Web crawling. Found. Trends Inf. Retr., 4(3):175--246, March 2010.

A Crawler Architecture Week 8

Focused Crawling • Attempts to download only those pages that are about a particular topic  used by vertical search applications  E.g. crawl and collect technical reports and papers appeared in all computer science dept. websites • Rely on the fact that pages about a topic tend to have links to other pages on the same topic  popular pages for a topic are typically used as seeds • Crawler uses text classifier to decide whether a page is on topic

Where/what to modify in this architecture for a focused crawler?

Offline Architecture at Ask

Offline Architecture at Ask Week 6 Week 2 Week 9 Week 9 Week 8

Similarity Analysis Candidate pairs : Locality- those pairs Docu- sensitive of signatures ment Hashing that we need to test for similarity. The set Signatures : of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity 10

Example of Shingling and Minhash Document 2 Document 1 2 64 2 64 2 64 2 64 2 64 2 64 A B 2 64 2 64 Are these equal? Test for 200 random permutations: p 1 , p 2 ,… p 200

Locality-Sensitive Hashing • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated. • Map a document to many buckets d1 d2 • Make elements of the same bucket candidate pairs.  Sample probability of collision: – 10% similarity  0.1% 12 – 1% similarity  0.0001%

Software Infrastructure Support at Ask.com • Programming support (multi-threading/exception Handling, Hadoop MapReduce) • Data stores for managing billions of objects  Distributed hash tables, queues etc • Communication and data exchange among machines/services • Execution environment  Controllable (stop, pause, restart).  Service registration and invocation  service monitoring  Logging and test framework.

Requirements for Data Repository Support in Offline Systems • Update  handling large volumes of modified documents  adding new content • Random access  request the content of a document based on its URL • Compression and large files  reducing storage requirements and efficient access • Scan  Scan documents for text mining.

Options for Key-value Data Stores • Support: append or put. get operations • Bigtable at Google • Dynamo at Amazon • Open source software Technology Language Users/ Platform sponsors Apache Bigtable Java/Hadoop Apache Cassandra Dynamo Hypertable Bigtable C++/Hadoop Baidu Hbase Bigtable Java/Hadoop Apache LevelDB Bigtable C++ Google MongoDB C++

Sample Requirements for Applications: Data repository for crawling • Common data operations  Update: Mainly append operations every day.  Content read: – Typically scan and then transfer data to another cluster – Sometime: random access individual pages for inspection

Sample Requirements for periodic data reclassification • Data repository hosting a large page collection with periodical page re-classification  Update: Append only operations for raw data – Update  meta data modification periodically for selected pages (random access).  Read: Scan only operations for raw data processing. – Random read sometime for a small number of pages. Data repository MapReduce for classification

Online Engine Architecture Client Traffic load balancer queries Frontend Frontend Frontend Frontend PageInfo Hierarchical Suggestions Neptune Clustering Middleware Cache Cache Cache Cache Ranking Document Ranking Web page Document Ranking Document Ranking Abstract Document Ranking Ranking index Abstract Abstract description Classification Web page Structured index DB Web Search for a Planet: The Google Cluster Architecture L. Barroso, J. Dean, U. Hölzle, IEEE Micro, vol. 23 (2003)

Online Engine Architecture Client queries Traffic load balancer Frontend Frontend Frontend Frontend PageInfo Hierarchical Suggestions Neptune Clustering Middleware Cache Cache Cache Document Cache Week 2,6,7 summary Ranking Document Web page Ranking Document Ranking Document Abstract Ranking index Document Ranking Abstract Ranking Week 1 Abstract description Classification Web page Structured index DB 3/11/2015 19

Document Ranking with Text, Quality, and Click Features • Text features  TFIDF, BM25  Where do they appear? Title/body  Proximity (word distance) • Document quality and classification  Web link scores (e.g. PageRank).  Page length, URL type etc. • User behavior data  Presentation : what a user sees before a click  Clickthrough : frequency and timing of clicks  Browsing : what users do after a click

Learning to rank • Convert ranking problem to a classification problem.  Point-wise learning – Given a query-document pair, predict a score (e.g. relevancy score)  Pair-wise learning – the input is a pair of documents for a query  List-wise learning • Bayes, SVM, decision trees, human rules. • Bagging/boosting to combine multiple schemes

Learning Ensembles • Learn multiple classifiers separately • Combine decisions (e.g. using weighted voting) • When combing multiple decisions, random errors cancel each other out, correct decisions are reinforced. Training Data Data2         Data1 Data m         Learner m Learner1 Learner2         Model1 Model2 Model m Final Model Combiner Model 22

Recommendation vs Search Ranking User rating Content • Collaborative filtering : Similarity-guided recommendation User click data Text Content Link popularity Item recommendation User a Web page ranking n   w ( r r ) a , u u , i u    1 u p r , a i a n  w , a u  u 1 Item i 23

Content-Boosted Collaborative Filtering with a Sparse Rating Matrix Vector Combine content-based prediction with user rating User-ratings Vector Training Examples Content-Based Predictor Pseudo User-ratings Vector User-rated Items Unrated Items Items with Predicted Ratings 24

Search Advertisement

Search advertisement

Query-advertisement matching

User Behavior Analysis with Query Sessions Session … Mission Mission Mission Query level Query Query Query Query Query Click level Click Click Click Click Click Eye-tracking level fixation fixation fixation Query-URL correlations: • Query-to-pick • Query-to-query • Pick-to-pick 30

Topic Summary: Data-Driven & Large-Scale • Information Retrieval and Web Search  Crawling, Indexing, Compression, and online retrieval/matching  Learning-to-rank with text/ link/click analysis. • Text Mining  Similarity analysis. Text Categorization and Clustering. Recommendation • Advertisement • Systems Support  Online servers and offline computation.  Caching. MapReduce. Key-value stores. Document parsing.  Open source systems T. Yang, A. Gerasoulis, Web Search Engines: Practice and Experience . Computer Science Handbook (T. Gonzalez. Eds), 2014. Chapman & Hall/CRC Press.

CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald - PowerPoint PPT Presentation

CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website . [MRS] Christopher D. Manning, Prabhakar

Search Evaluation Tao Yang CS290N Slides partially based on text book [CMS] [MRS] Table of

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

1 Aspects of Search Quality System Aspects of Evaluation Relevancy Response time:

1 Product Range Products 2 summary summary summary summary Relays with 8 and 11-Pins

An Ultramarathon Pie with Doge Glaze An Ultramarathon Pie with Doge Glaze Marathon: The Summary

SUMMARY OF 2 0 1 5 BRI TI SH EVENTI NG DATA DATA SUMMARY 2015 68,269 Cross Country Starters

summary(dsm_x_tw) summary(dsm_xyb_tw) summary(dsm_xy_tw) Overview Estimating smooths How

New patent case filings per year 1 Summary Judgment motions per year 2 All courts: 101 Summary

Search Summary Search Summary Some material from: D Lin, J You, JC Latombe 1 Search Summary #

Q3FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

Summary 1. Summary of

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY & OUTLOOK SUMMARY

EXECUTIVE SUMMARY ABOUT SEMPERTI Semperti Executive Summary Version: v1 // 2016 SEMPERTI

Q1FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

How similar are these curves? Jessica Sherette EAPSI Research and Experience Summary of Proposal

Lecture 12: Summary Summary Advanced Digital Communications (EQ2410) 1 Standards Final Exam

Ask Your Neurons: A Neural-based Approach to Answering Questions about Images Mateusz Malinowski

Introduction to OpenRefine Owen Stephens Felix Lohmeier Using these slides These slides were

Ask an Electric Vehicle Driver! Earth Week 2020 50 th Anniversary of Earth Day Sponsored by the

Using Users Paul Querna paul.querna@ask.com What is Bloglines? Blog & Feed Reader First

Use of Click Data for Web Search Tao Yang UCSB 290N Table of Content Search Engine Logs

Systems & Applications: Introduction Ling 573 NLP Systems and Applications April 1, 2014

Improving and Proving Marketing ROI with Testing How Shoebuy.com uses cross-site testing to

Selective Early Request Termination Selective Early Request Termination for Busy Internet

CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald - PowerPoint PPT Presentation

CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website . [MRS] Christopher D. Manning, Prabhakar

Search Evaluation Tao Yang CS290N Slides partially based on text book [CMS] [MRS] Table of

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

1 Aspects of Search Quality System Aspects of Evaluation Relevancy Response time:

1 Product Range Products 2 summary summary summary summary Relays with 8 and 11-Pins

An Ultramarathon Pie with Doge Glaze An Ultramarathon Pie with Doge Glaze Marathon: The Summary

SUMMARY OF 2 0 1 5 BRI TI SH EVENTI NG DATA DATA SUMMARY 2015 68,269 Cross Country Starters

summary(dsm_x_tw) summary(dsm_xyb_tw) summary(dsm_xy_tw) Overview Estimating smooths How

New patent case filings per year 1 Summary Judgment motions per year 2 All courts: 101 Summary

Search Summary Search Summary Some material from: D Lin, J You, JC Latombe 1 Search Summary #

Q3FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

Summary 1. Summary of

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY &amp; OUTLOOK SUMMARY

EXECUTIVE SUMMARY ABOUT SEMPERTI Semperti Executive Summary Version: v1 // 2016 SEMPERTI

Q1FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

How similar are these curves? Jessica Sherette EAPSI Research and Experience Summary of Proposal

Lecture 12: Summary Summary Advanced Digital Communications (EQ2410) 1 Standards Final Exam

Ask Your Neurons: A Neural-based Approach to Answering Questions about Images Mateusz Malinowski

Introduction to OpenRefine Owen Stephens Felix Lohmeier Using these slides These slides were

Ask an Electric Vehicle Driver! Earth Week 2020 50 th Anniversary of Earth Day Sponsored by the

Using Users Paul Querna paul.querna@ask.com What is Bloglines? Blog &amp; Feed Reader First

Use of Click Data for Web Search Tao Yang UCSB 290N Table of Content Search Engine Logs

Systems &amp; Applications: Introduction Ling 573 NLP Systems and Applications April 1, 2014

Improving and Proving Marketing ROI with Testing How Shoebuy.com uses cross-site testing to

Selective Early Request Termination Selective Early Request Termination for Busy Internet

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY & OUTLOOK SUMMARY

Using Users Paul Querna paul.querna@ask.com What is Bloglines? Blog & Feed Reader First

Systems & Applications: Introduction Ling 573 NLP Systems and Applications April 1, 2014