Search Engines Session 5 INST 301 Introduction to Information - PowerPoint PPT Presentation

Search Engines Session 5 INST 301 Introduction to Information Science

Washington Post (2007)

so what is a Search Engine?

Query the cat food D2 D1 Natural cats eat organic cat canned food. food available the cat food at petco.com is not good for dogs.

Find all the brown boxes No Structure and No Index

How about here • This is what indexing does • Makes data accessible in a structured format , easily accessible through search.

Building Index Documents: 1: cats eat canned food. the cat food is not good for dogs. 2: natural organic cat food available at petco.com Term – Document Index Matrix TERM D1 D2 available 0 1 canned 1 0 cat 2 1 dog 1 0 eat ? ? ? ? food … … …

Query the cat food D2 D3 D1 the the Natural cats eat the the the the organic cat canned food. food available the cat food at petco.com is not good for dogs. Some terms are more informative than others

How Specific is a Term? TERM (t) Document Inverse Document Log of Inverse Frequency of Frequency of term t Document Frequency term t (df t ) (idf t ) = (N/df t ) of term t [log(idf t )] cat 1 1,000,000 petco.com 100 10,000 food 1000 1000 canned 10,000 100 good 100,000 10 the 1,000,000 1

How Specific is a Term? TERM (t) Document Inverse Document Log of Inverse Frequency of Frequency of term t Document Frequency term t (df t ) (idf t ) = (N/df t ) of term t [log(idf t )] cat 1 1,000,000 petco.com 100 10,000 food 1000 1000 canned 10,000 100 good 100,000 10 the 1,000,000 1 Magnitude of increase

How Specific is a Term? TERM (t) Document Inverse Document Log of Inverse Frequency of Frequency of term t Document Frequency term t (df t ) (idf t ) = (N/df t ) of term t [log(idf t )] cat 1 1,000,000 6 petco.com 100 10,000 4 food 1000 1000 3 canned 10,000 100 2 good 100,000 10 1 the 1,000,000 1 0

Putting it all together • To rank, we obtain the weight for each term using tf-idf • The tf-idf weight of a term is the product of its tf weight and its idf weight Weight (t) = tf t × log(N /df t ) • Using the term weights, we obtain the document weight

Finding based on MetaData or Description • A type of “document expansion” – Terms near links describe content of the target • Works even when you can’t index content – Image retrieval, uncrawled links, …

Ways of Finding Information • Searching content – Characterize documents by the words the contain • Searching behavior – Find similar search patterns – Find items that cause similar reactions • Searching description – Anchor text

Crawling the Web

Web Crawl Challenges • Adversary behavior – “Crawler traps” • Duplicate and near-duplicate content – 30-40% of total content – Check if the content is already index – Skip document that do not provide new information • Network instability – Temporary server interruptions – Server and network loads • Dynamic content generation

How does Google PageRank work? Objective - estimate the importance of a webpage • Inlinks are “good” (like recommendations) • Inlinks from a “good” site are better than inlinks from a “bad” site P a P x P 2 P 1 P y P k P i P j

Link Structure of the Web Nature 405 , 113 (11 May 2000) | doi:10.1038/35012155

So, A Web search engine is an application composed of ; CRAWLING component - important to define a search space INDEXING component - of importance to developers AND content-centric SEARCH component - of importance to the users AND user-centric

Today: The “Search Engine” Source IR System Selection Query Query Formulation Search Ranked List Document Selection Indexing Index Examination Document Acquisition Collection Delivery

Next Session: “The Search” Source IR System Selection Query Query Formulation Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery

Before You Go • Assignment H2 On a sheet of paper, answer the following (ungraded) question (no names, please): What was the muddiest point in today’s class?

Search Engines Session 5 INST 301 Introduction to Information - PowerPoint PPT Presentation

Search Engines Session 5 INST 301 Introduction to Information Science Washington Post (2007) so what is a Search Engine? Query the cat food D2 D1 Natural cats eat organic cat canned food. food available the cat food at petco.com is

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

NCC Education and You Study and Communication Skills Your Name Internet Search Engines Date

Game Engines 1 Overview Game engines are a significant part of the modern games industry

The Overview of Web Search Engines Presented by Sunny Lam Outline Introduction Information

1 A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Engines Previously We talked about the motivation behind vertical search engines,

EPAs Air Quality Regulations for Stationary Engines for Stationary Engines Melanie King U.S.

Why learn how to build recommendation engines? Jamen Long Data Scientist DataCamp Building

Network Query Engines Network Query Engines Craig Knoblock USC Information Sciences Institute 1

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow

CS490W Without search engines the web wouldnt scale The acceptance of search interaction makes

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

Developers come and go but the code remains About me Committer for PhD from + CTO of About us

Become a Progressive No Kill Community without Breaking the Bank Presented by:

Cisco Enterprise Technical Advisory Board Survey When are you planning

Switching a Linux distributions main toolchain to LLVM/Clang Bernhard Bero

WE WELC LCOME OME Shelter Friends 2015 Member Meeting February 21, 2015 Who are we? Shelter

Wheelchair Mounted Dog Treat Dispenser Team Members : Zainab Abdullahi,Adam Dost, Gage Moore,

SYMBOLIC LOGIC UNIT 1: INTRODUCTION TO LOGIC What is an argument? An argument is the public,

Genode - OS Security By Design Dr.-Ing. Norman Feske < norman.feske@genode-labs.com >

Sambuz

Useful Links

Newsletter

Mail Us

Search Engines Session 5 INST 301 Introduction to Information - PowerPoint PPT Presentation

Search Engines Session 5 INST 301 Introduction to Information Science Washington Post (2007) so what is a Search Engine? Query the cat food D2 D1 Natural cats eat organic cat canned food. food available the cat food at petco.com is

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Set 10 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Set 10 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Set11 Search Engines &amp; SEO Outline How do search engines work? Basic operation

NCC Education and You Study and Communication Skills Your Name Internet Search Engines Date

Game Engines 1 Overview Game engines are a significant part of the modern games industry

The Overview of Web Search Engines Presented by Sunny Lam Outline Introduction Information

1 A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Engines Previously We talked about the motivation behind vertical search engines,

EPAs Air Quality Regulations for Stationary Engines for Stationary Engines Melanie King U.S.

Why learn how to build recommendation engines? Jamen Long Data Scientist DataCamp Building

Network Query Engines Network Query Engines Craig Knoblock USC Information Sciences Institute 1

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow

CS490W Without search engines the web wouldnt scale The acceptance of search interaction makes

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

Developers come and go but the code remains About me Committer for PhD from + CTO of About us

Become a Progressive No Kill Community without Breaking the Bank Presented by:

Cisco Enterprise Technical Advisory Board Survey When are you planning

Switching a Linux distributions main toolchain to LLVM/Clang Bernhard Bero

WE WELC LCOME OME Shelter Friends 2015 Member Meeting February 21, 2015 Who are we? Shelter

Wheelchair Mounted Dog Treat Dispenser Team Members : Zainab Abdullahi,Adam Dost, Gage Moore,

SYMBOLIC LOGIC UNIT 1: INTRODUCTION TO LOGIC What is an argument? An argument is the public,

Genode - OS Security By Design Dr.-Ing. Norman Feske &lt; norman.feske@genode-labs.com &gt;

Sambuz

Useful Links

Newsletter

Mail Us

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Genode - OS Security By Design Dr.-Ing. Norman Feske < norman.feske@genode-labs.com >