Crawling HTML create an user user inverted index query Search - PDF document

1/22/2013 Standard Web Search Engine Architecture store documents, check for duplicates, extract links crawl the web DocIds Crawling HTML create an user user inverted index query Search show results inverted engine index To user servers Slide adapted from Marti Hearst / UC Berkeley] Search Engine Architecture • Crawler (Spider) – Searches the web to find pages. Follows hyperlinks. Never stops • Indexer – Produces data structures for fast searching of all words in the pages • Retriever – Query interface – Database lookup to find hits CRAWLERS • 300 million documents (aka Spiders, Bots)… • 300 GB RAM, terabytes of disk – Ranking, summaries • Front End Open-Source Crawlers Crawlers • GNU Wget • 1000s of spiders out there… – Utility for downloading files from the Web. • Various purposes: – Fine if you just need to fetch files from 2-3 sites. – Search engines • Heritix – Digital rights management – Open-source, extensible, Web-scale crawler – Easy to get running Easy to get running. – Advertising – Advertising – Web-based UI – Spam harvesters • Nutch – Link checking – site validation – Featureful, industrial strength, Web search package. – Includes Lucene information retrieval part • TF/IDF and other document ranking • Optimized, inverted-index data store – You get complete control thru easy programming. 1

1/22/2013 Danger Will Robinson!! Robot Exclusion • Consequences of a bug • Person may not want certain pages indexed. • Crawlers should obey Robot Exclusion Protocol. – But some don’t • Look for file robots.txt at highest directory level – If domain is www.ecom.cmu.edu, robots.txt goes in www.ecom.cmu.edu/robots.txt • Specific document can be shielded from a crawler by adding the line: Max 6 hits/server/minute <META NAME="ROBOTS” CONTENT="NOINDEX"> plus…. http://www.cs.washington.edu/lab/policies/crawlers.html Danger, Danger Robots Exclusion Protocol • Ensure that your crawler obeys robots.txt • Format of robots.txt • Be sure to: – Two fields. User-agent to specify a robot – Notify the CS Lab Staff – Disallow to tell the agent what to ignore • To exclude all robots from a server: – Provide contact info in user-agent field . User-agent: * – Monitor the email address Monitor the email address Disallow: / Disallow: / – Honor all Do Not Scan requests • To exclude one robot from two directories: – Post all "stop-scanning" requests to classmates User-agent: WebCrawler Disallow: /news/ – “The scanee is always right." Disallow: /tmp/ • View the robots.txt specification at http://info.webcrawler.com/mak/projects/robots/norobots.html – Max 6 hits/server/minute Crawling Web Crawling Strategy • Queue := initial page URL 0 • Do forever • Starting location(s) – Dequeue URL • Traversal order – Fetch P – Depth first (LIFO) – Parse P for more URLs; add them to queue – Breadth first (FIFO) – Pass P to (specialized?) indexing program – Or ??? • Politeness • Issues… • Cycles? – Which page to look at next? • Coverage? • keywords, recency, focus, ??? – Politeness: Avoiding overload, scripts – Parsing links – How deep within a site to go? – How frequently to visit pages? – Traps! 2

1/22/2013 Structure of Mercator Spider Structure of Mercator Spider 1. Remove URL from queue 1. Remove URL from queue I/O abstraction 2. Simulate network protocols & REP 2. Simulate network protocols & REP Allows multiple reads 3. Read w/ RewindInputStream (RIS) Own thread Has document been seen before? (checksums and fingerprints) Structure of Mercator Spider Duplicate Detection • URL-seen test: has URL been seen before? Document fingerprints – To save space, store a hash • Content-seen test: different URL, same doc. – Supress link extraction from mirrored pages. p p g • What to save for each doc? – 64 bit “document fingerprint” – Minimize number of disk reads upon retrieval. 1. Remove URL from queue 2. Simulate network protocols & REP 3. Read w/ RewindInputStream (RIS) 4. Has document been seen before? (checksums and fingerprints) Structure of Mercator Spider Outgoing Links? • Parse HTML… • Looking for…what? anns html foos ? Bar baz hhh www A href = www.cs Frame font zzz ,li> bar bbb anns html foos Bar baz hhh www A href = ffff zcfg www.cs bbbbb z Frame font zzz ,li> bar bbb 1. Remove URL from queue 5. Extract links 2. Simulate network protocols & REP 6. Download new URL? 3. Read w/ RewindInputStream (RIS) 7. Has URL been seen before? 4. Has document been seen before? 8. Add URL to frontier (checksums and fingerprints) 3

1/22/2013 Structure of Mercator Spider Which tags / attributes hold URLs? Anchor tag: <a href=“URL” … > … </a> Option tag: <option value=“URL”…> … </option> Map: <area href=“URL” …> Frame: <frame src=“URL” …> F f “URL” Link to an image: <img src=“URL” …> Relative path vs. absolute path: <base href= …> 1. Remove URL from queue 5. Extract links Bonus problem: Javascript 2. Simulate network protocols & REP 6. Download new URL? 3. Read w/ RewindInputStream (RIS) 7. Has URL been seen before? In our favor: Search Engine Optimization 4. Has document been seen before? 8. Add URL to frontier (checksums and fingerprints) URL Frontier (priority queue) Fetching Pages • Most crawlers do breadth-first search from seeds. • Need to support http, ftp, gopher, .... • Politeness constraint: don’t hammer servers! – Extensible! • Need to fetch multiple pages at once. – Obvious implementation: “live host table” • Need to cache as much as possible – Will it fit in memory? – Is this efficient? – DNS • Mercator’s politeness: – robots.txt – Documents themselves (for later processing) – One FIFO subqueue per thread. • Need to be defensive! – Choose subqueue by hashing host’s name. – Dequeue first URL whose host has NO outstanding requests. – Need to time out http connections. – Watch for “crawler traps” (e.g., infinite URL names.) – See section 5 of Mercator paper. – Use URL filter module – Checkpointing! Mercator Statistics Nutch: A simple architecture • Seed set • Crawl • Remove duplicates • Extract URLs (minus those we’ve been to) Exponentially increasing size Exponentially increasing size PAGE TYPE PERCENT PAGE TYPE PERCENT – new frontier text/html 69.2% • Crawl again image/gif 17.9% image/jpeg 8.1% • Can do this with Map/Reduce architecture text/plain 1.5 pdf 0.9% audio 0.4% zip 0.4% postscript 0.3% other 1.4% 4

1/22/2013 Advanced Crawling Issues Focused Crawling • Limited resources • Priority queue instead of FIFO. – Fetch most important pages first • • How to determine priority? • Topic specific search engines – Similarity of page to driving query – Only care about pages which are relevant to topic • Use traditional IR measures • Exploration / exploitation problem – Backlink “Focused crawling” • How many links point to this page? – PageRank (Google) • Minimize stale pages • Some links to this page count more than others – Forward link of a page – Efficient re-fetch to keep index timely – Location Heuristics – How track the rate of change for pages? • E.g., Is site in .edu? • E.g., Does URL contain ‘home’ in it? – Linear combination of above 5

Crawling HTML create an user user inverted index query Search - PDF document

1/22/2013 Standard Web Search Engine Architecture store documents, check for duplicates, extract links crawl the web DocIds Crawling HTML create an user user inverted index query Search show results inverted engine index To user

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

1 Web Application Development HTML 5 Canvas HTML 5 Geolocation HTML 5 SVG HTML 5

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTML Week 3.1 - IST 263 Section M005 HTML SUMMARY Describe HTML HTML Vocabulary

HTML HTML is the HyperText Markup Language. HTML files are text files featuring

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

HTML Introduction (02c) Our Plan G HTML background G Unix and other issues G Minimal HTML

1 Web Application Development HTML Forms HTML Form Elements HTML Input Types HTML

HTML Introduction A.Y. 2020/2021 What HTML is HTML stands for Hyper Text Markup Language

CS 423 Operating System Design: File System Implementation Tianyin Xu Thanks Prof. Adam Bates

Overview/Questions Is it the Internet or the World Wide Web. Whats the difference? How

HTML Basics Joan Boone jpboone@email.unc.edu Slide 1 Topics Part 1: HTML Documents Part 2:

Do Now Day 5 Aug 139:31 AM Objectives: Students will be able to solve and graph absolute

Advanced GATE Embedded: Using Groovy Module 5 Seventh GATE Training Course June 2014 2014

Advanced GATE Embedded Track II, Module 8 Second GATE Training Course May 2010 Advanced GATE

CS429: Computer Organization and Architecture Linking I & II Dr. Bill Young Department of

Link Ranking group Archives Unleashed 4.0 Gregory Wiedeman (University at Albany, SUNY)

Sambuz

Useful Links

Newsletter

Mail Us

Crawling HTML create an user user inverted index query Search - PDF document

1/22/2013 Standard Web Search Engine Architecture store documents, check for duplicates, extract links crawl the web DocIds Crawling HTML create an user user inverted index query Search show results inverted engine index To user

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

1 Web Application Development HTML 5 Canvas HTML 5 Geolocation HTML 5 SVG HTML 5

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTML Week 3.1 - IST 263 Section M005 HTML SUMMARY Describe HTML HTML Vocabulary

HTML HTML is the HyperText Markup Language. HTML files are text files featuring

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

HTML Introduction (02c) Our Plan G HTML background G Unix and other issues G Minimal HTML

1 Web Application Development HTML Forms HTML Form Elements HTML Input Types HTML

HTML Introduction A.Y. 2020/2021 What HTML is HTML stands for Hyper Text Markup Language

CS 423 Operating System Design: File System Implementation Tianyin Xu Thanks Prof. Adam Bates

Overview/Questions Is it the Internet or the World Wide Web. Whats the difference? How

HTML Basics Joan Boone jpboone@email.unc.edu Slide 1 Topics Part 1: HTML Documents Part 2:

Do Now Day 5 Aug 139:31 AM Objectives: Students will be able to solve and graph absolute

Advanced GATE Embedded: Using Groovy Module 5 Seventh GATE Training Course June 2014 2014

Advanced GATE Embedded Track II, Module 8 Second GATE Training Course May 2010 Advanced GATE

CS429: Computer Organization and Architecture Linking I &amp; II Dr. Bill Young Department of

Link Ranking group Archives Unleashed 4.0 Gregory Wiedeman (University at Albany, SUNY)

Sambuz

Useful Links

Newsletter

Mail Us

CS429: Computer Organization and Architecture Linking I & II Dr. Bill Young Department of