Web search Web data management and distribution Serge Abiteboul - PowerPoint PPT Presentation

Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook June 4, 2013 WebDam (INRIA) Web search June 4, 2013 1 / 48

The World Wide Web Outline The World Wide Web 1 Web crawling 2 Web Information Retrieval 3 Web Graph Mining 4 Conclusion 5 WebDam (INRIA) Web search June 4, 2013 2 / 48

The World Wide Web Internet and the Web Internet: physical network of computers (or hosts) who communicate using the IP protocol (and higher-level protocols) World Wide Web, Web, WWW: logical collection of hyperlinked documents static and dynamic public Web and private Webs each document (or Web page, or resource) identified by a URL HTML Web pages, but also media files, PDF documents, etc. Communication protocol: HTTP (and HTTPS), based on TCP/IP WebDam (INRIA) Web search June 4, 2013 3 / 48

The World Wide Web Uniform Resource Locators # https ://www.example.com :443 /path/to/doc ?name=foo&town=bar � �� port query string scheme hostname path fragment scheme: way the resource can be accessed; generally http or https hostname: domain name of a host (cf. DNS); hostname of a website may start with www. , but not a rule. port: TCP port; defaults: 80 for http and 443 for https path: logical path of the document query string: additional parameters (dynamic documents). fragment: subpart of the document Query strings and fragments optionals Empty path: root of the Web server Relative URIs with respect to a context (e.g., the URI above): /titi https://www.example.com/titi tata https://www.example.com/path/to/tata WebDam (INRIA) Web search June 4, 2013 4 / 48

Web crawling Outline The World Wide Web 1 Web crawling 2 Web Information Retrieval 3 Web Graph Mining 4 Conclusion 5 WebDam (INRIA) Web search June 4, 2013 5 / 48

Web crawling Web Crawlers crawlers, (Web) spiders, (Web) robots: autonomous user agents that retrieve pages from the Web Basics of crawling: Start from a given URL or set of URLs 1 Retrieve and process the corresponding page 2 Discover new URLs (cf. next slide) 3 Repeat on each found URL 4 No real termination condition (virtual unlimited number of Web pages!) Graph-browsing problem deep-first: not very adapted, possibility of being lost in robot traps breadth-first combination of both: breadth-first with limited-depth deep-first on each discovered website WebDam (INRIA) Web search June 4, 2013 6 / 48

Web crawling Sources of new URLs From HTML pages: ◮ hyperlinks <a href="...">...</a> ◮ media <img src="..."> <embed src="..."> <object data="..."> ◮ frames <frame src="..."> <iframe src="..."> ◮ JavaScript links window.open("...") ◮ etc. Other hyperlinked content (e.g., PDF files) Non-hyperlinked URLs that appear anywhere on the Web (in HTML text, text files, etc.): use regular expressions to extract them Referrer URLs Sitemaps [sit08] WebDam (INRIA) Web search June 4, 2013 7 / 48

Web crawling Crawling ethics Standard for robot exclusion: robots.txt at the root of a Web server [ Kos94 ]. User-agent: * Allow: /searchhistory/ Disallow: /search Per-page exclusion ( de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW"> Per-link exclusion ( de facto standard). <a href="toto.html" rel="nofollow">Toto</a> Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server WebDam (INRIA) Web search June 4, 2013 8 / 48

Web Information Retrieval Outline The World Wide Web 1 Web crawling 2 Web Information Retrieval 3 Text Preprocessing Inverted Index Answering Keyword Queries Clustering Web Graph Mining 4 Conclusion 5 WebDam (INRIA) Web search June 4, 2013 9 / 48

Web Information Retrieval Information Retrieval, Search Problem How to index Web content so as to answer (keyword-based) queries efficiently? Context: set of text documents d 1 The jaguar is a New World mammal of the Felidae family. Jaguar has designed four new engines. d 2 d 3 For Jaguar, Atari was keen to use a 68K family device. The Jacksonville Jaguars are a professional US football team. d 4 d 5 Mac OS X Jaguar is available at a price of US $199 for Apple’s new “family pack”. d 6 One such ruling family to incorporate the jaguar into their name is Jaguar Paw. d 7 It is a big cat. WebDam (INRIA) Web search June 4, 2013 10 / 48

Web Information Retrieval Text Preprocessing Text Preprocessing Initial text preprocessing steps Number of optional steps Highly depends on the application Highly depends on the document language (illustrated with English) WebDam (INRIA) Web search June 4, 2013 11 / 48

한글 Web Information Retrieval Text Preprocessing Language Identification How to find the language used in a document? Meta-information about the document: often not reliable! Unambiguous scripts or letters: not very common! カタカナ Għ a rbi þorn WebDam (INRIA) Web search June 4, 2013 12 / 48

한글 Web Information Retrieval Text Preprocessing Language Identification How to find the language used in a document? Meta-information about the document: often not reliable! Unambiguous scripts or letters: not very common! カタカナ Għ a rbi þorn Respectively: Korean Hangul, Japanese Katakana, Maldivian Dhivehi, Maltese, Icelandic Extension of this: frequent characters, or, better, frequent k -grams Use standard machine learning techniques (classifiers) WebDam (INRIA) Web search June 4, 2013 12 / 48

Web Information Retrieval Text Preprocessing Tokenization Principle Separate text into tokens (words) Not so easy! In some languages (Chinese, Japanese), words not separated by whitespace Deal consistently with acronyms, elisions, numbers, units, URLs, emails, etc. Compound words: hostname , host-name and host name . Break into two tokens or regroup them as one token? In any case, lexicon and linguistic analysis needed! Even more so in other languages as German. Usually, remove punctuation and normalize case at this point WebDam (INRIA) Web search June 4, 2013 13 / 48

Web Information Retrieval Text Preprocessing Tokenization: Example the 1 jaguar 2 is 3 a 4 new 5 world 6 mammal 7 of 8 the 9 felidae 10 family 11 d 1 d 2 jaguar 1 has 2 designed 3 four 4 new 5 engines 6 for 1 jaguar 2 atari 3 was 4 keen 5 to 6 use 7 a 8 68k 9 family 10 device 11 d 3 d 4 the 1 jacksonville 2 jaguars 3 are 4 a 5 professional 6 us 7 football 8 team 9 d 5 mac 1 os 2 x 3 jaguar 4 is 5 available 6 at 7 a 8 price 9 of 10 us 11 $199 12 for 13 apple’s 14 new 15 family 16 pack 17 d 6 one 1 such 2 ruling 3 family 4 to 5 incorporate 6 the 7 jaguar 8 into 9 their 10 name 11 is 12 jaguar 13 paw 14 d 7 it 1 is 2 a 3 big 4 cat 5 WebDam (INRIA) Web search June 4, 2013 14 / 48

Web Information Retrieval Text Preprocessing Stemming Principle Merge different forms of the same word, or of closely related words, into a single stem Not in all applications! Useful for retrieving documents containing geese when searching for goose Various degrees of stemming Possibility of building different indexes, with different stemming WebDam (INRIA) Web search June 4, 2013 15 / 48

Web Information Retrieval Text Preprocessing Stemming schemes (1/2) Morphological stemming. Remove bound morphemes from words: ◮ plural markers ◮ gender markers ◮ tense or mood inflections ◮ etc. Can be linguistically very complex, cf: Les poules du couvent couvent. [The hens of the monastery brood.] In English, somewhat easy: ◮ Remove final -s, -’s, -ed, -ing, -er, -est ◮ Take care of semiregular forms (e.g., -y/-ies) ◮ Take care of irregular forms (mouse/mice) But still some ambiguities: cf stocking, rose WebDam (INRIA) Web search June 4, 2013 16 / 48

Web Information Retrieval Text Preprocessing Stemming schemes (2/2) Lexical stemming. Merge lexically related terms of various parts of speech, such as policy , politics , political or politician For English, Porter’s stemming [Por80]; stem university and universal to univers : not perfect! Possibility of coupling this with lexicons to merge (near-)synonyms Phonetic stemming. Merge phonetically related words: search despite spelling errors! For English, Soundex [US 07] stems Robert and Rupert to R163 . Very coarse! WebDam (INRIA) Web search June 4, 2013 17 / 48

Web Information Retrieval Text Preprocessing Stemming Example the 1 jaguar 2 be 3 a 4 new 5 world 6 mammal 7 of 8 the 9 felidae 10 family 11 d 1 d 2 jaguar 1 have 2 design 3 four 4 new 5 engine 6 for 1 jaguar 2 atari 3 be 4 keen 5 to 6 use 7 a 8 68k 9 family 10 device 11 d 3 d 4 the 1 jacksonville 2 jaguar 3 be 4 a 5 professional 6 us 7 football 8 team 9 d 5 mac 1 os 2 x 3 jaguar 4 be 5 available 6 at 7 a 8 price 9 of 10 us 11 $199 12 for 13 apple 14 new 15 family 16 pack 17 d 6 one 1 such 2 rule 3 family 4 to 5 incorporate 6 the 7 jaguar 8 into 9 their 10 name 11 be 12 jaguar 13 paw 14 d 7 it 1 be 2 a 3 big 4 cat 5 WebDam (INRIA) Web search June 4, 2013 18 / 48

Web search Web data management and distribution Serge Abiteboul - PowerPoint PPT Presentation

Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook June 4, 2013 WebDam (INRIA) Web

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Web CS490W: Web I nformation Search & Management Web opened the door for many important

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web

Link-based Web Search Web Search PageRank HITS Stability Issues Current

Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

EASM 2014 studies reflect Service-dominant Logic and the Nordic School of Service Marketing

Using Growth Data to Improve Learning, Teaching, and School Functioning Brian Gong Center for

Central Avenue (SR 19) Corridor Study Public Alternative Workshop January 30, 2017 1 FM

03 ZINC PERFORMANCE 20 04 FINANCIAL ANALYSIS 23 Univentures Public Company Limited 2 LRK

The Big Story: A Gospel Presentation unfortunately they communicate only a small part of Jesus'

Gospel Association of India Indiaa country of 1.2 billion people. Fewer than 3% are

Family Ministry Elementary Department FAMILY GOSPEL REFLECTION JUNE 7, 2020 HOLY TRINITY SUNDAY

2021 City Managers Recommended Budget City Council Workshop October 5, 2020 Distinguished Budget

Sambuz

Useful Links

Newsletter

Mail Us

Web search Web data management and distribution Serge Abiteboul - PowerPoint PPT Presentation

Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook June 4, 2013 WebDam (INRIA) Web

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Web CS490W: Web I nformation Search &amp; Management Web opened the door for many important

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web

Link-based Web Search Web Search PageRank HITS Stability Issues Current

Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

EASM 2014 studies reflect Service-dominant Logic and the Nordic School of Service Marketing

Using Growth Data to Improve Learning, Teaching, and School Functioning Brian Gong Center for

Central Avenue (SR 19) Corridor Study Public Alternative Workshop January 30, 2017 1 FM

03 ZINC PERFORMANCE 20 04 FINANCIAL ANALYSIS 23 Univentures Public Company Limited 2 LRK

The Big Story: A Gospel Presentation unfortunately they communicate only a small part of Jesus'

Gospel Association of India Indiaa country of 1.2 billion people. Fewer than 3% are

Family Ministry Elementary Department FAMILY GOSPEL REFLECTION JUNE 7, 2020 HOLY TRINITY SUNDAY

2021 City Managers Recommended Budget City Council Workshop October 5, 2020 Distinguished Budget

Sambuz

Useful Links

Newsletter

Mail Us

Web CS490W: Web I nformation Search & Management Web opened the door for many important