Technologies behind Internet Search Engine Ming-Jer Lee CTO - PowerPoint PPT Presentation

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc.

Type of Search Engine • Media – Text – Image – Audio – Video • Scope – General search engine – Domain specific topic – Language • Scale – personal, content site, intranet, Internet – thousand, million or billion (documents, users, querie s) • Structure – non-structure, semi-structure, structure • User Interface – Web-based, Standalone AP based, voice driven

Type of Internet Search Engine • Manual Index – Yahoo index, Looksmart, Open Directory • Automatic index • Metasearch • Answer by human expert • P2P • …

Search Engine in Business World • Internet Search Engine(ISE) • Google, Openfind, VisionNEXT ’ s eefind • Enterprise Search Engine(ESE) • Verity, Convera, Virage, Tornado Discovery Categorization Information Search ISE Spider follow links and Manually categorization - Keyword search of web resources Unstructured HTML, Office and PDF d - Boolean search ocuments - Search result ran king - Web page popula rity can be used as ESE Structure and non-structured data in Generally a mix of hum ISE weighting an input and automatic - File system - User intension int algorithms to maintain - Database erpretation content category - Content management server - Collaboration server - Enterprise web site - Online news feed

ESE: Verity UltraSeek Features Nature Language Search Application Integration Rapid Deployment Simple Administration Database Integration Security Integration Customized Interface Java API

ESE: Convera RetrievalWare

Technologies used in Internet Search Engine • Information discovery – Distributed system – Internet technology • Networking • DNS, IP • HTML – Storage system – Duplicate content detection – Information filtering • Index and Search – Natural Language Processing • Spelling check • Term stemming • Thesaurus handling – Data structure for fast retrieval • Inverted file is industrial standard for text retrieval • Distributed index • Storage system design to minimize disk access – Cluster computing for scalable search • Google uses more than 15000 Linux PCs • Load balance issue • High availability issue – Multi-dimension index for multimedia content

Spider: Information Discovery on Inter net I n t e r n e t I n t e r n e t • Affairs of a spider Spider – Crawl and explore … the Web space Crawler – Maintain the fresh Mirrored Web Page Repository ness of the crawle d pages Search Engine Indexer Web Content/Link Analysis ….

View of Crawling Process In Internet ternet � � � � � � � � � seeds � http://www.163.com/ http://www.yahoo.com.cn/ http://www.tsinghua.edu.cn/ ...

One-Step Crawling Process Inte Internet rnet Robot Parser Dispatcher seeds Queue

One-Step Crawling Process (cont.) Internet Inte rnet Socket GET /~john/ HTTP/1.1 Host: abc.com HTTP/1.1 200 OK DNS Last Modified: XX.XX.XX 100.100.99.98 HTTP Content-Length: 102 Abc.com Resolver Content-Type: text/html <HTML><BODY> http://abc.com/~john/ MIME <A HREF=“a.html”>A</A> <A HREF=“b.html”>B</A> Parser </BODY></HTML> Dispatcher http://abc.com/~john/ Link Queue Extractor http://abc.com/~john/ http://abc.com/~john/a.html http://abc.com/~john/b.html

Parallel Crawling Process Internet Inte rnet Parallel Crawling - Multi-processes - Multi-threads Socket Socket Socket Socket - One Process with Asynchronous IO DNS HTTP HTTP Resolver HTTP HTTP MIME MIME MIME MIME Parser Parser Dispatcher Parser Parser Link Link Link Queue Link Extractor Extractor Extractor Extractor

Socket Review Inte Internet rnet Client socket() Default IO State: Synchronous IO Drawback: Socket connect() The process is blocked on IO read()

IO-Driven Spider Infrastructure (Asynchronous IO Driver) Inte Internet rnet Client socket() Add socket to waiting Asynchronous IO Driver queue fcntl() s1 s2 s3 s4 ≈ Set non-blocking IO connect() Event Loop: Register write event polling by select () Register connect_callback() if ( si for write ) call connect_callback () if ( si for read ) read() call read_callback () Register read event Register read_callback()

HTTP & MIME Header Internet Inte rnet Response: Socket HTTP/1.1 200 OK Server: XXXX Last-Modified: XXXX Request: Keep-Alive: timeout=15,max=100 Content-Length: 102 GET /~john/ HTTP/1.1 Content-Type: text/html Host: abc.com User-Agent: My Spider <HTML><BODY> Connection: Keep-Alive <A HREF=“a.html”>A</A> <A HREF=“b.html”>B</A> </BODY></HTML> HTTP MIME Parser http://abc.com/~john/

Redirection Inte Internet rnet Socket Response: Request: Request: HTTP/1.1 302 Found GET /~john/ HTTP/1.1 GET /~john/ HTTP/1.1 Location: /~john/ Host: abc.com Host: abc.com XXX: XXX XXX: XXX YYY: YYY YYY: YYY ZZZ: ZZZ ZZZ: ZZZ HTTP MIME Parser

Link Extractor • Parse the HTML document and ext ract all the links that we are inter ested in • Sample – <A HREF=“…”> – <FRAME SRC=“…”> – <AREA HREF=“…”> – <META HTTP-EQUIV=“refresh” CONTENT= “0; Url=/index.shtml”>

Canonical Form of a URL • Canonical Form of a URL – Normalization: A URL string is normalized by following steps: • Removal of the protocol prefix (http://) if p resent • Removal of :80 port number if present (Ho wever, non-standard port number are retai ned) • Conversion of the server name to lower cas e • Problem cn.yahoo.com 202.106.184.4 www.yahoo.com.cn � cn.rc.yahoo.com 210.77.38.3 cn.rd.yahoo.com

DNS Lookup Contact the Domain Name Service (DNS) to resolve the host name into its IP address • Problem: – DNS resolution is a well-documented bottleneck of m ost web crawlers – Most system DNS lookup implementation is synchron ized • Strategies: – Keep a local host-to-IP cache to decrease the overhe ad by the default DNS lookup routine (e.g. gethostby name ) – Or, implement a non-synchronized DNS resolver

DNS Lookup (cont.) Inte Internet rnet Socket www.yahoo.com.tw DNS HTTP Resolver Special deal with multi-homed host: - choose the fastest IP to connect gethostbyname() Official host name: Update weight: rc.yahoo.com 1. w(n+1) = α w(n) + (1- α ) δ t Internet address: weight 2. w(n+1) = w(n) * decay 204.71.201.7, 0.0 204.71.201.8, 0.0 δ t: connection time 0.0 204.71.201.9. α : sensitivity factor

Filter Prevent reloading visited documents or downloading unnecessary ones • Problem: – Host name alias i.e., multiple host correspond to the sam e IP – Alternative paths on the same host i.e., symbolic links – Replication across different hosts e.g., site mirroring – Non-indexed documents such as images, *.zip, *.mp3, etc.

Filter (cont.) • Strategies: – URL constraints • Specify some regular expression rules for d omain, ip, prefix, protocol type, file suffix, e tc. e.g., exclude=“\.mp3$|\.jpg$|\.gif$” include=“htm$|html$|/[^/\.]*$” dn-cst=“\.cn$” ip-cst=“2 ;211.100.0.0:0.0.127.255 ;61.128.0.0:0.3.255.255” – URL-seen test • Check whether a URL has been fetched:

Filter (cont.) – Content-seen test • Check whether a document has been fetche d • Represent a document as a fixed-size finger print (e.g., MD5) and perform a fast search on the document fingerprint set to measure the document resemblance e.g., A 128-bit fingerprint: 01234567…ABCDEF Message <HTML><BODY> Digest <A HREF=…>A</A> Search <A HREF=…>B</A> </BODY></HTML> Doc FP Sets B-Tree

Robots Exclusion To be polite in the crawling process • Strategies: – Follow the Robots Exclusion Protocol e.g., http://www.example.com/robots.txt # robots.txt for http://www.example.com/ User-agent: * Disallow: /privacy/ Disallow: /personal.html – Obey the Robots Meta Information e.g., <META NAME=“robots” CONTENT=“nofollow,noindex”>

Other Spider Issues • Dispatching URLs – Prevent overloading one particular web server • Only one robot is responsible to one server at one time • Recovering from failures • Keeping the network bandwidth in good use – Keep as much connections (roughly several hundreds) as possible at the same time • Cache strategies – Keep in-memory caches for those steps with high locality • DNS lookup, URL-seen test are but Content-seen test is not

Human Index for the Internet • High precision, low recall • Subject directory tree(Yahoo!) • Expert guide (about.com) • Q&A search (ask.com) • 你問我答 (ExpertCentral.com)

Automatic Indexing for the Internet • Low precision, high recall • Full-text index/scan(excite, lycos, infoseek) • Large scale indexing(Alta Vista) • Popularity based indexing (Direct Hit) • Search result clustering (Northern Light) • Link Analysis(Google)

Technologies behind Internet Search Engine Ming-Jer Lee CTO - PowerPoint PPT Presentation

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search Engine Media Text Image Audio Video Scope General search engine Domain specific topic Language Scale

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Sub-Second Search on CanLII Marc-Andr Morissette Law via the Internet Conference, June 2011

Data Processing WWW and search Internet introduced a new challenge in the form of a web

SHODAN The Search Engine for the Internet of Things (IoT) Jeff Tomkiewicz, Sec +|CEH Outline

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

SEARCH ENGINE OPTIMIZATION (SEO) WHAT I S S E O ? SEO is a methodology of strategies,

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Web Mining for Knowledge Discovery Current Search Engine Search engines are doing good jobs

Get Found on Google: Search Engine Optimization (SEO) Basics & Technical Solutions Lindsay

Information Retrieval CS6200 Search Engine Architecture Jesse Anderton College of Computer and

The search engine you can see Connects people to information and services The search engine you

Internet Searching You Tube presentation: http://youtu.be/Qmk8f2T_myI Web Browser & Search

Search 2.0: Web 2.0, Personal Information Flows, and the Drive for the Perfect Search Engine

Search Engine Optimization (SEO) & Your Online Success What Is SEO, Exactly? SEO stands for

Deep Representation: Building a Semantic Image Search Engine Emmanuel Ameisen PINTEREST SEARCH

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

SEO: Why Search Engine Optimization is More Important Than Ever We d n e s d a y, Ma y 2 0 , 2 0

Advanced Internet Searching LIB 1201 October 26, 2011 Whats on the internet? News

GOOSE: A Goal-Oriented Search Engine with Commonsense Hugo Liu, Henry Lieberman, Ted Selker

Query by Analogical Examples: Relational Search Using Web Search Engine Indices by: Lohit Jain

Challenges for search engine retrieval effectiveness evaluations: Universal Search and user

Search Engine Optimization (SEO) Basics & Technical Solutions Lindsay Wassell

Search Engine Optimization (SEO) Basics & Technical Solutions Lindsay Wassell

Technologies behind Internet Search Engine Ming-Jer Lee CTO - PowerPoint PPT Presentation

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search Engine Media Text Image Audio Video Scope General search engine Domain specific topic Language Scale

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Sub-Second Search on CanLII Marc-Andr Morissette Law via the Internet Conference, June 2011

Data Processing WWW and search Internet introduced a new challenge in the form of a web

SHODAN The Search Engine for the Internet of Things (IoT) Jeff Tomkiewicz, Sec +|CEH Outline

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

SEARCH ENGINE OPTIMIZATION (SEO) WHAT I S S E O ? SEO is a methodology of strategies,

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Web Mining for Knowledge Discovery Current Search Engine Search engines are doing good jobs

Get Found on Google: Search Engine Optimization (SEO) Basics &amp; Technical Solutions Lindsay

Information Retrieval CS6200 Search Engine Architecture Jesse Anderton College of Computer and

The search engine you can see Connects people to information and services The search engine you

Internet Searching You Tube presentation: http://youtu.be/Qmk8f2T_myI Web Browser &amp; Search

Search 2.0: Web 2.0, Personal Information Flows, and the Drive for the Perfect Search Engine

Search Engine Optimization (SEO) &amp; Your Online Success What Is SEO, Exactly? SEO stands for

Deep Representation: Building a Semantic Image Search Engine Emmanuel Ameisen PINTEREST SEARCH

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

SEO: Why Search Engine Optimization is More Important Than Ever We d n e s d a y, Ma y 2 0 , 2 0

Advanced Internet Searching LIB 1201 October 26, 2011 Whats on the internet? News

GOOSE: A Goal-Oriented Search Engine with Commonsense Hugo Liu, Henry Lieberman, Ted Selker

Query by Analogical Examples: Relational Search Using Web Search Engine Indices by: Lohit Jain

Challenges for search engine retrieval effectiveness evaluations: Universal Search and user

Search Engine Optimization (SEO) Basics &amp; Technical Solutions Lindsay Wassell

Search Engine Optimization (SEO) Basics &amp; Technical Solutions Lindsay Wassell

Get Found on Google: Search Engine Optimization (SEO) Basics & Technical Solutions Lindsay

Internet Searching You Tube presentation: http://youtu.be/Qmk8f2T_myI Web Browser & Search

Search Engine Optimization (SEO) & Your Online Success What Is SEO, Exactly? SEO stands for

Search Engine Optimization (SEO) Basics & Technical Solutions Lindsay Wassell

Search Engine Optimization (SEO) Basics & Technical Solutions Lindsay Wassell