  1. Content • Query processing flow and data distribution. • Experience with Ask.com online architecture  Service programming with Neptune.  Zookeeper Query Processing and Online Architectures • T. Yang 290N 2013 • Partially from Croft, Metzler & Strohman‘s textbook Query Processing Document-At-A-Time • Document-at-a-time  Calculates complete scores for documents by processing all term lists, one document at a time • Term-at-a-time  Accumulates scores for documents by processing term lists one at a time • Both approaches have optimization techniques that significantly reduce time required to generate scores 1

  2. Term-At-A-Time Optimization Techniques • Term-at-a-time uses more memory for accumulators, data access is more efficient • Optimization  Read less data from inverted lists – e.g., skip lists – better for simple feature functions  Calculate scores for fewer documents  Threshold-based elimination – Avoid to select documents with a low score when high-score documents are available. coordinator Other Approaches Distributed Evaluation Index server Index server Index server Index server • Early termination of query processing • Basic process  ignore high-frequency word lists in term-at-a-time  All queries sent to a coordination machine  ignore documents at end of lists in doc-at-a-time  The coordinator then sends messages to many index servers  unsafe optimization  Each index server does some portion of the query • List ordering processing  order inverted lists by quality metric (e.g., PageRank)  The coordinator organizes the results and returns or by partial score them to the user  makes unsafe (and fast) optimizations more likely to • Two main approaches produce good documents  Document distribution – by far the most popular  Term distribution 2

  3. Distributed Evaluation Distributed Evaluation • Term distribution • Document distribution  Single index is built for the whole cluster of machines  each index server acts as a search engine for a small  Each inverted list in that index is then assigned to one fraction of the total collection index server  A coordinator sends a copy of the query to each of – in most cases the data to process a query is not stored on a the index servers, each of which returns the top- k single machine results  One of the index servers is chosen to process the  results are merged into a single ranked list by the query coordinator – usually the one holding the longest inverted list  Other index servers send information to that server  Final results sent to director Caching Open-Source Search Engines • Apache Solr: http://lucene.apache.org/solr/ • Query distributions similar to Zipf  full-text search with highlighting, faceted search,  Over 50% of queries repeat  cache hit dynamic clustering, database integration, rich  Some hot queries are very popular. document (e.g., Word, PDF) handling, and • Caching can significantly improve response time geospatial search  Cache popular query results  distributed search and index replication.  Cache common inverted lists  Based on Java Apache Lucene search. • Constellio : http://www.constellio.com/ • Inverted list caching can help with unique queries • Cache must be refreshed to prevent stale data Open-source enterprise level search based on Solr. • Zoie: sna-projects.com/zoie/ – Real time search indexing built ontop of Lucene. 3

  4. Open-Source Search Engines Fee-based Search Solutions • Lemur http://www.lemurproject.org/ • Google SiteSearch http://www.google.com/sitesearch/  C/C++, running on Linux/Mac and windows.  Site search is aimed primarily at websites, and not for an intranet.  Indri search engine by U. Mass/CMU.  It is a fully hosted solution  Parses PDF, HTML, XML, and TREC documents.  Pricing for site search is on a query basis per year. Word and PowerPoint parsing (Windows only).  UTF-8 Starting at $100 for 20,000 queries a year • Google Mini • Sphinx: http://sphinxsearch.com/  a server based solutions. Once deployed,  Cross platform open source search server written Mini crawls your Web sites and file systems / in C++ internal databases,  search across various systems, including database  Costs start at $1,995 (direct) plus a $995 yearly fee servers and NoSQL storage and flat files. after the first year for indexing of 50,000 documents, • Xapian : xapian.org/ – search library built on C++ and scales upwards Ask.com Search Engine Frontends and Cache • Front-ends Client queries Traffic load balancer  Receive web queries.  Direct queries through XML cache, compressed result Frontend Frontend Frontend Frontend cache, database retriever aggregators, page XML PageInfo XML Suggestion clustering/ranking, XML Cache Cache  Then present results to clients (XML). Neptune Cache • XML cache : Cache PageInfo Cache Aggregator  Save previously-queried search results (dynamic Web Cache Cache content).  Use these results to answer new queries. Speedup result Ranking Document Ranking Document computation by avoiding content regeneration Aggregator Ranking Document Abstract Ranking Document • Result cache Ranking Graph Abstract Abstract description  Contain all matched URLs for a query. Server  Given a query, find desired part of saved results. Frontends need to fetch description for each URL to compose the final Retriever PageInfo (HID) XML result. 4/22/2013 15 Research Presentation 4/22/2013 16 4

  5. Programming Challenges for Online Index Matching and Ranking Services • Retriever aggregators (Index match coordinator) • Challenges/requirements for online services:  Gather results from online database partitions.  Data intensive, requiring large-scale clusters.  Select proper partitions for different customers. • Index database retrievers  Incremental scalability.  7  24 availability.  Locate pages relevant to query keywords.  Select popular and relevant pages first.  Resource management, QoS for load spikes.  Database can be divided as many content units • Fault Tolerance: • Ranking server  Operation errors  Classify pages into topics & Rank pages  Software bugs • Snippet aggregators  Hardware failures • Lack of programming support for reliable/scalable  Combine descriptions of URLs from different online network services and applications. description servers. • Dynamic snippet servers 4/22/2013 17 4/22/2013 18  Extract proper description for a given URL. Example: a Neptune Clustered Service: The Neptune Clustering Middleware Index match service • Neptune: Clustering middleware for Snippet aggregating and replicating application generation modules with persistent data. • A simple and flexible programming model to HTTP Neptune server server shield complexity of service discovery, load Neptune Client scheduling, consistency, and failover Local- Client management area Ranking Network • www.cs.ucsb.edu/projects/neptune for code, Neptune papers, documents. server Index Front-end  K. Shen, et. al, USENIX Symposium on Internet match Web Servers Technologies and Systems, 2001 App 4/22/2013 19 4/22/2013 20 5

  6. Neptune architecture for cluster-based Inside a Neptune Server Node services (Symmetry and Decentralization) • Symmetric and decentralized:  Each node can host multiple services, acting as a service Service Access Point provider (Server) Service Polling Availability  Each node can also subscribe internal services from other Agent Network to the rest of the cluster Directory Service nodes, acting as a consumer (Client) Consumers Service Handling – Advantage: Support multi-tier or nested service architecture Module Service Service Load-balancing Availability Service provider Subsystem Subsystem Client requests Service Providers • Neptune components at each node: Service Load  Application service handling subsystem. Availability Service Runtime Index Server Publishing  Load balancing subsystem.  Service availability subsystem. 4/22/2013 21 4/22/2013 22 Availability and Load Balancing Programming Model in Neptune • Request-driven processing model: programmers • Availability subsystem: specify service methods to process each request .  Announcement once per second through IP • Application-level concurrency: Each service multicast; provider uses a thread or a process to handle a new  Availability info kept as soft state, expiring in 5 request and respond. seconds;  Service availability directory kept in shared- memory for efficient local lookup. Requests • Load-balancing subsystem: Service  Challenging: medium/fine-grained requests. method  Random polling with sampling.  Discarding slow-responding polls RUNTIME Data 4/22/2013 23 4/22/2013 24 6
