Content • Query processing flow and data distribution. • Experience with Ask.com online architecture Service programming with Neptune. Zookeeper Query Processing and Online Architectures • T. Yang 290N 2013 • Partially from Croft, Metzler & Strohman‘s textbook Query Processing Document-At-A-Time • Document-at-a-time Calculates complete scores for documents by processing all term lists, one document at a time • Term-at-a-time Accumulates scores for documents by processing term lists one at a time • Both approaches have optimization techniques that significantly reduce time required to generate scores 1
Term-At-A-Time Optimization Techniques • Term-at-a-time uses more memory for accumulators, data access is more efficient • Optimization Read less data from inverted lists – e.g., skip lists – better for simple feature functions Calculate scores for fewer documents Threshold-based elimination – Avoid to select documents with a low score when high-score documents are available. coordinator Other Approaches Distributed Evaluation Index server Index server Index server Index server • Early termination of query processing • Basic process ignore high-frequency word lists in term-at-a-time All queries sent to a coordination machine ignore documents at end of lists in doc-at-a-time The coordinator then sends messages to many index servers unsafe optimization Each index server does some portion of the query • List ordering processing order inverted lists by quality metric (e.g., PageRank) The coordinator organizes the results and returns or by partial score them to the user makes unsafe (and fast) optimizations more likely to • Two main approaches produce good documents Document distribution – by far the most popular Term distribution 2
Distributed Evaluation Distributed Evaluation • Term distribution • Document distribution Single index is built for the whole cluster of machines each index server acts as a search engine for a small Each inverted list in that index is then assigned to one fraction of the total collection index server A coordinator sends a copy of the query to each of – in most cases the data to process a query is not stored on a the index servers, each of which returns the top- k single machine results One of the index servers is chosen to process the results are merged into a single ranked list by the query coordinator – usually the one holding the longest inverted list Other index servers send information to that server Final results sent to director Caching Open-Source Search Engines • Apache Solr: http://lucene.apache.org/solr/ • Query distributions similar to Zipf full-text search with highlighting, faceted search, Over 50% of queries repeat cache hit dynamic clustering, database integration, rich Some hot queries are very popular. document (e.g., Word, PDF) handling, and • Caching can significantly improve response time geospatial search Cache popular query results distributed search and index replication. Cache common inverted lists Based on Java Apache Lucene search. • Constellio : http://www.constellio.com/ • Inverted list caching can help with unique queries • Cache must be refreshed to prevent stale data Open-source enterprise level search based on Solr. • Zoie: sna-projects.com/zoie/ – Real time search indexing built ontop of Lucene. 3
Open-Source Search Engines Fee-based Search Solutions • Lemur http://www.lemurproject.org/ • Google SiteSearch http://www.google.com/sitesearch/ C/C++, running on Linux/Mac and windows. Site search is aimed primarily at websites, and not for an intranet. Indri search engine by U. Mass/CMU. It is a fully hosted solution Parses PDF, HTML, XML, and TREC documents. Pricing for site search is on a query basis per year. Word and PowerPoint parsing (Windows only). UTF-8 Starting at $100 for 20,000 queries a year • Google Mini • Sphinx: http://sphinxsearch.com/ a server based solutions. Once deployed, Cross platform open source search server written Mini crawls your Web sites and file systems / in C++ internal databases, search across various systems, including database Costs start at $1,995 (direct) plus a $995 yearly fee servers and NoSQL storage and flat files. after the first year for indexing of 50,000 documents, • Xapian : xapian.org/ – search library built on C++ and scales upwards Ask.com Search Engine Frontends and Cache • Front-ends Client queries Traffic load balancer Receive web queries. Direct queries through XML cache, compressed result Frontend Frontend Frontend Frontend cache, database retriever aggregators, page XML PageInfo XML Suggestion clustering/ranking, XML Cache Cache Then present results to clients (XML). Neptune Cache • XML cache : Cache PageInfo Cache Aggregator Save previously-queried search results (dynamic Web Cache Cache content). Use these results to answer new queries. Speedup result Ranking Document Ranking Document computation by avoiding content regeneration Aggregator Ranking Document Abstract Ranking Document • Result cache Ranking Graph Abstract Abstract description Contain all matched URLs for a query. Server Given a query, find desired part of saved results. Frontends need to fetch description for each URL to compose the final Retriever PageInfo (HID) XML result. 4/22/2013 15 Research Presentation 4/22/2013 16 4
Programming Challenges for Online Index Matching and Ranking Services • Retriever aggregators (Index match coordinator) • Challenges/requirements for online services: Gather results from online database partitions. Data intensive, requiring large-scale clusters. Select proper partitions for different customers. • Index database retrievers Incremental scalability. 7 24 availability. Locate pages relevant to query keywords. Select popular and relevant pages first. Resource management, QoS for load spikes. Database can be divided as many content units • Fault Tolerance: • Ranking server Operation errors Classify pages into topics & Rank pages Software bugs • Snippet aggregators Hardware failures • Lack of programming support for reliable/scalable Combine descriptions of URLs from different online network services and applications. description servers. • Dynamic snippet servers 4/22/2013 17 4/22/2013 18 Extract proper description for a given URL. Example: a Neptune Clustered Service: The Neptune Clustering Middleware Index match service • Neptune: Clustering middleware for Snippet aggregating and replicating application generation modules with persistent data. • A simple and flexible programming model to HTTP Neptune server server shield complexity of service discovery, load Neptune Client scheduling, consistency, and failover Local- Client management area Ranking Network • www.cs.ucsb.edu/projects/neptune for code, Neptune papers, documents. server Index Front-end K. Shen, et. al, USENIX Symposium on Internet match Web Servers Technologies and Systems, 2001 App 4/22/2013 19 4/22/2013 20 5
Neptune architecture for cluster-based Inside a Neptune Server Node services (Symmetry and Decentralization) • Symmetric and decentralized: Each node can host multiple services, acting as a service Service Access Point provider (Server) Service Polling Availability Each node can also subscribe internal services from other Agent Network to the rest of the cluster Directory Service nodes, acting as a consumer (Client) Consumers Service Handling – Advantage: Support multi-tier or nested service architecture Module Service Service Load-balancing Availability Service provider Subsystem Subsystem Client requests Service Providers • Neptune components at each node: Service Load Application service handling subsystem. Availability Service Runtime Index Server Publishing Load balancing subsystem. Service availability subsystem. 4/22/2013 21 4/22/2013 22 Availability and Load Balancing Programming Model in Neptune • Request-driven processing model: programmers • Availability subsystem: specify service methods to process each request . Announcement once per second through IP • Application-level concurrency: Each service multicast; provider uses a thread or a process to handle a new Availability info kept as soft state, expiring in 5 request and respond. seconds; Service availability directory kept in shared- memory for efficient local lookup. Requests • Load-balancing subsystem: Service Challenging: medium/fine-grained requests. method Random polling with sampling. Discarding slow-responding polls RUNTIME Data 4/22/2013 23 4/22/2013 24 6
Recommend
More recommend