Bing Search: An Engine in the Clouds
Munich – Munich – Rablstr. Gewürzmühlstr. • Founded in January 2009 • Offices in London and Munich • More than 70 employees in total • Close collaborations with Microsoft Research in Redmond and Cambridge • Collaborations with various MSFT Product Groups (incl. Office, Skype, Windows, Xbox etc.) London - Cardinal Place • STC-E Munich: web ranking • Other STCs in Beijing, Hyderabad, and Silicon Valley
Applied Research branch of MSR working on: • • IT-Security • Data-privacy Enabli bling ng acquisi quisition tion of data ta on all class sses es of devices/se ces/sensor nsors, and d provi oviding ding industr stry y leadin ding g analyt alytic ics s and proce cess ssing ng capabil pabilitie ities s • Mobility from om the edge ge to cloud, d, leveragi raging: ng: • Mobile Solutions • Web-Services Deep technical Relationships with Strategic customer expertise engineering groups scenarios (Windows, Office 365 (8 years of experience in (T elemetry, Stream Analytics, Windows Azure) Early fault detection, platform and small devices) Predictive maintenance) http://research.microsoft.com/en-us/labs/atle/default.aspx
1 2 3 4 5
Content Gathering Crawling Indexing Matching Query Words to Content Query modifications needed? Ranking Results Features used
Comprehensiveness Serving & discovery of hundreds of billions of documents Frequency Optimized towards freshness & politeness Balance Depth vs. breadth in the processing of document content
Result counts are misleading Intelligent document selection matters
Balance Frequency Depth vs. breadth in the Optimized towards processing of document freshness & politeness content
• Most components are easily parallelizable such as crawling and document processing • Developed on top of Private Cloud • Leveraging Cosmos as a highly scalable storage and computing system • Running under highly performant Datacenter Management System
Petabyte Store and Computation System About 62 physical petabytes stored (~275 logical petabytes stored in 2011) Tens of thousands of computers across many datacenters Massively parallel processing based on Dryad Similar to MapReduce but can represent arbitrary DAGs of computation Automatic computation placement with data SCOPE (Structured Computation Optimized for Parallel Execution) SQL-like language with set-oriented record and column manipulation Automatically compiled and optimized for execution over Dryad Management of hundreds of “Virtual Clusters” for computation allocation Source: http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf
Linguistic alterations Stemming, morphological variations, plurals Spell Corrections Approximately one third of queries contain misspellings Synonyms Used sparingly, only in high confidence situations
Web represents Knowledge Approximately one third of queries contain misspellings Users often correct themselves Patterns of common (and not so common) misspellings are discovered Linguistic Depth Less useful Rules like “ i ” before “e” except after “c” do not scale Instead develop computer models of correct spelling Statistics & usage data help
• Heavy dependence on interaction logs ( queries, sessions …) User behavior is king • Leveraging Cosmos for processing Billions of entries Deriving query histogram Deriving query reformulation graph • Leveraging Machine Learning for ranking suggestions T raining word-level/contextual-level models
Create lots of features (attributes) 1. Calculate them per query/document pair 2. Let the computer learn how to rank with 3. millions/billions of examples Rinse & Repeat 4.
Do the words appear in the title of the document? Do the words appear in the order specified? Are the words a substantial part of the title? Are the words excessively repeated? Is the title uncommonly long?
Good • Title: Spaghetti Western - Wikipedia • Title: Western Spaghetti Recipe – T aste of Not so Home good • Title: The Spaghetti Western Orchestra Not so Tickets 2013 - The Spaghetti Western good Orchestra Concert tour 2013 Tickets
Words in Title in Correct Order? yes no 1000s of other features… Is Title uncommonly long? no yes Are words repeated excessively? +4 yes no 1000s of other trees… -1 +2
• The document with highest cumulative score is ranked highest • Improvements to algorithms & features are made constantly & shipped frequently • Machine Learning allows for scalability across many dimensions
Titles Images Document Content Visual Layout Links Co-occurrences of words Clicks Freshness Word Frequencies Word Proximity … and many more
• Leveraging Hybrid Cloud Computing Platform • Built on top of heterogeneous computing resources: • Single box • Cosmos: Map/Reduce • MPI • … • Easy deployment/management of applications (modules) • Powerful graphical layer of abstraction defining data-workflow. • Batch mode allowing to scale across dimensions such as markets • Leveraging Machine Learning as a Service
Accelerating Feature computation through Field Programmable Gate Arrays Sources: http://www.altera.com/technology/system-design/articles/2014/cpu-architects.html http://www.enterprisetech.com/2014/09/03/microsoft-using-fpgas-speed-bing-search/
Feature Fundamentals Which features are “universal”, which are market -specific? Multiple Queries When does it make sense to issue multiple (altered) queries? How does one merge the results for the user? Anchor & Link Signals Which links and which anchor text are informative? What features need to be leveraged in this context? Knowledge Modeling View the task of ranking as a translation from “query language” to “document language”
• Global Ranking • Everything mentioned before … for International (no English -US, no CJK) • Improving, Evaluating and Shipping Rankers in hundreds of markets • Universal Ranker • One Team and One Ranker
http://www.bingiton.com/
Recommend
More recommend