Scaling Java Based Recommendation Engine for Leading Q&A site Anurag Gupta Senior Architect, Yahoo Answers anuragg@yahoo-inc.com
Agenda Yahoo Answers – A Snapshot • • Why Open Question Recommender? • Technology – Technology Requirements – Architecture – Architecture – Design – Flow Diagrams – Science Models – Infrastructure – Results • Future
Yahoo! Answers – A Snapshot Yahoo! Answers Total Reach 9 Languages 21 Markets Over 200M users and 2B page views each month One of top 40 sites globally in terms of traffic 5 th most popular social website Over 300 million questions and 1 billion answers Mobile growth 2X YOY, 80M users, 400M PVs Source: Yahoo Internal Data
Why Open Question Recommender? Why Open Question Recommender? • Difficult for answerers to find relevant questions • Need to increase number of answers / question question • Increase signal to noise ratio • Route questions to “right” answerers
Technology Requirements Requirements • Use answerer's interests for recommendations • Bucket (A/B Test) aware • Avoid questions already answered by answerer • Fraction of open questions are diverse • Submission to globally available in less than 1s • Submission to globally available in less than 1s • 90 th percentile serving latency less than 100ms • Diagnostics
Architecture Users Spam Abuse Security Junk Detector Front End Caching Yapache / Maple Karma Applications API / YQL Middle Tier Tools Java, Tomcat Recommender Systems Customer User Care Question Database Recommender Editorial Editorial SocDir SocDir Related Analytics Questions Storage Tooldev AutoCat Oracle NoSQL Search Hadoop Grid
Open Question Recommender Design Front-End Open Question Recommender Data Store Engine R/W APIs U->list of NoSQL Mapping machine topics Algorithms Middle-tier ActiveMQ User � list of machine topics Q->list of machine machine topics Search APIs Q � list of machine topics Machine Topic Machine � list of Topic->list of NoSQL questions questions Oracle Cache Machine Topics Model
Information Flow in Open Question Recommender
Sequence Flows
Enable Users to Discover Relevant Content • Science model to infer user interest – Objective is to surface most relevant open questions for answerers – Model per top level Answers category – Clusters based on selectivity of words – Clusters based on selectivity of words – Questions / users mapped to clusters – Relevant questions surfaced based on answerer’s affinity to cluster • Success Metric – Average number of answers per question
Science Models • LDA (Latent Dirichlet Allocation) • TFIDF – (Term Frequency Inverse Document Frequency) • Answers Category • Answers Category • Diversity • Decay
User Representation Model
Infrastructure • Bucketing / Analytics infrastructure – Evaluate if differences in science algorithms, UI are statistically significant • ActiveMQ • ActiveMQ – Publish / subscribe system that decouples serving from peripheral systems
Results • Global CTR increased by ~50% • US CTR increased by ~3X • 4% growth in answers / user • 90 th percentile serving latencies below 100ms 90 th percentile serving latencies below 100ms
Sample Verbatim Feedback from Users • “this is a good idea i like it.” • “This is a cool idea.” • “Good idea.” • “Just now I saw this feature enabled for me. I liked it. Actually, when I came back from office, I liked it. Actually, when I came back from office, I will go to unanswered question or popular questions. Because I can’t check all the questions. This recommended feature is pretty nice.” • “I find this feature quite handy. Also this brings up qusetions which I would not have known ever existed for me to answer.”
Future • Re-architect front-end for mobile – HTML5 / JS / CSS – Driven off APIs • Re-architect back-end • Re-architect back-end – Scale across multiple colos – Reduce latency by caching content in colos closer to user • Discovery
Recommend
More recommend