building a scalable recommender system with apache spark
play

Building a Scalable Recommender System with Apache Spark, Apache - PowerPoint PPT Presentation

Nov / 14 / 16 Nick Pentreath Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning Author of Machine Learning


  1. Nov / 14 / 16 Nick Pentreath Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

  2. About § @MLnick § Principal Engineer, IBM § Apache Spark PMC § Focused on machine learning § Author of Machine Learning with Spark

  3. Agenda § Recommender systems & the machine learning workflow § Data modelling for recommender systems § Why Spark, Kafka & Elasticsearch? § Kafka & Spark Streaming § Spark ML for collaborative filtering § Deploying & scoring recommender models with Elasticsearch § Monitoring, feedback & re-training § Scaling model serving § Demo

  4. Recommender Systems & the ML Workflow

  5. Recommender Overview Systems

  6. The Machine Perception Learning Workflow Machine Data ??? ??? $$$ Learning

  7. The Machine Reality Learning Workflow Missing Spark ML piece! Various Spark DataFrames ??? Data Model Data Ingest Deploy Live System Processing Training • Historical • Feature • Model selection & • Pipelines, not just • Predict given new transformation & evaluation models data • Streaming engineering • Versioning • Monitoring & live evaluation Feedback Loop Stream (Kafka)

  8. The Machine Recommender Version Learning Workflow Spark ML Elasticsearch Spark DataFrames Elasticsearch Data Model Data Ingest Deploy Live System Processing Training • Aggregation • ALS • Model size & •User & item • User & Item complexity recommendations Metadata • Handle implicit • Ranking-style data evaluation •Monitoring, filters • Events Feedback => another Event Type Stream (Kafka)

  9. Data Modeling for Recommender Systems

  10. User and Item Data model Metadata ! !

  11. User and Item System Requirements Metadata Filtering & Grouping ! ! Business Rules

  12. Anatomy of a User Interactions User Event ! ! ! User interactions Implicit preference data Intent data • Page view • Search query • eCommerce - cart, purchase ! ! • Media – preview, watch, listen Explicit preference data Social network interactions ! ! • Rating • Like • Review • Share ! • Follow

  13. Anatomy of a Data model User Event ! ! ! ! ! ! !

  14. Anatomy of a How to handle implicit feedback? User Event ! ! ! ! ! ! ! !

  15. Why Kafka, Spark & Elasticsearch?

  16. Why Kafka? Scalability § De facto standard for a centralized enterprise message / event queue Integration § Integrates with just about every storage & processing system § Good Spark Streaming integration – 1 st class citizen § Including for Structured Streaming (but still very new & rough!)

  17. Why Spark? DataFrames § Events & metadata are “lightly structured” data § Suited to DataFrames § Pluggable external data source support Spark ML § Spark ML pipelines – including scalable ALS model for collaborative filtering § Implicit feedback & NMF in ALS § Cross-validation § Custom transformers & algorithms

  18. Why Storage Elasticsearch? § Native JSON § Scalable § Good support for time-series / event data § Kibana for data visualisation § Integration with Spark DataFrames Scoring § Full-text search § Filtering § Aggregations (grouping) § Search ~== recommendation (more later)

  19. Kafka for Recommender Systems

  20. Event Data Pipeline ! User analytics & aggregation Event store ! Spark Kafka Streaming ! Dashboards ! Item analytics & aggregation

  21. Write to Event Store Event store ! Spark Streaming

  22. Kibana Dashboards Spark Streaming ! Dashboards

  23. Item Metadata Analytics Spark Streaming Aggregated activity metrics ! Item analytics & aggregation

  24. User Metadata Analytics ! User analytics & aggregation Spark Streaming Aggregated activity metrics & item exclusions

  25. Structured Status Streaming § Still early days § Initial Kafka support in Spark 2.0.2 § No ES support yet – not clear if it will be a full-blown datasource or ForeachWriter § For now, you can create a custom ForeachWriter for your needs

  26. Spark ML for Collaborative Filtering

  27. Collaborative Matrix Factorization Filtering ! ! ! 3 −1.1 3.2 4.3 1 0.2 1.4 3.1 0.2 1.7 2.3 0.1 ! 5 2 2.5 0.3 2.3 1.9 0.4 0.8 −0.3 4.3 −2.4 0.5 1.5 −1.2 0.3 1.2 1 3.6 0.3 1.2 2 1

  28. Collaborative Prediction Filtering ! ! ! 3 −1.1 3.2 4.3 1 0.2 1.4 3.1 0.2 1.7 2.3 0.1 ! 5 2 2.5 0.3 2.3 1.9 0.4 0.8 −0.3 4.3 −2.4 0.5 1.5 −1.2 0.3 1.2 1 3.6 0.3 1.2 2 1

  29. Collaborative Loading Data in Spark ML Filtering

  30. Alternating Least Implicit Preference Data Squares

  31. Deploying & Scoring Recommendation Models

  32. Prelude: Search Full-text Search & Similarity Analysis Term vectors Scoring Ranking ! Sort results cat videos ! 0 0 ⋯ 0 1 ⋯ 0 1 ⋯ 1 0 ⋯ “cat videos” 0 1 ⋯ 1 1 ⋯ Similarity 1 1 ⋯ 0 0 ⋯ 1 0 ⋯ 0 1 ⋯

  33. Recommendation Can we use the same machinery? ? ! ! ! Analysis Term vectors Scoring Ranking ! Sort results ! 0 0 ⋯ 0 1 ⋯ 1.2 ⋯ −0.2 0.3 0 1 ⋯ 1 1 ⋯ User Similarity (or item) 1 1 ⋯ 0 0 ⋯ vector 1 0 ⋯ 0 1 ⋯ Dot product & cosine similarity … the same as we need for recommendations!

  34. Elasticsearch Delimited Payload Filter Term Vectors Raw vector Custom analyzer Term vector with payloads 0|1.2 ⋯ 3|-0.2 4|0.3 1.2 ⋯ −0.2 0.3

  35. Elasticsearch Custom scoring function Scoring • Native script (Java), compiled for speed • Scoring function computes dot product by: For each document vector index (“term”), retrieve § payload § score += payload * query(i) • Normalizes with query vector norm and document vector norm for cosine similarity

  36. Recommendation Can we use the same machinery? ! ! ! ! Analysis Term vectors Scoring Ranking ! Sort results ! −1.1 1.3 ⋯ 0.4 1.2 ⋯ −0.2 0.3 1.2 −0.2 ⋯ 0.3 Delimited Custom User payload filter scoring (or item) 0.5 0.7 ⋯ −1.3 function vector 0.9 1.4 ⋯ −0.8

  37. Elasticsearch We get search engine functionality for free! Scoring

  38. Alternating Least Deploying to Elasticsearch Squares

  39. Monitoring & Feedback

  40. System Events Logging Recommendations Served ! ! ! ! ! ! !

  41. System Events Logging Recommendation Actions ! ! ! !

  42. Tracking Performance Performance ! monitoring & alerts Event store ! ! ! Spark ! Kafka Streaming ! ! ! ! ! ! Dashboards ! Impression capping / fatigue

  43. Scaling Model Scoring

  44. Scoring Performance Scoring time per query, by factor dimension & number of items 600 k=20 k=50 k=100 500 400 Time (ms) 300 200 100 0 100,000 1,000,000 Size of item set *3x nodes, 30x shards

  45. Scoring Increasing number of shards Performance Scoring time per query, by number of shards & number of items 500 10 shards 30 shards 450 60 shards 90 shards 400 350 Time (ms) 300 250 200 150 100 50 0 100,000 1,000,000 Size of item set *3x nodes, k=50

  46. Scoring Locality Sensitive Hashing Performance • LSH hashes each input vector into L “hash tables”. Each table contains a “hash signature” created by applying k hash functions. • Standard for cosine similarity is Sign Random Projections • At indexing time, create a “bucket” by combining hash table id and hash signature • Store buckets as part of item model metadata • At scoring time, filter candidate set using term filter on buckets of query item • Tune LSH parameters to trade off speed / accuracy • LSH coming soon to Spark ML – SPARK-5992

  47. Scoring Locality Sensitive Hashing Performance Scoring time per query - brute force vs LSH 250 200 150 Time (ms) 100 50 0 Brute force LSH *3x nodes, 30x shards, k=50, 1,000,000 items

  48. Scoring Comparison to “score then search” Performance Scoring time per query – LSH vs score-then-search 250 Score Sort Search 200 150 Time (ms) 100 50 0 Brute force LSH Score-then-search *3x nodes, 30x shards, k=50, 1,000,000 items

  49. Demo

  50. Future Work

  51. Future Work • Apache Solr version of scoring plugin (any takers?) • Investigate ways to improve Elasticsearch scoring performance Performance for LSH-filtered scoring should be better! § Can we dig deep into ES scoring internals to combine § efficiency of matrix-vector math with ES search & filter capabilities? • Investigate more complex models Factorization machines & other contextual recommender § models Scoring performance § • Spark Structured Streaming with Kafka, Elasticsearch & Kibana Continuous recommender application including data, § model training, analytics & monitoring

  52. References • Elasticsearch • Elasticsearch Spark Integration • Spark ML ALS for Collaborative Filtering • Collaborative Filtering for Implicit Feedback Datasets • Factorization Machines • Elasticsearch Term Vectors & Payloads • Delimited Payload Filter • Vector Scoring Plugin • Kafka & Spark Streaming • Kibana

Recommend


More recommend