GoodReads Book Recommendation Service Yijun Tian, Vicky Bai, Zeynep Doganata
Introduction/Related Work - “A subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that a user would give to an item” -- Wikipedia - Recommendation systems drive significant engagement and revenue for companies such as Amazon, Netflix, and Good Readers. - Approaches: Collaborative filtering, Content-based filtering, Contextual filtering, Social and demographic filtering. - Techniques: Supervised Learning, Clustering/Unsupervised Learning, Transfer Learning, Text Classification, Text Embedding
Data - GoodReads Book Information (book id, title, Book description) Book User Information Recommendation Book (user id, user’s shelf) Engine Book User Behavior (rating, is_read) Book Datasets: ● Meta-Data of Books ( 2.36M books); ● User-Book Interactions ( 229M user-book interactions); ● Book Review Texts (15M records).
Book A Pipeline Extract Similar Books - Input: a book ID - Output: most similar books, Ground Truth Reader-based including ID, titles, description Similar Books Similar Books (A, Similar Book B1) Model (A, Similar Book Bn) InferSent Book A Similar Books Model embedding embedding Calculate Similarity Score Most Similar Books
Extract Similar Books - Ground Truth Similar Books Provided in GoodReads dataset. However, we don’t know how they generate the similar books (e.g. same (1) Ground Truth Similar Books series, topic, author?) - Reader-based Similar Books - Share same readers - Share same ratings (4/5 stars) - Randomly select 200 similar books (2) Reader-based Similar Books
Model Exploration 1. Word embedding: Word2vec vs FastText 2. Transfer learning: ULMFit 3. Sentence embedding: InferSent
Word Embedding One Hot Encoding Co-Occurrence Matrix Word2Vec apple apples <ap app ppl ple le> Word2Vec fastText
ULMFiT: Universal Language Model Fine-tuning for Text Classification - Transfer learning in NLP - Consists of 3 main phases: - Language model trained on general domain corpus - Fine tuning begins on target task data using slanted triangular learning rates to learn features - Further fine-tuning using gradual unfreezing and slanted triangular learning rates - to preserve low-level learnings and adapt high-level representations - Not yet used very much for unsupervised tasks such as Semantic Text Similarity - many tasks implemented with ULMFiT involve classification Paper: “Universal Language Model Fine-tuning for Text Classification”
InferSent: sentence embedding - To obtain general-purpose sentence embeddings that capture generic information - Pre-trained on Stanford Natural Language Inference (SNLI) dataset. - 570k humangenerated English sentence pairs - u: premise representation - v: hypothesis representation - 3-class classifier: entailment, contradiction and neutral - Example: A soccer game with multiple males playing & Some men are playing a sport. (entailment) Paper: “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data”
InferSent: sentence embedding - Our accuracy: 0.73 in all similar books, 0.77 in top 5. (test size: 3000 books and their similar books) Paper: “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data”
Example Results Original Book: Anita Diamant's international bestseller "The Red Tent" brilliantly re-created the ancient world of womanhood. Diamant brings her remarkable storytelling skills to "Good Harbor" -- offering insight to the precarious balance of marriage and career, motherhood and friendship in the world of modern women. The seaside town of Gloucester, Massachusetts is a place where the smell of the ocean lingers in the air and the rocky coast glistens in the Atlantic sunshine. When longtime Gloucester-resident Kathleen Levine is diagnosed with breast cancer, her life is thrown into turmoil. Frightened and burdened by secrets, she meets Joyce Tabachnik -- a freelance writer with literary aspirations -- and a once-in-a-lifetime friendship is born. Joyce has just bought a small house in Gloucester, where she hopes to write as well as vacation with her family. Like Kathleen, Joyce is at a fragile place in her life. A mutual love for books, humor, and the beauty of the natural world brings the two women together. They share their personal histories, and help each other to confront scars left by old emotional wounds. With her own trademark wisdom and humor, Diamant considers the nature, strength, and necessity of adult female friendship. "Good Harbor" examines the tragedy of loss, the insidious nature of family secrets, as well as the redemptive power of friendship. Similar Book: In A Little Love Story, Roland Merullo--winner of the Massachusetts Book Award and the Maria Thomas Fiction Award--has created a sometimes poignant, sometimes hilarious tale of attraction and loyalty, jealousy and grief. It is a classic love story--with some modern twists. Janet Rossi is very smart and unusually attractive, an aide to the governor of Massachusetts, but she suffers from an illness that makes her, as she puts it, "not exactly a good long-term investment." Jake Entwhistle is a few years older, a carpenter and portrait painter, smart and good-looking too, but with a shadow over his romantic history. After meeting by accident--literally--when Janet backs into Jake's antique truck, they begin a love affair marked by courage, humor, a deep and erotic intimacy . . . and modern complications. Working with the basic architecture of the love story genre, Merullo--a former carpenter known for his novels about family life--breaks new ground with a fresh look at modern romance, taking liberties with the classic design, adding original lines of friendship, spirituality, and laughter, and, of course, probing the mystery of love. ... (Score: 0.8631)
API Demo
Service Hosting Specs/Details: ● Flask application with model and data preloading ● GET endpoint with book_id and “top n” parameter ● Docker image ~ 10.4 GB ● InferSent model file size ~ 4.5 GB ● GoodReads data ~ 2.5 GB
AWS Fargate ● AWS Serverless compute engine for containers ● Works with ECS - elastic container service ● UI configuration - not very intuitive ○ Task definitions ○ Container definitions ○ Soft and hard limits on resources at both layers ● 10 GB memory limit! :(
Kubernetes on DigitalOcean ● “The node had condition: [MemoryPressure]”
AWS SageMaker ● Targeted towards Data Scientist and ML engineers to provide serverless capabilities for: ○ Labeling ○ Building ○ Training ○ Sharing notebooks ○ Deploying models ○ Managing Inference endpoints ○ Supports “ custom ” containers
More on ... ● Containers must be deployed to AWS ECR ● Must be organized in a compatible way: ○ Infer POST endpoint following Sagemaker spec ○ Model directory that gets packaged and uploaded to S3 as part of the deployment ○ Data directory
Short-term solution: EC2 Instance ● Sagemaker looked promising, but after deploying our container we saw it would require a non-trivial amount of refactoring to make it work ● To ensure we had our service deployed somewhere, we provisioned an EC2 instance ● Trade - offs: ○ Availability - intermittent crashing ○ Scaling requires: ■ fleet management ■ load balancer
Hosting Enhancements If we had more time… ● Dockerize better - layering analysis and pruning unnecessary base image packages ● Host the model file externally (S3) ● Upload GoodReads data externally or pickle data structures (S3) ● Possibly use a small key-value based DB for GoodReads data storage ● Try SageMaker which has been optimized to do this for us ● Latency optimization: ○ GPU inference ○ Experiment with precomputing embeddings for our dataset
Other Enhancements - Recommendation Engine Enhancement - Combined with the user reviews and other information - Use another metrics instead of accuracy - Embedding enhancements - Incorporate the trained fastText model in InferSent - Combine the word embeddings and text embeddings
Reference Datasets : ● Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18 . [bibtex] ● Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19 . [bibtex] ● Common Crawl: https://commoncrawl.org/ Models: ● Howard, Jeremy and Ruder, Sebastian. "Universal Language Model Fine-tuning for Text Classification." Paper presented at the meeting of the ACL, 2018. ● Conneau, Alexis, Douwe Kiela, Holger Schwenk, Loïc Barrault and Antoine Bordes. “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.” EMNLP (2017).
Questions
Appendix
Recommend
More recommend