Recommender System in KKBOX
Simple Complex Ranking Model Collaborative Persona Aware based Filtering Embedding Attribute Based Context Aware Representation
#Item x #users x #attributes
Serendipity/Novelty Diversity Precision
Collaborative Filtering Matrix Factorization
Word2Vec - “ The results, to our own surprise, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counterparts. ” - Marco et al. “You should know the word by the company it keeps” (Firth J.R.)
CBOW Skip-gram CBOW
DeepWalk (Bryan Perozzi, Rami Al-Rfou& Steven Skiena, 2014 ) Random Walk Word2Vec
青花瓷 給我一首歌的時 珊瑚海 我不配 間 黃金甲 珊瑚海 雙截棍 天地一鬥
(#item + #users) x log(#Item + #users) x #hidden nodes x window_size
Cold start ? Learn the relationships between laten factors and audio signals
We got features - And ranking is another problem
Click/Play Prediction ● Regression ● Classification Learn to Rank Content User Understanding Understaning ● Embedding ● User Profiling ● Classification ● Embedding ● Topic Mining
買菜送蔥 Building a pipeline Data pre-processing → ETL Job Feature extraction → Numerical/ Categorical... Model fitting → Logistic Regression/GBDT Validation stages → Cross Validation
Challenges ● Big data ● Heterogeneous sources ● Various formatting ● Data versioning ● Data quality ● Data freshness ● Cost ● Coding is hard, debugging is harder
Logs: External Datasets / Logs: Databases: Parquet, Genre, BPM, Artist Json, Tsv, Songs, DB Mixpanel, App Annie Members, …... Text, …... …... ● Data cleaning, normalization ETL ● Pre aggregation / join Parquet files in S3, partitioned by date and service region if needed.
ETL Data (Parquet files on S3) DB Thrift (or Protobuf, Hive Table Replication Avro) Schema Presto (or Amazon Athena) ● Apache Spark (Scala) ○ From files on S3 to RDD / Dataframe ○ Use JDBC Driver from Presto ● Python / R ○ Read file from S3, deserialize parquet ○ Use JDBC/ODBC driver from Presto
Example
Challenges ● Big data = EC2 + Spark + Hadoop Family + Presto ● Heterogeneous sources = ETL ● Various formatting = ETL ● Data versioning = ETL ● Data quality = ETL ● Data freshness = DB Replication, Data Streaming ● Cost = EC2, Good Tool Chain ● Coding is hard, debugging is harder - Good Design
Case Study
Nearest Neighbors of Songs 1. Build a weighted bipartite graph of users and songs from logs ● Terebytes of data, billions of nodes and edges ● Spark cluster on EC2. (On-demand, hundreds of cores, I/O optimized) 2. Put each song on a vector space 3. Find K-NN for each song ● Random walks ● O(n^2) is impossible ● An embedded model (We use ● Approximation. For example, word2vec) Locality-Sensitive Hashing ● In an very very large instance with a ● Using a spark cluster on EC2, the lot of memory and cores. worker nodes are cpu optimized. All middle results are in parquet format on S3, so we can inpect them with Presto.
Songs a User Like to Listen Again 1. Extract features from logs, databases, and external data set ● Join billions of transactions. ● Spark cluster on EC2. (On-demand, hundreds of cores) 2. Train a model 3. Repeat - feature selection, parameter tuning ● Spark MLlib (EX: GBDT) ● Deep learning frameworks 4. Predict from recent logs (TensorFlow)
Life cycle of ML-related features Define the Problem Deploy and Inspect the A/B Testing Data Train and Verify the Hypothesis Model
References Apache Spark ● Apache Parquet ● Apache Thrift ● Apache Hive ● Presto ● Amazon Elastic Compute Clould ●
Recommend
More recommend