Lyric-Based Music Recommendation Paper Authors: Derek Gossi, Mehmet H Gunes University of Nevada, Reno By: Brendan Abraham, Abhijith Mandya
Overview The challenge of Music Recommendation ● Industry standard (Collaborative Filtering) vs Text Mining Approach(Lyrical ● Analysis) Choice of Data (musiXmatch & Million Song Dataset - MSD) ● Feature Engineering (TF-IDF, Cosine Similarity, Subgraph Analysis) ● Ranked recommendations using K-Means ● Performance comparison against each other and random recommendations ● Conclusions and way forward ●
Collaborative Filtering Hey! I like Today, most music-based ● Well! I like songs A, songs B, B, C,and D recommendation systems use C, D and E collaborative filtering Based on User Preference ● and not musical or lyrical You should You should definitely listen content totally check to A then out E Doesn’t scale well when ● songs lack user ratings
Subjectivity in Music Recommendations Preferences can span a wide ● range of genres and styles User ratings and preferences ● limit the capabilities of recommendation tasks Find factors beyond genre ● that influence a listener’s probability of enjoying a song. ME TRYING TO EXPLAIN MY TASTE IN MUSIC
The Lyrical Approach Supervised classification of ● lyrical content into categories using tags Can show importance of ● rhyme, repetition, and meter This paper uses a complex ● lyrical network and compares clustering methods on them
Lyrics Dataset Lyrics for 237,662 tracks and 22,821 ● unique artists linked to MSD BOW format ● Stemmed using a modified Porter2 ● stemming algorithm Limited to the top 5,000 words ● accounting for ~92% all unique words.
Methodology: Overview Lyric Data User Listening Data Lyrical VSM User-based VSM TF-IDF Similarity Matrix Similarity Matrix KNN KNN Graph Graph Analysis Analysis
Methodology: Vector Space Models Goal: Represent an artist’s “vocabulary” Group song vectors Stopword Removal by artist Artist TF-IDF Song TFM Artist TFM Matrix 237k rows 22k unique artists One word vector per song One word vector per artist
Methodology: Pairwise Artist Similarity Matrix A1 A2 ... An ● An AnxAn matrix from artist A1 1 .8 ... .2 computed from artist TFM A2 .32 1 ... .4 . . . . .6 ● Measures cosine similarity . . . . between authors Ai and Aj . . . . An .22 .6 1
Methodology: Artist Graph Construction Built KNN based on cosine similarity matrix 1. a. Vector: cosine similarities to all other artists Dr. Dre Each artist connected to top K most similar 50 Cent b. artists c. Chose K=10 without much justification -_- Eminem 2. Each artist node has: Metallica a. Outdegree (outgoing edges) of k Carlos b. Unknown Indegree (incoming edges) Santana Linkin Park Los Lonely Shakira Boys
Methodology: Subgraph Analysis 23 categories and 3 types ● Each artist has set of tags (latin, spanish … etc) ● Filter graph by category to only include artists with tags from that category ● Analyze # of incoming and outgoing edges to subgraph ● Dr. Dre 50 Cent Category Type Unique Tags Eminem Metallica Carlos Santana Linkin Park Los Lonely Shakira Boys
Methodology: Subgraph Analysis 23 categories and 3 types ● Each artist has set of tags (latin, spanish … etc) ● Filter graph by category to only include artists with tags from that category ● Analyze # of incoming and outgoing edges to subgraph ● Dr. Dre 50 Cent Category Type Unique Tags Eminem Metallica Carlos Santana Linkin Park Los Lonely Shakira Boys
Evaluation Approach 1. Compared Network Topologies of Lyric and Collaborative-Filtering graphs using subgraph analysis a. Each connection = a recommendation b. Measured how often recommendations stay within genre i. Compared # of edges ‘leaving’ subgraph to # of edges ‘staying’ in subgraph 2. For each artist, calculated top 1k most similar artists from both graphs a. Calculated difference between lists b. Used Rank Biased Overlap (RBO) Measured lyrical graph utility by comparing recs. to randomly generated recs. 3.
Network Topology Comparison But more tightly clustered - users listen to a broad spectrum of ● categories Lesser cluster connectivity - niche lyrical content vs pop genres ● Network Diameter Average Shortest Path Clustering Coefficient Lyrics Network 10 4.52 0.217 CF Network 6 4.22 0.119
Recommendation Performance against Random Lyrical Network 12.5 times more superior compared to random ● recommendations Advantageous to consider in determining the initial recommendations to and ● from new/emerging artist or song. Ranking Compared to CF Mean RBO Lyrical Ranking 0.0649 Random Ranking 0.0052
Improvements and Future Work Improvements ● Did not justify chosen parameter values (!) (k-value) ○ Never explicitly explained how recommendations were made… ○ Could incorporate more features other than TF-IDF ○ Sentiment Analysis ■ Measures of repetition and word choice ■ Word embeddings for subtleties ■ Never mentioned how random list was generated for RBO analysis ○ Future Work ● Can only extract so much information from lyrics ○ Using raw sound data could be more fruitful (like bpm) ■ Combine approaches to reap benefits of both ○
THANK YOU Abhijith Brendan Mandya Abraham
In-Degree Distribution The lyrics network is significantly biased than the collaborative filtering network, with the top 10% of nodes receiving 65.1% of the possible edges. In comparison, the top 10% of nodes in the collaborative filtering network only receive 22.6% of the possible edges.
Recommend
More recommend