Clova Music: 똑똑한 DJ같은 AI�비서 김정명 (Adrian Kim), M.S. Clova AI Research(CLAIR), Naver Corp.
Clova: Cloud-based Virtual Assistant General Purpose AI platform
Clova: Cloud-based Virtual Assistant https://clova.ai
Clova: Cloud-based Virtual Assistant
Clova Music • The biggest need from a speaker would be MUSIC = Music Listening Platform?
Clova Music • Intelligent music recommendation service of Clova • Aims to be a human DJ-like curator • Powered with NAVER/LINE music user/content data
Contents • Part 1 Short Tutorial on Music modeling - What kind of data do we use? - What kind of models can we use? - What kind of problems can we solve? - Any industry research? • Part 2 Music Research in Clova - Recommendation Systems - Representation learning - Emotion recognition - Highlight extraction - Automatic DJ list generation
Introducing the Music Domain
Popular Domains...
Audio domain data +
Audio Domain Data Wave • Basic data form is 16 bit integer • You can normalize to [-1, 1] • 1D vector of samples For 16kHz, • 16kHz, 22050Hz, ... 30 seconds = 480k datapoints! • Very information inefficient
Audio Domain Data Spectrograms Expressive, has more information!
Audio Domain Data Mel-spectrograms frequency bins > 1k Mel Filter banks mel bins = 80, 96, 128
Audio Domain Data Mel-spectrograms • Mel-spectrogram filter distributions give relative focus on lower frequency bins Image from Choi, et. al. 16
Audio Domain Data Transformation between data types WavenetVocoder (Shen et al. 17) If complex, inverse stft If only magnitude, Griffin-Lim algorithm (1323000,) stft irreversable (1025, 2584) =2648600 mel filter bank (128, 2584) =330752
Low quality, weakly labeled (Choi et al. 2017) Issues Takes a lot of time for high quality Not much open data Dirty Labels Storage problem Memory problem Information per data point is very small Audio Low efficiency Data Too large Convoluted Multiple Must hear to sources evaluate
Issues Comparing Simple Tasks MNIST GTZAN Storage 45MB 1.2GB Data pairs 60000 1000 (30 second) Classes 10 digits 10 genres (100 each) Preprocessing Fast Slow Testing Easy Hard
Issues Comparing Speech and Music Bad Boy – Red Velvet News Speech Audio Short, Single source Long, Multiple source
Example Baselines
er ected Element-wise multiplication What kind of problems can we solve? er LSTM output Attention-weighted LSTM LSTM tional Attention ers (softmax) LSTM LSTM • Genre/Artist Classification ion n Channel summation • Automatic Tagging e el ion Convolution & pooling g • Music generation er • Style transfer • Source separation • Onset detection • Sound embedding • Beat tracking • and more...!
Autotagging with Convnets • Input: mel-spectrogram (MSD dataset) • Output: Tags (50 top tags) 2D Convs https://github.com/keunwoochoi/music-auto_tagging-keras Automatic tagging using deep convolutional neural networks, ISMIR 16, Choi et. al
Note: Filter design in CNNs 2D convs 1D convs Slow training Fast training Local structure in freq Frequencies are discrete nxm filters, 1 channel nx1 filters, m channels
Auto Music Transcription with Deep Complex Networks • Input: Spectrogram complex output • Change network components (batchnorm, initialization, activations, convolution) to match complex domain Real: real and imaginary values as separate channels complex: as suggested Deep Complex Networks, Trabelsi et al., To appear at ICLR18
WaveNet for TTS • Input: wav format data Image from https://kakalabblog.wordpress.com/2017/07/18/wavenetnsynth-deep-audio-generative-models/ WaveNet: A Generative Model for Raw Audio, Oord et al., https://arxiv.org/pdf/1609.03499.pdf
Industries focusing on Music Research and more!
NSynth: Encoding sounds with Wavenet Autoencoder • Wavenet based model made by Magenta to produce a neural synthesizer • Latent embeddings(Z) from various sounds made by the model can be used to produce new sounds • New dataset with instrument, pitch, etc. tags on individual sounds https://magenta.tensorflow.org/nsynth
Performance RNN • Trained by Yamaha e-Piano Competition dataset • Midi of 1400+ piano performances • Magenta used LSTMs to predict from 388 events occuring during the timeline Generated example https://magenta.tensorflow.org/performance-rnn
Discover Weekly • Spotify’s weekly personalized recommendation service • Collaborative Filtering • NLP modeling • Audio modeling http://benanne.github.io/2014/08/05/spotify-cnns.html#contentbased http://blog.galvanize.com/spotify-discover-weekly-data-science/
Any questions? • onto part 2..
Clova Music Recommendation System
Recommendation in Clova Music • User logs as main data, content data hybrid is possible • Large and sparse online data • Topics: • User log analysis • Music semantic embedding learning • Collaborative filtering with matrix factorization
* Reported at 2017 Oct. Top queries with Music • 노래 틀어줘 • 자장가 틀어줘 • 동요 틀어줘 • 신나는 노래 틀어줘 • Artists > Tracks • 조용한 노래 틀어줘 • Genre, mood, themes > Artists • 핑크퐁 노래 틀어줘 • JUST PLAY > Genres • 아이유 노래 틀어줘 • 클래식 틀어줘 • 분위기 좋은 음악 틀어줘 • 잔잔한 음악 틀어줘 • 발라드 틀어줘
* Reported at 2017 Oct. Device Usage Patterns NAVER_APP NAVER_PC WAVE CLOVA_APP 0 5 10 15 20 25
* Reported at 2017 Oct. Device Usage Patterns NAVER_APP WAVE 가요 기능성음악 팝 동요 OST 클래식 재즈 종교음악 일렉트로… 락 힙합 기타
* Reported at 2017 Oct. Device Usage Patterns • Artists / Play count ratio Playing ratio • Long tail distribution • Distribution itself is not so different... Artist
* Reported at 2017 Oct. Device Usage Patterns WAVE NAVER MUSIC APP Playing ratio 핑크퐁 EXO 아이유 아이유 동요 젝스키스 동요 방탄소년단 뉴이스트 EXO 뉴이스트(NU`EST) 윤종신 Wanna One 별하나 동요 윤종신 이루마 우원재 오르골뮤직 볼빨간사춘기 볼빨간사춘기 뉴이스트 W 젝스키스 황치열 트니트니 헤이즈 헤이즈 선미 성시경 WINNER Artist 힐링피아노 자장가
Implication • Paradigm shift in terms of music consumption on AI speaker devices • New market • Kids, New parents • Lean-out music, lounge music • Classic, Jazz • Music Recommendation takes an important role on AI assistant platforms
Recommendation Challenges Lack of well-defined Musical Semantic Embedding meta data Personalized Playlists Multimodal Semantic Embedding
Semantic Embedding Lack of well-defined meta data Music Semantic Embedding • Mapping tracks, artists, and words to the same embedding space • Word2Vec 가을 신나는 • Feature learning • Usages • Item similarities • Used as features
Semantic Embedding Lack of well-defined meta data Word2Vec with Tagged playlists • JAMM playlists • User-created playlists in Naver Music • About 72,000 playlists • Keywords from tags • Artists from tracks • Treat trackIds as ”words” within a playlist
Semantic Embedding That song in the charts • 벚꽃엔딩 / 버스커버스커
Semantic Embedding Personalized Playlists Multimodal Semantic Embedding • We would want to model different playlists for different personalities • Query: 밤편지 < 밤편지_2 > < 밤편지_1 >
Semantic Embedding Personalized Playlists Embedding with session data • User playing sequence as document! • We use multimodal word distributions formed from Gaussian distributions Ben Athiwaratkun and Andrew Gordon Wilson , Multimodal Word Distributions , 2017
Collaborative Filtering Most popular method: Matrix Factorization
Collaborative Filtering Matrix Factorization for Personalized Recommendation • Basic MF objective • Select tracks and artists that user prefers when generating a playlist • Simple, but hard to apply • Sparsity • Overfitting / Underfitting • Hard to evaluate (need real feedback, not rmse!) • Combining with other models
Collaborative Filtering What can we do? • Learning in 2 phases • Long term: batch learning • Short term: online learning • Negative sampling • When doing negative sampling, consider item distribution • Remove abusing users • Over clicking users • Top 100 only users
Remaining Challenges • Conventional problems • Sparsity • Top 100 songs • Cold-start problems • Explanatory recommendation • Music Recommendation for AI Speakers • Interaction • Lean-in / Lean-back • Personalizing level (Familiar vs New)
Music Modeling
Music Modeling • Audio data as main data • Topics: • Representation Vector Extraction (Park et al. 17) • Music Emotion Recognition (Jeon et al. 17) • Music Highlight Extraction (Ha et al. 17) • Automatic DJ mix Generation (Kim et al. 17)
Recommend
More recommend