A quantitative comparison of sung and spoken lyrics NUS Sung and Zhiyan Duan � Haotian Fang � Bo Li � Spoken Lyrics Corpus Khe Chai Sim � Ye Wang � 1
Outline ❖ Motivation � ❖ Dataset Description � ❖ Duration Analysis � ❖ Spectral Analysis � ❖ Conclusion � ❖ Future Work
Outline ❖ Motivation � ❖ Dataset Description � ❖ Duration Analysis � ❖ Spectral Analysis � ❖ Conclusion � ❖ Future Work
Motivation ❖ Understanding the characteristics of singing voice � ❖ Benefiting a wide range of research problems � ❖ Lack of a comprehensive dataset with phoneme level annotation
Outline ❖ Motivation � ❖ Dataset Description � ❖ Duration Analysis � ❖ Spectral Analysis � ❖ Conclusion � ❖ Future Work
Dataset ❖ Diversity : in gender, accent, tempo etc. � ❖ Size : number of songs, subjects � � ❖ Balance the two Image by Digitalnative
Songs Selection ❖ Phonetic richness : to get the most out of selected songs � ❖ Phonetic balance : to minimize bias � ❖ Tempo balance : to cover songs with different tempo � ❖ Popularity : easier to recruit subjects � ❖ Ease of learning : easier for subjects to learn
Songs Selection ❖ Songs : 20 � ❖ Est. Phoneme Count : 140 ~ 980 per song � ❖ Tempo : 68 ~ 150 bpm
Subjects ❖ 6 males, 6 females � ❖ All levels of vocal experiences � ❖ Amateur to 10+ years of vocal training � ❖ All common voice types � ❖ Soprano, alto, tenor, baritone and bass
Subjects - Accents Singing Speech 6 4.5 3 1.5 0 North American Mild Malay Malay Mild SingaporeanSingaporean North Chinese Number of subjects with different accents
Recording ❖ Sound-proof recording studio � ❖ 44.1 kHz, 16-bit � ❖ Pro Tools 9 � ❖ Metronome with downbeat accent (through earphone) � ❖ Lyrics printouts on music stand
Annotation ❖ Phoneme set : CMU Dictionary * � ❖ Annotators : with musical & phonetic backgrounds � ❖ Software : Audacity * http://www.speech.cs.cmu.edu/cgi- bin/cmudict
Annotation
Annotation ❖ Annotated sung tracks : 48 tracks � ❖ Subjects : 12 (6 male, 6 female), 4 tracks per subject � ❖ Total Length : 169 mins � ❖ Phoneme Count : 25,474 � ❖ Spoken data : alignment of labels from sung data * http://www.speech.cs.cmu.edu/cgi- bin/cmudict
Outline ❖ Motivation � ❖ Dataset Description � ❖ Duration Analysis � ❖ Spectral Analysis � ❖ Conclusion � ❖ Future Work
Duration Analysis ❖ Focus on consonants � ❖ Stretching in time and subject variations � ❖ Proportion in syllable and position effects � ❖ Compare among different types of consonants
Phoneme Classes Class CMU Phonemes AA, AE, AH, AO, AW, AY, EH, ER, EY, IH, IY, OW, OY, UH, UW Vowels Semivowels W, Y Stops B, D, G, K, P, T Affricates CH, JH Fricatives DH, F, S, SH, TH, V, Z, ZH Aspirates HH Liquids L, R Nasals M, N, NG
Consonants Stretching ❖ Intuitively, vowels can be stretched arbitrarily. � ❖ Consonants are supposed to be less so ?
Consonants Stretching Speech Singing 4 3 Time (s) 2 1 0 Vowel Consonant
Consonants Stretching Stretching Ratio = Singing Duration / Speech Duration Male Female Overall 2.3 Average Stretching Ratio 1.725 1.15 0.575 0 Semivowel Stops Affricates Fricatives Aspirates Liquids Nasals Average stretching ratio comparison of different types of consonants
Consonants Stretching - Subject Variations Comparison on probability density function of consonants duration � stretching ratio with respect to gender.
Consonant Stretching - Subject Variations Gender Accent Musical Exposure 2 years of choral Subject 05 Female Malay experience Subject 08 Male Northern Chinese no vocal training
Consonants Stretching - Subject Variations Comparison on consonants duration stretching ratio of subject 05 and 08
Consonant Proportion Male Female Overall 34 Consonant Proportion in Syllable 25.5 (%) 17 8.5 0 Semivowel Stops Affricates Fricatives Aspirates Liquids Nasals Phoneme proportion in syllable comparison of different types of consonants
Consonant Proportion ❖ Syllabic proportions of consonants are higher in males � ❖ Absolute length of both consonants and syllables are higher in male
Consonant Proportion - Position Effect Type Description Example Starting At the beginning of a word /g/ in go Preceding a vowel, but not at the Preceding /m/ in small beginning of a word Succeeding a vowel, but not at the end Succeeding /l/ in angel of a word Ending At the end of a word /t/ in at
Consonant Proportion - Position Effect Start Preceding Succeeding Ending 40 Consonant Proportion in Syllable 30 (%) 20 10 0 Semivowel Stops Affricates Fricatives Aspirates Liquids Nasals The effect of positioning on consonant proportion in syllable
Outline ❖ Motivation � ❖ Dataset Description � ❖ Duration Analysis � ❖ Spectral Analysis � ❖ Conclusion � ❖ Future Work
Spectral Analysis ❖ Likelihood score comparison of sung and spoken phonemes � ❖ Discrepancies between the effects of duration & pitch on MFCC features
Likelihood Score Comparison ❖ Using a GMM-HMM system trained on WSJ0 corpus � ❖ Perform alignment on both speech and singing data � ❖ Phonemes boundaries are fixed for sung tracks
Likelihood Score Comparison Spoken Phoneme Sung Phoneme GMM-HMM System Score Score -
Likelihood Score Comparison Average likelihood difference = |Average likelihood score (sung) - Average likelihood score(spoken)| Male Female Overall 90 Average Likelihood Difference 67.5 45 22.5 0 Vowels Semivowels Stops Affricates Fricatives Aspirates Liquids Nasals Average likelihood difference comparison of different types of phonemes
Effects of Duration & Pitch on Acoustic Features ❖ Discretize phoneme duration/pitch into 10 bins � ❖ Ensure bins have balanced cumulative density masses � ❖ Cluster using decision tree � ❖ Lower reduction rate indicates larger impact on low level acoustic features (i.e. MFCC)
Effects of Duration & Pitch on Acoustic Features Sung Spoken 60 Model Reduction Rate 45 30 15 0 Duration Pitch
Outline ❖ Motivation � ❖ Dataset Description � ❖ Duration Analysis � ❖ Spectral Analysis � ❖ Conclusion � ❖ Future Work
Conclusion ❖ Created the NUS-48E dataset of sung and spoken lyrics � ❖ Conducted comparative study of sung and spoken phonemes in both time and frequency domain
Outline ❖ Motivation � ❖ Dataset Description � ❖ Duration Analysis � ❖ Spectral Analysis � ❖ Conclusion � ❖ Future Work
Future Work ❖ Continue to annotate the remaining tracks (currently 80 out of 420 are annotated) � ❖ Annotate the spoken data � ❖ Repeat some previous work related to singing voice using the new dataset � ❖ Further exploration based on current observations
Thank you!
Question & Answer
Recommend
More recommend