Speaker Movement Correlates with Prosodic Indicators of Engagement Rob Voigt, Robert J. Podesva, and Dan Jurafsky L INGUISTICS D EPARTMENT S TANFORD U NIVERSITY
Links Between Acoustic and Visual Prosody • Gestural apices align with pitch accents Jannedy and Mendoza-Denton (2006) • Production of “visual beats” increases the prominence of the co-occurring speech Krahmer and Swerts (2007) • Speakers move their head and eyebrows more during prosodically focused words Cvejic et al. (2010)
Question 1: Is the Relationship Between Acoustic and Visual Prosody Continuous? • Previous research • Identified discrete relationships • Our proposal • Examine scalar relationships • Particularly between movement and affective measures of engagement Yu et al. (2004), Mairesse et al. (2007), Gravano et al. (2011), Oertel et al. (2011), MacFarland et al. (2013), etc.
Question 2: Methodological Barriers to Studying Visual Prosody • Prior studies generally employ • Time-intensive annotation schemes or • Expensive or invasive experimental hardware • Thus face limitations • Small amounts of data • Prohibitive expense
Our Solution: New Data Source Automatically extract visual and acoustic data from YouTube • Potentially huge amounts of data • Ecologically valid (“in the wild”) • Allows replicability
Our Solution: New Data Source • We chose “first day of school” video blogs (“ vlogs ”) • 14 videos, 95 minutes of data • Static backgrounds and stable cameras • Generally engaged, animated
Our Solution: Automatic Phrasal Units Approximate pause-bounded units (PBUs) • Our unit of prosodic analysis • Calculated with a simple iterative algorithm • Find silences ( Praat ) with a threshold of -30.0dB; sounding portions are approximate PBUs • If average phrase length > 2 seconds, raise threshold by 3.0dB and re-extract
Our Solution: New Visual Feature Movement Amplitude • Assumes speaker is talking in front of a static background • Quantifies speaker movement as pixel-by-pixel difference between frames • Calculated in log space, z-scored per speaker
Visualization: Continuous Measurements • Video at 30 FPS allows observations at 30 Hz
Visualization: Movement-Only Video • Coarse, but reasonable overall estimation
Acoustic Features • Following prior work on prosodic engagement • Pitch (fundamental frequency) and Intensity (loudness) • Eight features per phrase • max, min, mean, standard deviation (std) for both pitch and intensity
Statistical Analysis • Movement amplitude measures (max, min, mean, std) are highly co-linear • PCA for dimensionality reduction • Two components explain 96% of variance in MA
Statistical Analysis • Series of linear regressions • Predicting acoustic variables from O VERALL M OVEMENT and M OVEMENT V ARIANCE • Controlling for speaker-specific variation by including speakers as random effects • Controlling for log(phrase length)
Experimental Pipeline • Download videos, extract frames and audio • Calculate approximate phrase units (PBUs) • Compute movement amplitude for each frame • Calculate MA principal components • Extract acoustic features • Run statistical models
Results ** is p < 0.01, *** is p < 0.001, — is no significant relationship
Results • During phrases with more O VERALL M OVEMENT , speakers use • higher and more variable pitch • louder and more variable intensity • M OVEMENT V ARIANCE was not predictive of any of our acoustic features
Visualization: Across Phrases • Notice light and dark vertical banding • Suggests sequence modeling as future work
Moving Forward • More advanced vision-based features • Face tracking • Gesture recognition • Expanding the data • Genre effects • Sociolinguistic variables • Movement in interaction
Discussion • Further empirical evidence for a rich link between acoustic and visual prosody • Adds dimension of quantity / continuous association, in addition to previously demonstrated temporal synchrony • Methodological contributions suggest new avenues for multi-modal analysis of prosody • Code and Corpus: nlp.stanford.edu/robvoigt/speechprosody
Recommend
More recommend