speaker movement correlates with
play

Speaker Movement Correlates with Prosodic Indicators of Engagement - PowerPoint PPT Presentation

Speaker Movement Correlates with Prosodic Indicators of Engagement Rob Voigt, Robert J. Podesva, and Dan Jurafsky L INGUISTICS D EPARTMENT S TANFORD U NIVERSITY Links Between Acoustic and Visual Prosody Gestural apices align with pitch


  1. Speaker Movement Correlates with Prosodic Indicators of Engagement Rob Voigt, Robert J. Podesva, and Dan Jurafsky L INGUISTICS D EPARTMENT S TANFORD U NIVERSITY

  2. Links Between Acoustic and Visual Prosody • Gestural apices align with pitch accents Jannedy and Mendoza-Denton (2006) • Production of “visual beats” increases the prominence of the co-occurring speech Krahmer and Swerts (2007) • Speakers move their head and eyebrows more during prosodically focused words Cvejic et al. (2010)

  3. Question 1: Is the Relationship Between Acoustic and Visual Prosody Continuous? • Previous research • Identified discrete relationships • Our proposal • Examine scalar relationships • Particularly between movement and affective measures of engagement Yu et al. (2004), Mairesse et al. (2007), Gravano et al. (2011), Oertel et al. (2011), MacFarland et al. (2013), etc.

  4. Question 2: Methodological Barriers to Studying Visual Prosody • Prior studies generally employ • Time-intensive annotation schemes or • Expensive or invasive experimental hardware • Thus face limitations • Small amounts of data • Prohibitive expense

  5. Our Solution: New Data Source Automatically extract visual and acoustic data from YouTube • Potentially huge amounts of data • Ecologically valid (“in the wild”) • Allows replicability

  6. Our Solution: New Data Source • We chose “first day of school” video blogs (“ vlogs ”) • 14 videos, 95 minutes of data • Static backgrounds and stable cameras • Generally engaged, animated

  7. Our Solution: Automatic Phrasal Units Approximate pause-bounded units (PBUs) • Our unit of prosodic analysis • Calculated with a simple iterative algorithm • Find silences ( Praat ) with a threshold of -30.0dB; sounding portions are approximate PBUs • If average phrase length > 2 seconds, raise threshold by 3.0dB and re-extract

  8. Our Solution: New Visual Feature Movement Amplitude • Assumes speaker is talking in front of a static background • Quantifies speaker movement as pixel-by-pixel difference between frames • Calculated in log space, z-scored per speaker

  9. Visualization: Continuous Measurements • Video at 30 FPS allows observations at 30 Hz

  10. Visualization: Movement-Only Video • Coarse, but reasonable overall estimation

  11. Acoustic Features • Following prior work on prosodic engagement • Pitch (fundamental frequency) and Intensity (loudness) • Eight features per phrase • max, min, mean, standard deviation (std) for both pitch and intensity

  12. Statistical Analysis • Movement amplitude measures (max, min, mean, std) are highly co-linear • PCA for dimensionality reduction • Two components explain 96% of variance in MA

  13. Statistical Analysis • Series of linear regressions • Predicting acoustic variables from O VERALL M OVEMENT and M OVEMENT V ARIANCE • Controlling for speaker-specific variation by including speakers as random effects • Controlling for log(phrase length)

  14. Experimental Pipeline • Download videos, extract frames and audio • Calculate approximate phrase units (PBUs) • Compute movement amplitude for each frame • Calculate MA principal components • Extract acoustic features • Run statistical models

  15. Results ** is p < 0.01, *** is p < 0.001, — is no significant relationship

  16. Results • During phrases with more O VERALL M OVEMENT , speakers use • higher and more variable pitch • louder and more variable intensity • M OVEMENT V ARIANCE was not predictive of any of our acoustic features

  17. Visualization: Across Phrases • Notice light and dark vertical banding • Suggests sequence modeling as future work

  18. Moving Forward • More advanced vision-based features • Face tracking • Gesture recognition • Expanding the data • Genre effects • Sociolinguistic variables • Movement in interaction

  19. Discussion • Further empirical evidence for a rich link between acoustic and visual prosody • Adds dimension of quantity / continuous association, in addition to previously demonstrated temporal synchrony • Methodological contributions suggest new avenues for multi-modal analysis of prosody • Code and Corpus: nlp.stanford.edu/robvoigt/speechprosody

Recommend


More recommend