First experiments in audio/video features for phoneme recognition Petr Motl´ ıˇ cek FIT VUT Brno, motlicek@fit.vutbr.cz M4 meeting in Prague, January 22nd - 23rd 2004
Introduction • Data: M4 – IDIAP, 41 min., audio-video data (training, testing). • Labels: 47 phoneme categories, obtained by forced alignment (models on ICSI data, adapted on M4 data). • Audio: Beam-formed recordings, 16kHz. • Video: Cut off head positions.
Audio preprocessing • F s = 16kHz, frame-rate 100Hz, 20ms long frames of MFB log energies. Video preprocessing • Frame-rate 25Hz, RGB frames 70x70 points.
Bimodal speech recognition system Audio signal Acoustic features Acoustic parameterization (16kHz) (23 dim., 100Hz) −−−> time Acoustic−visual features Feature Neural Net fusion (39ddim., 100Hz) 10 20 30 Visual signal Visual features Visual features Visual 40 50 Interpolation 60 Recognition results parameterization 70 (25Hz) (16 dim., 25Hz) (16 dim., 25Hz) 80 90 100 110 10 20 30 40 50 60 70 Gray Edge 2D−cross Resize scalization calculation correlation Maximum Square cropping 2D − DCT LPF
Recognition results - Accuracy Acoustic [%] Visual [%] Acoustic-Visual [%] Phonemes 31.05 12.15 31.33 VAD 94.04 83.79 94.12 - 0 (83%) 96.86 99.62 96.89 - 1 (17%) 79.32 1.44 79.71
Problems & Current focus • More data for acoustic-visual experiments. • Incorporation of robust mouth detection algorithm. • Compensation algorithms to reduce lighting variations, rotation, . . . . • LDA - to reduce dimensionality and improve discrimination among the speech classes.
Recommend
More recommend