SPEAKER, ENVIR ONMENT AND CHANNEL CHANGE DETECTION AND CLUSTERING VIA THE BA YESIAN INF ORMA TION CRITERION Sc ott Shaobing Chen & P.S. Gop alakrish nan IBM T.J. Watson R ese ar ch Center email: schen@watson.ibm.c om V arious segmen tation algorithms ha v e b een prop osed in ABSTRA CT the literation [2, 4, 6, 8 , 10, 14 ], whic h can b e categorized In this pap er, w e are in terested in detecting c hanges in as follo ws: sp eak er iden tit y , en vironmen tal condition and c hannel con- Deco der-guided segmen tation. The input audio stream dition; w e call this the problem of ac oustic change dete c- � can b e �rst deco ded; then the desired segmen ts can b e tion . The input audio stream can b e mo deled as a Gauss- pro duced b y cutting the input at the silence lo cations ian pro cess in the cepstral space. W e presen t a maxim um generated from the deco der [14, 8]. Other informations lik eliho o d approac h to detect turns of a Gaussian pro cess; the decision of a turn is based on the from the deco der, suc h as the gender information, could Bayesian Informa- also b e utilized in the segmen tation [8]. tion Criterion (BIC), a mo del selection criterion w ell-kno wn in the statistics literature. The BIC criterion can also b e Mo del-based segmen tation. [2] prop osed to build dif- � applied as a termination criterion in hierarc hical metho ds feren t mo dels, e.g. Gaussian mixture mo dels, for a for clustering of audio segmen ts: t w o no des can b e merged �xed set of acoustic classes, suc h as telephone sp eec h, only if the merging increases the BIC v alue. Our exp eri- pure m usic, etc, from a training corpus; the incoming men ts on the Hub4 1996 and 1997 ev aluation data sho w that audio stream can b e classi�ed b y maxim um lik eliho o d our segmen tation algorithm can successfully detect acoustic selection o v er a sliding windo w; segmen tation can b e c hanges; our clustering algorithm can pro duce clusters with made at the lo cations where there is a c hange in the high purit y , leading to impro v emen ts in accuracy through acoustic class. unsup ervised adaptation as m uc h as the ideal clustering b y Metric-based segmen tation. [4, 6, 10] prop osed to seg- the true sp eak er iden tities. � men t the audio stream at maxima of the distances b et w een neigh b oring windo ws placed at ev ery sample; distances suc h as the KL distance, the generalized lik e- 1. INTR ODUCTION liho o d ratio distance ha v e b een in v estigated. Automatic segmen tation of an audio stream and automatic In our opinion, these metho ds are not v ery successful in clustering of audio segmen ts according to sp eak er iden ti- detection the acoustic c hanges presen t in the data. The ties, en vironmen tal conditions and c hannel conditions ha v e deco der-guided segmen tation only places b oundaries at si- receiv ed quite a bit of atten tion recen tly [4, 8, 6, 10 ]. F or lence lo cations, whic h in general has no direct connection example, in the task of automatic transcription of broadcast with the acoustic c hanges in the data. Both the mo del- news [3], the data con tains clean sp eec h, telephone sp eec h, based segmen tation and the metric-based segmen tation rely m usic segmen ts, sp eec h corrupted b y m usic or noise, etc. on thresholding of measuremen ts whic h lac k stabilit y and There are no explicit cues for the c hanges in sp eak er iden- robustness. Besides, the mo del-based segmen tation do es tit y , en vironmen t condition and c hannel condition. Also not generalize to unseen acoustic conditions. the same sp eak er ma y app ear m ultiple times in the data. Clustering of audio segmen ts is often p erformed via hier- In order to transcrib e the sp eec h con ten t in audio streams arc hical clustering [10, 8]. First, a distance matrix is com- of this nature, puted; the common practice is to mo del eac h audio segmen t w e w ould lik e to se gment the audio stream in to homo- � as one Gaussian in the cepstral space and to use the KL geneous regions according to sp eak er iden tit y , en viron- distance or the generalized lik eliho o d ratio as the distance men tal condition and c hannel condition so that regions measure [6 ]. Then b ottom-up hierarc hical clustering can b e of di�eren t nature can b e handled di�eren tly: for ex- p erformed to generate a clustering tree. It is often di�cult ample, regions of pure m usic and noise can b e rejected; to determine the n um b er of clusters. One can heuristicall y also, one migh t design a separate recognition system pre-determine the n um b er of clusters or the minim um size for telephone sp eec h. of eac h cluster; accordingly , one can go do wn the tree to w e w ould lik e to cluster sp eec h segmen ts in to homoge- � obtain desired clustering [14]. Another heuristic solution neous clusters according to sp eak er iden tit y , en viron- men t and c hannel; unsup ervised adaptation can then is to threshold the distance measures during the hierarc hi- cal pro cess; the thresholding lev el is tuned on a training b e p erformed on eac h cluster. [8, 10] sho w ed that a go o d clustering pro cedure can greatly impro v e the p er- set [10]. Jin et al. [7] shed some ligh t on automatically formance of unsup ervised adaptation suc h as MLLR. c ho osing a clustering solution.
