Tuesday, March 18, 2008

Cepstral mean subtraction worthless?

It has been cited in several research papers (most notably, Aucouturier and Pachet 2006), that performing cepstral mean subtraction (CMS) is damaging to music information retrieval. However, such an approach is common place in automatic speech recognition. I've noticed this with any algorithm that models global timbre. For example, Aucouturier and Pachet modeled each song with a Gaussian mixture model (GMM) and then compared distances with an estimated Kullback-Leibler divergence and noticed a detrimental effect with CMS. This result has been verified by other researchers as well (as well as me). However, there is an important point: the model is built at the global song level. When models are shared among several songs, like acoustic segment modeling (ASM) (Reed and Lee, 2006), it is not only useful to perform CMS, but necessary. If one does not perform CMS, the ASM approach does not work. In fact, most songs sent through the Viterbi decoder will not have surviving paths and even if the paths do survive, it is most often only going to produce a couple of "musiphones."

It should be remembered that CMS discards information (e.g., recording equipment), which is definitely useful for similarity. Obviously, people who record similar types of music are going to use similar types of equipment. However, if one performs CMS on a global model there is no gain to be had by discarding this information. On the other hand, if one wants to use dynamic information, then discarding information by using CMS is necessary. I think a lot of researchers have been citing the conclusions by Aucouturier and Pachet a little unfairly. Their paper was based on global timbre models and results are not applicable to approaches which take dynamics into account.

However, it should be noted that just using HMMs does not necessarily bring useful dynamic information, either. One needs to use these intelligently, which will be the subject of tomorrow's post.

No comments: