Wednesday, March 19, 2008

Using an HMM != ASR

As I said yesterday, many MIR researchers have tried to copy the usual automatic speech recognition (ASR) paradigm by using hidden Markov models (HMMs). However, almost all of these approaches have not correctly used HMMs... at least, if their goal was to mirror what is done in ASR research. Most MIR researchers have modeled an entire song with an HMM with the number of states varying between 3 to 10. Usually, it's been noted that these approaches fair no better than using a GMM for the entire song. The conclusion for a long time has been that dynamic information is not important for music similarity.

The problem I have is in the model itself. Using an HMM for an entire song (or even worse, genre) is NOT the same paradigm in ASR. A song is typically 3 minutes in length, but HMMs in speech are rarely larger than a single word or phone, so the length of time for an HMM is typically on the order of milliseconds. The reality is that HMMs in speech are shared among different utterances. If one wants to copy this for MIR, then HMMs need to be shared across songs. Of course, no one has come up with a good way to provide music transcriptions from which to train HMMs in this way.

Christopher Raphael presented a paper that did try to provide transcriptions to monophonic melodies using HMMs, but no one has really stated how this would apply to polyphonic music. Imagine a slow-moving bass line with a very fast staccato melody on top. Does one start a model each time a new sound starts? Doing so would mean that there would be a very high number of models because every possible combination of notes from every voice would have to have its own model. What about only modeling the lowest note? The problem is that the densities under each state would need a very high number of mixtures to account for the different notes that may be played on top (even more still if one includes different instruments). This means tons of data.

In reality, until something better comes along, using unsupervised HMM tokenization is the best chance for modeling music in the same fashion as speech. The downside is that no one has a direct interpretation for what these models mean. However, there are language identification papers where phone models trained in one language are modeled on another language, even if one language has sounds that are not modeled in the other. This gives hope for those studying music similarity and classification.

Tuesday, March 18, 2008

Cepstral mean subtraction worthless?

It has been cited in several research papers (most notably, Aucouturier and Pachet 2006), that performing cepstral mean subtraction (CMS) is damaging to music information retrieval. However, such an approach is common place in automatic speech recognition. I've noticed this with any algorithm that models global timbre. For example, Aucouturier and Pachet modeled each song with a Gaussian mixture model (GMM) and then compared distances with an estimated Kullback-Leibler divergence and noticed a detrimental effect with CMS. This result has been verified by other researchers as well (as well as me). However, there is an important point: the model is built at the global song level. When models are shared among several songs, like acoustic segment modeling (ASM) (Reed and Lee, 2006), it is not only useful to perform CMS, but necessary. If one does not perform CMS, the ASM approach does not work. In fact, most songs sent through the Viterbi decoder will not have surviving paths and even if the paths do survive, it is most often only going to produce a couple of "musiphones."

It should be remembered that CMS discards information (e.g., recording equipment), which is definitely useful for similarity. Obviously, people who record similar types of music are going to use similar types of equipment. However, if one performs CMS on a global model there is no gain to be had by discarding this information. On the other hand, if one wants to use dynamic information, then discarding information by using CMS is necessary. I think a lot of researchers have been citing the conclusions by Aucouturier and Pachet a little unfairly. Their paper was based on global timbre models and results are not applicable to approaches which take dynamics into account.

However, it should be noted that just using HMMs does not necessarily bring useful dynamic information, either. One needs to use these intelligently, which will be the subject of tomorrow's post.