Wednesday, March 19, 2008

Using an HMM != ASR

As I said yesterday, many MIR researchers have tried to copy the usual automatic speech recognition (ASR) paradigm by using hidden Markov models (HMMs). However, almost all of these approaches have not correctly used HMMs... at least, if their goal was to mirror what is done in ASR research. Most MIR researchers have modeled an entire song with an HMM with the number of states varying between 3 to 10. Usually, it's been noted that these approaches fair no better than using a GMM for the entire song. The conclusion for a long time has been that dynamic information is not important for music similarity.

The problem I have is in the model itself. Using an HMM for an entire song (or even worse, genre) is NOT the same paradigm in ASR. A song is typically 3 minutes in length, but HMMs in speech are rarely larger than a single word or phone, so the length of time for an HMM is typically on the order of milliseconds. The reality is that HMMs in speech are shared among different utterances. If one wants to copy this for MIR, then HMMs need to be shared across songs. Of course, no one has come up with a good way to provide music transcriptions from which to train HMMs in this way.

Christopher Raphael presented a paper that did try to provide transcriptions to monophonic melodies using HMMs, but no one has really stated how this would apply to polyphonic music. Imagine a slow-moving bass line with a very fast staccato melody on top. Does one start a model each time a new sound starts? Doing so would mean that there would be a very high number of models because every possible combination of notes from every voice would have to have its own model. What about only modeling the lowest note? The problem is that the densities under each state would need a very high number of mixtures to account for the different notes that may be played on top (even more still if one includes different instruments). This means tons of data.

In reality, until something better comes along, using unsupervised HMM tokenization is the best chance for modeling music in the same fashion as speech. The downside is that no one has a direct interpretation for what these models mean. However, there are language identification papers where phone models trained in one language are modeled on another language, even if one language has sounds that are not modeled in the other. This gives hope for those studying music similarity and classification.

No comments: