Recently, I presented a paper at ICASSP that discussed the importance of incorporating temporal information into the structure of automatic tag recommendation algorithms. Until this paper, all studies and systems designed to overcome the "cold start" problem ignored temporal information for the most part. In fact, I am aware of only three exceptions:
(1) Using derivatives of features, such as MFCCs. This incorporates temporal information on only a small scale (~50-100ms). (2) Averaging features from multiple frames in a given temporal window (ex: averaging 100 ms frames over a duration of 1 second). (3) Extracting song-level features, such as rhythmic features; e.g., estimated beat histograms.
The problem with the above approaches it that they incorporate information on a very small scale and do not incorporate "syntactic structure." One problem I have had with music recommendation research is that many of the systems are based on a rather faulty assumption. Researchers have taken an abstract from a presentation given by Gjerdingen and Perrot in 1999 and essentially taken the results way too far. Specifically, researchers have taken for granted the "bag-of-frames" approach, which essentially says that any small segment of a song is representative of the whole song. In other words, one can listen to 250 ms of a song and that will be representative. This is obviously a faulty assumption and it has been discussed here and here. Originally, this assumption was used in studies on genre classification. Since genre is an ill-defined concept anyway, it is difficult to verify this assumption. However, even if this assumption is true, it does not make sense to translate this to tags, which have a better defined meaning. For example, the Pandora tags of "repetative melodic phrasing" and "extensive vamping" have obvious acoustic semantic structure.
So how does our study contrast with previous approaches?
In the paper, we build a vocabulary of acoustic tokens, which can be seen as acoustic generalizations of phonemes in automatic speech recognition (or musiphones as Doug Turnbull called them - yes, Doug, I consider this your terminology). Not only are the musiphones represented by a temporal model (i.e., a multi-state HMM), but syntax is also considered through the use of unigram and bigram counts. Compared to the baseline (Turnbull, Barrington, Torres, and Lanckriel), our algorithm performance substantially better; especially for tags which are considered to be temporal in nature (i.e., melody, solos, etc.). While this paper mirrors the implementation we proposed for genre detection in 2006, the results are more informative in the 2009 ICASSP paper.
Note: The Gjerdingen and Perrot presentation has finally been published so that people can read the study in its entirety. Two things to note in addition to the papers linked above:
(1) Gjerdingen and Perrot performed a task of discriminating genres (i.e., a closed, forced choice) and not identification (i.e., an open, forced or unforced choice). (2) Only 10 genres were used and most where fairly easy to discriminate.