Wednesday, August 6, 2008

The Continued Popularity of USPop2002

In order to gather some useful training data for my thesis, I need to get some preference rankings for music recommendation. It is also necessary for there to exist tag information as well, such as Last.fm and Pandora. Further, I must be able to obtain audio (or some acoustic features) rather cheaply. The best data I have found is LabROSA's USPop2002. It's much larger than RWC Database and because the songs are based on popularity in 2002, it is much more likely to have tags than Magnatunes. The downside is that I'm limited to Mel-frequency cepstral coefficients.

While, LabROSA also has playlists from OpenNap, there are no preferences given; a song is either on a person's playlist or not on a person's playlist. I've been using Last.fm's API to try to remedy this situation. First, I gathered the top listeners for each of the 400 artists in the USPop2002 set. Over the past couple weeks I have been extracting the total combined weekly chart lists to get the number of plays of a particular song for each listener. While number of plays may not be a direct measure of preference (or rating), it is reasonable to assume that people will listen to song they like more than the ones they do not like. At the moment, I have only downloaded about 4000 listeners (I have to download several pages per listener and Last.fm requests a 1 second wait between requests). Also, artist names appear in several different varieties. Rap and hip-hop seem to be exceptionaly bad since they are unable to do any song without a guest star.

There's tons of data to play with, but for now, let's look into what artists are popular. Note: there are still thousands of users to download and some artists' top 50 listeners have not been reached yet. These results should be taken with precaution so that we don't leap to Montauk monster conclusions (it's a racoon, let it go people).

This kind of continued success is what I would expect to see: superstars make up the vast majority of hits and the short-lived fame of others dies out. However, one should note the artists appearing at the bottom may have more plays due to the "rap problem" described above. I also wanted to see if the data was consistent with Zipf's Law, but it is not (the bend is not deep enough).


One neat thing occurred in the top 5 artists: Beatles, Radiohead, Pink Floyd, David Bowie, Queen. Only the Beatles and possibly David Bowie have had enough users from their lists to explain such high results. Indeed, it appears that the other artists would be just as popular if I had taken a random group of users (note: I'm sure the Beatles will also have this once I extract more pages).

I'll have more later.

No comments: