Friday, August 8, 2008

Occum's Razor and the "Rap Problem"

Yesterday, I briefly described the "Rap problem," which is where artists names appear several times in a database because they feature other artists. It's probably unfair to "pick" on rap after looking at the greatest violators, but there is a clear trend that rap is a fairly big violator. Note: I'm not saying rap sucks or anything like that. I'm just saying that this presents a problem for researchers dealing in search technology. In fact, as I'll show, people who feature lots of guest artists make a pretty impressive list of musicians and performers.

At first, I thought I would have to do an extensive literary search for an efficient solution to this problem, but my girlfriend proposed a quick solution. She suggest that I just look to see if the artists' names are the first ones listed. At the surface this seemed reasonable, except that some artists have names that are sub-sequences of other artists (e.g., "Joe", "Pink"). But this lead to an efficient solution to the problem: look for names that are equal or that have a special formatting. For example, most of the feature problems can be dealt with by looking for the regular expression /^artistsName_feat_/ or /^artistName_&_/ (underscores and ampersand are not wildcards).

This actually worked fairly well since I am only looking for a group of users that listened to songs from my dataset. This is not a solution to the misspelling problem, but it's a fair assumption that most people will listen to correct spellings when using a well-established site like Last.fm. This greatly saved some time and proved once again that one should always try something quick and dirty first.

Looking at the top 20, there is a definite pattern:

mariah_carey: 135
busta_rhymes: 105
usher: 54
nelly: 52
madonna: 48
ludacris: 42
wyclef_jean: 40
santana: 39
michael_jackson: 37
bob_marley: 37
david_bowie: 35
ja_rule: 32
dmx: 31
nelly_furtado: 31
ricky_martin: 29
frank_sinatra: 29
sting: 28
cypress_hill: 27
elton_john: 27
outkast: 25

One should note that artists like Mariah Carey and Busta Rhymes have not necessarily played with over a hundred different artists because those artists can have different spellings, which I did not correct for (e.g., "mariah_carey_feat_boys_2_men" vs "mariah_carey_feat_boys_ii_men). However, the likelihood of mispelling of the featured artists is probably not an inherint trait of the first artists, so we can treat it as noise. I don't think Mariah Carey has a particular fondness of easily mispelled or varied names.

One can also divide this group into about 3 groups (some overlap depending on personal genre definitions): hip-hop, rap, and old and established rock/pop artists. So, the "rap" problem may not be such a problem in terms of taste given the list above. Also, voice and style are very central to the "musicalness" of rap and hip-hop, so using a different artist is probably the same as a rock musician using an orchestra or a different instrument than normal.

No comments: