Wednesday, July 1, 2009

Are recommenders reaching their limit?... No!

At the recent UMAP 2009 conference, a paper raised the possibility that we are reaching the possible performance limits of recommendation systems (RS). If true, this would change the landscape for research and development in RS. In fact, some blogs discussed this paper before it was even presented! However, after reading the paper, I'm a bit inclined to disagree at the hype over this paper. It's true, the paper does point to the performance limit for RS based on the current system of obtaining recommendation data. However, it does not mean that no one can build a better RS.

First, I want to discuss what the potential impact could mean for RS if indeed, we reach a true limit of performance. As an example, assume that for a particular task (e.g., music recommendation), people have a self-agreement 0f 90%. That is, a person will agree with themselves 90 times if they rank 100 songs one day and then rank the same 100 songs two weeks later. Assume that tastes do not change, which the authors argue is the case in their setting (they make three measurements at different points in time). What does this mean? Some possible explanations:

(1) The user doesn't know if or how much he likes the movie.
(2) The user doesn't understand or can't specify the degree to what he likes the movie into discrete, deterministic categories.

(1) is, by default, the wrong option since the user's judgment is the correct answer automatically; however, (2) makes some sense. While people may have an understanding if they really like or hate something, there is a rather large ambiguous area in the middle. How many people can consistently listen to a song and say, "I like that song 40%"? What does that even mean? Does the user like it 40% of the time he hears it? Does it mean that it would be in the 40th percentile of songs if the user were to rank every song he has listened to? If he ranked every song he's heard several times, would the average rank be the 40% percentile? The authors of the paper demonstrate this when they show that the inter-subject disagreement occurs 34% between rankings 2 and 3 and 25% of the time between 3 and 4 on a 5 point scale. In other words, people aren't able to rank movies accurately if they do not have a strong opinion. Usually, it assumed that the fault lies with the user; that is, a person is confused about what the categories imply. I disagree. I believe that it's a probabilistic rating because yes, opinions constantly change. People are not machines. They have emotions. Emotional states have an impact on how we both interpret and want to interpret our environment.

Second, how can we measure the success of a RS when 100% is theoretically impossible? What does this even mean? This issue has come up several times in terms of genre recognition. Until the reprint of Scanning the Dial and the accompanying criticism directed at the MIR community, some authors have validated their algorithms by stating that it is more accurate than humans. As pointed out in a couple papers, this is nonsensical since genres are ill-defined. Ultimately, our categorical dimensions of music is largely subjective and built over a life-time of (often conflicting) feedback from society. Still, we can ask, what if a RS comes out with a better accuracy than the documented limit? Does it know what people will like more than humans? Of course not. It shows an error in the choice of evaluation criteria. Ultimately, a RS is measured at a moment in time. If a person likes something on Tuesday, but does not like it on Wednesday, it does not mean the user is confused. It means he liked it on Tuesday, but not on Wednesday. Tastes may change based on mood, evaluation of new information based on the world around us, etc. Future RS may be able to detect this information to update adaptively.

This again brings up the problem with an RS. Every RS is based on the idea that a user will like something similar to what the user liked in the past. Further, almost all RS model a user as a single entity or that the user must maintain separate profiles for different tastes. For example, Last.fm and Pandora cannot build a station or user profile that maintains two separate personalities - it's up to the user to construct this system. Netflix only allows one user profile per account. While my fiancee and I may like some of the same movies, we certainly do not like all of them. Heck, some days I want a good skeptical show like Bullshit, but on another day I may want pure magical fantasy.

Even with transparency, somethings get muddled by small clusters of users who have a very demonstrated behavior. For example, Netflix is currently telling me that I will like Wallace & Gromit because I like This is Spinal Tap. How are these two even connected? The first is in the category "Children & Family Suggestions" and the other is a movie about a fictitious failed hair-band. Granted, both are good, but the only relation here is only based on discrimination (Wallace & Gromit is a UK show and Spinal Tap is an American movie about a British band). Apparently, all British humor is the same to Netflix users. The weirdest might be that I'll like a Talking Heads concert because I like the movie Fargo. Obviously, content analysis would do a better job of filtering the nonsense predictions.

So, in conclusion, research in recommendation systems is not reaching it's limit in performance. Rather, recommendations based on the idea that a user is a simple, static classifier is limited from the start. Smarter, better recommendation systems that can understand complex user behavior such as emotional state, thoughts on quality, and incorporating content analysis will be the building blocks of the next generation of recommendation systems.

1 comment:

zazi said...

Very good and interesting article, Jeremy. You mentioned several current topics, which should be handled in modern RS.

Cheers zazi