Challenges of Collaborative Filtering

Previously Vincent wrote about collaborative filtering here on Tech It Easy and made a really good business case on the topic of user-generated content (UGC) versus Expert input. Here, I’ll go a bit more deep into the ways collaborative filtering is done and what are the challenges.

For simplicity, I have divided the ways to filter in two. There’s the Pandora way, where the approach is that a song can be explained by about 150 different genes and recommendations are (in very simple terms) other songs in the neighborhood in that multi-dimensional space. To accurately achieve this, they use expert opinions. Then there’s the Amazon, Last.fm, Netflix et al. way of clustering users with similar histories and recommending what other people in that cluster have liked.

The huge difference in these approaches is best illustrated by the fact that for the Pandora way to work, you don’t actually need any users. The expert’s role in the latter way is to somehow come up with a way to model these clusters accurately.

The latter is much more interesting, because it’s always a challenge to infer anything from user data. The Pandora way’s “only” major challenge is the assumption that people like similar things (ie. how big the searched neighborhood should be).

The other main reason for interest in the Amazon/Netflix way is, of course, money. The $ 1,000,000 Netflix Prize is, simply put, a hunt for a certain RMSE (root mean squared error). When described this way, some interesting questions arise.

 

For the record, I liked Napoleon Dynamite

For the record, I liked Napoleon Dynamite

One question I think is important is what’s the theoretical limit for accuracy in Netflix’s case. In other words, let’s assume that all users at Netflix rate fully rationally all and only the movies they have seen on a cardinal scale. That’s a pretty heavy load of assumptions and I’m pretty sure that’s not all. That’s why even though Netflix could accurately forecast the data it wouldn’t mean it mirrors users’ true preferences. So, what actually is this upper limit on accuracy, or lower limit on RMSE, in Netflix’s case is a good question.

 

For these reasons it shouldn’t be surprising that “just a guy in a garage”, a psychologist employing behavioral decision making assumptions instead of hard rationality, could get so good scores in the Netflix Prize. A pretty good story on that was in Wired a while ago.

For the reasons above, it’s also pretty backwards to think that the problem is fitting the data into the algorithm, so I wouldn’t really call it a “Napoleon Dynamite problem” as NY Times did recently. But do note, that the “Pragmatic Theory” team interviewed in this article, just like “just a guy in a garage”, didn’t actually invent anything new, they just realized to use a method didn’t know or had forgotten about, in this case singular value decomposition. One such method is the Principal Component Analysis, which is available in pretty much any statistical software package available (no, Excel doesn’t count) (and yes, could think Pandora way as something similar to Factor analysis).

One difficulty in Netflix’s case it pretty much boils down to what’s in a number. Remember that in this case the teams work only on user rating data, but they are of course free to add more data from other sources as well. This doesn’t change the fact that the only user data they have are user’s ratings.

As a sidenote, I guess that one reason demographics aren’t used is legal issues. Vince pointed that things like the “Napoleon Dynamite problem” could be solved with more data like demographics and mood. Now, usually more data means just more problems, but let’s forget about that for this discussion.

On this topic, I recently listened to a really interesting lecture about modern consumer analysis by Petri Vasara from Pöyry consulting. They had come up with neat tool, ConsuNaut (PDF) to show what certain segments are doing at what times (comparing to the old “your target audience watches TV x hours day” way) and what was their mood etc. One “press release friendly” finding of this tool is that the Global Rush Hour, or when most of the world’s people are commuting, is at 18-19 Finnish time (UTC+2).

Anyway, back to the topic. What I also see as a problem is the actual “forecasting” part. Now, this doesn’t affect Netflix that much, because I assume that it is in their interest to get customers rent whatever movies, even – using the out-of-fashion term – “long tail”. Even more so if there are inventory costs involved. What happens when a new movie enters the pool? Remember, that for clustering to work, there has to be data, which is pretty sparse for a new movie. How long does it take for new movie’s recommendations to be accurate and how does it affect other recommendations?

In other words, how stable is the solution for the problem? How does seeing the latest James Bond, because everyone goes to see that, change the recommendations to someone who doesn’t like other action movies? Is he recommended Transporter 2? Is fan of Pixar movies offered Disney’s children’s animations, or worse yet, DreamWorks’ animations?

Wall-E

Not Madagascar 2

So, while Netflix way is about fitting data and finding clusters, Pandora bases it assumption on the idea that all music can be labeled accurately and objectively. The main criticism against this approach in my opinion is the post-modern philosophy of subjectiveness. Is there really one truth? (Also, how many genes does it need?)

I was attending a guest lecture by Andrzej P. Wierzbicki on “The Problem of Objective Ranking: Foundations, Approaches and Applications”, where he, for example, discussed the “dangers and errors of the subjectivist reduction of objectivity to power and money”. So he was painting with a broad brush, but there were lots of gems. He also noted that intersubjective rational ranking is difficult and full objectivity is impossible, which should demotivate the Pandora crowd a little.

So, what might at surface look like a statistical challenge is deep down much more cross-disciplinary and it goes all the way to our assumptions of reality. This is why it is important to keep in mind the most important thing, the end of all this – the business angle. It is not Netflix’s or Pandora’s interest to 100% accurately predict anything, they only need to do it well enough. Well, not Netflix’s anyway. The whole reason for improving Cinematch is purely economical, they have found out that people actually rent more if the recommendations are good (enough). There’s a reason they’re offering one million dollars for 10% improvement. I’d love to know how quickly that million pays itself back.

And, really, let’s face it. Most of the collaborative filtering things today are just toys so none of this really matters. There’s a lot of assumptions and approximations and the results are good enough for the purpose. For example, iTunes’ Genius is certainly flawed and limited, but it’s way better than normal random or shuffle play. But if you want to go that extra mile, then you see that the challenge gets exponentially more difficult.

To top it off, in the end there’s the age old problem of optimization, which is that on average, the solutions are “good”, but not “interesting” and definitely not varied. But to add “interestingness” we have to add uncertainty and that’s whole new world of pain (Allais paradox being the least)… but risk should have its rewards, shouldn’t it?

Kari Silvennoinen is a Ph.D student at Helsinki School of Economics and is currently working on behavioral decision making topics.

Related posts:

  1. Collaborative filtering: is it better to weigh user-input or expert-input?
  2. The state of media 2.0 – challenges and opportunities
  3. 5 structuring challenges new software ventures face
  4. The power of statistics and why the “why” doesn't matter
  5. Creating relevance in a complex world

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

No Responses to “Challenges of Collaborative Filtering”

  1. I’m not a statistician, but isn’t it the case that if you study a phenomenon, in this case a user’s voting-history, long enough, then the errors average out? Playing devil’s advocate here—I know what I wrote last time—isn’t that kind of the power of many(!) users, that there are so many of them, hence you can generalise along a huge time and space dimension, as it where? (Devil’s advocate robe off: of course, you’ll need the users first.)

    In other words, I think its a mistake to draw generalisations from a single instance and a single user. We all have a Bambi, 2001, or Napoleon D. (I only liked the last 5 mins) in our closet, that doesn’t mean that our other ratings are wrong.

    About new movies. I don’t know if ever look at IMDB-ratings for blockbusters as they just come out? I’ve come to learn (= be trained) that these movies are usually overrated the first few weeks, after which their ratings normalise—they can’t all be in the IMDB-top 100… About new movies, I’ve also come to learn that if only 50 people vote for a movie, they also usually overrate it, probably because they are the film-maker’s friends. In other words, users like me arent’ stupid and will know that new films can’t be rated perfectly through some magic voodoo.

    There was another factor I found interesting in the NYTimes-story, which is that human video-clerks make mistakes, but also bring a form of excitement to the table. People like to be surprised and even extremely bad films can lead to some pretty heated debates afterward—any experience is all about the memories, after all, and if everything is “good,” it eventually becomes mediocre.

    As far as pay-back is concerned, I think that either the Wired or the NYTimes story mentioned what kind of numbers, Netflix is renting out, 1 million is peanuts to them + its money well-spent.

    Anyway, the more I learn and think about collab. filtering, the more I love it, because there are so many factors that these AIs don’t yet take into account yet.

  2. Yes, if you’re taking sample average, the errors average out the bigger the sample is. And yes, random mistakes in user’s rating are tolerable. And yes, your ratings are never “wrong” in a sense. The problem is when those errors are not random, like in the “new movie”-overrating-problem you described.

    It’s all just a number and the guys crunching the numbers are supposed to find out if someone’s just overrating movies. You also point out a good point why assuming everyone’s ratings as objective is pretty flawed. But, large enough sample and this average out, right? =)

    The problem is that we’re not looking at aggregates, but trying to predict on individual level where averages aren’t that useful. Of course Netflix could optimize its operations based on just aggregate data, and that’s probably worth something too.

    When people realise they gave way too high scores just because they were still excited about the movie, do they go back and “correct” them? Or do they think that “well, now when I’ve seen this movie, I think I gave a wrong score to those some movies?” Is it worth the effort to go back and re-rate movies after they’ve got new information on what to base their “more correct” rating?

    The human video-clerks making mistakes but the excitement is the uncertainty part I briefly discussed in the end. The inverse of this is why all movies seem to be “safe bets” (sequels, generic story, etc.) because I guess it’s easy to estimate how much they reel in. For “interesting” movies, there’s always the risk. It’s like portfolio management from movie studios’ side, with high enough volume of safe bets, you can afford a couple interesting movies.

    That’s a really good point that people aren’t really after good movies, but good experiences. One might wonder if it’s possible to forecast good experience from previous movie ratings (is the experience visible in the number?) And true, if everything’s good, then everything’s medicore, which leads to a problem that should the recommendation systems actually recommend you crap just to recalibrate your ratings. In the long run it’d be good for the user, but in short run you’re just wasting user’s time and money… =)

    Strictly speaking, this isn’t AI, but one approach that I didn’t cover here was AI-like methods like machine learning, neural networks, genetic algortihms, etc.

    And finally, true, there are countless factors here. One of the challenges is to find out what the important ones are.

  3. leafar says:

    One missing link : the wikipedia article

    http://en.wikipedia.org/wiki/Collaborative_filtering

    What you call the pandora way is called the content based approach.

    And i doubt a lot about this phrase : “Most of the collaborative filtering things today are just toys so none of this really matters.” Because i think you miss the point of the user experience trying to focus too much on the theoritical approach.

    Your problem on a new movie should have end up on a discussion about stock vs flux. Back to the content based approach that will slowly leave the user/user Cf approach as the ratings flows in. Like in real life, interest for james bond is based on james bond, actors, director … etc until we have some newspapers/friends reviews.

    It’s well written but for me it is too much or not enough.

    Will be happy to talk with you about it next time your write an article on the subject.

  4. Ha, Mr. Ulike, I was wondering when you were going to drop by. :) Thanks for introducing me to the concept of stock vs. flux, it is indeed a good way to look at the problem dynamically, from the no-users to some-users, to a-constant-flux-of-users perspective.

  5. Leafar, good points. What I meant by “toys” was that I’d like to see more profit-generating applications instead of just something added to a web app as an afterthought like “tag clouds” and stuff like that.

    I also tried to bring out the point that while news articles on the subject focus a lot on the programming side of things, there’s a strong foundation on, among others, psychology and statistics too and that there are already good algorithms and one major challenge is the data in it’s own right.

    I did not focus that much on implementation or user experience, because those are things that I’ve no idea about. For example, how to solve stock vs. flux, I just threw those questions to air. The approach you described sounds like a practical one.

    What do you see as the challenges in this field?

Staypressed theme by Themocracy