Collaborative Filtering: An Introduction

Why we need recommendation system

Information grows in a way we’d never imagined before. Every second, the amount of data in the Web is increasing and there’s no way humans could process all of it. This explosion of information is changing the world everyday and surely we don’t want to be ignorant of it. But, human’s ability to learn can’t hope to match the speed of the increasing data. That’s why we make computers to do it for us.

In the ever-growing jungle of data, only a tiny part of them is actually important for us. Let’s say that you are a pharmacist. You don’t need any information about the new planets discovered in some faraway galaxy (assuming you’re not interested in astronomy). But you might want to be notified when there’s a new paper about pharmacy. But how do computers know which information is important for you?

In a sense, we need a computer system that takes all these information as an input, and selects a few of them that are suitable for us. A recommendation system does exactly this.

A recommendation system makes predictions about things that might be useful for you. Spotify makes predictions about music you might enjoy, Amazon makes predictions about items you might want to buy, Twitter shows you people you might want to follow, and so on. And the results are considerably good. But how do these companies do it? How do they build such system?

We consider one of the techniques to build a recommendation system—collaborative filtering.

About collaborative filtering

Collaborative filtering works by finding similar users with similar tastes or interests to yours and make recommendations based on these users.

Collaborative Filtering Process

Figure above[2] shows the schematic diagram of the collaborative filtering process. Collaborative filtering algorithm takes ratings table as input. The ratings table is typically represented by a matrix of user-item data. The data is taken from the preferences of the existing users. There also exists a distinguished user, called the active user, whom we are trying to give recommendations to. The output of the process can be of two forms: prediction and recommendation. The prediction is a numerical value expressing the predicted rating of an item for the active user. The recommendation is a list of items that the active user will probably like the most. The recommendation form is what we’ve seen in Spotify, Amazon, and Twitter.

There are several different ways of deciding which people have similar interests and combining their choices to make a list. We will cover one of techniques/workflows used in building a recommendation system using collaborative filtering algorithm, called user-based collaborative filtering. Another useful technique to consider is item-based collaborative filtering.

  1. Collecting Preferences

The first thing to do is collecting users and their preferences. User’s preferences can be collected from explicit observations, such as ratings given for items, or it can be based on implicit observations, such as the user’s behavior.

  1. Finding and Ranking Similar Users

After collecting the data, we need to find other users with similar taste to the active user. We do this by comparing the active user with every other user and calculating a similarity score. There are a few ways to calculate similarity score, such as using Euclidean distance, Pearson correlation, or Tanimoto score.

Now that we have similarity scores for every other users, we select few users who has the highest score. This can be done by simply sorting the scores.

  1. Recommending Items

Recommending items is not done by simply looking for an item that the active users haven’t seen yet. That would be too permissive. We give recommendations by scoring each items based on the similarity scores previously obtained.

Check out the references for more detailed implementation of this technique[1] and the item-based approach which provides better performance[2] than user-based one explained here.

What we need for a better recommendation system: challenges and innovations

Collaborative filtering is one of the most promising technology for recommendation system. However, there are many challenges that the technique must overcome. Data sparsity and scalability are examples of problem that is faced by collaborative filtering, especially the user-based one. With the growing number of users and items, the accuracy and performance of the technique are getting poorer.

Another problem faced by collaborative filtering is shilling attacks (profile injection attacks). The problem is caused by manipulation of ratings by users: it could be introduced by users who give positive ratings for their own items and negative ratings for their competitors.  Robust collaborative filtering tries to tackle this problem, typically by building spam user detection model. However, this technique is still an active research field and doesn’t have any major applications yet.

References

[1] Toby Segaran. Building Smart Web 2.0 Applications Programming Collective Intelligence. O’Reilly

[2] Sarwar B., Karypis G., Konstan J., and Riedl J. Item-Based Collaborative Filtering Recommendation Algorithms

[3] https://en.wikipedia.org/wiki/Collaborative_filtering

[4] https://en.wikipedia.org/wiki/Robust_collaborative_filtering

[5] http://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify/10-Collaborative_Filtering10HeyI_like_tracks_P