Tutorial on a Recommendation system

Before starting the post I would like to mention that here I am no expert on recommendation systems and haven’t really read much about them. The motivation for writing a post on recommendation systems came from the idea of trying to be able to outline one based on the knowledge I have and concepts (like LDA) that I have read recently. Thus this is more like a learning exercise for me than a tutorial. Hence moving forward I will really appreciate the errors being highlighted by readers. So let’s get started.

Problem Definition: The plan is to design a recommendation system. Such a system should be capable of recommending items to users based on the evidence/information they provide. Suppose I tell the system that I bought books on statistics and machine learning (as on Amazon), it should recommend me more of machine learning or related books. The only training data we have is the product liking list of let’s say M users. Each user U_i has a product count p_{ij} (need not be integer) for product j, which can be interpreted as some statistics about a specific product. Adding to this, one of my friend Bill Lennon mentioned that these statistics can also be looked up as user feedback and be grouped into explicit and implicit. These statistics could be any one of the following:

1. Explicit Feedback: No of times they have bought it; no of likes.

2. Implicit: No of page visits; amount of time spent on a particular page.

This information will look like a matrix


Approach: My plan is to outline a general idea/intuition that behind the solution. This will be followed by discussing a naive solution, then a better one and finally concluded by a probabilistic approach that seems to solve the problem the best. I have tried to motivate the final approach with apt reasons, both technical and non-technical.

Intuition: One way of thing about this problem is through an example. Suppose I like product p_1, let’s say a fiction book, then it’s somewhat likely that I might like to try another fiction book p_2. So we would like to somehow capture the correlation between different products and it is very likely that two products that are highly correlated are close together. The methods that are discussed follow this intuition very closely, if not strictly.

Approach 1(Naive Approach): The first solution could employ simple methods like recommending products based on nearest neighbor. Suppose we project all the users on a Q dimensional space (assuming Q products), then let’s say a new user comes in and we just know that he has bought product p_j. In this case we can find his closest distance wrt to j^(th) dimension and then suggest items bought by closest users in training set. It is important to note here that for users in the training set we can have a list a of more products than those in testing set in which case we just measure the distance wrt few dimensions.

Limitation: A possible problem with this method is that it is unable to capture the correlations between different products and this might lead to errors. For instance, if we find closest distance wrt to the dimension containing a fiction book by Jeffery Archer, then this will find closest distance wrt to only this book. In such a case the system will recommend books only from those users who have bought this book without realizing that the same user might like books from other fiction authors. Thus there should be someway of capturing the information that this book belongs to a certain class (‘Fiction’ in this case) and a system taking care of this information can provide better recommendations.

Solution to limitation: Need a method that can discover common classes among books.

Note: I would like to mention again that I am trying to explain a naive system that won’t be using any matrix completion strategy. We can use it in different scenarios. In one case the users in the training set might have more number of products than those in testing, in which case we can still exploit correlations and suggest new products to user in test set. In second case a new user might not be aware of a product that exists in the catalog and thus can be made a new suggestion. Pl. remind me about more cases.

Here it will would also be interesting to highlight the difference between collaborative filtering and content based recommendation systems (courtesy Bill Lennon). In collaborative filtering we are only concerned about user behavior and not about the properties of items. On the other hand in recommendation systems it is important to exploit relationships between different products. Although two might be closely related I am sure there would be more subtle differences.

2. Approach 2 (Clustering): Since we concluded the last point with the idea that we require a method that can discover common classes among products. So the next approach that comes into mind is using clustering. Clustering can be understood well with the following graphical model (explanation below):


Although the above diagram might confound people unfamiliar with graphical models, it is fairly easy to understand. The above graph highlights the process of generating a list of products for a user in a mixture (clustering) model. The various steps and variables are:

1. C denotes the class from which current user is sampled. First a class C_i is sampled for user U_i from the prior distribution of class variable C_i \sim Pr(C_i).

2. Once we know the class of current example, we can sample the product liking list from the conditional distribution P_{ij} \sim Pr(P_{ij}|C_i).

It is important to note two things here. Firstly the N in the graph above means that the generation process is repeated N times. Thus the products are sampled independently from it’s conditional distribution.  Secondly there seems to be an ambiguity between the meaning of variables N and Q. Q represents the number of products but N represents the number of observations that have been taken from a user. For instance we can consider Q as number of bins of an histogram while N is the number of observations over which this histogram is build.

What did we discover: Moving ahead let’s try to get an idea of what kind of  clusters we will be discovering here. The clusters will look like as those shown below:


Limitation: This doesn’t look good for our application since we can expect users who might have a different combination of product liking and may not fit into any of these clusters. In such a case the recommendations will be erroneous. Although one option could have been to increase the number of clusters and discover clusters that correspond to only users liking certain kind of products like books. But it can be argued here that a user can have different kinds of product liking and a clustering model cannot allow so since a data-point can fall into only one cluster.

Next Step : The next step should be to identify approach that (1) allows users to lie in different clusters, and (2) somehow discover the common products/themes that are present in the training data we have. This two assumptions seem to be more apt to this problem since when a new user comes in and has a liking only for books, we can make him relevant recommendations only from the products lying in theme ‘books’. One such approach is latent topic models (LDA) that lie in the general class of models called Factor Models (like PCA, SVD, PLSA). Pl. note I shall be using the term themes and topics interchangeably in next section.

3. Approach 3 (Latent Topic Model):  Now we would like to select an approach that could help us in discovering the themes/topics present in the products. We shall use factor models for this case. In simple words factor models are used to find the underlying factors (or subspaces or manifolds) that are generating the data. It is usually the case the these factors lie in a lower dimensional subspace than data and it is hoped that the identity of these factors will yield meaningful information about the data. Factor models can also be considered as generalization of mixture models since they allow a data-point to be explained by multiple factors (with even sparse loadings).

In this case we shall be using LDA that has a lot of preferred properties and is a probabilstic version of factor models. For this problem we hope that the underlying factors will correspond to common themes/topics present inside the products. For instance given a user’s product liking list, we can obtain his liking for different product topics as shown below:


This seems to be advantageous for our application since knowing the kind of preferences a user has, we can recommend him new items from users having similar preferences. Now we shall invoke the great power of probability (and mathematics of-course) to understand the solution more clearly.

We shall follow the following probability model (generative process) for LDA:


The generative process for a user is explained below:

1. Random variable T represents a topic (or theme) wise probability for a particluar user. This is similar to the one shown in the previous figure that highlights the hidden product themes for a particular user. This is sampled from a Dirichlet distribution i.e. T \sim Dirc(\alpha) (parameters are not shown). I will try to explain the reason for various distributions later (may be in separate post). Note here that the number of topics N_{topics} are usually less than the number of products Q, which is obvious.

2. Once we know the underlying topic distribution T_i for a particular user U_i, we will repeat the following N times:

2.1. Sample a topic Z_{ij} from distribution Z_{ij} \sim Pr(Z_{ij}/T_i). For instance we get topic ‘books’ from this.

2.2 We next select a product P_{ij} from this topic by sampling from distribution P_{ij} \sim Pr(P_{ij}/Z_{ij}).

The above generation process assumes a hidden/latent (not visible during observations) topic distribution for each user which is generating topics and product list. This way we can hope for or capture various themes that a user is interested in as we wanted to do earlier. An earlier diagram is re-drawn to again stress on what we mean by topic/theme distribution:


I am avoiding details of training and inference procedure as such. We are interested in two things for our recommendation system:

1. Topic probability: This is the probability of topics given the product liking list of a particular user Pr(T_i/P_i). Using this we can get an idea of topics that the user in interested in.

2. Idea for recommending items to users:

1. Probability of a product given topic probability for a user: This could be used to recommend new products to user $U_i$ depending his discovered topic distribution T_i. A score could be obtain for product P_j by using probability Pr(P_j/T_i).

2. We could also use the probability Pr(P_j/P_{ik}) to estimate a score of recommending product P_j to a user knowing that he/she has like product $P_{ik}$. Note here that we are only interested in the likelihood of recommending an item based on just one item here.

I would also like to mention here the fact that this is the beauty I like about a probabilistic approach since it makes it possible to do different kinds of inferences).

I would like to point here that the above probabilities can be easily found using sum-product rules of probability along with independence properties implied by the graph.

Caveat: An important point to notice here is that the inference and learning is not trivial for such models and might require more work. However I have read that there are good implementations of LDA available.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s