# Paper accepted in CVPR 2016 & my Thesis

I am elated to announce that my paper, along with my collaborator- Dr. Gaurav Sharma got accepted in CVPR’16. This work offers an apt end to my PhD thesis topic and positions me back to where I started but with improved knowledge and insights.

To say in a few words my PhD thesis (still to be defended) is focussed on action classification in videos and titled ‘Dynamic Space-Time Volumes for Predicting Natural Expressions‘. To explain this idea, let me first explain the standard/classic paradigm for video classification. In a classical vision approach for action classification  features are first extracted from each frame or local Space-Time volumes in a video and are then combined into a single features (akin to Bag of Words or temporal pooling). An example is shown below (borrowed from my thesis talk)

Although such a paradigm works perfectly for videos that are either pre-segmented or are composed of a single uniform action, they might not work well for complex videos. Complex videos are usually composed of sub-events and can be recorded via phone cams (e.g. those on youtube) or recorded during a clinical session and are unsegmented. An ideal framework to classify such videos should have a perfect segmentation done by a human, followed by using the bag of words approach. However, this could be both expensive in time and resources. In my thesis I approach this problem by using weakly supervised learning approach, where annotations labels are weak. For example a video labelled as having ‘pain’ expression are weak as they don’t give any information about where the pain occurs or how many frames in the video show this action. I also refer to this paradigm as ‘localized Space-Time’ approach since one is dealing locally with video segments instead of globally.

My first paper at IEEE AFGR 2013 titled ‘Weakly Supervised Pain Localization using Multiple Instance Learning‘ (PS: see videos), tackled by this representing each video as a bag of multi-scale segments and then using an algorithm (Multiple Instance Learning (MIL)) to automatically mine discriminative segments in pain videos. In simple words this algorithm first identified which parts of the video correspond to the label (Inference) and then learned the model (Learning). This paper was pretty well received (‘Best Student Paper Honorable Mention Award’ at AFGR) and I decided further to pursue this as a thesis topic.

Bag of segments representation

In the previous work no temporal information was explicitly encoded inside the model. The following work was trying to solve the same problem but by using Hidden Markov Models instead of MIL that can encode temporal information. The idea was pursued by learning a single HMM model for each video. Each HMM model results in segmenting frames into sub-events such as neutral and apex in a smile video. These HMM models were then used for classification (PS: skipping details) and took into account local structure present in each video sequence. Here is a diagram for Exemplar-HMM.

Advantages of this algorithm were (i) it could explicitly model the temporal information by modeling transitions between different events (such as neutral -> apex), (ii) and the kernel comparing HMMs facilitated comparison between different sub-events. A particular drawback of this model was that it wasn’t learned in a end to end manner and thus its generalization performance was a little low and also that it couldn’t work with high dimensional features (HMMs are generative).

Coming back to my thesis title ‘Dynamic Space-Time Volumes for Predicting Natural Expressions’, let me justify the name now. The dynamic component comes from the fact that at run-time we are trying to identify both events in a video and also model their temporal structure. The current work accepted in CVPR combines the best of both previous works by explicitly:

1. Learning multiple discriminative templates. This is different from MIL where only a single template is learned.
2. Modeling the temporal ordering between such sub-events.
3. Classification score is a combination of above two.

This model is named LOMo- Local Ordinal Model as it explicitly models the ordering between sub-events. I have shown an example below where states- neutral, onset and apex are ordered (identified automatically without any additional training data).

This algorithm is not only more expressive but also outperformed MIL. I shall not go into the details and let you figure the details once the camera ready is prepared. This algorithm can be connected to Actom sequence model and HCRFs. I shall talk about these in future.

To be truthful I want to thank Gaurav who was the real push behind this work. Combined with his vision and expertise, we were able to bring this work into reality over a period of less than a month (of course years of experience!). I wish him good luck for his next job at IIT Kanpur as a Assistant Professor.

# Moving from MATLAB to Python

It all started with picking up a tool for learning and doing some Deep Learning (DL) before my intern started. After some discussion I decided to go for Theano due to its simplicity and ability to do more modification as compared to Caffe. However, this meant learning Python since Theano’s API is written for python. Actually let me back up a bit as I now recollect that I started learning Python while coding for the DL course from Stanford (http://cs231n.github.io/) (excellent course!). Contrary to my expectations, Python was easier to follow for following reasons :

1. My tons of experience coding in MATLAB surely helped in relating things between two languages.
2. Python has excellent libraries/modules (Numpy etc.)
3. Most interesting part of Python is that it is an actual programming language and if you have ever done some data structures, you will know why I say so.

Now let’s fast forward to my intern few months later where one of my colleague suggested to use Python instead of MATLAB while I had some trouble finding enough MATLAB image processing toolboxes. I took it to my heart and converted all my code to Python along with using opencv. The transition was easier than expected and the speed with which Python operates did impress me.

1. Python has excellent toolboxes- Numpy, Matplotlib, Scikit-learn etc that implement much more libraries than MATLAB (feel so) and give you more power.
2. Python has excellent string handling capabilities. Parsing .csv files is easier to perform compared to MATLAB.
3. Python lists are an excellent data structure to do tones of stuff a. Of course, there are other standard data structures such as dictionaries, tuple, sets.
4. For starters, Python has a MATLAB-like IDE called Spyder. The entire Python environment can easily be setup with Anaconda package which is there for all platforms.
5. Python is faster compared to MATLAB as it doesn’t load all modules akin to MATLAB.
6. Python is FREE! Though it might not seem like a huge benefit during grad school, it is a huge plus while working at a industrial lab. Moreover, you are never limited by the available number of licenses.
7. Python has nice saving and loading tools. The most popular being pickle, but I would suggest to use h5py which is great in handling huge amouts of data. It lets you save huge numpy arrays on disk and loading only part of them in memory with easy.
8. Following on above, I feel Python is well suited to Big Data domain.
9. Excellent DL libraries/API’s in Python.
10. Easy of converting to C++ code from Python than MATLAB.
11. A big community of open source developers who have made excellent modules for Python.

Anyways I can keep talking about the benefits but the nutshell is Python is great. This doesn’t mean you ditch MATLAB, but you can always use MATLAB to either do very small-scale testing or post-analysis of the results generated by Python. I will be happy to respond to anyone (comments or by email: karan.sikka1@gmail.com) on transitioning from MATLAB to Python. It took me some convincing but wish had done it earlier. I will be writing more about these tools in future.

# A humble attempt to re-start my blog

Hello to anyone reading this on the vast internet space. I feel that I have many things to share and haven’t been able to do so in sometime. Then looking back at my blog I noticed that I haven’t blogged for past 3 years, which is pushing me to re-start my blog in 2016. I am currently in the final year of my PhD and interning at SRI International, Princeton. The past few months have been pretty interesting in terms of how much I have been learning and got exposed too esp. Deep learning and shifting my hat towards more computer sciencee way of solving things. Also a big push to my learning experience has been reading comments made by other people who spend time to share their knowledge. Thus I hope I will not let this spurt of enthusiasm go in vain.

# Running Ubuntu 12.04.2 LTS or higher with NVIDIA card

So before starting on this new project I decided to upgrade my Ubuntu OS. Everything went smooth, however on installation I found that I cannot reduce the resolution of my monitor.

Sound strange. Sure it is since most people will be worried about not being able to go up the resolution. The things with my lab PC is that it has a huge monitor and it is impossible for me to work with highest resolution all the time. So this is a problem.

Since I am not a geek with Linux, I decided to google and find my way. Many people had reported this problem and it seemed legitimate. So I tried updating the nvidia drives using jockey (in-build Ubuntu hardware manager) but no success. Finally I decide to format everything and have a new start.

On suggesting from my few friends, I found out that Ubuntu 12.04 will be having a long-standing support (LTS). This means that most Linux distribution have updates for around an year since Ubuntu guys always keep trying something new. So I decided to install 12.04 LTS and happily burned a CD.

However booting from CD was not working and I had to edit grub commands and add ‘nomodeset’ since this lets it boot with low graphics mode. Now I was able to install Ubuntu, after which I removed the default drivers that Ubuntu uses (again doesn’t work with Nvidia). Fingers-crossed I installed Nvidia drivers and prayed. However the problem seemed to be the same.

I think I am impatient in writing the whole story but it took me more than a day to figure out things. The solution is:

1. Install new drivers from Nvidia unix website.

2. However remove (rather purge) any other nvidia drivers (incl nvidia settings and anything).

3. After installation run nvidia-settings and you can change the resolution.

Pl. note that 2nd step is very important otherwise you will have different version of drivers and different versions of nvidia-settings.

I am not a great supporter of apple but one place where it sweeps most of the point is having a stable OS with unix shell and amazing GUI. I hope Ubuntu will be able to cater to all type of customers in due time. It has come a long way anyways.

Enjoy

Karan

# Some statistics for Face and Gestures 2013

The website for FG’13 lists that the acceptance rate for this year has been around 35%. Naturally this should be the overall acceptance rate. Since there are some missing numbers I decided to do some simple mathematics here. They are:

Total number of POSTER papers (from program details) = 36

Total number of Oral Presentation = 68

Total number of papers (accepted papers are 35% of total) = 297

Acceptance rate for Oral= 12%

Acceptance rate for POSTER= 27%

Pl. note that most of these numbers are approximations and may differ from original statistics. I will try to write a more detailed review once I have a chance to go to the conference and see things myself.

Karan

# Tutorial on a Recommendation system

Before starting the post I would like to mention that here I am no expert on recommendation systems and haven’t really read much about them. The motivation for writing a post on recommendation systems came from the idea of trying to be able to outline one based on the knowledge I have and concepts (like LDA) that I have read recently. Thus this is more like a learning exercise for me than a tutorial. Hence moving forward I will really appreciate the errors being highlighted by readers. So let’s get started.

Problem Definition: The plan is to design a recommendation system. Such a system should be capable of recommending items to users based on the evidence/information they provide. Suppose I tell the system that I bought books on statistics and machine learning (as on Amazon), it should recommend me more of machine learning or related books. The only training data we have is the product liking list of let’s say $M$ users. Each user $U_i$ has a product count $p_{ij}$ (need not be integer) for product $j$, which can be interpreted as some statistics about a specific product. Adding to this, one of my friend Bill Lennon mentioned that these statistics can also be looked up as user feedback and be grouped into explicit and implicit. These statistics could be any one of the following:

1. Explicit Feedback: No of times they have bought it; no of likes.

2. Implicit: No of page visits; amount of time spent on a particular page.

This information will look like a matrix

Approach: My plan is to outline a general idea/intuition that behind the solution. This will be followed by discussing a naive solution, then a better one and finally concluded by a probabilistic approach that seems to solve the problem the best. I have tried to motivate the final approach with apt reasons, both technical and non-technical.

Intuition: One way of thing about this problem is through an example. Suppose I like product $p_1$, let’s say a fiction book, then it’s somewhat likely that I might like to try another fiction book $p_2$. So we would like to somehow capture the correlation between different products and it is very likely that two products that are highly correlated are close together. The methods that are discussed follow this intuition very closely, if not strictly.

Approach 1(Naive Approach): The first solution could employ simple methods like recommending products based on nearest neighbor. Suppose we project all the users on a $Q$ dimensional space (assuming $Q$ products), then let’s say a new user comes in and we just know that he has bought product $p_j$. In this case we can find his closest distance wrt to $j^(th)$ dimension and then suggest items bought by closest users in training set. It is important to note here that for users in the training set we can have a list a of more products than those in testing set in which case we just measure the distance wrt few dimensions.

Limitation: A possible problem with this method is that it is unable to capture the correlations between different products and this might lead to errors. For instance, if we find closest distance wrt to the dimension containing a fiction book by Jeffery Archer, then this will find closest distance wrt to only this book. In such a case the system will recommend books only from those users who have bought this book without realizing that the same user might like books from other fiction authors. Thus there should be someway of capturing the information that this book belongs to a certain class (‘Fiction’ in this case) and a system taking care of this information can provide better recommendations.

Solution to limitation: Need a method that can discover common classes among books.

Note: I would like to mention again that I am trying to explain a naive system that won’t be using any matrix completion strategy. We can use it in different scenarios. In one case the users in the training set might have more number of products than those in testing, in which case we can still exploit correlations and suggest new products to user in test set. In second case a new user might not be aware of a product that exists in the catalog and thus can be made a new suggestion. Pl. remind me about more cases.

Here it will would also be interesting to highlight the difference between collaborative filtering and content based recommendation systems (courtesy Bill Lennon). In collaborative filtering we are only concerned about user behavior and not about the properties of items. On the other hand in recommendation systems it is important to exploit relationships between different products. Although two might be closely related I am sure there would be more subtle differences.

2. Approach 2 (Clustering): Since we concluded the last point with the idea that we require a method that can discover common classes among products. So the next approach that comes into mind is using clustering. Clustering can be understood well with the following graphical model (explanation below):

Although the above diagram might confound people unfamiliar with graphical models, it is fairly easy to understand. The above graph highlights the process of generating a list of products for a user in a mixture (clustering) model. The various steps and variables are:

1. $C$ denotes the class from which current user is sampled. First a class $C_i$ is sampled for user $U_i$ from the prior distribution of class variable $C_i \sim Pr(C_i)$.

2. Once we know the class of current example, we can sample the product liking list from the conditional distribution $P_{ij} \sim Pr(P_{ij}|C_i)$.

It is important to note two things here. Firstly the $N$ in the graph above means that the generation process is repeated $N$ times. Thus the products are sampled independently from it’s conditional distribution.  Secondly there seems to be an ambiguity between the meaning of variables $N$ and $Q$. $Q$ represents the number of products but $N$ represents the number of observations that have been taken from a user. For instance we can consider $Q$ as number of bins of an histogram while $N$ is the number of observations over which this histogram is build.

What did we discover: Moving ahead let’s try to get an idea of what kind of  clusters we will be discovering here. The clusters will look like as those shown below:

Limitation: This doesn’t look good for our application since we can expect users who might have a different combination of product liking and may not fit into any of these clusters. In such a case the recommendations will be erroneous. Although one option could have been to increase the number of clusters and discover clusters that correspond to only users liking certain kind of products like books. But it can be argued here that a user can have different kinds of product liking and a clustering model cannot allow so since a data-point can fall into only one cluster.

Next Step : The next step should be to identify approach that (1) allows users to lie in different clusters, and (2) somehow discover the common products/themes that are present in the training data we have. This two assumptions seem to be more apt to this problem since when a new user comes in and has a liking only for books, we can make him relevant recommendations only from the products lying in theme ‘books’. One such approach is latent topic models (LDA) that lie in the general class of models called Factor Models (like PCA, SVD, PLSA). Pl. note I shall be using the term themes and topics interchangeably in next section.

3. Approach 3 (Latent Topic Model):  Now we would like to select an approach that could help us in discovering the themes/topics present in the products. We shall use factor models for this case. In simple words factor models are used to find the underlying factors (or subspaces or manifolds) that are generating the data. It is usually the case the these factors lie in a lower dimensional subspace than data and it is hoped that the identity of these factors will yield meaningful information about the data. Factor models can also be considered as generalization of mixture models since they allow a data-point to be explained by multiple factors (with even sparse loadings).

In this case we shall be using LDA that has a lot of preferred properties and is a probabilstic version of factor models. For this problem we hope that the underlying factors will correspond to common themes/topics present inside the products. For instance given a user’s product liking list, we can obtain his liking for different product topics as shown below:

This seems to be advantageous for our application since knowing the kind of preferences a user has, we can recommend him new items from users having similar preferences. Now we shall invoke the great power of probability (and mathematics of-course) to understand the solution more clearly.

We shall follow the following probability model (generative process) for LDA:

The generative process for a user is explained below:

1. Random variable $T$ represents a topic (or theme) wise probability for a particluar user. This is similar to the one shown in the previous figure that highlights the hidden product themes for a particular user. This is sampled from a Dirichlet distribution i.e. $T \sim Dirc(\alpha)$ (parameters are not shown). I will try to explain the reason for various distributions later (may be in separate post). Note here that the number of topics $N_{topics}$ are usually less than the number of products $Q$, which is obvious.

2. Once we know the underlying topic distribution $T_i$ for a particular user $U_i$, we will repeat the following $N$ times:

2.1. Sample a topic $Z_{ij}$ from distribution $Z_{ij} \sim Pr(Z_{ij}/T_i)$. For instance we get topic ‘books’ from this.

2.2 We next select a product $P_{ij}$ from this topic by sampling from distribution $P_{ij} \sim Pr(P_{ij}/Z_{ij})$.

The above generation process assumes a hidden/latent (not visible during observations) topic distribution for each user which is generating topics and product list. This way we can hope for or capture various themes that a user is interested in as we wanted to do earlier. An earlier diagram is re-drawn to again stress on what we mean by topic/theme distribution:

I am avoiding details of training and inference procedure as such. We are interested in two things for our recommendation system:

1. Topic probability: This is the probability of topics given the product liking list of a particular user $Pr(T_i/P_i)$. Using this we can get an idea of topics that the user in interested in.

2. Idea for recommending items to users:

1. Probability of a product given topic probability for a user: This could be used to recommend new products to user $U_i$ depending his discovered topic distribution $T_i$. A score could be obtain for product $P_j$ by using probability $Pr(P_j/T_i)$.

2. We could also use the probability $Pr(P_j/P_{ik})$ to estimate a score of recommending product $P_j$ to a user knowing that he/she has like product $P_{ik}$. Note here that we are only interested in the likelihood of recommending an item based on just one item here.

I would also like to mention here the fact that this is the beauty I like about a probabilistic approach since it makes it possible to do different kinds of inferences).

I would like to point here that the above probabilities can be easily found using sum-product rules of probability along with independence properties implied by the graph.

Caveat: An important point to notice here is that the inference and learning is not trivial for such models and might require more work. However I have read that there are good implementations of LDA available.

Karan