I am elated to announce that my paper, along with my collaborator- Dr. Gaurav Sharma got accepted in CVPR’16. This work offers an apt end to my PhD thesis topic and positions me back to where I started but with improved knowledge and insights.
To say in a few words my PhD thesis (still to be defended) is focussed on action classification in videos and titled ‘Dynamic Space-Time Volumes for Predicting Natural Expressions‘. To explain this idea, let me first explain the standard/classic paradigm for video classification. In a classical vision approach for action classification features are first extracted from each frame or local Space-Time volumes in a video and are then combined into a single features (akin to Bag of Words or temporal pooling). An example is shown below (borrowed from my thesis talk)
Although such a paradigm works perfectly for videos that are either pre-segmented or are composed of a single uniform action, they might not work well for complex videos. Complex videos are usually composed of sub-events and can be recorded via phone cams (e.g. those on youtube) or recorded during a clinical session and are unsegmented. An ideal framework to classify such videos should have a perfect segmentation done by a human, followed by using the bag of words approach. However, this could be both expensive in time and resources. In my thesis I approach this problem by using weakly supervised learning approach, where annotations labels are weak. For example a video labelled as having ‘pain’ expression are weak as they don’t give any information about where the pain occurs or how many frames in the video show this action. I also refer to this paradigm as ‘localized Space-Time’ approach since one is dealing locally with video segments instead of globally.
My first paper at IEEE AFGR 2013 titled ‘Weakly Supervised Pain Localization using Multiple Instance Learning‘ (PS: see videos), tackled by this representing each video as a bag of multi-scale segments and then using an algorithm (Multiple Instance Learning (MIL)) to automatically mine discriminative segments in pain videos. In simple words this algorithm first identified which parts of the video correspond to the label (Inference) and then learned the model (Learning). This paper was pretty well received (‘Best Student Paper Honorable Mention Award’ at AFGR) and I decided further to pursue this as a thesis topic.
In the previous work no temporal information was explicitly encoded inside the model. The following work was trying to solve the same problem but by using Hidden Markov Models instead of MIL that can encode temporal information. The idea was pursued by learning a single HMM model for each video. Each HMM model results in segmenting frames into sub-events such as neutral and apex in a smile video. These HMM models were then used for classification (PS: skipping details) and took into account local structure present in each video sequence. Here is a diagram for Exemplar-HMM.
Advantages of this algorithm were (i) it could explicitly model the temporal information by modeling transitions between different events (such as neutral -> apex), (ii) and the kernel comparing HMMs facilitated comparison between different sub-events. A particular drawback of this model was that it wasn’t learned in a end to end manner and thus its generalization performance was a little low and also that it couldn’t work with high dimensional features (HMMs are generative).
Coming back to my thesis title ‘Dynamic Space-Time Volumes for Predicting Natural Expressions’, let me justify the name now. The dynamic component comes from the fact that at run-time we are trying to identify both events in a video and also model their temporal structure. The current work accepted in CVPR combines the best of both previous works by explicitly:
- Learning multiple discriminative templates. This is different from MIL where only a single template is learned.
- Modeling the temporal ordering between such sub-events.
- Classification score is a combination of above two.
This model is named LOMo- Local Ordinal Model as it explicitly models the ordering between sub-events. I have shown an example below where states- neutral, onset and apex are ordered (identified automatically without any additional training data).
This algorithm is not only more expressive but also outperformed MIL. I shall not go into the details and let you figure the details once the camera ready is prepared. This algorithm can be connected to Actom sequence model and HCRFs. I shall talk about these in future.
To be truthful I want to thank Gaurav who was the real push behind this work. Combined with his vision and expertise, we were able to bring this work into reality over a period of less than a month (of course years of experience!). I wish him good luck for his next job at IIT Kanpur as a Assistant Professor.