Summary: There are some interesting use cases where combining CNNs and RNN/LSTMs seems to make sense and a number of researchers pursuing this.  However, the latest trends in CNNs may make this obsolete. That was my reaction when I first came across the idea of combining CNNs (convolutional neural nets) and RNNs (recurrent neural nets).  After all they’re optimized for completely different problem types.

We all know to reach for the appropriate tool based on these very unique problem types. As it turns out, yes.  Most of these are readily identified as images that occur in a temporal sequence, in other words video.  But there are some other clever applications not directly related to video that may spark your imagination.  We’ll describe several of those below.

There are also several emerging models of how to combine these tools.  In most cases CNNs and RNNs have been married as separate layers with the output of the CNN being used as input to the RNN.  But there are some researchers cleverly combining these two capabilities within a single deep neural net.  The classical approach to scene labeling is to train a CNN to identify and classify the objects within a frame and perhaps to further classify the objects into a higher level logical group.  For example, the CNN identifies a stove, a refrigerator, a sink, etc. and also up-classifies them as a kitchen.

Clearly the element that’s missing is the meaning of the motion over several frames (time).  For example, several frames of a game of pool might correctly say, the shooter sinks the eight ball in the side pocket.  Or several frames of a young person learning to ride a two-wheeler followed by the frame of the rider on the ground, might reasonably be summarized as ‘boy falls off bike’. Researchers have used layered CNN-RNN pairs where the output of the CNN is input to the RNN.  Logically the RNN has also been replaced with LSTMs to create a more ‘in the moment’ description of each video segment.  Finally there has been some experimentation done with combined RCNNs where the recurrent connection is directly in the kernels as in the diagram above.  See more here.

Judging the emotion of individuals or groups of individuals from video remains a challenge.  There is an annual competition around this held by the ACM International Conference on Multimodal Interaction known as the EmotiW Grand Challenge. Each year the target data changes somewhat in nature and typically there are different tests for classifying groups of people versus individuals appearing in videos. Read more from…

thumbnail courtesy of