Structural-RNN: Deep Learning on Spatio-Temporal Graphs

Paper Summary

Posted by on Sunday, March 5, 2017 Tags: Summaries ML   4 minute read

Structural-RNN: Deep Learning on Spatio-Temporal Graphs

Details:

  • Authors: Ashesh Jain, Amir R. Zamir, Silvio Savarese, Ashutosh Saxena
  • Link: cs.stanford
  • Tags: RNNs, Motion Prediction, Structured ML
  • Year: 2015
  • Conference CVPR 2016 (Best Paper Award)
  • Implementation Official in Theano

Summary

Problem

Spatiotemporal graphs are a popular tool for imposing high-level intuitions in the formulation of several real-world problems. But modeling them through simple RNNs doesn’t work as well in practice. In this work, the authors propose structured RNNs that can encode the spatial features of objects onto the model. This has consequences to a wide variety of tasks like human motion prediction, Human activity detection as well as real-life problems such as driver maneuver prediction that can be used to reduce accidents.

How they solve it.

The authors propose Structured RNNs that can model spatial-temporal graphs.

  • Spatio-temporal graphs: Defined in figure 2. A spatiotemporal graph is defined as (V, E_t, E_s) where V is the set of vertices/nodes, E_t and E_s being the temporal and spatial edge respectively. The graphs capture the prior knowledge of a situation and in turn design the architecture of the network.
  • The nodes need to be designed with care depending on the situation. The authors have described the process for human activity detection in section 3.1 second half. After that, each node has a temporal edge with itself, passing its information from one time-step to the next. The spatial connections can be treated as hyper-parameters ( as well can be nodes but then the hyper-parameter space becomes very large ). The authors fixed the connections by experimentation only.
  • If nodes are semantically similar, they have the same edge-factor (an RNN in the model). As explained in human activity detection task (Fig 3), If we model objects separately then, there will be 2 RNN wrt human( think of them as just weights ), but by considering the same, they have the same edgeRNN wrt to human.

  • Converting from ST-graph to S-RNN: First we should understand types of RNNs in S-RNN:
    • Temporal EdgeRNN: These are the RNNs that represent temporal edges in ST-Graph.
    • Spatial EdgeRNN: Represent the spatial edges in the graph.
    • NodeRNN: They model the nodes/vertices in the graphs.
  • The edges among them are defined as:
    • Temporal EdgeRNN has only one edge, to its corresponding NodeRNN.
    • Spatial EdgeRNN can have one or two edges depending if it is connecting semantically similar nodes or not respectively.
    • There is no edge among EdgeRNNs and NodeRNNs themselves. Hence this creates a bi-partite graph.
  • This defines the over-arching model as described in the paper. The following part is not mentioned explicitly in the paper and is essential only from the implementation perspective.

  • Another important aspect of this work (as it deals with non-traditional problems) is the input and output methodology of these component RNNs. I Explain this with the help of human motion forecasting example as in Fig 5. Here each node is a body part at a time step t, and the model is fed the node positions at time steps 1, 2, … t. The aim is to predict the position at time step t+1. Note that as each constituent is an RNN, the model involves similar unfolding.
    • NodeRNN is fed the concatenation of all the outputs of its connected EgdeRNNs (both temporal and spatial) and the node feature ( i.e. the position of that body part) at that time step.
    • Both EdgeRNNs are fed the summation of the features, f_{o1o2} of the nodes that are in the same semantic groups. Further, these features f_{o1o2} for each constituent of the semantic group are defined as follows.
      • For Temporal Edges they are a concatenation of that body part’s coordinates at that point and the difference between the coordinates at that time step and previous time step.
      • For Spatial Edges, it is just the concatenation of both body part’s coordinates.
    • The output of each nodeRNN is the mo-cap features at time step t+1.