Multimodal Motion Prediction with Stacked Transformers
Yicheng Liu1* Jinghuai Zhang2* Liangji Fang2, Qinhong Jiang2, Bolei Zhou1
The Chinese Univsersity of Hong Kong1, SenseTime Research2
Computer Vision and Pattern Recognition (CVPR), 2021
We propose a novel end-to-end motion prediction framework (mmTransformer) for multimodal motion prediction. Firstly, we utilize stacked transformers architecture to incoporate multiple channels of contextual information, and model the multimodality at feature level with a set of trajectory proposals. Then, we induce the multimodality via a tailored region-based training strategy. By steering the training of proposals, our model effectively mitigates the complexity of motion prediction while ensuring the multimodal outputs.
The proposed architecture of mmTransformer (MultiModal Transformer). The backbone is composed of stacked transformers, which aggregate the contextual information progressively. Proposal feature decoder further generates the trajectory and confidence score for each learned trajectory proposal through the trajectory generator and selector, respectively.
Overview of the Region-based Training Strategy (RTS). We distribute each proposal to one of the M regions. These proposals, shown in colored rectangles, learn corresponding proposal feature through the stacked transformers. In training stage, we select the proposals assigned to the region where the GT endpoints locate, generate their trajectories and confidence scores, and then calculate the losses for them.
Visualization of the multimodal prediction results on Argoverse validation set. We utilize all trajectory proposals to generate multiple trajectories for each scenario and visualize all the predicted endpoints (black background) in the figures. Colored points indicate the prediction results of a specific group of proposals (after filtering by score). We observe that the endpoints generated by each group of regional proposals are within the associated region.
Qualitative comparison between mmTransformer (6 proposals) and mmTransformer+RTS (36 proposals):
Demo video link:
Demo video of multimodal motion prediction by mmTransformer. For each moving vehicle nearby the ego car, three plausible future trajectories are visualized.

Also can be found in: [bilibili]
  title={Multimodal Motion Prediction with Stacked Transformers},
  author={Liu, Yicheng and Zhang, Jinghuai and Fang, Liangji and Jiang, Qinhong and Zhou, Bolei},
  journal={Computer Vision and Pattern Recognition},
Related Work