Temporal Pyramid Network for Action Recognition
Ceyuan Yang*,1Yinghao Xu*,1Jianping Shi2Bo Dai1Bolei Zhou1 
1The Chinese University of Hong Kong, 2SenseTime Group Limited
Overview
Visual tempo characterizes the dynamics and the temporal scale of an action, which actually describes how fast an action goes. Modeling such visual tempos of different actions facilitates their recognition. In this work we propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. TPN also shows consistent improvements over other challenging baselines on several action recognition datasets. A further analysis also reveals that TPN gains most of its improvements on action classes that have large variances in their visual tempos, validating the effectiveness of TPN.
Results
  • Quantitive Results
  • Our TPN could achieve 78.9%, 49.0% and 62.0% top-1 accuracy on the mainstream benchmarks of action recognition i.e., Kinetics-400, Something-Something V1 and V2 respectively, which basically outperforms other state-of-the-art methods. More detailed comparison and ablation studie are presented in our paper.

  • Empirical Study
  • Per-class Performance Gain vs. Per-class Variance of Visual Tempos : Figure 4 indicates that the performance gain is clearly positively correlated with the variance of visual tempos. This study has strongly verified our motivation that TPN could bring a significant improvement for such actions with large variances of visual tempo.

    Robustness of TPN to Visual Tempo Variation : Figure 5 suggests that TPN helps improve the robustness of I3D-50, resulting in a curve with moderater fluctuations. More discussion is presented in our experimental section.

    Bibtex
    @inproceedings{yang2020tpn,
      title   = {Temporal Pyramid Network for Action Recognition}},
      author  = {Yang, Ceyuan and Xu, Yinghao and Shi, Jianping and Dai, Bo and Zhou, Bolei},
      journal = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      year    = {2020}
    }