Top-Heavy CapsNets Based on Spatiotemporal Non-Local for Action Recognition




Action recognition, Attention mechanism, Capsule network, Deep neural network, Spatiotemporal


To effectively comprehend human actions, we have developed a Deep Neural Network (DNN) that utilizes inner spatiotemporal non-locality to capture meaningful semantic context for efficient action identification. This work introduces the Top-Heavy CapsNet as a novel approach for video analysis, incorporating a 3D Convolutional Neural Network (3DCNN) to apply the thematic actions of local classifiers for effective classification based on motion from the spatiotemporal context in videos. This DNN comprises multiple layers, including 3D Convolutional Neural Network (3DCNN), Spatial Depth-Based Non-Local (SBN) layer, and Deep Capsule (DCapsNet). Firstly, the 3DCNN extracts structured and semantic information from RGB and optical flow streams. Secondly, the SBN layer processes feature blocks with spatial depth to emphasize visually advantageous cues, potentially aiding in action differentiation. Finally, DCapsNet is more effective in exploiting vectorized prominent features to represent objects and various action features for the ultimate label determination. Experimental results demonstrate that the proposed DNN achieves an average accuracy of 97.6%, surpassing conventional DNNs on the traffic police dataset. Furthermore, the proposed DNN attains average accuracies of 98.3% and 80.7% on the UCF101 and HMDB51 datasets, respectively. This underscores the applicability of the proposed DNN for effectively recognizing diverse actions performed by subjects in videos.

Author Biography

Manh-Hung Ha, Vietnam National University

Faculty of Applied Sciences, International School, Vietnam National University, Hanoi 100000, Vietnam


M.-H. Ha and O. T.-C. Chen, “Action Recognition Improved by Correlations and Attention of Subjects and Scene,” in 2021 International Conference on Visual Communications and Image Processing (VCIP), Dec. 2021, pp. 1–5. doi: 10.1109/VCIP53242.2021.9675340.

M.-H. Ha and O. T.-C. Chen, “Deep Neural Networks Using Residual Fast-Slow Refined Highway and Global Atomic Spatial Attention for Action Recognition and Detection,” IEEE Access, vol. 9, pp. 164887–164902, 2021, doi: 10.1109/ACCESS.2021.3134694.

M.-H. Ha and O. T.-C. Chen, “Deep Neural Networks Using Capsule Networks and Skeleton-Based Attentions for Action Recognition,” IEEE Access, vol. 9, pp. 6164–6178, 2021, doi: 10.1109/ACCESS.2020.3048741.

K. Rajesh, V. Ramaswamy, and K. Kannan, “Prediction of Cyclone Using Kalman Spatio Temporal and Two-Dimensional Deep Learning Model,” Malaysian J. Comput. Sci., pp. 24–38, Nov. 2020, doi: 10.22452/mjcs.sp2020no1.3.

N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “MARS: Motion-Augmented RGB Stream for Action Recognition,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp. 7874–7883. doi: 10.1109/CVPR.2019.00807.

G. Huang and A. G. Bors, “Busy-Quiet Video Disentangling for Video Classification,” in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Jan. 2022, pp. 756–765. doi: 10.1109/WACV51458.2022.00083.

G. Huang and A. G. Bors, “Learning Spatio-Temporal Representations With Temporal Squeeze Pooling,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 2103–2107. doi: 10.1109/ICASSP40776.2020.9054200.

L. Wang, P. Koniusz, and D. Huynh, “Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition With CNNs,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, pp. 8697–8707. doi: 10.1109/ICCV.2019.00879.

Y. Zhang et al., “VidTr: Video Transformer Without Convolutions,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp. 13557–13567. doi: 10.1109/ICCV48922.2021.01332.

R. K. Rachman, D. R. I. M. Setiadi, A. Susanto, K. Nugroho, and H. M. M. Islam, “Enhanced Vision Transformer and Transfer Learning Approach to Improve Rice Disease Recognition,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 446–460, Apr. 2024, doi: 10.62411/jcta.10459.

B. Igor L. O., M. Victor H. C., and W. R. Schwartz, “Bubblenet: A Disperse Recurrent Structure To Recognize Activities,” in 2020 IEEE International Conference on Image Processing (ICIP), Oct. 2020, pp. 2216–2220. doi: 10.1109/ICIP40778.2020.9190769.

Y. Li, Z. Lu, X. Xiong, and J. Huang, “PERF-Net: Pose Empowered RGB-Flow Net,” in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Jan. 2022, pp. 798–807. doi: 10.1109/WACV51458.2022.00087.

H. Duan, Y. Zhao, Y. Xiong, W. Liu, and D. Lin, “Omni-Sourced Webly-Supervised Learning for Video Recognition,” 2020, pp. 670–688. doi: 10.1007/978-3-030-58555-6_40.

Y.-H. Wen, L. Gao, H. Fu, F.-L. Zhang, S. Xia, and Y.-J. Liu, “Motif-GCNs With Local and Non-Local Temporal Blocks for Skeleton-Based Action Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2009–2023, Feb. 2023, doi: 10.1109/TPAMI.2022.3170511.

H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-Attention Generative Adversarial Networks,” in Proceedings of the 36th International Conference on Machine Learning, 2019, vol. 97, pp. 7354–7363. [Online]. Available:

G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with EM routing,” in International Conference on Learning Representations, 2018. [Online]. Available:

J. Rajasegaran, V. Jayasundara, S. Jayasekara, H. Jayasekara, S. Seneviratne, and R. Rodrigo, “DeepCaps: Going Deeper With Capsule Networks,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp. 10717–10725. doi: 10.1109/CVPR.2019.01098.

W. Wang, F. Lee, S. Yang, and Q. Chen, “An Improved Capsule Network Based on Capsule Filter Routing,” IEEE Access, vol. 9, pp. 109374–109383, 2021, doi: 10.1109/ACCESS.2021.3102489.

W. Huang and F. Zhou, “DA-CapsNet: dual attention mechanism capsule network,” Sci. Rep., vol. 10, no. 1, p. 11383, Jul. 2020, doi: 10.1038/s41598-020-68453-w.

D. Li, T. Yao, L.-Y. Duan, T. Mei, and Y. Rui, “Unified Spatio-Temporal Attention Networks for Action Recognition in Videos,” IEEE Trans. Multimed., vol. 21, no. 2, pp. 416–428, Feb. 2019, doi: 10.1109/TMM.2018.2862341.

V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “PoTion: Pose MoTion Representation for Action Recognition,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 7024–7033. doi: 10.1109/CVPR.2018.00734.

K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, vol. 37, pp. 2048–2057. [Online]. Available:

J. He, C. Zhang, X. He, and R. Dong, “Visual Recognition of traffic police gestures with convolutional pose machine and handcrafted features,” Neurocomputing, vol. 390, pp. 248–259, May 2020, doi: 10.1016/j.neucom.2019.07.103.

Z. Fang, W. Zhang, Z. Guo, R. Zhi, B. Wang, and F. Flohr, “Traffic Police Gesture Recognition by Pose Graph Convolutional Networks,” in 2020 IEEE Intelligent Vehicles Symposium (IV), Oct. 2020, pp. 1833–1838. doi: 10.1109/IV47402.2020.9304675.

D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri, “ConvNet Architecture Search for Spatiotemporal Feature Learning,” Aug. 2017, [Online]. Available:

L. Wang et al., “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” in Computer Vision – ECCV 2016, 2016, pp. 20–36. doi: 10.1007/978-3-319-46484-8_2.

D. Zhang, X. Dai, X. Wang, and Y.-F. Wang, “S3D: Single Shot multi-Span Detector via Fully 3D Convolutional Networks.” Jul. 20, 2018. [Online]. Available:




How to Cite

Ha, M.-H. (2024). Top-Heavy CapsNets Based on Spatiotemporal Non-Local for Action Recognition. Journal of Computing Theories and Applications, 2(1), 39–50.