Body motion is an important channel for human communication and plays a crucial role in automatic emotion recognition. This work proposes a multiscale spatio-temporal network that captures coarse-grained and fine-grained affective information conveyed by full-body motion and decodes the complex mapping between emotion and body movement. The method introduces an energy-guided scale selection strategy, a hierarchical spatio-temporal network for posture covariance matrices and 3D posture images, and a spatio-temporal iterative fusion algorithm. Evaluations on five public datasets show that the proposed approach improves learning capability, generalization, and convergence, with average classification results exceeding 86% across datasets.
More details about this article are available at this link.