Depression recognition using a proposed speech chain model fusing speech production and perception features

摘要

Body motion is an important channel for human communication and plays a crucial role in automatic emotion recognition. This work proposes a multiscale spatio-temporal network, which captures the coarse-grained and fine-grained affective information conveyed by full-body motion and decodes the complex mapping between emotion and body movement. The proposed method consists of three main components. First, a scale selection algorithm based on the pseudo-energy model is presented, which guides our network to focus not only on long-term macroscopic body expressions, but also on short-term subtle posture changes. Second, we propose a hierarchical spatio-temporal network that can jointly process posture covariance matrices and 3D posture images with different time scales, and then hierarchically fuse them in a coarse-to-fine manner. Finally, a spatio-temporal iterative (ST-ITE) fusion algorithm is developed to jointly optimize the proposed network. The proposed approach is evaluated on five public datasets. The experimental results show that the introduction of the energy-based scale selection algorithm significantly enhances the learning capability of the network. The proposed ST-ITE fusion algorithm improves the generalization and convergence of our model. The average classification results of the proposed method exceed 86% on all datasets and outperform the state-of-the-art methods.

More details about this article are available at this link.

Journal Article
下一页
上一页