This work presents an explainable affective body expression recognition framework that integrates multi-scale spatiotemporal encoding with LLM-based reasoning. The framework uses MSCMNet to encode body movement patterns across scales, bidirectional state-space modeling to capture temporal dependencies, and an Emotion-Action Interpreter to generate human-readable explanations. A spatiotemporal semantic understanding module and cross-dataset joint training further improve generalization. Experiments show accuracy improvements of up to 7.83% and stronger explainable reasoning than general-purpose multimodal large language models such as GPT-4o and Gemini 1.5 Pro.
More details about this article are available at this link.