Research on Multi-Modal Question Answering Emotion Recognition Method Based on User Preference
Abstract
The current multimodal sentiment recognition methods fall short in handling dynamic changes in modal weights and modeling modal consistency. Specifically, when processing the MELD dataset, multiple rounds of structured processing and feature optimization were not conducted; meanwhile, the Word2Vec similarity-based sentiment lexicon expansion strategy still falls short in terms of semantic consistency and emotional accuracy. Additionally, in the original experimental setup, the model relied solely on cross-entropy loss for training, overlooking the uncertainties and inconsistencies in information fusion across modalities. Therefore, this project proposes a multimodal question-answering sentiment recognition method based on user preferences. By introducing a multimodal attention mechanism guided by sentiment preferences and sentiment prototypes in a three-dimensional sentiment representation space (Valence-Arousal-Dominance, VAD), the method enhances multimodal information fusion and modeling capabilities for modal consistency. Furthermore, an extended sentiment lexicon strategy and context-dependent modeling mechanism are designed to improve the accuracy and stability of dialogue sentiment recognition. The project conducted systematic ablation and comparative experiments on the standard multimodal dialogue sentiment dataset MELD, demonstrating that the proposed method outperforms existing representative models in accuracy, precision, and F1 score, validating its effectiveness and potential for application in multimodal question-answering sentiment recognition tasks.