Complementary Fusion of Multi-Features and Multi-Modalities in Sentiment Analysis

Sentiment analysis, mostly based on text, has been rapidly developing in the last decade and has attracted widespread attention in both academia and industry. However, the information in the real world usually comes as multiple modalities. In this paper, based on audio and text, we consider the task of multimodal sentiment analysis and propose a novel fusion strategy including both the multi-feature fusion and the multi-modality fusion to improve the accuracy of audio-text sentiment analysis. We call it the DFF-ATMF (Deep Feature Fusion - Audio and Text Modality Fusion) model and the features learned by using DFF-ATMF are complementary to each other and robust. Experiments on the CMU-MOSI dataset and the recently released CMU-MOSEI dataset, both collected from YouTube for sentiment analysis, show the very competitive results of our proposed DFF-ATMF model. Surprisingly, DFF-ATMF also achieves state-of-the-art results on the IEMOCAP dataset, indicating that the proposed fusion strategy also has a good generalization ability for multimodal emotion recognition.

[Accepted by NeurIPS 2019 Workshop on NewInML, Vancouver, BC, Canada.]

citation:

@article{chen2019sentiment,
  title={Sentiment Analysis using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.},
  author={Chen, Feiyang and Luo, Ziqian},
  journal={CoRR},
  year={2019}
}

Download paper here