Dual-Channel Attention-Based Multimodal Sentiment Analysis Model Integrating Text and Image Features
Main Article Content
Abstract
With the rise of multimodal data in social media, sentiment analysis based solely on text has become insufficient to capture the richness of human emotion. To address this limitation, this paper proposes a dual-channel multimodal sentiment analysis model based on attention mechanisms, named ACMSA (Attention Channel Multimodal Sentiment Analysis). The model integrates textual and visual features to improve emotional understanding and classification accuracy. Text features are extracted using the BERT model and processed through a CNN–BiGRU-Attention dual-channel architecture to capture both local and global semantic dependencies. Image features are obtained via ResNet152, enhanced by a Channel-Spatial Attention Module (CSAM) that adaptively emphasizes salient regions. The fusion of multimodal features is achieved through a Co-Attention mechanism, enabling fine-grained interaction between textual and visual representations. Experimental evaluations on the MVSA-Single and MVSA-Multi Twitter datasets demonstrate that ACMSA outperforms state-of-the-art baselines, achieving an accuracy of 77.08% and 74.42%, respectively. The results verify that attention-guided dual-channel modeling effectively strengthens cross-modal correlation and interpretability. This framework provides a robust and extensible solution for sentiment analysis in multimedia-rich environments, offering valuable implications for emotion recognition, social media monitoring, and intelligent interaction systems.