PUB-SalNet: A Pre-trained Unsupervised Self-Aware Backpropagation Network for Biomedical Salient Segmentation

Salient segmentation is a critical step in biomedical image analysis, aiming to cut out regions that are most interesting to humans. Recently, supervised methods have achieved promising results in biomedical areas, but they depend on annotated training data sets, which requires labor and proficiency in related background knowledge. In contrast, unsupervised learning makes data-driven decisions by obtaining insights directly from the data themselves. In this paper, we propose a completely unsupervised self- aware network based on pre-training and attentional backpropagation for biomedical salient segmentation, named as PUB-SalNet. Firstly, we aggregate a new biomedical data set from several simulated Cellular Electron Cryo- Tomography (CECT) data sets featuring rich salient objects, different SNR settings and various resolutions, which is called SalSeg-CECT. Based on the SalSeg-CECT data set, we then pre-train a model specially designed for biomedical tasks as a backbone module to initialize network parameters. Next, we present a U-SalNet network to learn to selectively attend to salient objects. It includes two types of attention modules to facilitate learning saliency through global contrast and local similarity. Lastly, we jointly refine the salient regions together with feature representations from U-SalNet, with the parameters updated by self- aware attentional backpropagation. We apply PUB-SalNet for analysis of 2D CECT images and achieve state-of-the-art performance on simulated biomedical data sets. Furthermore, our proposed PUB-SalNet can be easily extended to 3D images. The experimental results on the 2d and 3d data sets also demonstrate the generalization ability and robustness of our method.

Is Deep Learning All You Need for Unsupervised Saliency Detection?

Pre-trained networks have recently achieved great success in computer vision. At present, most deep learning-based saliency detection methods use pre-trained networks to extract features, regardless of supervised or unsupervised. However, we found that when unsupervised saliency detection is performed on grayscale biomedical images, pre-trained networks such as VGG cannot effectively extract significant features. We suggest that VGG is not able to learn salient information from grayscale biomedical images and its performance greatly depends on RGB cues and quality of the training set. To verify our hypothesis, we construct an adversarial data set featuring a low signal-to-noise ratio (SNR), low resolution and rich salient objects and conduct a series of probing experiments. What’s more, in order to further explore what VGG has learned, we visualize intermediate feature maps. To the best of our knowledge, we are the first to investigate the reliability of deep learning methods for unsupervised saliency detection on grayscale biomedical images. It’s worth noticing that our adversarial data set also provides a more robust evaluation of saliency detection and may serve as a standard benchmark in future work on this task.

Complementary Fusion of Multi-Features and Multi-Modalities in Sentiment Analysis

Sentiment analysis, mostly based on text, has been rapidly developing in the last decade and has attracted widespread attention in both academia and industry. However, the information in the real world usually comes as multiple modalities. In this paper, based on audio and text, we consider the task of multimodal sentiment analysis and propose a novel fusion strategy including both the multi-feature fusion and the multi-modality fusion to improve the accuracy of audio-text sentiment analysis. We call it the DFF-ATMF (Deep Feature Fusion - Audio and Text Modality Fusion) model and the features learned by using DFF-ATMF are complementary to each other and robust. Experiments on the CMU-MOSI dataset and the recently released CMU-MOSEI dataset, both collected from YouTube for sentiment analysis, show the very competitive results of our proposed DFF-ATMF model. Surprisingly, DFF-ATMF also achieves state-of-the-art results on the IEMOCAP dataset, indicating that the proposed fusion strategy also has a good generalization ability for multimodal emotion recognition.

Learning Robust Heterogeneous Signal Features from Parallel Neural Network for Audio Sentiment Analysis

Audio Sentiment Analysis is a popular research area that extends the conventional text-based sentiment analysis to depend on the effectiveness of acoustic features extracted from speech. However, current progress on audio sentiment analysis mainly focuses on extracting homogeneous acoustic features or doesn’t fuse heterogeneous features effectively. In this paper, we propose an utterance-based deep neural network model, which has a parallel combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) based network, to obtain representative features termed Audio Sentiment Vector (ASV), that can maximally reflect sentiment information in audio. Specifically, our model is trained by utterance-level labels and ASV can be extracted and fused creatively from two branches. In the CNN model branch, spectrum graphs produced by signals are fed as inputs while in the LSTM model branch, inputs include spectral features and cepstrum coefficient extracted from dependent utterances in audio. Besides, Bidirectional Long Short-Term Memory (BiLSTM) with attention mechanism is used for feature fusion. Extensive experiments have been conducted to show our model can recognize audio sentiment precisely and quickly, and demonstrate our ASV is better than traditional acoustic features or vectors extracted from other deep learning models. Furthermore, experimental results indicate that the proposed model outperforms the state-of-the-art approach by 9.33% on the Multimodal Opinion-level Sentiment Intensity dataset (MOSI) dataset.