Fusion of Multi-domain EEG Signatures Improves Emotion Recognition

Background: Affective computing has gained increasing attention in the area of the human-computer interface where electroencephalography (EEG)-based emotion recognition occupies an important position. Nevertheless, the diversity of emotions and the complexity of EEG signals result in unexplored relationships between emotion and multichannel EEG signal frequency, as well as spatial and temporal information. Methods: Audio-video stimulus materials were used that elicited four types of emotions (sad, fearful, happy, neutral) in 32 male and female subjects (age 21–42 years) while collecting EEG signals. We developed a multidimensional analysis framework using a fusion of phase-locking value (PLV), microstates, and power spectral densities (PSDs) of EEG features to improve emotion recognition. Results: An increasing trend of PSDs was observed as emotional valence increased, and connections in the prefrontal, temporal, and occipital lobes in high-frequency bands showed more differentiation between emotions. Transition probability between microstates was likely related to emotional valence. The average cross-subject classification accuracy of features fused by Discriminant Correlation Analysis achieved 64.69%, higher than that of single mode and direct-concatenated features, with an increase of more than 7%. Conclusions: Different types of EEG features have complementary properties in emotion recognition, and combining EEG data from three types of features in a correlated way, improves the performance of emotion classification.

Keywords

electroencephalography (EEG)

emotion

classification

power spectral density (PSD)

microstate

phase-locking value (PLV)

1. Introduction

A key element of advanced human-computer interaction is the communication of emotions. A reliable emotion recognition system with acceptable adaptability, robustness, and recognition accuracy is an important premise in realizing affective human-computer interaction [1].

Emotion can be evaluated by subjective feelings, behavioral tendencies, motor expressions, physiological reactions, and cognitive appraisals such as blood pressure, heart rate, eye activity, skin resistance measurements, and electroencephalography (EEG) [2]. Among these, EEG signals provide objective and abundant informational signatures in response to varying emotional states [3, 4]. EEG features used in most studies of emotion principally focus on frequency and/or spatial domains such as power spectral densities (PSD) [5] and functional connectivity [6] such as phase-locking value (PLV) [7, 8, 9], Pearson correlation coefficient [10], and phase lag index [9]. Nevertheless, abundant temporal transient topology information is ignored in these studies. More recently, microstate, which takes time information into account, has attracted increasing attention in the study of emotional EEG [11, 12, 13].

Fusion of biometrics information often requires more reliable recognition, and improved recognition performance can be achieved through feature-level fusion [14, 15, 16]. Therefore, herein we provide more comprehensive emotion recognition using spatial, temporal, and spectral-domain EEG features by integrating PSD, PLV, and microstate.

2. Materials and Methods

2.1 Study Participants

Thirty-two healthy, right-handed adults with normal sleeping patterns from Tianjin Artificial Intelligence Innovation Center (TAIIC) participated in this study. These participants included 21 males and 11 females between the ages of 21 and 42 years (Mean $\pm{}$ SD = 27.4 $\pm{}$ 4.2 years). All participants learned the experimental process and signed informed consent agreements prior to enrollment. Experimental approval was obtained from the Ethics Committee of Tianjin University (TJUE-2021-138).

2.2 EEG Data Acquisition and Preprocessing

A flow chart outlining experimental procedures is shown in Fig. 1. Seven members of our research group constructed an initial database of 129 movie clips. According to emotional movie database (EMDB) requirements [17], 29 of these videos were selected as alternative stimulating materials. Each video has been carefully reviewed to ensure the stability of content, consistent character portrayal, clear image and sound quality, and a fixed refresh rate [2]. Additionally, we have taken great care to ensure that each video clip does not contain both positive and negative emotional content. Twenty-seven annotators (age: 26.26 $\pm{}$ 3.30 years; 19 males, 8 females) assessed these 29 video clips by self-rating their emotions (valence, arousal) according to a self-assessment questionnaire, following the practice of decoding affective physiological responses (DECAF) [18]. Three annotators were regarded as outliers based on abnormal rating values. The consistency of the valence/arousal ratings among the remaining 24 annotators were 0.69 and 0.30 according to Krippendorff’s alpha metric [19] and suggested that the selected videos consistently induced designated emotions among annotators. More details regarding the emotion stimulating materials are outlined in previous work [20]. Twenty videos (sad $\times{}$ 5, fear $\times{}$ 5, happy $\times{}$ 5, neutral $\times{}$ 5) were further selected as stimulating materials according to their valence and arousal values in this study, with each video lasting for 150 seconds.

Fig. 1.

Flow chart of the experimental procedures. DCA, discriminant correlation analysis; PCA, principal component analysis; SVM, support vector machine.

Subjects wore a NEUROSCAN wireless electrode cap (Neuroscan SynAmps2 Model 8050, Compumedics USA, Inc., Charlotte, NC, USA) with 64 channels to collect EEG signals. The electrode cap followed the international 10–20 system, and the sampling rate was 1000 Hz. Four electrode channels (M1, M2, CB1, CB2) were excluded. Subjects watched video stimulating materials on a 15-inch screen with a refresh rate of 60 Hz. The ground electrode was located between Fz and FPz, and the reference electrode was located between Cz and CPz (Fig. 2A). Raw EEG data were preprocessed using MATLAB’s EEGLAB toolbox (https://sccn.ucsd.edu/eeglab) [21] with bandpass filtering between 0.3–50 Hz, 50 Hz trap filtering to remove powerline interference, down-sampling to 200 Hz to reduce computational complexity, and independent component analysis (ICA) to remove eye-movement artifacts. We only analyzed the EEG during the video viewing phase, and EEG signals labeled in the videos were considered ground truth.

Fig. 2.

The visualization of the extracted three types of EEG features. (A) Diagram of sixty-channel EEG cap. (B) Five frequency band topographic maps of group-averaged PSD values (left) and summing of PSDs across all channels (right). (C) The top 5% PLV connectivities and significant differences in the PLV feature of each band. (D) Eight microstates were extracted, and (E) Transition probability between microstates (Mean $\pm{}$ SEM). EEG, electroencephalography; PSD, power spectral density; PLV, phase-locking value; SEM, standard error of mean.

2.3 EEG Feature Extraction

2.3.1 PSD Extraction

PSD describes the power distribution of signals between different frequenciesc and is a widely used metric to analyze EEG signals in five frequency bands to obtain significant frequency-domain information: $\delta{}$ (1–3 Hz); $\theta{}$ (4–7 Hz); $\alpha{}$ (8–13 Hz); $\beta{}$ (14–30 Hz); $\gamma{}$ (31–50 Hz). For each trial, we first divided the EEG signal (150 seconds) into 30 non-overlapping short segments of 5 seconds each. PSD were extracted from these segments independently and averaged across all segments. As outlined by Sani et al. [22], PSD was calculated using this equation:

$S_{x}=20\log_{10}\left(\sqrt{\frac{1}{N}\sum_{n=1}^{N}|x(n)|^{2}}\right)$

where $N$ represents the number of time samples in signal $x$ .

2.3.2 PLV Extraction

Interaction between brain regions is an essential feature of human brain function. PLV provides phase coupling information between EEG signal pairs, and this measure is widely used to construct functional brain networks for EEG-based emotion recognition [7, 8]. As previously outlined [23], PLV between signal a and b was computed using the following equation:

$PLV_{(a,b)}=\left|\frac{1}{N}\sum_{j=1}^{N}e^{i\left(\phi_{a}(j)-\phi_{b}(j)% \right)}\right|$

where $i$ is the imaginary unit, $\phi{}$ denotes the instantaneous phase, and N is the total number of time samples.

2.3.3 Microstate Extraction

Microstate was extracted using the EEGLAB microstate 1.0 plug-in (https://sccn.ucsd.edu/eeglab). In brief, multi-channel EEG signals were decomposed into a series of transient potential topologies. The global field potential (GFP) was then obtained by calculating the standard deviation of the signal across all electrodes. The peak point of the GFP curve represented the moment of the highest signal-to-noise ratio, and the potential of each electrode at the peak point of the GFP curve was recorded. Then the topographic map set was empirically into eight microstates using a K-means clustering algorithm. Statistical characteristics of these microstates such as duration, occurrence, contribution, and transition probability, were used for subsequent analysis and classification.

2.4 Emotion Recognition

Discriminant Correlation Analysis (DCA) [16] was used to fuse multi-domain EEG features. The concept of DCA maximizes pairwise correlations between feature sets in the same class while eliminating correlation between feature sets in different classes. This results in limiting the intra-class correlation. Principal component analysis (PCA) was used for dimensionality reduction prior to feature fusion and for high-dimensional features that required large computing requirements, which may lead to the dimensionality curse. Following established practices [24], pairwise feature sets were first fused using DCA, and all fused features were then summed. We carried out leave-one-subject-out cross validation with linear-support vector machine (SVM) as a classifier. The average classification accuracy was calculated as an index for the performance evaluation of our emotion recognition model.

3. Results

3.1 Power Spectrum Analysis

We calculated the PSDs of 60 channels in the five frequency bands outlined above and obtained 5 $\times{}$ 60 = 300 PSD features. For the four emotion types, the group-average PSD from individual EEG signals was calculated. Topographic PSD maps of five bands are shown in Fig. 2B. By adding up the PSD of all channels; the PSD tends to increase with increases in emotional valence (negative emotions $<$ neutral $<$ positive emotions) in all frequency bands.

3.2 Functional Connectivity Analysis

The PLV was calculated in the form of a 60 $\times{}$ 60 matrix to represent functional connectivity for the classification of emotion. We extracted the upper triangular elements of five bands as features for classification, resulting in 5 $\times{}$ 60 $\times{}$ (60–1)/2 = 8850 dimensional feature vectors. Then we selected the top 5% strongest connectivity links for analysis (Fig. 2C). Highly correlated connections in the low-frequency bands $\delta{}$ and $\theta{}$ were principally distributed within the occipital lobe (electrodes of O, PO, and P), and showed strong similarities between different emotions. In contrast, highly correlated connections in the high-frequency bands $\alpha{}$ , $\beta{}$ and $\gamma{}$ primarily distributed within the prefrontal and occipital lobes and showed more differentiation between emotions.

We conducted a one-way analysis of variance (ANOVA) to test whether there were significant differences in the connectivity strength among different emotions at each frequency band (Fig. 2C, right column). We set p $<$ 0.01 as the threshold for statistical significance. As outlined above, few connectivity differences were observed among emotions in the $\delta{}$ and $\theta{}$ bands. As many connections pass the significance test in $\alpha{}$ , $\beta{}$ and $\gamma{}$ bands, we selected the top 5% significant connections, as calculated by F-statistic, for visualization to facilitate observation. Unlike group-averaged, highly correlated connections that are mainly distributed between neighboring electrodes (Fig. 2C, left three columns), significant connectivity differences among emotions are primarily distributed between distant electrodes (Fig. 2C, right column). Consistent with a previous study [23], these findings showed that significant differences were mainly observed in the connections between nodes located in the temporal lobe suggesting that the temporal lobe region contained more information regarding emotion recognition. In regards to the $\alpha{}$ band, in addition to connections related to the temporal lobe, regions within the prefrontal and occipital lobe showed more active interactions in emotional brain activities.

3.3 Microstate Analysis

We obtained eight microstates, designated microstate 1 (MS1) through MS8, which represent all topographic maps (Fig. 2D). MS1 occupied approximately 20% of occurrence and others occupied approximately 11% each. MS1 and MS2 were symmetric in the occipital-to-prefrontal orientation, and MS8 was symmetric with regards to the parietal-to-peripheral axis, while other microstates were lateralized. The microstates of four emotions were analyzed, and we observed variety in MS5 through MS8, while consistency was observed across emotions in MS1 through MS4.

We also computed the transition probability between microstates, and the results are shown in Fig. 2E. The transition probability of neutral was lower than the other three emotions in the first column (MS2~MS8 $\rightarrow$ MS1). We also observed an increasing trend in the 5th column as well as a decreasing trend in the 8th column with the increase of emotional valence.

3.4 Emotion Recognition

As shown in Fig. 3A, we measured an emotion classification accuracy of 44.30 $\pm{}$ 14.5% (Mean $\pm{}$ SD), 56.88 $\pm{}$ 13.79%, and 56.95 $\pm{}$ 14.12% for PSD, PLV and microstate, respectively. Two possibilities may account for the higher emotion classification accuracies of PLV and microstate compared with PSD. First, spatial and temporal features of EEG may be superior to spectral features in measuring emotion recognition. Second, the interaction of multiple brain regions may better represent changes in emotions rather than the activation of local brain regions. When we conducted pairwise feature fusion, PLV + microstate performed the best with an accuracy of 62.73 $\pm{}$ 13.45%. Although PSD reduced classification performance when used pairwise with either PLV or microstate, PSD still provided complementary information that improved the classification accuracy of PSD + PLV + microstate to 64.69 $\pm{}$ 11.99%. Of note, classification accuracy fell to 57.50 $\pm{}$ 13.46% when we directly concatenated the PSD + PLV + microstate features without conducting DCA, indicating that DCA improved EEG-based emotion classification performance (t (31) = 3.03, p $<$ 0.01, paired-sample t-test).

Fig. 3.

The classification accuracies and confusion matrix. Classification accuracy (A) and confusion matrix (B, C, D, E) of fused features and single mode features. *, p $<$ 0.01, **, p $<$ 0.001, paired-sample t-test.

We next generated a group-average confusion matrix using the fused features (Fig. 3B) and the single mode feature of PSD (Fig. 3C), PLV (Fig. 3D), and microstate (Fig. 3E). Columns and rows of the confusion matrix represented classified and ground truth labels, respectively. The entries in the diagonal showed percentages of emotion classes that were classified correctly. We observed that negative emotions, such as sadness and fear, exhibited relatively low classification accuracies in contrast to the other two non-negative emotions. This led us to develop the view that negative emotions are more difficult to recognize [25]. In general, PLV features had advantages in recognizing happy, sad, and neutral emotions, whereas microstate features had higher accuracy in recognizing fearful emotional states. This finding indicates the complementarity of different feature types. A feature-fusion approach stably improves the performance of emotion classification with different degrees of improvement; for example sad was 1.26%, fear was 3.14%, neutral was 6.92%, and happy was 5.03%. Taking into consideration that the emotion classification in this study was both cross-subject and four-class and that two types of negative emotions were included, our proposed model satisfied classification performance compared with other state-of-the-art work [23, 26]. For example, Chen et al. [23] conducted frequency-domain fusion on EEG features and obtained two-class (negative vs. positive) cross-subject emotion recognition accuracies of 71.14 $\pm{}$ 6.97%, 61.48 $\pm{}$ 10.97%, and 66.62 $\pm{}$ 9.10% on SEED, BCI2020-A, and BCI2020-B datasets, respectively. Based on EEG signals, Cai et al. [26] obtained subject-specific accuracies between 58% to 62% using a four-class (joy, peace, anger, and depression) cross-subject emotion recognition.

4. Discussion

In this study, we utilized leave-one-subject-out cross validation to conduct cross-subject emotion recognition. We used this approach rather than randomly dividing all segments into a number of folds and thus avoided information in the training set leaking to the test set. This ensured the reliability of the recognition performance [27]. Nevertheless, inter-subject variability in EEG signals may prevent inter-class separation and thus may have a significant adverse effect on emotion recognition performance [28]. Hence, building models that account for the problem of inter-subject variability and extract distinctive emotional features across subjects is a possible future direction for improving such modeling.

Negative emotions have adverse effects on mental stress as well as vigilance [29, 30], and recognizing negative emotions plays an important role in effective human-computer interaction. Nevertheless, negative emotions are easily confused [23]. Negative emotions are not only easily confused with other negative emotions but also have a considerable probability of being identified as non-negative emotions. For example, sad was wrongly classified as neutral, with a probability of more than 20%, when using a single mode feature. Our proposed fusion model not only achieved stable improvement for negative emotion recognition performance, but it greatly reduced the probability of identifying negative emotions as non-negative emotions. This development has practical significance as an early warning of negative emotions and an early diagnosis of mental stress.

Although multi-modal features can generally provide complementary information useful in improving recognition accuracy [14, 15, 23], the feature-fusion strategy is noteworthy. In comparison to single mode features and direct-concatenated features, this study used DCA-fused features and found these to perform better in classification. Specifically, we observed, with an accuracy increase of more than 7%, that the fusion, when based on the relationship between features, performed better than direct concatenation. We note that, despite good classification results, the feature-fusion strategy can be poorly interpreted as features were summed after mapping to have the same dimensions. In this study, although the neural activity of emotions was interpreted using EEG-based features from different angles independently, feature-fusion methodologies that combine recognition performance improvements with interpretability are encouraging.

5. Conclusions

In conclusion, we showed that different types of EEG features have complementary properties in regard to emotion recognition. Further, when EEG features from three different types were fused in a correlated way, we observed improved performance of emotion classification. Over 64% accuracy was achieved in the cross-subject experiments, a value which is significantly better than either corresponding single mode features or directly concatenated features.

Availability of Data and Materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Author Contributions

ZGL, LX and EWY designed the research study. XMW and YP performed the research. SKZ, SL, DM and YY provided help and advice on the experimental paradigm design. XMW and YP analyzed the data. All authors contributed to editorial changes in the manuscript. All authors read and approved the final manuscript. All authors have participated sufficiently in the work and agreed to be accountable for all aspects of the work.

Ethics Approval and Consent to Participate

All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Tianjin University (TJUE-2021-138).

Acknowledgment

We would like to express our gratitude to all those who helped me during the writing of this manuscript. Thanks to all the peer reviewers for their opinions and suggestions.

Funding

This study was supported by the National Natural Science Foundation of China under Grant 62076250, Grant 61901505 and Grant 61703407.

Conflict of Interest

The authors declare no conflict of interest.

References

[1]

Kim J. Bimodal emotion recognition using speech and physiological changes. 2007; 265: 280.