Related work: Lui, S. 2013. “A preliminary analysis of the continuous axis value of the three-dimensional PAD speech emotional state model”. The 16th edition of the International Conference on Digital Audio Effects (DAFx), Maynooth, Ireland. [session chair]
[Download the paper]
This is the traditionally used 2-D emotional model. The two axes namely Arousal (Energy) and Valence .
When the 2-D model is applied to classify among the Big Six emotions (Joy, Angry, Fear, Disgust, Bored, Sad) and neutral: it cannot classify fear and disgust very well.
So we propose to use a 3-D PAD model, with the 3rd axis: Aggressiveness. After several preliminary experiments, we define it as the fluctuation of the 2nd to 6th Log Frequency Power Coefficient (LFPC).
We use a German Speech database with 800 clips for training. The result is as follow:
Figure 1. 3D view of the average value of the 800 emotional german speech clips.
Figure 2. another perspective of Figure 1.
Figure 3. Aggressiveness of four negative emotion (400 clips)
Figure 4 shows that the classification result is around 81%. There is a significant improvement on Fear and Disgust. It is because by only using the energy and valence axis from the 2-D model, most other people can already classify all the other emotions except Fear and Disgust (since they are located in almost the same position in the 2-D model). We defined the 3rd axis which can separate Fear and Disgust apart, hence we are doing much better on Fear and Disgust than the others.
Figure 4. Classification result.
Figure 5 shows that the three axes are quite orthogonal to each other, but there are room for improvement.
Figure 4. PCC orthogonality of the three axes.