The approach to Video-Audio Emotion Recognition takes advantage of gaining additional information from multimodalites. Since the target features are time related without strict alignment in time, video-audio features become simply video features and audio features. Exploring toward such a goal, spectrogram as outstanding vocal feature in neural network solution is selected to get benefits of convolution filters. Inspired by solution of image captioning of LSTM where embedded words information and image information are spatially aligned, we perform embedding of the audio spectrogram and image sequences since time information is converted to spatial information in spectrogram. We propose both architecture and framework optimizing the alignment of the mentioned temporal features and we provide the analysis of the significant performance improvement along with the discussion of the Video-Audio Emotion Recognition general tasks.
This paper aims to explore the practicality of transfer learning regarding to the emotion recognition task. We present superior performance of the transfer learning from the face identification, compared with the solutions of train-from-scratch feed-forward deep neural networks and general transfer learning from object classifications. We illustrate that the better adaptation of source domain can help with the initialization of the network, providing more efficient learning from the target training samples. In such way even network with complex architecture can overcome over-fitting problems thus having better results than other solutions can do having the same amount of training data. We discuss the detailed training strategies to the get best performance of such transfer leaning using fine-tuning mechanisms on the classical VGG-16 architecture network based on the public accessible FER2013 emotion database.
This paper refers to facial multi-expression recognition as an element of human-computer interface. The discriminative features are extracted from each camera frame using Candide 3D model for human head. Namely, for the selected parts, including mouth, nose, eyes and eye brows areas, shape deformation units and animated motion units are specified. They are corrected versions of the units defined originally for Candide-3 model. By nonlinear least squared LM method scalar parameters for affine motion, the shape deformation, and the animated motion are identified. The error function is based on the orthographic projection of Candide 3D points corresponding to on-line detected 68 facial landmarks. The feature vectors are comprised of less than 10 coefficients for controlling the animated motion of Candide 3D model, only. The multiple expression classifier is based on the Structural Support Vector Machine linear model. The SSVM model is trained by few hundreds images for each of four expression classes: idle, smile, anger, and surprise. On-line experiments with web camera confirm high correlation of the proposed class with the subjective impression of face expression. Moreover, comparing to FP68/SSVM classifier our new proposal outperforms it by 3 × 15 rule: its success rate is about 15% higher while having more than 15 times shorter feature vector, and therefore 15 times shorter recognition time.
The paper presents the results of our recent work on development of contemporary computing platform DC2 for multimedia education usingWebGL andWeb Audio { the W3C standards. Using literate programming paradigm the WEBSA educational tools were developed. It offers for a user (student), the access to expandable collection of WEBGL Shaders and web Audio scripts. The unique feature of DC2 is the option of literate programming, offered for both, the author and the reader in order to improve interactivity to lightweightWebGL andWeb Audio components. For instance users can define: source audio nodes including synthetic sources, destination audio nodes, and nodes for audio processing such as: sound wave shaping, spectral band filtering, convolution based modification, etc. In case of WebGL beside of classic graphics effects based on mesh and fractal definitions, the novel image processing analysis by shaders is offered like nonlinear filtering, histogram of gradients, and Bayesian classifiers.
The novel smile recognition algorithm is presented based on extraction of 68 facial salient points (fp68) using the ensemble of regression trees. The smile detector exploits the Support Vector Machine linear model. It is trained with few hundreds exemplar images by SVM algorithm working in 136 dimensional space. It is shown by the strict statistical data analysis that such geometric detector strongly depends on the geometry of mouth opening area, measured by triangulation of outer lip contour. To this goal two Bayesian detectors were developed and compared with SVM detector. The first uses the mouth area in 2D image, while the second refers to the mouth area in 3D animated face model. The 3D modeling is based on Candide-3 model and it is performed in real time along with three smile detectors and statistics estimators. The mouth area/Bayesian detectors exhibit high correlation with fp68/SVM detector in a range [0:8; 1:0], depending mainly on light conditions and individual features with advantage of 3D technique, especially in hard light conditions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.