A Multi-Task Neural Approach for Emotion Attribution, Classification, and Summarization
IEEE Transactions on Multimedia, pp. 148-159, 2020.
EI WOS
Weibo:
Abstract:
Emotional content is a crucial ingredient in user-generated videos. However, the sparsely expressed emotions in the user-generated video cause difficulties to emotions analysis in videos. In this paper, we propose a new neural approach---Bi-stream Emotion Attribution-Classification Network (BEAC-Net) to solve three related emotion analysi...More
Code:
Data:
Introduction
- The explosive growth of user-generated video has created great demand for computational understanding of visual data and attracted significant research attention in the multimedia community.
- The ingredients that form emotions include interaction among cognitive processes, temporal succession of appraisals, and coping behaviors [21], [25]
- This may have inspired computational work like DeepSentiBank [26] and zero-shot emotion recognition [14], which broaden the emotion categories that can be recognized
Highlights
- The explosive growth of user-generated video has created great demand for computational understanding of visual data and attracted significant research attention in the multimedia community
- Extending our earlier work [17], we propose a multitask neural architecture, the Bi-stream Emotion AttributionClassification Network (BEAC-Net), which tackles both emotion attribution and classification at the same time, thereby allowing related tasks to reinforce each other
- We propose a novel twostream neural architecture that employs the emotion segment selected by the attribution network in combination with the original video
- We suggest that the ability to locate emotional content is crucial for accurate emotion understanding
- We present a multi-task neural network with a novel bi-stream architecture, called Bi-stream Emotion Attribution-Classification Network (BEAC-Net)
- The attribution network locates the emotional content, which is processed in parallel with the original video within the bi-stream architecture
Methods
- The authors conduct experiments on two video emotion datasets based on Ekman’s six basic emotions.
- The Emotion6 Video Dataset.
- The Emotion6 dataset [64] contains 1980 images that are labeled with a distribution over 6 basic emotions and a neutral category.
- The images do not contain facial expressions or text directly associated with emotions.
- The authors consider the emotion category with the highest probability as the dominant emotion
Results
- Less than 1% of the videos comprise less than 100 frames, so the padding is rarely necessary.
Conclusion
- Computational understanding of emotions in user-generated video content is a challenging task due to the sparsity of emotional content, the presence of multiple emotions, and the variable quality of user-generated videos.
- The authors suggest that the ability to locate emotional content is crucial for accurate emotion understanding.
- Toward this end, the authors present a multi-task neural network with a novel bi-stream architecture, called Bi-stream Emotion Attribution-Classification Network (BEAC-Net).
- An ablation study shows the bi-stream architecture provides significant benefits for emotion recognition and the proposed emotion attribution network outperforms traditional temporal attention.
Summary
Introduction:
The explosive growth of user-generated video has created great demand for computational understanding of visual data and attracted significant research attention in the multimedia community.- The ingredients that form emotions include interaction among cognitive processes, temporal succession of appraisals, and coping behaviors [21], [25]
- This may have inspired computational work like DeepSentiBank [26] and zero-shot emotion recognition [14], which broaden the emotion categories that can be recognized
Methods:
The authors conduct experiments on two video emotion datasets based on Ekman’s six basic emotions.- The Emotion6 Video Dataset.
- The Emotion6 dataset [64] contains 1980 images that are labeled with a distribution over 6 basic emotions and a neutral category.
- The images do not contain facial expressions or text directly associated with emotions.
- The authors consider the emotion category with the highest probability as the dominant emotion
Results:
Less than 1% of the videos comprise less than 100 frames, so the padding is rarely necessary.Conclusion:
Computational understanding of emotions in user-generated video content is a challenging task due to the sparsity of emotional content, the presence of multiple emotions, and the variable quality of user-generated videos.- The authors suggest that the ability to locate emotional content is crucial for accurate emotion understanding.
- Toward this end, the authors present a multi-task neural network with a novel bi-stream architecture, called Bi-stream Emotion Attribution-Classification Network (BEAC-Net).
- An ablation study shows the bi-stream architecture provides significant benefits for emotion recognition and the proposed emotion attribution network outperforms traditional temporal attention.
Tables
- Table1: EMOTION RECOGNITION RESULTS
- Table2: TRANSFER LEARNING: OUT-OF-DOMAIN BEAC-NET FINETUNED ON 20%
- Table3: CLASSIFICATION ACCURACY WITH DIFFERENT PROPORTIONS OF FRAMES
Funding
- This work was supported in part by NSFC Projects (61572134, 61572138, U1611461), Shanghai Sailing Program (17YF1427500), Fudan University-CIOMP Joint Fund (FC2017-006), STCSM Project (16JC1420400), Shanghai Municipal Science and Technology Major Project (2017SHZDZX01, 2018SHZDZX01) and ZJLab
Reference
- A. R. Damasio, Descartes error: Emotion, reason and the human brain. New York: Avon Books, 1994.
- G. L. Clore and J. E. Palmer, “Affective guidance of intelligent agents: How emotion controls cognition,” Cognitive Systems Research, no. 1, pp. 21–30, 2009.
- R. E. Guadagno, D. M. Rempala, S. Murphy, and B. M. Okdie, “What makes a video go viral? an analysis of emotional contagion and internet memes,” Computers in Human Behavior, vol. 29, no. 6, pp. 2312–2319, 2013.
- K. Yadati, H. Katti, and M. Kankanhalli, “CAVVA: Computational affective video-in-video advertising,” IEEE Transactions on Multimedia, vol. 16, no. 1, 2014.
- N. Ikizler-Cinbis and S. Sclaroff, “Web-based classifiers for human action recognition,” IEEE Transactions on Multimedia, vol. 14, pp. 1031– 1045, Aug 2012.
- W. Xu, Z. Miao, X. P. Zhang, and Y. Tian, “A hierarchical spatiotemporal model for human activity recognition,” IEEE Transactions on Multimedia, vol. 19, pp. 1494–1509, July 2017.
- K. Somandepalli, N. Kumar, T. Guha, and S. S. Narayanan, “Unsupervised discovery of character dictionaries in animation movies,” IEEE Transactions on Multimedia, vol. PP, no. 99, pp. 1–1, 2017.
- H. Joho, J. M. Jose, R. Valenti, and N. Sebe, “Exploiting facial expressions for affective video summarisation,” in Proc. ACM conference on Image and Video Retrieval, 2009.
- S. Zhao, H. Yao, X. Sun, P. Xu, X. Liu, and R. Ji, “Video indexing and recommendation based on affective analysis of viewers,” in Proceedings of the 19th ACM international conference on Multimedia, 2011.
- Q. Zhen, D. Huang, Y. Wang, and L. Chen, “Muscular movement model-based automatic 3D/4D facial expression recognition,” IEEE Transactions on Multimedia, vol. 18, pp. 1438–1450, July 2016.
- M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2014.
- X. Alameda-Pineda, E. Ricci, Y. Yan, and N. Sebe, “Recognizing emotions from abstract paintings using non-linear matrix completion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5240–5248, 2016.
- A. Yazdani, K. Kappeler, and T. Ebrahimi, “Affective content analysis of music video clips,” in Proc. 1st ACM workshop Music information retrieval with user-centered and multimodal strategies, 2011.
- B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal, “Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization,” IEEE Trasactions on Affective Computing, 2017.
- Y. Jiang, B. Xu, and X. Xue, “Predicting emotions in user-generated videos,” in The AAAI Conference on Artificial Intelligence, 2014.
- B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal, “Video emotion recognition with transferred deep feature encodings,” in Indian Council of Medical Research, 2016.
- J. Gao, Y. Fu, Y.-G. Jiang, and X. Xue, “Frame-transformer emotion classification network,” in Proceedings of the 2017 ACM International Conference on Multimedia Retrieval, 2017.
- P. Ekman, “Universals and cultural differences in facial expressions of emotion,” Nebrasak Symposium on Motivation, vol. 19, pp. 207–284, 1972.
- P. Ekman, “Basic emotions,” in Handbook of Cognition and Emotion, 1999.
- R. Plutchik and H. Kellerman, Emotion: Theory, research and experience. Vol. 1, Theories of emotion. Academic Press, 1980.
- J. J. Gross, “Emotion regulation: Affective, cognitive, and social consequences,” Psychophysiology, vol. 39, no. 3, p. 281291, 2002.
- L. F. Barrett, “Are emotions natural kinds?,” Perspectives on Psychological Science, vol. 1, no. 1, pp. 28–58, 2006.
- K. A. Lindquist, E. H. Siegel, K. S. Quigley, and L. F. Barrett, “The hundred-year emotion war: Are emotions natural kinds or psychological constructions? comment on Lench, Flores, and Bench (2011),” Psychological Bulletin, no. 1, p. 255263, 2013.
- L. Nummenmaa, E. Glerean, R. Hari, and J. K. Hietanen, “Bodily maps of emotions,” Proceedings of the National Academy of Sciences of the United States of America, vol. 111, no. 2, pp. 646–651, 2013.
- B. Li, “A dynamic and dual-process theory of humor,” in The 3rd Annual Conference on Advances in Cognitive Systems, pp. 57–74, 2015.
- T. Chen, D. Borth, Darrell, and S.-F. Chang, “DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks,” CoRR, 2014.
- A. Russell, James, “A circumplex model of affect,” Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980.
- J. R. Fontaine, K. R. Scherer, E. B. Roesch, and P. C. Ellsworth, “The world of emotions is not two-dimensional,” Psychological Science, vol. 18, no. 12.
- H. Lovheim, “A new three-dimensional model for emotions and monoamine neurotransmitters,” Medical Hypotheses, vol. 78, no. 2, pp. 341–348, 2012.
- S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task learning for dimensional and continuous emotion recognition,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pp. 19– 26, 2017.
- J. Huang, Y. Li, J. Tao, Z. Lian, Z. Wen, M. Yang, and J. Yi, “Continuous multimodal emotion prediction based on long short term memory recurrent neural network,” in AVEC’17 Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pp. 11–18, 2017.
- Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris-accede: A video database for affective content analysis,” IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 43–55, 2015.
- S. Benini, L. Canini, and R. Leonardi, “A connotative space for supporting movie affective recommendation,” IEEE Transactions on Multimedia, vol. 13, no. 6, pp. 1356–1370, 2011.
- J. Machajdik and A. Hanbury, “Affective image classication using features inspired by psychology and art theory,” in Proceedings of the 18th ACM international conference on Multimedia, pp. 83–92, 2010.
- X. Lu, P. Suryanarayan, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang, “On shape and the computability of emotions,” in Proceedings of the 20th ACM international conference on Multimedia, 2012.
- B. Jou, S. Bhattacharya, and S.-F. Chang, “Predicting viewer perceived emotions in animated GIFs,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014.
- W. Hu, X. Ding, B. Li, J. Wang, Y. Gao, F. Wang, and S. Maybank, “Multi-perspective cost-sensitive context-aware multi-instance sparse coding and its application to sensitive video recognition,” IEEE Transactions on Multimedia, vol. 18, no. 1, 2016.
- Y. Song, L.-P. Morency, and R. Davis, “Learning a sparse codebook of facial and body microexpressions for emotion recognition,” in Proceedings of the 15th ACM International conference on multimodal interaction, 2013.
- B. Schuller, G. Rigoll, and M. Lang, “Hidden markov model-based speech emotion recognition,” in Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 2, ICME ’03, pp. 401– 404, IEEE Computer Society, 2003.
- Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
- S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Transactions on Multimedia, vol. 20, no. 6, pp. 1576–1590, 2018.
- H.-L. Wang and L.-F. Cheong, “Affective understanding in film,” IEEE TCSVT, 2006.
- Z. Zeng, J. Tu, M. Liu, T. S. Huang, B. Pianfetti, D. Roth, and S. Levinson, “Audio-visual affect recognition,” IEEE Transactions on multimedia, vol. 9, no. 2, pp. 424–428, 2007.
- E. Acar, F. Hopfgartner, and S. Albayrak, “A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material,” Multimedia Tools and Applications, vol. 76, pp. 1–29, 2016.
- L. Pang, S. Zhu, and C.-W. Ngo, “Deep multimodal learning for affective analysis and retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 11, 2015.
- S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, C. Gulcehre, R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari, et al., “Combining modality specific deep neural networks for emotion recognition in video,” in Proceedings of the 15th ACM International conference on multimodal interaction, pp. 543–550, ACM, 2013.
- Q. You, J. Luo, H. Jin, and J. Yang, “Robust image sentiment analysis using progressively trained and domain transferred deep networks,” in AAAI, 2015.
- D. Borth, R. Ji, T. Chen, T. M. Breuel, and S.-F. Chang., “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in Proceedings of the 21st ACM international conference on Multimedia, 2013.
- S. Wang and Q. Ji, “Video affective content analysis: a survey of state of the art methods,” IEEE Transactions on Automatic Control, vol. PP, no. 99, pp. 1–1, 2015.
- S. Arifin and P. Y. K. Cheung, “Affective level video segmentation by utilizing the pleasure-arousal-dominance information,” IEEE Transactions on Multimedia, vol. 10, no. 7, 2008.
- B. T. Truong and S. Venkatesh, “Video abstraction: A systematic review and classification,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 3, no. 1, pp. 79–82, 2007.
- Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A user attention model for video summarization,” in Proceedings of the 10th ACM international conference on Multimedia, 2002.
- J.-L. Lai and Y. Yi, “Key frame extraction based on visual attention model,” Journal of Visual Communication and Image Representation, vol. 23, no. 1, pp. 114–125, 2012.
- M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S. Chua, “Event driven web video summarization by tag localization and key-shot identification,” IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 975–985, 2012.
- F. Wang and C. W. Ngo, “Summarizing rushes videos by motion, object, and event understanding,” IEEE Transactions on Multimedia, vol. 14, pp. 76–87, Feb 2012.
- X. Wang, Y. Jiang, Z. Chai, Z. Gu, X. Du, and D. Wang, “Real-time summarization of user-generated videos based on semantic recognition,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014.
- M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems 28, pp. 2017–2025, 2015.
- K. K. Singh and Y. J. Lee, “End-to-end localization and ranking for relative attributes,” in European Conference on Computer Vision, pp. 753–769, Springer, 2016.
- X. Kelvin, L. B. Jimmy, K. Ryan, C. Kyunghyun, C. Aaron, S. Ruslan, R. S. Zemel, and B. Yoshua, “Show, attend, tell: Neural image caption generation with visual attention,” International Conference on Machine Learning, vol. 37, pp. 2048–2057, 2015.
- C.-H. Lin and S. Lucey, “Inverse compositional spatial transformer networks,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2017.
- K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 4724–4733, 2017.
- Z. Li, G. M. Schuster, and A. K. Katsaggelos, “Minmax optimal video summarization,” IEEE Transactions on Circuits and Systems for Video Technology, 2005.
- K.-C. Peng, T. Chen, A. Sadovnik, and A. Gallagher, “A mixed bag of emotions: Model, predict, and transfer emotion distributions,” pp. 860– 868, 06 2015.
- D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
- P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2018.
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515, 2015.
- F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970, 2015.
- G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database forstudying face recognition in unconstrained environments,” tech. rep., 2008.
- R. Cardona-Rivera and B. Li, “Plotshot: Generating discourseconstrained stories around photos,” in Proceedings of the 12th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 2016. Yanwei Fu received the Ph.D. degree from Queen Mary University of London in 2014, and the M.Eng. degree from the Department of Computer Science and Technology, Nanjing University, China, in 2011. He held a post-doctoral position at Disney Research, Pittsburgh, PA, USA, from 2015 to 2016. He is currently a tenure-track Professor with Fudan University. His research interests are image and video understanding, and life-long learning.
- Boyang Li is a Senior Research Scientist at Baidu Research at Sunnyvale, Carlifornia. Prior to Baidu, he directed the Narrative Intelligence research group at Disney Research Pittsburgh. His research interests lie broadly in machine learning and multimodal reasoning, and particularly in computational understanding and generation of content with complex semantic structures, such as narratives, human emotions, and the interaction between visual and textual information. He received his Ph.D. in Computer Science from Georgia Institute of Technology in 2014, and his B. Eng. from Nanyang Technological University, Singapore in 2008. He has authored and co-authored more than 40 peer-reviewed papers in international journals and conferences.
- Guoyun Tu received his Bachelor’s degree in physics at Fudan University in 2018 and is now a graduate candidate at EIT Digital Program (Eindhoven University of Technology & KTH Royal Institute of Technology). His research interests includes machine learning theory and its application.
- Yu-Gang Jiang is a Professor of Computer Science at Fudan University and Director of Fudan-Jilian Joint Research Center on Intelligent Video Technology, Shanghai, China. He is interested in all aspects of extracting high-level information from big video data, such as video event recognition, object/scene recognition and large-scale visual search. His work has led to many awards, including the inaugural ACM China Rising Star Award, the 2015 ACM SIGMM Rising Star Award, and the research award for outstanding young researchers from NSF China. He is currently an associate editor of ACM TOMM, Machine Vision and Applications (MVA) and Neurocomputing. He holds a PhD in Computer Science from City University of Hong Kong and spent three years working at Columbia University before joining Fudan in 2011.
- Xiangyang Xue received the BS, MS, and PhD degrees in communication engineering from Xidian University, Xi’an, China, in 1989, 1992, and 1995, respectively. He is currently a professor of computer science with Fudan University, Shanghai, China. His research interests include computer vision, multimedia information processing and machine learning.
Full Text
Tags
Comments