DeepMEF: A Deep Model Ensemble Framework for Video Based Multi-modal Person Identification

Proceedings of the 27th ACM International Conference on Multimedia(2019)

引用 5|浏览90
暂无评分
摘要
The goal of video based multi-modal person identification is to identify a person of interest using multi-modal video features, such as person's face, body, audio or head features. This task is challenging due to many factors, for example, variant body or face poses, poor face image quality, low frame resolution, etc. To address these problems, we propose a deep model ensemble framework, namely DeepMEF. Specifically, the proposed framework includes three novel modules, i.e., the video feature fusion module, the multi-modal feature fusion module and the model ensemble module. The first and second module form the basic deep model for ensemble, with the video feature fusion module fuses facial features from different frames as one. Then the multi-modal feature fusion module further fuses the face feature and features of other modalities for identification. In this work, we adopt the scene feature extracted by ourselves as the additional input of the multi-modal module. At last, the model ensemble module promotes the overall performance by combining the predictions of multiple multi-modal learners. The proposed method achieves a competitive result of 89.86% in mAP on the iQIYI-VID-2019 dataset, which helps us win the third place in the 2019 iQIYI Celebrity Video Identification Challenge.
更多
查看译文
关键词
ensemble learning, multi-modal learning, neural networks, video based person identification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要