AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
For addressing the above problems, we propose the Unified Network for Cross-media Similarity Metric to associate the cross-media shared representation learning with distance metric in a unified framework

Cross-media Similarity Metric Learning with Unified Deep Networks.

Multimedia Tools and Applications, no. 23 (2017): 25109.0-25127.0

Cited: 11|Views8
EI

Abstract

As a highlighting research topic in the multimedia area, cross-media retrieval aims to capture the complex correlations among multiple media types. Learning better shared representation and distance metric for multimedia data is important to boost the cross-media retrieval. Motivated by the strong ability of deep neural network in feature...More

Code:

Data:

0
Introduction
  • The rapid growth of multimedia data, such as text, image, video and audio, has generated huge requirements for cross-media retrieval.
  • Comparing with single-media retrieval, cross-media retrieval can provide search results with multiple media types, which makes it more flexible and convenient for users.
  • Under this situation, to fully understand the multimedia data and meet the users’ demand for searching whatever they want across different media types, it is increasingly important to model the similarity between different media types for performing cross-media retrieval.
  • They can be mainly divided into two categories which are described as follows
Highlights
  • Recent years, the rapid growth of multimedia data, such as text, image, video and audio, has generated huge requirements for cross-media retrieval
  • For addressing the above problems, we propose the Unified Network for Cross-media Similarity Metric (UNCSM) to associate the cross-media shared representation learning with distance metric in a unified framework
  • We propose a unified framework to associate the cross-media shared representation learning with distance metric, which leads to a better accuracy on cross-media retrieval
  • Our UNCSM approach associates the representation learning with distance metric for further improving the cross-media retrieval accuracy
  • A cross-media similarity learning model, UNCSM, has been proposed. This UNCSM model associates the cross-media shared representation learning with distance metric in a unified framework
  • The experimental results show that our UNCSM approach outperforms 8 state-of-the-art methods on 4 widely-used datasets
Methods
  • The authors compare the UNCSM approach with 8 existing cross-media methods. CCA, CFA and KCCA are the classical baselines, while the others are all based on DNN.
  • The source codes of Bimodal AE, Multimodal DBN and Corr-AE are available from [5], and the source codes of DCCA and DCCAE are from [27]
  • The authors will introduce these 8 compared methods briefly as follows:.
  • CCA projects the data with two media types into a common subspace by maximizing the pairwise correlations.
  • CFA learns a common space for different modalities by minimizing the Frobenius norm between pairwise data in the transformed domain
Results
  • Compared to the existing methods, the UNCSM approach achieves significant improvement for three reasons: pretraining the two-pathway network with contrastive loss to model the pairwise similar and dissimilar constraints, fine-tuning the network using double triplet similarity loss to preserve the relative similarity for learning the optimized semantic shared representation, and embracing more complex similarity functions to calculate the cross-media similarity with the metric network, by modeling the pairwise similar and dissimilar constraints.
  • The authors' UNCSM approach associates the representation learning with distance metric for further improving the cross-media retrieval accuracy
Conclusion
  • A cross-media similarity learning model, UNCSM, has been proposed.
  • This UNCSM model associates the cross-media shared representation learning with distance metric in a unified framework.
  • Compared to the existing methods, the UNCSM approach further improves the cross-media retrieval accuracy by preserving the relative similarity as well as embracing more complex similarity functions at the same time.
  • As for future work, the authors attempt to integrate semi-supervised information into the unified metric framework for further boosting the accuracy of cross-media retrieval
Tables
  • Table1: The MAP scores on Wikipedia dataset
  • Table2: The MAP scores on NUS-WDIE Dataset
  • Table3: The MAP scores on NUS-WDIE-10k Dataset
  • Table4: The MAP scores on Pascal Sentences Dataset
  • Table5: MAP scores of Cosine distance metric and our Metric Network on the shared representation obtained from two-pathway network
  • Table6: MAP scores of our UNCSM approach with or without pretrain stage
Download tables as Excel
Related work
  • To measure the cross-media similarity between the data of different media types, most of the existing methods attempt to generate the shared representation for each media types in one common space, which can be divided into two categories: Traditional methods and DNN-based methods.

    As for the first strategy, it mainly uses linear function to project the multimedia data into one common space. For example, a straightforward method is to adopt Canonical Correlation Analysis (CCA) [10], which is a traditional statistical correlation analysis method, to project the features of different media types into a lower-dimensional common space. Given the training pairs, CCA can find matrices for them, which can make the projected training pairs have maximum correlations with the same dimension to obtain the shared representation. Thus, the simple similarity measurement can be adopted to present crossmedia retrieval. Some later works attempt to combine semantic categories to extend CCA, such as [19]. Besides, Cross-modal Factor Analysis (CFA) [11] minimizes the Frobenius norm between pairwise data in the transformed domain to learn a common space for different modalities. The joint graph regularized heterogeneous metric learning (JGRHML) [31] is proposed by Zhai et al to construct the joint graph regularization term using the data in the learned metric space, and this work is further improved as the joint representation learning (JRL) [32] by modeling the correlations and semantic information in a unified framework.
Funding
  • This work was supported by National Natural Science Foundation of China under Grants 61371128 and 61532005, and National Hi-Tech Research and Development Program of China (863 Program) under Grant 2014AA015102
Reference
  • Andrew G, Arora R, Bilmes JA, Livescu K (2013) Deep canonical correlation analysis. In: International Conference on Machine Learning (ICML), pp 1247–1255
    Google ScholarLocate open access versionFindings
  • Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: International Conference on Computer Vision (ICCV), pp 1–8
    Google ScholarLocate open access versionFindings
  • Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of Singapore. In: ACM International Conference on Image and Video Retrieval (ACM-CIVR
    Google ScholarLocate open access versionFindings
  • Farhadi A, Hejrati SMM, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: Generating sentences from images. In: European Conference on Computer Vision (ECCV), pp 15–29
    Google ScholarLocate open access versionFindings
  • Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: ACM international conference on Multimedia (ACM-MM), pp 7–16
    Google ScholarLocate open access versionFindings
  • Han X, Leung T, Jia Y, Sukthankar R, Berg AC (2015) Matchnet: Unifying feature and metric learning for patch-based matching. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp 3279–3286
    Google ScholarLocate open access versionFindings
  • Hardoon DR, Szedmak S (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664
    Google ScholarLocate open access versionFindings
  • Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
    Google ScholarLocate open access versionFindings
  • Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: Similarity-Based Pattern Recognition (SIMBAD), pp 84–92
    Google ScholarFindings
  • Hotelling H (1936) Relations between two sets of variates. Biometrika 321–377 11. Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: ACM international conference on Multimedia (ACM-MM), pp 604–611 12.
    Google ScholarLocate open access versionFindings
  • Manjunath BS, Ohm JR, Vinod VV, Yamada A (2001) Color and texture descriptors. In: IEEE
    Google ScholarLocate open access versionFindings
  • Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: International
    Google ScholarFindings
  • Nie L, Wang M, Zha Z, Chua T (2012) Oracle in image search: A content-based approach to performance prediction. ACM Trans Inf Syst 30(2):13:1–13:23 15.
    Google ScholarLocate open access versionFindings
  • Nie L, Yan S, Wang M, Hong R, Chua T (2012) Harvesting visual concepts for image search with complex queries. In: ACM international conference on Multimedia (ACM-MM), pp 59–68 16.
    Google ScholarLocate open access versionFindings
  • Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. In: International Conference on Computer Vision (ICCV), pp 145–175 17.
    Google ScholarLocate open access versionFindings
  • Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: International Joint Conference on Artificial Intelligence (IJCAI), pp 3846–3853 18.
    Google ScholarLocate open access versionFindings
  • Peng Y, Ngo CW (2006) Clip-based similarity measure for query-dependent clip retrieval and video summarization. IEEE Trans Circuits Syst Video Technol (TCSVT) 16(5):612–627 19.
    Google ScholarLocate open access versionFindings
  • Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A
    Google ScholarFindings
  • Salakhutdinov R, Hinton GE (2012) An efficient learning procedure for deep boltzmann machines. Neural Comput 24(8):1967–2006 22.
    Google ScholarLocate open access versionFindings
  • Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. In: International Conference on Machine Learning (ICML) 23.
    Google ScholarLocate open access versionFindings
  • Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. In: Advances in Neural Information Processing Systems (NIPS), pp 2222–2230 24.
    Google ScholarLocate open access versionFindings
  • Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems. In: The International Society for Music Information Retrieval (ISMIR), pp 153–160 25.
    Google ScholarFindings
  • Vincent P, Larochelle H, Bengio Y, Manzagol P (2008) Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML), pp 1096–1103 26.
    Google ScholarLocate open access versionFindings
  • Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, Chen B, Wu Y (2014) Learning finegrained image similarity with deep ranking. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp 1386–1393 27.
    Google ScholarLocate open access versionFindings
  • Wang W, Arora R, Livescu K, Bilmes JA (2015) On deep multi-view representation learning. In: International Conference on Machine Learning (ICML), pp 1083–1092 28.
    Google ScholarLocate open access versionFindings
  • Welling M, Rosen-Zvi M, Hinton GE (2004) Exponential family harmoniums with an application to information retrieval. In: Advances in Neural Information Processing Systems (NIPS), pp 1481–1488 29.
    Google ScholarLocate open access versionFindings
  • Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3441–3450 30.
    Google ScholarLocate open access versionFindings
  • Yu J, Tian Q (2008) Semantic subspace projection and its applications in image retrieval. IEEE Trans Circuits Syst Video Technol (TCSVT) 18(4):544–548 31.
    Google ScholarLocate open access versionFindings
  • Zhai X, Peng Y, Xiao J (2013) Heterogeneous metric learning with joint graph regularization for crossmedia retrieval. In: AAAI Conference on Artificial Intelligence (AAAI), pp 1198–1204 32.
    Google ScholarLocate open access versionFindings
  • Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans Circuits Syst Video Technol (TCSVT) 24:965–978
    Google ScholarLocate open access versionFindings
  • Jinwei Qi received the B.S. degree in computer science and technology from Peking University, in Jul. 2016. He is currently pursuing the M.S. degree with the Institute of Computer Science and Technology (ICST), Peking University. His current research interests include cross-media retrieval and deep learning.
    Google ScholarLocate open access versionFindings
  • Xin Huang received the B.S. degree in computer science and technology from Peking University, in Jul. 2014. He is currently pursuing the Ph.D. degree in the Institute of Computer Science and Technology (ICST), Peking University. His research interests include cross-media retrieval and machine learning.
    Google ScholarLocate open access versionFindings
  • Besides, he won the first prize of Beijing Science and Technology Award for Technological Invention in 2016 (ranking first). He has applied 30 patents, and obtained 14 of them. His current research interests mainly include cross-modal analysis and reasoning, image & video understanding and retrieval, and computer vision.
    Google ScholarFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn