COBRA: Contrastive Bi-Modal Representation Algorithm

Udandarao Vishaal
Udandarao Vishaal
Srivatsav Deepak
Srivatsav Deepak
Vyalla Suryatej Reddy
Vyalla Suryatej Reddy
Cited by: 0|Bibtex|Views53
Other Links: arxiv.org
Weibo:
We believe this is due to the fact that current crossmodal representation systems regularize the distance of pairs of representations of those data samples which belong to the same classes but not of pairs of representations belonging to different classes

Abstract:

There are a wide range of applications that involve multi-modal data, such as cross-modal retrieval, visual question-answering, and image captioning. Such applications are primarily dependent on aligned distributions of the different constituent modalities. Existing approaches generate latent embeddings for each modality in a joint fash...More

Code:

Data:

Introduction
  • Systems built on multi-modal data have been shown to perform better than systems that solely use uni-modal data [7, 49].
  • Latent representations, contrastive learning, bi-modal data
  • Motivation: The authors posit that preserving the relationship between representations of samples belonging to different classes, in a modality invariant fashion, can improve the quality of joint cross-modal embedding spaces.
Highlights
  • Systems built on multi-modal data have been shown to perform better than systems that solely use uni-modal data [7, 49]
  • Any similarity metric between the representations across modalities is intractable to compute [37]. The reduction of this distributional shift boils down to two challenges: (1) projecting the representations of data belonging to different modalities to a common manifold, and (2) retaining their semantic relationship with other samples from the same class as well as different classes
  • Over the last few years, literature [18, 29, 36] has been presented where the representations were modeled in the joint embedding space, but existing methods perform less satisfactorily as significant semantic gap still exists among the learnt representations from different modalities
  • We believe this is due to the fact that current crossmodal representation systems regularize the distance of pairs of representations of those data samples which belong to the same classes but not of pairs of representations belonging to different classes
  • 4.1.3 Results We report the highest Mean Average Precision (mAP) for Text to Image (TTI) and Image to Text (ITT) retrieval on all four datasets
  • From the t-SNE [28] plot for Wikipedia given in Figure 3a, we observe that COBRA is able to effectively form joint embeddings for different classes across modalities, resulting in superior performances across the aforementioned datasets
Results
  • The authors propose a novel joint cross-modal embedding framework called COBRA (COntrastive Bi-modal Representation Algorithm) which represents the data across different modalities in a common manifold.
  • Gautam et al [13] developed a novel decision diffusion technique on the CrisisMMD dataset [2, 34] to classify disaster related data into informative and non-informative categories using image and text uni-modal models.
  • 3.2.3 Supervised Loss As the authors try to model an orthogonal latent space having the joint embeddings, the authors utilize the one-hot labels of the data samples to reinforce those samples belonging to the same class but different modalities to be grouped together in the same subspace.
  • Letbe the one-hot encoded label for the -th sample of the -th modality, and be the projected representation, the authors define the supervised loss shown in Eq 3 as:
  • Given the projected representations and , a positive pair is defined as the representations of data samples belonging to the same modality and class.
  • A negative pair is defined as the representations of two data samples belonging to same or different modality of different classes.
  • The Wikipedia dataset [46] contains 2866 text-image pairs, divided into 10 semantic classes, such as warfare, art & architecture and media.
  • The PKU-Xmedia dataset [39, 69] contains 5000 text-image pairs, divided into 20 semantic classes.
  • The MS-COCO dataset [26] contains 82079 text-image pairs, divided into 80 semantic classes.
  • The NUS-Wide 10k dataset [9] contains 10000 text-image pairs, divided into 10 semantic classes.
  • From the t-SNE [28] plot for Wikipedia given in Figure 3a, the authors observe that COBRA is able to effectively form joint embeddings for different classes across modalities, resulting in superior performances across the aforementioned datasets.
Conclusion
  • In the task of multi-modal fake news detection, the authors use COBRA to determine whether a given bi-modal query corresponds to a real or fake news sample.
  • To visualize the purity of the joint embedding space for different classes and modality samples, the authors plot the joint embeddings of COBRA trained on both the Gossipcop and Poltifact datasets.
  • Validation and test set of 4500, 1000 and 1000 text-image pairs respectively, across all models that the authors test.
Summary
  • Systems built on multi-modal data have been shown to perform better than systems that solely use uni-modal data [7, 49].
  • Latent representations, contrastive learning, bi-modal data
  • Motivation: The authors posit that preserving the relationship between representations of samples belonging to different classes, in a modality invariant fashion, can improve the quality of joint cross-modal embedding spaces.
  • The authors propose a novel joint cross-modal embedding framework called COBRA (COntrastive Bi-modal Representation Algorithm) which represents the data across different modalities in a common manifold.
  • Gautam et al [13] developed a novel decision diffusion technique on the CrisisMMD dataset [2, 34] to classify disaster related data into informative and non-informative categories using image and text uni-modal models.
  • 3.2.3 Supervised Loss As the authors try to model an orthogonal latent space having the joint embeddings, the authors utilize the one-hot labels of the data samples to reinforce those samples belonging to the same class but different modalities to be grouped together in the same subspace.
  • Letbe the one-hot encoded label for the -th sample of the -th modality, and be the projected representation, the authors define the supervised loss shown in Eq 3 as:
  • Given the projected representations and , a positive pair is defined as the representations of data samples belonging to the same modality and class.
  • A negative pair is defined as the representations of two data samples belonging to same or different modality of different classes.
  • The Wikipedia dataset [46] contains 2866 text-image pairs, divided into 10 semantic classes, such as warfare, art & architecture and media.
  • The PKU-Xmedia dataset [39, 69] contains 5000 text-image pairs, divided into 20 semantic classes.
  • The MS-COCO dataset [26] contains 82079 text-image pairs, divided into 80 semantic classes.
  • The NUS-Wide 10k dataset [9] contains 10000 text-image pairs, divided into 10 semantic classes.
  • From the t-SNE [28] plot for Wikipedia given in Figure 3a, the authors observe that COBRA is able to effectively form joint embeddings for different classes across modalities, resulting in superior performances across the aforementioned datasets.
  • In the task of multi-modal fake news detection, the authors use COBRA to determine whether a given bi-modal query corresponds to a real or fake news sample.
  • To visualize the purity of the joint embedding space for different classes and modality samples, the authors plot the joint embeddings of COBRA trained on both the Gossipcop and Poltifact datasets.
  • Validation and test set of 4500, 1000 and 1000 text-image pairs respectively, across all models that the authors test.
Tables
  • Table1: Performance (mAP) on the Wikipedia Dataset
  • Table2: Performance (mAP) on the MS-COCO Dataset
  • Table3: Performance (mAP) on the PKU-XMedia Dataset
  • Table4: Performance (mAP) on the NUS-Wide 10k dataset
  • Table5: Accuracy on the FakeNewsNet dataset
  • Table6: Table 6
  • Table7: Table 7
  • Table8: Dataset Descriptions - */*/* in the samples column denotes the number of training/validation/test samples used. ‘I’, ‘HC’ and ‘DS’ for the CrisisMMD dataset refer to the ‘Informativeness’, ‘Humanitarian Categories’ and ‘Disaster Severity’
Download tables as Excel
Funding
  • • We empirically validate our model by achieving state-of-theart results on four diverse downstream tasks: (1) cross-modal retrieval, (2) finegrained multi-modal sentiment classification,
  • We achieve a 22% improvement over the previous state-of-theart (DAML [66]) on the Wikipedia dataset (Table 1)
  • We achieve a 3% improvement over the previous state-of-the-art (SDML [18]) on the MS-COCO dataset (Table 2)
  • We achieve a 3.5% improvement over the previous state-of-the-art (SDML [18]) on the PKU-XMedia dataset (Table 3)
  • We achieve a 10.9% improvement over the previous state-of-the-art (ACMR [60]) on the NUS-Wide 10k dataset (Table 4)
  • 4.2.3 Results We achieve a 1.4% and a 1.1% improvement over the previous stateof-the-art (SpotFake+ [52]) on the Politifact and Gossipcop dataset respectively (Table 5)
  • 4.3.3 Results We obtain an average classification accuracy of 88.32% across all classes on the MeTooMA Dataset
  • This is a 1.2% improvement over Early Fusion (Table 6)
Study subjects and analysis
text-image pairs: 2866
The experimental results show that our proposed framework outperforms existing work by 1.09% - 22%, as it generates a robust and task agnostic joint-embedding space. MCCA [47] ml-CCA [42] DDCAE [61]

JRL [70] ACMR [60] CMDN [36] CCL [38] D-SCMR [72] SDML [18] DAML [66] COBRA Image → Text Text → Image Average

0.195 0.372 0.299 0.330 0.452 0.457 0.481 0.499 0.505 0.520 0.740 as 4096-dimensional feature vectors, generated using the fc7 layer of VGGnet [51].

• The Wikipedia dataset [46] contains 2866 text-image pairs, divided into 10 semantic classes, such as warfare, art & architecture and media
. We use a training, validation and test set of 2173, 231 and 462 text-image pairs [46] respectively.

• The PKU-Xmedia dataset [39, 69] contains 5000 text-image pairs, divided into 20 semantic classes

text-image pairs: 462
MCCA [47] ml-CCA [42] DDCAE [61]

JRL [70] ACMR [60] CMDN [36] CCL [38] D-SCMR [72] SDML [18] DAML [66] COBRA Image → Text Text → Image Average

0.195 0.372 0.299 0.330 0.452 0.457 0.481 0.499 0.505 0.520 0.740 as 4096-dimensional feature vectors, generated using the fc7 layer of VGGnet [51].

• The Wikipedia dataset [46] contains 2866 text-image pairs, divided into 10 semantic classes, such as warfare, art & architecture and media. We use a training, validation and test set of 2173, 231 and 462 text-image pairs [46] respectively.

• The PKU-Xmedia dataset [39, 69] contains 5000 text-image pairs, divided into 20 semantic classes
. We use a training, validation and test set of 4000, 500 and 500 text-image pairs [39, 69] respectively.

• The MS-COCO dataset [26] contains 82079 text-image pairs, divided into 80 semantic classes

text-image pairs: 500
We use a training, validation and test set of 2173, 231 and 462 text-image pairs [46] respectively.

• The PKU-Xmedia dataset [39, 69] contains 5000 text-image pairs, divided into 20 semantic classes. We use a training, validation and test set of 4000, 500 and 500 text-image pairs [39, 69] respectively.

• The MS-COCO dataset [26] contains 82079 text-image pairs, divided into 80 semantic classes
. We use a training, validation and test set of 57455, 14624 and 10000 text-image pairs [18] respectively.

• The NUS-Wide 10k dataset [9] contains 10000 text-image pairs, divided into 10 semantic classes

text-image pairs: 10000
We use a training, validation and test set of 4000, 500 and 500 text-image pairs [39, 69] respectively.

• The MS-COCO dataset [26] contains 82079 text-image pairs, divided into 80 semantic classes. We use a training, validation and test set of 57455, 14624 and 10000 text-image pairs [18] respectively.

• The NUS-Wide 10k dataset [9] contains 10000 text-image pairs, divided into 10 semantic classes
. We use a training, validation and test set of 8000, 1000 and 1000 text-image pairs [60] respectively

text-image pairs: 1000
We use a training, validation and test set of 57455, 14624 and 10000 text-image pairs [18] respectively.

• The NUS-Wide 10k dataset [9] contains 10000 text-image pairs, divided into 10 semantic classes. We use a training, validation and test set of 8000, 1000 and 1000 text-image pairs [60] respectively. Image → Text Text → Image Average

MCCA [47] DDCAE [61]

JRL [70] ACMR [60] CMDN [36] DCCA [4] GSS-SL [71] SDML [18]

JRL [70] ACMR [60] CMDN [36] CCL [38] DCCA [4] SDML [18] DAML [66]

4.2 Multi-modal Fake News Detection

In the task of multi-modal fake news detection, we use COBRA to determine whether a given bi-modal query (text and image) corresponds to a real or fake news sample.

4.2.1 Datasets For the multi-modal fake news detection task, we utilize the FakeNewsNet Repository [50]

text-image pairs: 2866
0.195 0.372 0.299 0.330 0.452 0.457 0.481 0.499 0.505 0.520 0.740 as 4096-dimensional feature vectors, generated using the fc7 layer of VGGnet [51]. • The Wikipedia dataset [46] contains 2866 text-image pairs, divided into 10 semantic classes, such as warfare, art & architecture and media. We use a training, validation and test set of 2173, 231 and 462 text-image pairs [46] respectively

text-image pairs: 462
• The Wikipedia dataset [46] contains 2866 text-image pairs, divided into 10 semantic classes, such as warfare, art & architecture and media. We use a training, validation and test set of 2173, 231 and 462 text-image pairs [46] respectively. • The PKU-Xmedia dataset [39, 69] contains 5000 text-image pairs, divided into 20 semantic classes

text-image pairs: 5000
We use a training, validation and test set of 2173, 231 and 462 text-image pairs [46] respectively. • The PKU-Xmedia dataset [39, 69] contains 5000 text-image pairs, divided into 20 semantic classes. We use a training, validation and test set of 4000, 500 and 500 text-image pairs [39, 69] respectively

text-image pairs: 500
• The PKU-Xmedia dataset [39, 69] contains 5000 text-image pairs, divided into 20 semantic classes. We use a training, validation and test set of 4000, 500 and 500 text-image pairs [39, 69] respectively. • The MS-COCO dataset [26] contains 82079 text-image pairs, divided into 80 semantic classes

text-image pairs: 82079
We use a training, validation and test set of 4000, 500 and 500 text-image pairs [39, 69] respectively. • The MS-COCO dataset [26] contains 82079 text-image pairs, divided into 80 semantic classes. We use a training, validation and test set of 57455, 14624 and 10000 text-image pairs [18] respectively

text-image pairs: 10000
• The MS-COCO dataset [26] contains 82079 text-image pairs, divided into 80 semantic classes. We use a training, validation and test set of 57455, 14624 and 10000 text-image pairs [18] respectively. • The NUS-Wide 10k dataset [9] contains 10000 text-image pairs, divided into 10 semantic classes

text-image pairs: 10000
We use a training, validation and test set of 57455, 14624 and 10000 text-image pairs [18] respectively. • The NUS-Wide 10k dataset [9] contains 10000 text-image pairs, divided into 10 semantic classes. We use a training, validation and test set of 8000, 1000 and 1000 text-image pairs [60] respectively

text-image pairs: 1000
• The NUS-Wide 10k dataset [9] contains 10000 text-image pairs, divided into 10 semantic classes. We use a training, validation and test set of 8000, 1000 and 1000 text-image pairs [60] respectively. 4.1.2 Evaluation Metrics We compare our performance against state-of-the-art models based on Mean Average Precision (mAP)

datasets: 4
For a fair comparison, we ensure that we use the same features across models. 4.1.3 Results We report the highest mAP for Text to Image (TTI) and Image to Text (ITT) retrieval on all four datasets. From the t-SNE [28] plot for Wikipedia given in Figure 3a, we observe that COBRA is able to effectively form joint embeddings for different classes across modalities, resulting in superior performances across the aforementioned datasets

text-image pairs: 1056
Each dataset contains two semantic classes, namely, Real and Fake. • The Politifact dataset contains 1056 text-image pairs. We get 321 Real and 164 Fake text-image pairs after pre-processing

Fake text-image pairs: 164
• The Politifact dataset contains 1056 text-image pairs. We get 321 Real and 164 Fake text-image pairs after pre-processing. We use a training, validation and test set of 381, 50 and 54 text-image pairs [52] respectively

text-image pairs: 54
We get 321 Real and 164 Fake text-image pairs after pre-processing. We use a training, validation and test set of 381, 50 and 54 text-image pairs [52] respectively. • The Gossipcop dataset contains 22140 text-image pairs

text-image pairs: 22140
We use a training, validation and test set of 381, 50 and 54 text-image pairs [52] respectively. • The Gossipcop dataset contains 22140 text-image pairs. We get 10259 Real and 2581 Fake text-image pairs after preprocessing

Fake text-image pairs: 2581
• The Gossipcop dataset contains 22140 text-image pairs. We get 10259 Real and 2581 Fake text-image pairs after preprocessing. We use a training, validation and test set of 10010, 1830 and 1000 text-image pairs [52] respectively

text-image pairs: 1000
We get 10259 Real and 2581 Fake text-image pairs after preprocessing. We use a training, validation and test set of 10010, 1830 and 1000 text-image pairs [52] respectively. 4.2.2 Evaluation metrics We compare our performance against existing state-of-the-art models based on number of correctly classified queries (accuracy)

tweets: 9973
4.3.1 Datasets For the multi-modal fine-grained sentiment classification task, we analyze the performance of our model on the MeTooMA dataset [12]. This dataset contains 9973 tweets that have been manually annotated into 10 classes, namely, text only informative and image only informative (Relevance), Support, Opposition and Neither (Stance), Directed Hate and Generalized Hate (Hate Speech), Allegation, Refutation and Justification (Dialogue acts), and sarcasm. We convert the images into 4096-dimensional feature vectors using the fc7 layer of VGGnet [51]

text-image pairs: 1000
We convert the texts into 300-dimensional feature vectors using Doc2Vec [25]. We use a training, validation and test set of 4500, 1000 and 1000 text-image pairs respectively, across all models that we test. 4.3.2 Evaluation Metrics We report the number of correctly classified queries (accuracy)

samples: 600
We observe a low increase in Text only and Image only informative tasks due to the fact that 53.2% of our training data had text-image pairs with conflicting labels, i.e., from a given text-image pair, the text may be labelled as “relevant” whereas the corresponding image may be labelled as “irrelevant”. Furthermore, for classes under the Hate Speech, Sarcasm, and Dialogue Acts categories, we observe that there are less than 600 samples for each class. In categories such as Stance, where the ‘Support’ class has over 3000 samples, we observe much larger improvements in performance

samples: 3000
Furthermore, for classes under the Hate Speech, Sarcasm, and Dialogue Acts categories, we observe that there are less than 600 samples for each class. In categories such as Stance, where the ‘Support’ class has over 3000 samples, we observe much larger improvements in performance. 4.4 Multi-modal Disaster Classification

tweets: 16058
4.4.1 Datasets For the multi-modal disaster classification task, we utilize the CrisisMMD dataset [2, 34]. It consists of 16058 tweets and 18082 images that were collected during natural disasters. There are 3 classification tasks that can be performed on this dataset —

text-image pairs: 793
We convert the texts into 300dimensional feature vectors using Doc2vec [25]. We use a training set of 2000 text-image pairs, a validation set of 793 text-image pairs for the first 2 classification tasks, a validation set of size 697 for the third classification task, and a test set of 500 text-image pairs. 2Architectural details can be found in the supplementary material

Reference
  • Mansi Agarwal, Maitree Leekha, Ramit Sawhney, Rajiv Ratn Shah, Rajesh Kumar Yadav, and Dinesh Kumar Vishwakarma. 2020. MEMIS: Multimodal Emergency Management Information System. In Advances in Information Retrieval, Joemon M. Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro, Mário J. Silva, and Flávio Martins (Eds.). Springer International Publishing, Cham, 479–494.
    Google ScholarLocate open access versionFindings
  • Firoj Alam, Ferda Ofli, and Muhammad Imran. 2018. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM) (USA, 23-28).
    Google ScholarLocate open access versionFindings
  • Elad Amrani, Rami Ben-Ari, Daniel Rotman, and Alex Bronstein. 2020. Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning. arXiv preprint arXiv:2003.03186 (2020).
    Findings
  • Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research), Sanjoy Dasgupta and David McAllester (Eds.), Vol. 28. PMLR, Atlanta, Georgia, USA, 1247–1255. http://proceedings.mlr.press/v28/andrew13.html
    Locate open access versionFindings
  • Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. 2019. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229 (2019).
    Findings
  • Devanshu Arya, Stevan Rudinac, and Marcel Worring. 2019. HyperLearn: A Distributed Approach for Representation Learning in Datasets With Many Modalities. In Proceedings of the 27th ACM International Conference on Multimedia. 2245–2253.
    Google ScholarLocate open access versionFindings
  • Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (Feb. 2019), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607
    Locate open access versionFindings
  • Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:cs.LG/2002.05709
    Google ScholarLocate open access versionFindings
  • Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 200NUS-WIDE: A Real-World Web Image Database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval (Santorini, Fira, Greece) (CIVR ’09). Association for Computing Machinery, New York, NY, USA, Article 48, 9 pages. https://doi.org/10.1145/1646396.1646452
    Locate open access versionFindings
  • Keyan Ding, Ronggang Wang, and Shiqi Wang. 2019. Social Media Popularity Prediction: A Multiple Feature Fusion Approach with Deep Neural Networks. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 2682–2686. https://doi.org/10.1145/3343031.3356062
    Locate open access versionFindings
  • Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-Modal Retrieval with Correspondence Autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM ’14). Association for Computing Machinery, New York, NY, USA, 7–16. https://doi.org/10.1145/2647868.2654902
    Locate open access versionFindings
  • Akash Gautam, Puneet Mathur, Rakesh Gosangi, Debanjan Mahata, Ramit Sawhney, and Rajiv Ratn Shah. 2019. #MeTooMA: Multi-Aspect Annotations of Tweets Related to the MeToo Movement. arXiv:cs.CL/1912.06927
    Google ScholarFindings
  • A. K. Gautam, L. Misra, A. Kumar, K. Misra, S. Aggarwal, and R. R. Shah. 2019. Multimodal Analysis of Disaster Tweets. In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). 94–103.
    Google ScholarLocate open access versionFindings
  • Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2012. A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics. CoRR abs/1212.4522 (2012). arXiv:1212.4522 http://arxiv.org/abs/1212.4522
    Findings
  • Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297–304.
    Google ScholarLocate open access versionFindings
  • Xin Hong, Pengfei Xiong, Renhe Ji, and Haoqiang Fan. 2019. Deep Fusion Network for Image Completion. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 2033–2042. https://doi.org/10.1145/3343031.3351002
    Locate open access versionFindings
  • Harold Hotelling. 1936. Relations Between Two Sets of Variates. Biometrika 28, 3/4 (1936), 321–377. http://www.jstor.org/stable/2333955
    Locate open access versionFindings
  • Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. 2019. Scalable Deep Multimodal Learning for Cross-Modal Retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR’19). Association for Computing Machinery, New York, NY, USA, 635–644. https://doi.org/10.1145/3331184.3331213
    Locate open access versionFindings
  • Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and Aaron van den Oord. 20Data-Efficient Image Recognition with Contrastive Predictive Coding. arXiv:cs.CV/1905.09272
    Google ScholarLocate open access versionFindings
  • Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759 (2016).
    Findings
  • Onno Kampman, Elham J. Barezi, Dario Bertero, and Pascale Fung. 2018. Investigating Audio, Video, and Text Fusion Methods for End-to-End Automatic Personality Prediction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 606–611. https://doi.org/10.18653/v1/P18-2096
    Locate open access versionFindings
  • Meina Kan, Shiguang Shan, and Xilin Chen. 2016. Multi-view Deep Network for Cross-View Classification. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 4847–4855.
    Google ScholarLocate open access versionFindings
  • Dhruv Khattar, Jaipal Singh Goud, Manish Gupta, and Vasudeva Varma. 2019. MVAE: Multimodal Variational Autoencoder for Fake News Detection. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 2915–2921. https://doi.org/10.1145/3308558.3313552
    Locate open access versionFindings
  • Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. arXiv preprint arXiv:2004.11362 (2020).
    Findings
  • Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. CoRR abs/1405.4053 (2014). arXiv:1405.4053 http://arxiv.org/abs/1405.4053
    Findings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755.
    Google ScholarLocate open access versionFindings
  • Fei Liu, Jing Liu, Richang Hong, and Hanqing Lu. 2019. Erasing-based Attention Learning for Visual Question Answering. In Proceedings of the 27th ACM International Conference on Multimedia. 1175–1183.
    Google ScholarLocate open access versionFindings
  • Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research (2008).
    Google ScholarLocate open access versionFindings
  • Sijie Mai, Haifeng Hu, and Songlong Xing. 2019. Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion. arXiv:cs.CV/1911.07848
    Google ScholarFindings
  • Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. RoyChowdhury. 2018. Learning Joint Embedding with Multimodal Cues for CrossModal Video-Text Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval (Yokohama, Japan) (ICMR ’18). Association for Computing Machinery, New York, NY, USA, 19–27. https://doi.org/10.1145/3206025.3206064
    Locate open access versionFindings
  • Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, and Amit K. Roy-Chowdhury. 2018. Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea) (MM ’18). Association for Computing Machinery, New York, NY, USA, 1856–1864. https://doi.org/10.1145/3240508.3240712
    Locate open access versionFindings
  • Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems. 2265–2273.
    Google ScholarFindings
  • Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, and Louis-Philippe Morency. 2016. Deep Multimodal Fusion for Persuasiveness Prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (Tokyo, Japan) (ICMI ’16). Association for Computing Machinery, New York, NY, USA, 284–288. https://doi.org/10.1145/2993148.2993176
    Locate open access versionFindings
  • Ferda Ofli, Firoj Alam, and Muhammad Imran. 2020. Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response. In 17th International Conference on Information Systems for Crisis Response and Management. ISCRAM, ISCRAM.
    Google ScholarLocate open access versionFindings
  • Liang Peng, Yang Yang, Zheng Wang, Xiao Wu, and Zi Huang. 2019. CRANet: Composed Relation Attention Network for Visual Question Answering. In Proceedings of the 27th ACM International Conference on Multimedia. 1202–1210.
    Google ScholarLocate open access versionFindings
  • Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (New York, New York, USA) (IJCAI’16). AAAI Press, 3846–3853.
    Google ScholarLocate open access versionFindings
  • Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-Modal Generative Adversarial Networks for Common Representation Learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 22 (Feb. 2019), 24 pages. https://doi.org/10.1145/3284750
    Locate open access versionFindings
  • Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-Modal Correlation Learning With Multigrained Fusion by Hierarchical Network. Trans. Multi. 20, 2 (Feb. 2018), 405–420. https://doi.org/10.1109/TMM.2017.2742704
    Findings
  • Y. Peng, X. Zhai, Y. Zhao, and X. Huang. 2016. Semi-Supervised Cross-Media Feature Learning With Unified Patch Graph Regularization. IEEE Transactions on Circuits and Systems for Video Technology 26, 3 (2016), 583–596.
    Google ScholarLocate open access versionFindings
  • Hai Pham, Thomas Manzini, Paul Liang, and Barnabas Poczos. 2018. Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis. 53–63. https://doi.org/10.18653/v1/W18-3308
    Findings
  • S. Poria, I. Chaturvedi, E. Cambria, and A. Hussain. 2016. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis. In 2016 IEEE
    Google ScholarLocate open access versionFindings
  • V. Ranjan, N. Rasiwasia, and C. V. Jawahar. 2015. Multi-label Cross-Modal Retrieval. In 2015 IEEE International Conference on Computer Vision (ICCV). 4094– 4102.
    Google ScholarLocate open access versionFindings
  • Jinfeng Rao, Hua He, and Jimmy Lin. 2016. Noise-contrastive estimation for answer selection with deep neural networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1913– 1916.
    Google ScholarLocate open access versionFindings
  • Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R.G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A New Approach to CrossModal Multimedia Retrieval. In Proceedings of the 18th ACM International Conference on Multimedia (Firenze, Italy) (MM ’10). Association for Computing Machinery, New York, NY, USA, 251–260. https://doi.org/10.1145/1873951.1873987
    Locate open access versionFindings
  • Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R.G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A New Approach to CrossModal Multimedia Retrieval. In Proceedings of the 18th ACM International Conference on Multimedia (Firenze, Italy) (MM ’10). Association for Computing Machinery, New York, NY, USA, 251–260. https://doi.org/10.1145/1873951.1873987
    Locate open access versionFindings
  • Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R.G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A New Approach to CrossModal Multimedia Retrieval. In Proceedings of the 18th ACM International Conference on Multimedia (Firenze, Italy) (MM ’10). Association for Computing Machinery, New York, NY, USA, 251–260. https://doi.org/10.1145/1873951.1873987
    Locate open access versionFindings
  • Jan Rupnik and John Shawe-Taylor. 2010. Multi-View Canonical Correlation Analysis. SiKDD (01 2010).
    Google ScholarFindings
  • Sebastian Schmiedeke, Pascal Kelm, and Thomas Sikora. 2012. Cross-Modal Categorisation of User-Generated Video Sequences. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval (Hong Kong, China) (ICMR ’12). Association for Computing Machinery, New York, NY, USA, Article 25, 8 pages. https://doi.org/10.1145/2324796.2324828
    Locate open access versionFindings
  • Rajiv Shah and Roger Zimmermann. 2017. Multimodal analysis of user-generated multimedia content. Springer.
    Google ScholarFindings
  • Kai Shu. 2019. FakeNewsNet. https://doi.org/10.7910/DVN/UEMMHS [51] Karen Simonyan and Andrew Zisserman.2014. Very Deep Convolutional Networks for Large-Scale Image Recognition.arXiv:cs.CV/1409.1556
    Locate open access versionFindings
  • [52] Shivangi Singhal, Anubha Kabra, Mohit Sharma, Rajiv Ratn Shah, Tanmoy Chakraborty, and Ponnurangam Kumaraguru. 2020. SpotFake+: A Multimodal Framework for Fake News Detection via Transfer Learning (Student Abstract). (2020).
    Google ScholarFindings
  • [53] S. Singhal, R. R. Shah, T. Chakraborty, P. Kumaraguru, and S. Satoh. 2019. SpotFake: A Multi-modal Framework for Fake News Detection. In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). 39–47.
    Google ScholarLocate open access versionFindings
  • [54] Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems. 1857–1865.
    Google ScholarFindings
  • [55] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (San Francisco, California, USA) (AAAI’17). AAAI Press, 4278–4284.
    Google ScholarLocate open access versionFindings
  • [56] Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive Multiview Coding. arXiv:cs.CV/1906.05849
    Google ScholarFindings
  • [57] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning. arXiv:cs.CV/2005.10243
    Google ScholarLocate open access versionFindings
  • [58] Thi Quynh Nhi Tran, Hervé Le Borgne, and Michel Crucianu. 2016. CrossModal Classification by Completing Unimodal Representations. In Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion (Amsterdam, The Netherlands) (iV&;L-MM ’16). Association for Computing Machinery, New York, NY, USA, 17–25. https://doi.org/10.1145/2983563.2983570
    Locate open access versionFindings
  • [59] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv:cs.LG/1807.03748
    Google ScholarFindings
  • [60] Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the 25th ACM International Conference on Multimedia (Mountain View, California, USA) (MM ’17). Association for Computing Machinery, New York, NY, USA, 154–162. https://doi.org/10.1145/3123266.3123326
    Locate open access versionFindings
  • [61] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On Deep Multi-View Representation Learning. In Proceedings of the 32nd International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 1083–1092.
    Google ScholarLocate open access versionFindings
  • [62] Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu Xun, Kishlay Jha, Lu Su, and Jing Gao. 2018. EANN: Event Adversarial Neural Networks for MultiModal Fake News Detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Machinery, New York, NY, USA, 849–857. https://doi.org/10.1145/3219819.3219903
    Locate open access versionFindings
  • [63] Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis. In Proceedings of The Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 2514–2520. https://doi.org/10.1145/3366423.3380000
    Locate open access versionFindings
  • [64] Yiling Wu, Shuhui Wang, and Qingming Huang. 2018. Learning Semantic Structure-Preserved Embeddings for Cross-Modal Retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea) (MM ’18). Association for Computing Machinery, New York, NY, USA, 825–833. https://doi.org/10.1145/3240508.3240521
    Locate open access versionFindings
  • [65] M. Wöllmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L. Morency. 2013. YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context. IEEE Intelligent Systems 28, 3 (2013), 46–53.
    Google ScholarLocate open access versionFindings
  • [66] Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep Adversarial Metric Learning for Cross-Modal Retrieval. World Wide Web 22, 2 (March 2019), 657–672. https://doi.org/10.1007/s11280-018-0541-x
    Locate open access versionFindings
  • [67] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 5754–5764. http://papers.nips.cc/paper/8812-xlnet-generalized-autoregressivepretraining-for-language-understanding
    Locate open access versionFindings
  • [68] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 1103–1114. https://doi.org/10.18653/v1/D17-1115
    Locate open access versionFindings
  • [69] Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2014. Learning Cross-Media Joint Representation with Sparse and Semi-Supervised Regularization. IEEE Transactions on Circuits and Systems for Video Technology 24 (06 2014), 1–1. https://doi.org/10.1109/TCSVT.2013.2276704
    Locate open access versionFindings
  • [70] Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2014. Learning Cross-Media Joint Representation with Sparse and Semi-Supervised Regularization. IEEE Transactions on Circuits and Systems for Video Technology 24 (06 2014), 1–1. https://doi.org/10.1109/TCSVT.2013.2276704
    Locate open access versionFindings
  • [71] L. Zhang, B. Ma, G. Li, Q. Huang, and Q. Tian. 2018. Generalized Semi-supervised and Structured Subspace Learning for Cross-Modal Retrieval. IEEE Transactions on Multimedia 20, 1 (2018), 128–141.
    Google ScholarLocate open access versionFindings
  • [72] Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments