AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
The method is effective at finding contextual lexical groundings of words in unlabeled multi-image, multi-sentence documents even in the presence of high cross-document similarity

Domain Specific Lexical Grounding in Noisy Visual Textual Documents

EMNLP 2020, (2020)

Cited by: 0|Views22
Full Text
Bibtex
Weibo

Abstract

Images can give us insights into the contextual meanings of words, but current imagetext grounding approaches require detailed annotations. Such granular annotation is rare, expensive, and unavailable in most domainspecific contexts. In contrast, unlabeled multiimage, multi-sentence documents are abundant. Can lexical grounding be learned...More

Code:

Data:

0
Introduction
  • Multimodal data consisting of text and images is ubiquitous but increasingly diverse: libraries are digitizing visual-textual collections (British Library Labs, 2016; The Smithsonian, 2020); news organizations release over 1M images per year to accompany news articles (The Associated Press, 2020); and social media messages are rarely sent without visual accompaniment.
  • The authors focus on one such specialized, multimodal domain: New York City real estate listings from the website StreetEasy.
  • Most prior image captioning work has focused on rare and expensive single-image, singlecaption collections such as MSCOCO, which focuses on literal, context-free descriptions for 80 object types (Lin et al, 2014).
  • In the specialized real estate context, “pool” commonly refers to a swimming pool
Highlights
  • Multimodal data consisting of text and images is ubiquitous but increasingly diverse: libraries are digitizing visual-textual collections (British Library Labs, 2016; The Smithsonian, 2020); news organizations release over 1M images per year to accompany news articles (The Associated Press, 2020); and social media messages are rarely sent without visual accompaniment
  • We focus on one such specialized, multimodal domain: New York City real estate listings from the website StreetEasy
  • As a result of this self-similarity, in §3, we find that image-text grounding is difficult for off-the-shelf image tagging methods like multinomial/softmax regression, which leverage variation in both lexical and visual features across documents
  • We show that EntSharp outperforms both object detection and image tagging baselines at retrieving relevant images for given word types
  • The method is effective at finding contextual lexical groundings of words in unlabeled multi-image, multi-sentence documents even in the presence of high cross-document similarity
Methods
  • The authors selected words with a a variety of frequencies and degree of lexical/visual overlap with ImageNet categories: “kitchen”, “bedroom” (175k), “washer” (65k), “outdoor” (50k), “fitness” (49k), and “pool” (29k).
  • For each of these words of interest, the authors labeled a different random 1% subset of all images (2,943 images each): an image in a sample was labeled true if it corresponded with any sense of the associated word and false otherwise.
  • The authors perform evaluations on the entire samples of 2,943 images in order to avoid overstating performance
Results
  • As shown in Table 1, EntSharp outperforms all baselines on PR AUC on all six of the evaluation words.
  • Table 2 shows the ImageNet object labels associated with each word in manually selected images.
  • Though “kitchen” is not a category in the ImageNet dataset, “microwave”, “refrigerator”, and “dishwasher” are, and these words are sufficiently close to “kitchen” to learn an association.
  • EntSharp achieves the highest PR AUC even in the case of “washer”, which is a category learned by the object detection baselines.
  • EntSharp’s performance increase is most pronounced for the words “outdoor”, “bedroom”, “pool”, and especially “fitness”, which have dissimilar visual manifestations in StreetEasy and ImageNet
Conclusion
  • The authors present EntSharp, a simple clustering-based algorithm for learning image groundings for words.
  • It is motivated by the unlabeled multimodal data that exists in abundance rather than relying on expensive custom datasets.
  • The method is effective at finding contextual lexical groundings of words in unlabeled multi-image, multi-sentence documents even in the presence of high cross-document similarity.
Summary
  • Introduction:

    Multimodal data consisting of text and images is ubiquitous but increasingly diverse: libraries are digitizing visual-textual collections (British Library Labs, 2016; The Smithsonian, 2020); news organizations release over 1M images per year to accompany news articles (The Associated Press, 2020); and social media messages are rarely sent without visual accompaniment.
  • The authors focus on one such specialized, multimodal domain: New York City real estate listings from the website StreetEasy.
  • Most prior image captioning work has focused on rare and expensive single-image, singlecaption collections such as MSCOCO, which focuses on literal, context-free descriptions for 80 object types (Lin et al, 2014).
  • In the specialized real estate context, “pool” commonly refers to a swimming pool
  • Objectives:

    The authors consider a direct image-text grounding task: for each word type, the authors aim to retrieve images mostassociated with that word.
  • Methods:

    The authors selected words with a a variety of frequencies and degree of lexical/visual overlap with ImageNet categories: “kitchen”, “bedroom” (175k), “washer” (65k), “outdoor” (50k), “fitness” (49k), and “pool” (29k).
  • For each of these words of interest, the authors labeled a different random 1% subset of all images (2,943 images each): an image in a sample was labeled true if it corresponded with any sense of the associated word and false otherwise.
  • The authors perform evaluations on the entire samples of 2,943 images in order to avoid overstating performance
  • Results:

    As shown in Table 1, EntSharp outperforms all baselines on PR AUC on all six of the evaluation words.
  • Table 2 shows the ImageNet object labels associated with each word in manually selected images.
  • Though “kitchen” is not a category in the ImageNet dataset, “microwave”, “refrigerator”, and “dishwasher” are, and these words are sufficiently close to “kitchen” to learn an association.
  • EntSharp achieves the highest PR AUC even in the case of “washer”, which is a category learned by the object detection baselines.
  • EntSharp’s performance increase is most pronounced for the words “outdoor”, “bedroom”, “pool”, and especially “fitness”, which have dissimilar visual manifestations in StreetEasy and ImageNet
  • Conclusion:

    The authors present EntSharp, a simple clustering-based algorithm for learning image groundings for words.
  • It is motivated by the unlabeled multimodal data that exists in abundance rather than relying on expensive custom datasets.
  • The method is effective at finding contextual lexical groundings of words in unlabeled multi-image, multi-sentence documents even in the presence of high cross-document similarity.
Tables
  • Table1: Area under the precision-recall curve (AUC) for each grounding method on each labeled random image subset. Best-in-column is bolded. Random performance results in an AUC equal to the percentage labeled true
  • Table2: Top DenseNet169 ImageNet class predictions for selected example images
Download tables as Excel
Related work
  • Learning image-text relationships is central to many applications, including image captioning/tagging (Kulkarni et al, 2013; Mitchell et al, 2013; Karpathy and Fei-Fei, 2015) and cross-modal retrieval/search (Jeon et al, 2003; Rasiwasia et al, 2010). While most captioning work assumes a supervised one-to-one corpus, recent works consider documents containing multiple images/sentences (Park and Kim, 2015; Shin et al, 2016; Agrawal et al, 2016; Liu et al, 2017; Chu and Kao, 2017; Hessel et al, 2019; Nag Chowdhury et al, 2020). Furthermore, compared to crowdannotated captioning datasets, web corpora are more challenging, as image-text relationships often transcend literal description (Marsh and White, 2003; Alikhani and Stone, 2019).

    We consider a direct image-text grounding task: for each word type, we aim to retrieve images mostassociated with that word. Models are evaluated by their capacity to compute word-image similarities that align with human judgment.

    EntSharp. For each image in a document we iteratively infer a probability distribution over the words present in the document. During training, these distributions are encouraged to have low entropy. The output is an embedding of each word into image space: the model computes word-image similarities in this joint space. This can be thought of as a soft clustering, such that each word type is equivalent to a cluster but only certain clusters are available to certain images. This approach could also be situated within the framework of multipleinstance learning (Carbonneau et al, 2018).
Funding
  • We would particularly like to thank Grant Long for putting together the StreetEasy data as well as Ondrej Linda, Ramin Mehran, and Randy Puttick for helpful conversations
  • This work was supported by Zillow Group and NSF #1652536
Study subjects and analysis
multimodal image-text datasets: 7
We identify domain-specific associations between words and images from unlabeled multisentence, multi-image documents. Documents in the StreetEasy dataset are much more visually similar to each other than documents in seven multimodal image-text datasets spanning storytelling, cooking, travel blogs, captioning, etc. (Lin et al, 2014; Huang et al, 2016; Yagcioglu et al, 2018; Hessel et al, 2018, 2019; Nag Chowdhury et al, 2020). Examples from StreetEasy show that words like “kitchen” are frequent and grounded. Black lines represent 99.99% CI. Top images for EntSharp and object detection baselines on the StreetEasy dataset. Images in each word’s section come from the same evaluation set, and each row is ranked in decreasing order from left to right. For example, the three rows in the “kitchen” section are different orderings of the same 2,943 images. Images with dark blue borders were labeled true with respect to the word, and those with light red borders were labeled false. E: EntSharp. W: word2vec object detection baseline. R: RoBERTa object detection baseline

Reference
  • Harsh Agrawal, Arjun Chandrasekaran, Dhruv Batra, Devi Parikh, and Mohit Bansal. 2016. Sort story: Sorting jumbled images and captions into stories. In EMNLP.
    Google ScholarFindings
  • Malihe Alikhani and Matthew Stone. 2019. “caption” as a coherence relation: Evidence and implications. In Proceedings of the Second Workshop on Shortcomings in Vision and Language, pages 58–67.
    Google ScholarLocate open access versionFindings
  • British Library Labs. 2016. Digitised books. https://data.bl.uk/digbks/.
    Findings
  • Marc-Andre Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. 2018. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329–353.
    Google ScholarLocate open access versionFindings
  • Yu Liu, Jianlong Fu, Tao Mei, and Chang Wen Chen. 2017. Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In AAAI.
    Google ScholarFindings
  • Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. 2018. Exploring the limits of weakly supervised pretraining. In ECCV, pages 181–196.
    Google ScholarLocate open access versionFindings
  • Wei-Ta Chu and Ming-Chih Kao. 201Blog article summarization with image-text alignment techniques. In 2017 IEEE International Symposium on Multimedia (ISM), pages 244–247. IEEE.
    Google ScholarLocate open access versionFindings
  • Jack Hessel, Lillian Lee, and David Mimno. 2019. Unsupervised discovery of multimodal links in multiimage, multi-sentence documents. In EMNLP.
    Google ScholarFindings
  • Emily E Marsh and Marilyn Domas White. 2003. A taxonomy of relationships between images and text. Journal of Documentation.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Jack Hessel, David Mimno, and Lillian Lee. 2018. Quantifying the visual concreteness of words and topics in multimodal datasets. In NAACL.
    Google ScholarFindings
  • Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In CVPR, pages 4700–4708.
    Google ScholarLocate open access versionFindings
  • Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In NAACL, pages 1233–1239.
    Google ScholarLocate open access versionFindings
  • Jiwoon Jeon, Victor Lavrenko, and Raghavan Manmatha. 2003. Automatic image annotation and retrieval using cross-media relevance m odels. In SIGIR.
    Google ScholarFindings
  • Andrej Karpathy and Li Fei-Fei. 20Deep visualsemantic alignments for generating image descriptions. In CVPR, pages 3128–3137.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
    Google ScholarFindings
  • Margaret Mitchell, Kees Van Deemter, and Ehud Reiter. 2013. Generating expressions that refer to visible objects. In NAACL. Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Sreyasi Nag Chowdhury, William Cheng, Gerard De Melo, Simon Razniewski, and Gerhard Weikum. 2020. Illustrate your story: Enriching text with images. In ACM WSDM.
    Google ScholarFindings
  • Douglas L Nelson, Cathy L McEvoy, and Thomas A Schreiber. 2004. The university of south florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, & Computers, 36(3):402–407.
    Google ScholarLocate open access versionFindings
  • Cesc C Park and Gunhee Kim. 2015. Expressing an image stream with a sequence of natural sentences. In NeurIPS, pages 73–81.
    Google ScholarLocate open access versionFindings
  • Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM MM, pages 251–260.
    Google ScholarLocate open access versionFindings
  • Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252.
    Google ScholarLocate open access versionFindings
  • Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In ICML.
    Google ScholarFindings
  • Andrew Shin, Katsunori Ohnishi, and Tatsuya Harada. 2016. Beyond caption to narrative: Video captioning with multiple sentences. In 2016 IEEE International Conference on Image Processing (ICIP), Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV, pages 740–
    Google ScholarLocate open access versionFindings
  • The Associated Press. 2020. AP information: https://www.ap.org/en-us/, accessed may 14, 20 20.
    Findings
  • The Smithsonian. 2020. Smithsonian open access
    Google ScholarFindings
  • dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Na-Luke Zettlemoyer, and Veselin Stoyanov. 2019.
    Google ScholarFindings
  • zli Ikizler-Cinbis. 2018. RecipeQA: a challenge
    Google ScholarFindings
  • We compute a length-controlled version of word mover’s distance (Kusner et al., 2015) to measure the textual distances between documents. This was inspired by the simple extension to “image mover’s distance” enabled by swapping the word2vec token representations to CNN image representations.
    Google ScholarFindings
  • After computing image/word mover’s distances, we noticed that these metrics were slightly correlated with document length; this correlation was also noted by Kusner et al. (2015), who mention that longer documents might be closer to others “as longer documents may contain several similar words.” To account for this, we implemented a version of mover’s distances that selects a bootstrap sample of b1 = 50 words and b2 = 10 images before computing distances. The scatterplot we report in Figure 2 is insensitive to reasonable choices of these parameters, as it looks largely the same for any b1, b2 ∈ {10, 30, 50} × {3, 5, 10}.
    Google ScholarLocate open access versionFindings
  • The dataset consists of 29,347 English-language real estate listings from the StreetEasy website from June 2019. They contain a total of 294,279 images and 24,078,190 word tokens across 34,564 word types. We preprocess the text by removing
    Google ScholarFindings
  • Image tagging. We reserved 20% of the StreetEasy corpus as a validation set. We don’t hold out a test set: this tasks the algorithms only with fitting the dataset, not generalizing beyond it. We use the validation set for early stopping, model selection, and hyperparameter optimization. We optimize learning rate (in {0.001, 0.0005, 0.0007}) and number of layers (in {0, 1, 2, 3, 4, 5}). We decay learning rate upon validation loss plateau. We use the Adam optimizer (Kingma and Ba, 2015).
    Google ScholarLocate open access versionFindings
Author
Gregory Yauney
Gregory Yauney
Your rating :
0

 

Tags
Comments
小科