Feature Deformation Meta-Networks in Image Captioning of Novel Objects

AAAI, pp. 10494-10501, 2020.

Cited by: 0|Bibtex|Views112
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We propose feature deformation meta-networks combined with the prevailing encoder-decoder framework to tackle the novel object captioning problem

Abstract:

This paper studies the task of image captioning with novel objects, which only exist in testing images. Intrinsically, this task can reflect the generalization ability of models in understanding and captioning the semantic meanings of visual concepts and objects unseen in training set, sharing the similarity to one/zero-shot learning. The...More

Code:

Data:

0
Introduction
  • It is one of the long term goals for the AI community to pursue an agent that can automatically and linguistically describe the captured visual signals
  • This task is formulated as the task of image captioning, which has made significant progress powered by deep architectures (Xu et al 2015).
  • Novel object captioning (Tran et al 2016; Anne Hendricks et al 2016) is recently studied to generate appropriate descriptions for novel classes which have no training instances
  • To address this problem, extra resources have been investigated.
  • These resources should be easy to be collected and contain abundant visual concepts
Highlights
  • It is one of the long term goals for the AI community to pursue an agent that can automatically and linguistically describe the captured visual signals
  • We propose feature deformation meta-networks (FDM-net) that aim to help image captioning models adapt to the novel objects by generating deformed training data
  • In Tab. 4, comparing with popular methods separately on whether using constrained beam search, we can see that our framework with mis-labelled probability strategy and scene graph sentence reconstruction network achieves the state-of-the-art score among all the methods in terms of out-domain SPICE, METEOR, average F1 score; and all metrics tested on the indomain subset
  • We introduce external knowledge by scene graph sentence reconstruction network and mis-labelled probability strategy to assist in generating reasonable pairs
  • We propose feature deformation meta-networks combined with the prevailing encoder-decoder framework to tackle the novel object captioning problem
  • Extensive experiments demonstrate that our approach has achieved the state-of-the-art performance on novel object captioning task
Methods
  • The authors propose a framework that combines feature deformation sub-net and scene graph sentence reconstruction sub-net (SGSR) to caption images with novel objects.
  • The pipeline of generating training instances with visual and text pairs is shown in Fig. 2.
  • In the following, both the feature deformation sub-net and scene graph sentence reconstruction sub-net are introduced in detail
Results
  • Several experiments are conducted by adding different numbers of augmented instances and utilizing kinds of strategies, as shown in Tab. 3.
  • The authors use the results that are generated by general captioning model on NOC split as the baseline.
  • Different numbers of augmented feature-text pairs are added into the NOC training split.
  • The MLS strategy denotes that the authors apply the features of top three nearest objects to the following replacement, while one most similar object is chosen for each novel object based on human common sense whenever MLS is not used.
  • Constrained beam search decoder is an optional strategy in the testing stage
Conclusion
  • The authors propose FDM-net combined with the prevailing encoder-decoder framework to tackle the novel object captioning problem.
  • It is a conceptually simple but powerful approach that generates additional training instances on the feature level.
  • The authors' FDM-net aims to solve the mismatching problem when doing deformation on the spacial level in vision-language tasks.
  • Extensive experiments demonstrate that the approach has achieved the state-of-the-art performance on novel object captioning task.
Summary
  • Introduction:

    It is one of the long term goals for the AI community to pursue an agent that can automatically and linguistically describe the captured visual signals
  • This task is formulated as the task of image captioning, which has made significant progress powered by deep architectures (Xu et al 2015).
  • Novel object captioning (Tran et al 2016; Anne Hendricks et al 2016) is recently studied to generate appropriate descriptions for novel classes which have no training instances
  • To address this problem, extra resources have been investigated.
  • These resources should be easy to be collected and contain abundant visual concepts
  • Methods:

    The authors propose a framework that combines feature deformation sub-net and scene graph sentence reconstruction sub-net (SGSR) to caption images with novel objects.
  • The pipeline of generating training instances with visual and text pairs is shown in Fig. 2.
  • In the following, both the feature deformation sub-net and scene graph sentence reconstruction sub-net are introduced in detail
  • Results:

    Several experiments are conducted by adding different numbers of augmented instances and utilizing kinds of strategies, as shown in Tab. 3.
  • The authors use the results that are generated by general captioning model on NOC split as the baseline.
  • Different numbers of augmented feature-text pairs are added into the NOC training split.
  • The MLS strategy denotes that the authors apply the features of top three nearest objects to the following replacement, while one most similar object is chosen for each novel object based on human common sense whenever MLS is not used.
  • Constrained beam search decoder is an optional strategy in the testing stage
  • Conclusion:

    The authors propose FDM-net combined with the prevailing encoder-decoder framework to tackle the novel object captioning problem.
  • It is a conceptually simple but powerful approach that generates additional training instances on the feature level.
  • The authors' FDM-net aims to solve the mismatching problem when doing deformation on the spacial level in vision-language tasks.
  • Extensive experiments demonstrate that the approach has achieved the state-of-the-art performance on novel object captioning task.
Tables
  • Table1: Top three similar concepts for novel class decided by Mis-labeled Probability Strategy bottle bus couch microwave pizza racket suitcase zebra cup truck chair stove sandwich bat bag
  • Table2: Evaluation on sentence reconstruction network. Where B@1, B@4, M, R, C and S mean Blue-1, Blue-4, METEOR, Rough, CIDEr-D and SPICE respectively
  • Table3: Results on test dataset of DCC split. Where / , / and / mean whether apply Mis-Labeled Strategy (MLS), scene graph sentence reconstruction network (SGSR) and Constrained Beam Search (CBS) respectively. And No means the augmented examples number of each novel object
  • Table4: Evaluating performance with popular methods
  • Table5: Human evaluation results. Where ‘coco’ means the test set comes from MSCOCO dataset and ‘open image’ is from Open Images dataset
Download tables as Excel
Related work
  • Novel Object Captioning

    General image captioning aims to describe images with sentences. For increasing scalability of diversified objects, recently novel object captioning (Anne Hendricks et al 2016; Lu et al 2018; Wu et al 2018) has attracted lots of attention. However, most proposed methods are architectural in essence. Researchers have designed template-based caption models (Lu et al 2018), multi-task models (Anne Hendricks et al 2016) and novel sampling algorithms (Koehn 2016). These novel structures are disjointed from normal image captioning task to varying degrees, which causes poor performance on in-domain scores. Inspired by the newfangled deformation strategies (Chen et al 2019b; 2019a; Satoshi Tsutsui 2019), we design deformed meta-networks for the novel object image captioning task which can be seamlessly integrated into popular encoder-decoder captioning models. However, different from image deformation methods, we deformed RoI features in our proposed feature deformation sub-net.
Funding
  • Acknowledge: This work was supported in part by the Shanghai Research and Innovation Functional Program under Grant 17DZ2260900, and Shanghai Municipal Science and Technology Major Project(19511120700, and 2018SHZDZX01) and ZJLab
Reference
  • Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016a. Guided open vocabulary image captioning with constrained beam search. arXiv preprint arXiv:1612.00576.
    Findings
  • Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016b. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, 382– 398. Springer.
    Google ScholarLocate open access versionFindings
  • Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6077–6086.
    Google ScholarLocate open access versionFindings
  • Anderson, P.; Gould, S.; and Johnson, M. 2018. Partiallysupervised image captioning. In Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Anne Hendricks, L.; Venugopalan, S.; Rohrbach, M.; Mooney, R.; Saenko, K.; and Darrell, T. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1–10.
    Google ScholarLocate open access versionFindings
  • Banerjee, S., and Lavie, A. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72.
    Google ScholarLocate open access versionFindings
  • Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Dollar, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
    Findings
  • Chen, Z.; Fu, Y.; Chen, K.; and Jiang, Y.-G. 2019a. Image block augmentation for one-shot learning zitian. In AAAI.
    Google ScholarFindings
  • Chen, Z.; Fu, Y.; Wang, Y.-X.; Ma, L.; Liu, W.; and Hebert, M. 2019b. Image deformation meta-networks for one-shot learning. In CVPR.
    Google ScholarFindings
  • Koehn, P. 2016. Statistical machine translation. arXiv preprint arXiv:1612.00576.
    Findings
  • Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; Bernstein, M.; and Fei-Fei, L. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In arXiv preprint arXiv:1602.07332.
    Findings
  • Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, 740–755. Springer.
    Google ScholarLocate open access versionFindings
  • Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7219–7228.
    Google ScholarLocate open access versionFindings
  • Marr, D. 1982. Vision: A computational investigation into the hu- man representation and processing of visual information. Cambridge, Massachusetts.
    Google ScholarFindings
  • Plummer, B. A.; Wang, L.; Cervantes, C. M.; Caicedo, J. C.; Hockenmaier, J.; and Lazebnik, S. 20Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, 2641–2649.
    Google ScholarLocate open access versionFindings
  • Ren, S.; He, K.; Girshick, R. B.; and Sun, J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497.
    Findings
  • Rom, H.; Uijlings, J.; Popov, S.; Veit, A.; Belongie, S.; Gomes, V.; Gupta, A.; Sun, C.; Chechik, G.; Cai, D.; Feng, Z.; Narayanan, D.; and Murphy, K. 20Openimages: A public dataset for large-scale multi-label and multi-class image classification.
    Google ScholarFindings
  • Satoshi Tsutsui, Yanwei Fu, D. C. 2019. Meta-reinforced synthetic data for one-shot fine-grained visual recognition. In Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Tran, K.; He, X.; Zhang, L.; Sun, J.; Carapcea, C.; Thrasher, C.; Buehler, C.; and Sienkiewicz, C. 2016. Rich image captioning in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
    Google ScholarLocate open access versionFindings
  • Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575.
    Google ScholarLocate open access versionFindings
  • Venugopalan, S.; Anne Hendricks, L.; Rohrbach, M.; Mooney, R.; Darrell, T.; and Saenko, K. 2017. Captioning images with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5753–5761.
    Google ScholarLocate open access versionFindings
  • Wu, Y.; Zhu, L.; Jiang, L.; and Yang, Y. 2018. Decoupled novel object captioner. In Proceedings of the 26th ACM international conference on Multimedia, 1029–1037.
    Google ScholarLocate open access versionFindings
  • Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015.
    Google ScholarFindings
  • Yang, X.; Tang, K.; Zhang, H.; and Cai, J. 2018. Autoencoding scene graphs for image captioning. arXiv preprint arXiv:1812.08658.
    Findings
  • Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2017. Incorporating copying mechanism in image captioning for learning novel objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6580–6588.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments