Storytelling from an Image Stream Using Scene Graphs

national conference on artificial intelligence, 2020.

Cited by: 22|Bibtex|Views221
Other Links: academic.microsoft.com
Weibo:
We propose a novel graph-based method named SGVST for visual storytelling, which parses images to scene graphs, and models the relationships on scene graphs at two levels, i.e., within-image and cross-images levels

Abstract:

Visual storytelling aims at generating a story from an image stream. Most existing methods tend to represent images directly with the extracted high-level features, which is not intuitive and difficult to interpret. We argue that translating each image into a graph-based semantic representation, i.e., scene graph, which explicitly encodes...More

Code:

Data:

0
Introduction
  • For most people, showing them images and ask them to compose a reasonable story about the images is not a difficult task.
  • Though the recent advances in deep neural networks have achieved encouraging results, it is still nontrivial for the machine to summarize the meanings from images and generate a narrative story.
  • Different from image captioning (Karpathy and Fei-Fei 2015; Vinyals et al 2017; Yao et al 2018; Fan et al 2019) which aims at generating a literal description for a single image, visual storytelling is more challenging, which further investigates machine’s capabilities of understanding a sequence of images and generate a coherent story with multiple sentences
Highlights
  • For most people, showing them images and ask them to compose a reasonable story about the images is not a difficult task
  • We will observe the images in order and reason the relationship among images. Taking this idea as motivation, we propose a novel graphbased architecture named SGVST for visual storytelling, which first translates each image into a graph-based semantic representation, i.e., scene graph, and models the relationship on within-image level and cross-images level, as shown in Figure 1
  • We propose a framework based on scene graphs to realize enriching fine-grained representations by modeling the visual relationships through Graph Convolution Network on the within-image level and through Temporal Convolution Network on the cross-images level
  • The results indicate that our proposed SGVST model achieves superior performances over other state-of-the-art models optimized with MLE and RL, which directly demonstrates our graph-based model can help for story generation
  • We propose a novel graph-based method named SGVST for visual storytelling, which parses images to scene graphs, and models the relationships on scene graphs at two levels, i.e., within-image and cross-images levels
Methods
  • (1) Pairwise Comparison In pairwise comparison, the workers are asked to compare two stories generated by corresponding methods and choose the one that more human-like and descriptive.
  • The workers are asked to rate the quality of the story by telling how much they Agree or Disagree with each question, on a scale of 1-5.
  • The scores reported show that the SGVST model outperforms in all six characteristics, which further proves the storied generated by the model are more informative and high-quality
Results
  • Figure 4 shows some examples with the an image stream, scene graphs, ground-truth story and generated story by three approaches, i.e., seq2seq, AREL and the SGVST, where the seq2seq (Huang et al 2016) is implemented by them and AREL (Wang et al 2018b) is trained and evaluated according to its publicly available code.
  • The authors randomly select 150 stories, each evaluated by 3
Conclusion
  • The authors propose a novel graph-based method named SGVST for visual storytelling, which parses images to scene graphs, and models the relationships on scene graphs at two levels, i.e., within-image and cross-images levels.
  • The authors would explore the method to other multi-modal tasks, e.g., video captioning
Summary
  • Introduction:

    For most people, showing them images and ask them to compose a reasonable story about the images is not a difficult task.
  • Though the recent advances in deep neural networks have achieved encouraging results, it is still nontrivial for the machine to summarize the meanings from images and generate a narrative story.
  • Different from image captioning (Karpathy and Fei-Fei 2015; Vinyals et al 2017; Yao et al 2018; Fan et al 2019) which aims at generating a literal description for a single image, visual storytelling is more challenging, which further investigates machine’s capabilities of understanding a sequence of images and generate a coherent story with multiple sentences
  • Methods:

    (1) Pairwise Comparison In pairwise comparison, the workers are asked to compare two stories generated by corresponding methods and choose the one that more human-like and descriptive.
  • The workers are asked to rate the quality of the story by telling how much they Agree or Disagree with each question, on a scale of 1-5.
  • The scores reported show that the SGVST model outperforms in all six characteristics, which further proves the storied generated by the model are more informative and high-quality
  • Results:

    Figure 4 shows some examples with the an image stream, scene graphs, ground-truth story and generated story by three approaches, i.e., seq2seq, AREL and the SGVST, where the seq2seq (Huang et al 2016) is implemented by them and AREL (Wang et al 2018b) is trained and evaluated according to its publicly available code.
  • The authors randomly select 150 stories, each evaluated by 3
  • Conclusion:

    The authors propose a novel graph-based method named SGVST for visual storytelling, which parses images to scene graphs, and models the relationships on scene graphs at two levels, i.e., within-image and cross-images levels.
  • The authors would explore the method to other multi-modal tasks, e.g., video captioning
Tables
  • Table1: Overall performance of story generation on VIST dataset for different models in terms of BLEU (B), METEOR (M),
  • Table2: Human evaluation results. Workers on AMT rate the quality of the story by telling how much they Agree or Disagree with each question, on a scale of 1-5
Download tables as Excel
Related work
Funding
  • This work is partially supported by National Natural Science Foundation of China (No 61751201, No 61702106) and Science and Technology Commission of Shanghai Municipality Grant (No.18DZ1201000, No.17JC1420200, No.16JC1420401)
Reference
  • Bai, S.; Kolter, J. Z.; and Koltun, V. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271.
    Findings
  • Banerjee, S., and Lavie, A. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop, 65–72.
    Google ScholarLocate open access versionFindings
  • Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
    Findings
  • Fan, Z.; Wei, Z.; Li, P.; Lan, Y.; and Huang, X. 2018a. A question type driven framework to diversify visual question generation. In IJCAI, 4048–4054.
    Google ScholarFindings
  • Fan, Z.; Wei, Z.; Wang, S.; Liu, Y.; and Huang, X.-J. 2018b. A reinforcement learning framework for natural question generation using bi-discriminators. In COLING, 1763– 1774.
    Google ScholarFindings
  • Fan, Z.; Wei, Z.; Wang, S.; and Huang, X.-J. 2019. Bridging by word: Image grounded vocabulary construction for visual captioning. In ACL, 6514–6524.
    Google ScholarFindings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 1026–1034.
    Google ScholarFindings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
    Google ScholarLocate open access versionFindings
  • Huang, T.-H. K.; Ferraro, F.; Mostafazadeh, N.; Misra, I.; Agrawal, A.; Devlin, J.; Girshick, R.; He, X.; Kohli, P.; Batra, D.; et al. 2016. Visual storytelling. In NAACL, 1233– 1239.
    Google ScholarFindings
  • Huang, Q.; Gan, Z.; Celikyilmaz, A.; Wu, D.; Wang, J.; and He, X. 2019. Hierarchically structured reinforcement learning for topically coherent visual story generation. In AAAI, 8465–8472.
    Google ScholarFindings
  • Johnson, J.; Krishna, R.; Stark, M.; Li, L.-J.; Shamma, D.; Bernstein, M.; and Fei-Fei, L. 2015. Image retrieval using scene graphs. In CVPR, 3668–3678.
    Google ScholarFindings
  • Johnson, J.; Gupta, A.; and Fei-Fei, L. 2018. Image generation from scene graphs. In CVPR, 1219–1228.
    Google ScholarLocate open access versionFindings
  • Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR, 3128–3137.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
    Google ScholarFindings
  • Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1):32–73.
    Google ScholarLocate open access versionFindings
  • Li, Y.; Ouyang, W.; Bolei, Z.; Jianping, S.; Chao, Z.; and Wang, X. 2018. Factorizable net: An efficient subgraphbased framework for scene graph generation. In ECCV, 346–363.
    Google ScholarLocate open access versionFindings
  • Lin, C.-Y., and Och, F. J. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL, 605.
    Google ScholarLocate open access versionFindings
  • Liu, Y.; Fu, J.; Mei, T.; and Chen, C. W. 2017. Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In AAAI, 1445–1452.
    Google ScholarFindings
  • Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In CVPR, 3431–3440.
    Google ScholarFindings
  • Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2018. Neural baby talk. In CVPR, 7219–7228.
    Google ScholarLocate open access versionFindings
  • Modi, Y., and Parde, N. 2019. The steep road to happily ever after: An analysis of current visual storytelling models. In NAACL Workshop on SiVL, 47–57.
    Google ScholarLocate open access versionFindings
  • Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, 311–318.
    Google ScholarLocate open access versionFindings
  • Park, C. C., and Kim, G. 2015. Expressing an image stream with a sequence of natural sentences. In Cortes, C.; Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; and Garnett, R., eds., NIPS. Curran Associates, Inc. 73–81.
    Google ScholarFindings
  • Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster rcnn: Towards real-time object detection with region proposal networks. In NIPS, 91–99.
    Google ScholarLocate open access versionFindings
  • Rocktaschel, T.; Grefenstette, E.; Hermann, K. M.; Kocisky, T.; and Blunsom, P. 2015. Reasoning about entailment with neural attention. CoRR abs/1509.06664.
    Findings
  • Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In CVPR, 4566–4575.
    Google ScholarFindings
  • Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2017. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. PAMI 39(4):652–663.
    Google ScholarLocate open access versionFindings
  • Wang, J.; Fu, J.; Tang, J.; Li, Z.; and Mei, T. 2018a. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training. In AAAI, 7396– 7403.
    Google ScholarLocate open access versionFindings
  • Wang, X.; Chen, W.; Wang, Y.-F.; and Wang, W. Y. 2018b. No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling. In ACL, 899–909.
    Google ScholarLocate open access versionFindings
  • Wang, B.; Ma, L.; Zhang, W.; Jiang, W.; and Zhang, F. 2019. Hierarchical photo-scene encoder for album storytelling. In AAAI, 8909–8916.
    Google ScholarLocate open access versionFindings
  • Xu, D.; Zhu, Y.; Choy, C.; and Fei-Fei, L. 2017. Scene graph generation by iterative message passing. In CVPR.
    Google ScholarFindings
  • Yang, X.; Tang, K.; Zhang, H.; and Cai, J. 2019. Autoencoding scene graphs for image captioning. In CVPR, 10685–10694.
    Google ScholarLocate open access versionFindings
  • Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2018. Exploring visual relationship for image captioning. In ECCV, 684–699.
    Google ScholarLocate open access versionFindings
  • Yu, L.; Bansal, M.; and Berg, T. 2017. Hierarchicallyattentive rnn for album summarization and storytelling. In EMNLP, 966–971.
    Google ScholarLocate open access versionFindings
  • Zellers, R.; Yatskar, M.; Thomson, S.; and Choi, Y. 2018. Neural motifs: Scene graph parsing with global context. In CVPR.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments