Storytelling from an Image Stream Using Scene Graphs
national conference on artificial intelligence, 2020.
Weibo:
Abstract:
Visual storytelling aims at generating a story from an image stream. Most existing methods tend to represent images directly with the extracted high-level features, which is not intuitive and difficult to interpret. We argue that translating each image into a graph-based semantic representation, i.e., scene graph, which explicitly encodes...More
Code:
Data:
Introduction
- For most people, showing them images and ask them to compose a reasonable story about the images is not a difficult task.
- Though the recent advances in deep neural networks have achieved encouraging results, it is still nontrivial for the machine to summarize the meanings from images and generate a narrative story.
- Different from image captioning (Karpathy and Fei-Fei 2015; Vinyals et al 2017; Yao et al 2018; Fan et al 2019) which aims at generating a literal description for a single image, visual storytelling is more challenging, which further investigates machine’s capabilities of understanding a sequence of images and generate a coherent story with multiple sentences
Highlights
- For most people, showing them images and ask them to compose a reasonable story about the images is not a difficult task
- We will observe the images in order and reason the relationship among images. Taking this idea as motivation, we propose a novel graphbased architecture named SGVST for visual storytelling, which first translates each image into a graph-based semantic representation, i.e., scene graph, and models the relationship on within-image level and cross-images level, as shown in Figure 1
- We propose a framework based on scene graphs to realize enriching fine-grained representations by modeling the visual relationships through Graph Convolution Network on the within-image level and through Temporal Convolution Network on the cross-images level
- The results indicate that our proposed SGVST model achieves superior performances over other state-of-the-art models optimized with MLE and RL, which directly demonstrates our graph-based model can help for story generation
- We propose a novel graph-based method named SGVST for visual storytelling, which parses images to scene graphs, and models the relationships on scene graphs at two levels, i.e., within-image and cross-images levels
Methods
- (1) Pairwise Comparison In pairwise comparison, the workers are asked to compare two stories generated by corresponding methods and choose the one that more human-like and descriptive.
- The workers are asked to rate the quality of the story by telling how much they Agree or Disagree with each question, on a scale of 1-5.
- The scores reported show that the SGVST model outperforms in all six characteristics, which further proves the storied generated by the model are more informative and high-quality
Results
- Figure 4 shows some examples with the an image stream, scene graphs, ground-truth story and generated story by three approaches, i.e., seq2seq, AREL and the SGVST, where the seq2seq (Huang et al 2016) is implemented by them and AREL (Wang et al 2018b) is trained and evaluated according to its publicly available code.
- The authors randomly select 150 stories, each evaluated by 3
Conclusion
- The authors propose a novel graph-based method named SGVST for visual storytelling, which parses images to scene graphs, and models the relationships on scene graphs at two levels, i.e., within-image and cross-images levels.
- The authors would explore the method to other multi-modal tasks, e.g., video captioning
Summary
Introduction:
For most people, showing them images and ask them to compose a reasonable story about the images is not a difficult task.- Though the recent advances in deep neural networks have achieved encouraging results, it is still nontrivial for the machine to summarize the meanings from images and generate a narrative story.
- Different from image captioning (Karpathy and Fei-Fei 2015; Vinyals et al 2017; Yao et al 2018; Fan et al 2019) which aims at generating a literal description for a single image, visual storytelling is more challenging, which further investigates machine’s capabilities of understanding a sequence of images and generate a coherent story with multiple sentences
Methods:
(1) Pairwise Comparison In pairwise comparison, the workers are asked to compare two stories generated by corresponding methods and choose the one that more human-like and descriptive.- The workers are asked to rate the quality of the story by telling how much they Agree or Disagree with each question, on a scale of 1-5.
- The scores reported show that the SGVST model outperforms in all six characteristics, which further proves the storied generated by the model are more informative and high-quality
Results:
Figure 4 shows some examples with the an image stream, scene graphs, ground-truth story and generated story by three approaches, i.e., seq2seq, AREL and the SGVST, where the seq2seq (Huang et al 2016) is implemented by them and AREL (Wang et al 2018b) is trained and evaluated according to its publicly available code.- The authors randomly select 150 stories, each evaluated by 3
Conclusion:
The authors propose a novel graph-based method named SGVST for visual storytelling, which parses images to scene graphs, and models the relationships on scene graphs at two levels, i.e., within-image and cross-images levels.- The authors would explore the method to other multi-modal tasks, e.g., video captioning
Tables
- Table1: Overall performance of story generation on VIST dataset for different models in terms of BLEU (B), METEOR (M),
- Table2: Human evaluation results. Workers on AMT rate the quality of the story by telling how much they Agree or Disagree with each question, on a scale of 1-5
Related work
- There are many works focus on vision-to-language, e.g., VQA (Fan et al 2018a; 2018b) and image captioning. Some earlier works (Karpathy and Fei-Fei 2015; Vinyals et al 2017) propose CNN-RNN frameworks for image captioning. Further, some works (Yao et al 2018; Lu et al 2018) explore visual relationship for image captioning. Different from image captioning, visual storytelling aims at generating a narrative story from an image stream. The pioneering work was done by Park and Kim (2015). Huang et al (2016)
Funding
- This work is partially supported by National Natural Science Foundation of China (No 61751201, No 61702106) and Science and Technology Commission of Shanghai Municipality Grant (No.18DZ1201000, No.17JC1420200, No.16JC1420401)
Reference
- Bai, S.; Kolter, J. Z.; and Koltun, V. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271.
- Banerjee, S., and Lavie, A. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop, 65–72.
- Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Fan, Z.; Wei, Z.; Li, P.; Lan, Y.; and Huang, X. 2018a. A question type driven framework to diversify visual question generation. In IJCAI, 4048–4054.
- Fan, Z.; Wei, Z.; Wang, S.; Liu, Y.; and Huang, X.-J. 2018b. A reinforcement learning framework for natural question generation using bi-discriminators. In COLING, 1763– 1774.
- Fan, Z.; Wei, Z.; Wang, S.; and Huang, X.-J. 2019. Bridging by word: Image grounded vocabulary construction for visual captioning. In ACL, 6514–6524.
- He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 1026–1034.
- He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
- Huang, T.-H. K.; Ferraro, F.; Mostafazadeh, N.; Misra, I.; Agrawal, A.; Devlin, J.; Girshick, R.; He, X.; Kohli, P.; Batra, D.; et al. 2016. Visual storytelling. In NAACL, 1233– 1239.
- Huang, Q.; Gan, Z.; Celikyilmaz, A.; Wu, D.; Wang, J.; and He, X. 2019. Hierarchically structured reinforcement learning for topically coherent visual story generation. In AAAI, 8465–8472.
- Johnson, J.; Krishna, R.; Stark, M.; Li, L.-J.; Shamma, D.; Bernstein, M.; and Fei-Fei, L. 2015. Image retrieval using scene graphs. In CVPR, 3668–3678.
- Johnson, J.; Gupta, A.; and Fei-Fei, L. 2018. Image generation from scene graphs. In CVPR, 1219–1228.
- Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR, 3128–3137.
- Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1):32–73.
- Li, Y.; Ouyang, W.; Bolei, Z.; Jianping, S.; Chao, Z.; and Wang, X. 2018. Factorizable net: An efficient subgraphbased framework for scene graph generation. In ECCV, 346–363.
- Lin, C.-Y., and Och, F. J. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL, 605.
- Liu, Y.; Fu, J.; Mei, T.; and Chen, C. W. 2017. Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In AAAI, 1445–1452.
- Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In CVPR, 3431–3440.
- Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2018. Neural baby talk. In CVPR, 7219–7228.
- Modi, Y., and Parde, N. 2019. The steep road to happily ever after: An analysis of current visual storytelling models. In NAACL Workshop on SiVL, 47–57.
- Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, 311–318.
- Park, C. C., and Kim, G. 2015. Expressing an image stream with a sequence of natural sentences. In Cortes, C.; Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; and Garnett, R., eds., NIPS. Curran Associates, Inc. 73–81.
- Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster rcnn: Towards real-time object detection with region proposal networks. In NIPS, 91–99.
- Rocktaschel, T.; Grefenstette, E.; Hermann, K. M.; Kocisky, T.; and Blunsom, P. 2015. Reasoning about entailment with neural attention. CoRR abs/1509.06664.
- Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In CVPR, 4566–4575.
- Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2017. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. PAMI 39(4):652–663.
- Wang, J.; Fu, J.; Tang, J.; Li, Z.; and Mei, T. 2018a. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training. In AAAI, 7396– 7403.
- Wang, X.; Chen, W.; Wang, Y.-F.; and Wang, W. Y. 2018b. No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling. In ACL, 899–909.
- Wang, B.; Ma, L.; Zhang, W.; Jiang, W.; and Zhang, F. 2019. Hierarchical photo-scene encoder for album storytelling. In AAAI, 8909–8916.
- Xu, D.; Zhu, Y.; Choy, C.; and Fei-Fei, L. 2017. Scene graph generation by iterative message passing. In CVPR.
- Yang, X.; Tang, K.; Zhang, H.; and Cai, J. 2019. Autoencoding scene graphs for image captioning. In CVPR, 10685–10694.
- Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2018. Exploring visual relationship for image captioning. In ECCV, 684–699.
- Yu, L.; Bansal, M.; and Berg, T. 2017. Hierarchicallyattentive rnn for album summarization and storytelling. In EMNLP, 966–971.
- Zellers, R.; Yatskar, M.; Thomson, S.; and Choi, Y. 2018. Neural motifs: Scene graph parsing with global context. In CVPR.
Full Text
Tags
Comments