Object Counts! Bringing Explicit Detections Back into Image Captioning.

north american chapter of the association for computational linguistics(2018)

引用 37|浏览45
暂无评分
摘要
The use of explicit object detectors as an intermediatestep to image captioning – whichused to constitute an essential stage in earlywork – is often bypassed in the currently dominantend-to-end approaches, where the languagemodel is conditioned directly on a midlevelimage embedding. We argue that explicitdetections provide rich semantic information,and can thus be used as an interpretable representationto better understand why end-to-endimage captioning systems work well. We providean in-depth analysis of end-to-end imagecaptioning by exploring a variety of cues thatcan be derived from such object detections.Our study reveals that end-to-end image captioningsystems rely on matching image representationsto generate captions, and that encodingthe frequency, size and position of objectsare complementary and all play a role informing a good image representation. It alsoreveals that different object categories contributein different ways towards image captioning.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要