Object Counts! Bringing Explicit Detections Back into Image Captioning.
north american chapter of the association for computational linguistics(2018)
摘要
The use of explicit object detectors as an intermediatestep to image captioning – whichused to constitute an essential stage in earlywork – is often bypassed in the currently dominantend-to-end approaches, where the languagemodel is conditioned directly on a midlevelimage embedding. We argue that explicitdetections provide rich semantic information,and can thus be used as an interpretable representationto better understand why end-to-endimage captioning systems work well. We providean in-depth analysis of end-to-end imagecaptioning by exploring a variety of cues thatcan be derived from such object detections.Our study reveals that end-to-end image captioningsystems rely on matching image representationsto generate captions, and that encodingthe frequency, size and position of objectsare complementary and all play a role informing a good image representation. It alsoreveals that different object categories contributein different ways towards image captioning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要