Object Counts! Bringing Explicit Detections Back into Image Captioning.

Josiah Wang,Pranava Swaroop Madhyastha,Lucia Specia

north american chapter of the association for computational linguistics（2018）

引用 37|浏览45

暂无评分

摘要

The use of explicit object detectors as an intermediatestep to image captioning – whichused to constitute an essential stage in earlywork – is often bypassed in the currently dominantend-to-end approaches, where the languagemodel is conditioned directly on a midlevelimage embedding. We argue that explicitdetections provide rich semantic information,and can thus be used as an interpretable representationto better understand why end-to-endimage captioning systems work well. We providean in-depth analysis of end-to-end imagecaptioning by exploring a variety of cues thatcan be derived from such object detections.Our study reveals that end-to-end image captioningsystems rely on matching image representationsto generate captions, and that encodingthe frequency, size and position of objectsare complementary and all play a role informing a good image representation. It alsoreveals that different object categories contributein different ways towards image captioning.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要