Exploiting Image Representations Learned from Sentences

user-5f03edee4c775ed682ef5237(2014)

引用 0|浏览18
暂无评分
摘要
A recent focus of Computer Vision has been on leveraging textual data to improve the quality of image features. An important motivation for this is the following: the number of variations that are possible in the visual domain are exponentially higher than the number of variations in the text domain for the same object. For example, in Figure 1, we can observe three different kinds of chairs, and from a pixel level perspective, none of the images have much in common. However, the word “chair” is an adequate linguistic representation of all three images. One could argue that alternative words such as “seat”,“sofa” or “couch” could be used instead. However, the number of synonyms for the word “chair” is far smaller than the number of different chair images, taking into account appearence of the chair itself, lighting, view-point and other perturbations.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要