Dense Models from Videos: Can YouTube be the Font of All Knowledge Bases?

ICMR(2015)

引用 0|浏览14
暂无评分
摘要
Many recent advances in computer science have been driven by the convergent availability of large numbers of data and of fast machines on which to analyze them. This availability has enabled us to acquire implicit partial models of the underlying generators for the data and apply those models to tasks such as translation, transcription, and image captioning. To date, though, few if any of these models have been dense, in the sense of thoroughly modelling some aspect of the world in way that can facilitate any relevant task. Dense models should support: a) Prediction: What might happen next in this situation, or what might be true in the vicinity? b) Interpolation: What may have happened between these situations? What might be located between these things? c) Causal reasoning: Why did this happen? d) Purpose reasoning: What is this configuration of things for? For what purpose is that happening? e) Task performance: The model should be able to aid (e.g.) a robot performing a domain task. f) Explanation: The model should be at a level that supports communication. In short, a dense model is the sort of model - including both implicit and explicit components - humans form about aspects of their worlds: aspects like meetings, plants, lawnmowers, rivers and kitchens. These models support pretty-much any kind of relevant reasoning. These are also the sorts of models that builders of large-scale \"commonsense\" knowledge bases have been working to construct. But, to date, although some such knowledge bases support particular instances of each kind of reasoning task, they do not approach doing so comprehensively, even within quite narrow domains. Although some work is being done on automating KB construction, this generally aims at breadth, rather than density. Similarly, although machine vision and NLP researchers have long discussed the potential use of background knowledge in scene and text understanding, demonstrating that utility in any general way has been hampered by the vast incompleteness of available KBs. The time is ripe for a 5-10 year AI challenge problem in production of dense models directly from data. As a particular example, kitchens are somewhat limited in complexity, from a human point of view, and are densely modelled by most humans; we are not frequently surprised by what we find in a kitchen, or by what happens there. And we are not lacking for data; there are more than 6 million YouTube hits for \"kitchen\", around 5 million for cooking. If each was a mere 1 minute long, this represents 22 years of kitchen video. Dull perhaps, but also, presumably, enough grist for building a very dense model. The proposed challenge is this: to have computers automatically build, from just the vast amount of video found on the web, a sufficiently dense local world model to enable that video to be thoroughly understood for prediction, interpolation, explanation and other tasks.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要