
Video Retrieval for Everyday Scenes With Common Objects

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval(2023)

引用 0|浏览14
We propose a video retrieval system for everyday scenes with common objects. Our system exploits the predictions made by deep neural networks for image understanding tasks using natural language processing (NLP). It aims to capture the relationships between objects in a video scene as well as the ordering of the matching scenes. For each video in the database, it identifies and generates a sequence of key scene images. For each such scene, it generates most probable captions using state-of-the-art models for image captioning. The captions are parsed and represented by tree structures using NLP techniques. These are then stored and indexed in a database system. When a user poses a query video, a sequence of key scenes are generated. For each scene, its caption is generated using deep learning and parsed into its corresponding tree structure. After that, optimized tree-pattern queries are constructed and executed on the database to retrieve a set of candidate videos. Finally, these candidate videos are ranked using a combination of longest common subsequence of scene matches and tree-edit distance between parse trees. We evaluated the performance of our system using the MSR-VTT dataset, which contained everyday scenes. We observed that our system achieved higher mean average precision (mAP) compared to two recent techniques, namely, CSQ and DnS.
Video retrieval,indexing,scene captioning,NLP,XML,ranking
AI 理解论文
Chat Paper