VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen, Zhaoyang Lv,Shiwei Wu,Kevin Qinghong Lin,Chenan Song,Difei Gao,Jia-Wei Liu,Ziteng Gao,Dongxing Mao,Mike Zheng Shou

CVPR 2024（2024）

Cited 0|Views44

No score

Abstract

Recent Large Language Models have been enhanced with vision capabilities,enabling them to comprehend images, videos, and interleaved vision-languagecontent. However, the learning methods of these large multimodal modelstypically treat videos as predetermined clips, making them less effective andefficient at handling streaming video inputs. In this paper, we propose a novelLearning-In-Video-Stream (LIVE) framework, which enables temporally aligned,long-context, and real-time conversation within a continuous video stream. OurLIVE framework comprises comprehensive approaches to achieve video streamingdialogue, encompassing: (1) a training objective designed to perform languagemodeling for continuous streaming inputs, (2) a data generation scheme thatconverts offline temporal annotations into a streaming dialogue format, and (3)an optimized inference pipeline to speed up the model responses in real-worldvideo streams. With our LIVE framework, we built VideoLLM-online model uponLlama-2/Llama-3 and demonstrate its significant advantages in processingstreaming videos. For instance, on average, our model can support streamingdialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, italso showcases state-of-the-art performance on public offline video benchmarks,such as recognition, captioning, and forecasting. The code, model, data, anddemo have been made available at https://showlab.github.io/videollm-online.

Translated text

Key words

Language Model,Online Video,Large Language Models,Narrative,Benchmark,Learning Methods,Forecasting,Video Clips,Continuous Stream,Temporal Alignment,Decoding,Validation Set,Slow Speed,Video Frames,Action Recognition,Training Objective,Video Dataset,Memory Cost,Efficient Inference,Video Understanding,Image Encoder,Spatial Understanding,Action Detection

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined