VideoLLM-online: Online Video Large Language Model for Streaming Video
CVPR 2024(2024)
Abstract
Recent Large Language Models have been enhanced with vision capabilities,enabling them to comprehend images, videos, and interleaved vision-languagecontent. However, the learning methods of these large multimodal modelstypically treat videos as predetermined clips, making them less effective andefficient at handling streaming video inputs. In this paper, we propose a novelLearning-In-Video-Stream (LIVE) framework, which enables temporally aligned,long-context, and real-time conversation within a continuous video stream. OurLIVE framework comprises comprehensive approaches to achieve video streamingdialogue, encompassing: (1) a training objective designed to perform languagemodeling for continuous streaming inputs, (2) a data generation scheme thatconverts offline temporal annotations into a streaming dialogue format, and (3)an optimized inference pipeline to speed up the model responses in real-worldvideo streams. With our LIVE framework, we built VideoLLM-online model uponLlama-2/Llama-3 and demonstrate its significant advantages in processingstreaming videos. For instance, on average, our model can support streamingdialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, italso showcases state-of-the-art performance on public offline video benchmarks,such as recognition, captioning, and forecasting. The code, model, data, anddemo have been made available at https://showlab.github.io/videollm-online.
MoreTranslated text
Key words
Language Model,Online Video,Large Language Models,Narrative,Benchmark,Learning Methods,Forecasting,Video Clips,Continuous Stream,Temporal Alignment,Decoding,Validation Set,Slow Speed,Video Frames,Action Recognition,Training Objective,Video Dataset,Memory Cost,Efficient Inference,Video Understanding,Image Encoder,Spatial Understanding,Action Detection
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined