Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
arxiv(2024)
摘要
Transformers have revolutionized computer vision and natural language
processing, but their high computational complexity limits their application in
high-resolution image processing and long-context analysis. This paper
introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the
NLP field with necessary modifications for vision tasks. Similar to the Vision
Transformer (ViT), our model is designed to efficiently handle sparse inputs
and demonstrate robust global processing capabilities, while also scaling up
effectively, accommodating both large-scale parameters and extensive datasets.
Its distinctive advantage lies in its reduced spatial aggregation complexity,
which renders it exceptionally adept at processing high-resolution images
seamlessly, eliminating the necessity for windowing operations. Our evaluations
in image classification demonstrate that VRWKV matches ViT's classification
performance with significantly faster speeds and lower memory usage. In dense
prediction tasks, it outperforms window-based models, maintaining comparable
speeds. These results highlight VRWKV's potential as a more efficient
alternative for visual perception tasks. Code is released at
.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要