Subobject-level Image Tokenization
CoRR(2024)
摘要
Transformer-based vision models typically tokenize images into fixed-size
square patches as input units, which lacks the adaptability to image content
and overlooks the inherent pixel grouping structure. Inspired by the subword
tokenization widely adopted in language models, we propose an image tokenizer
at a subobject level, where the subobjects are represented by semantically
meaningful image segments obtained by segmentation models (e.g., segment
anything models). To implement a learning system based on subobject
tokenization, we first introduced a Sequence-to-sequence AutoEncoder (SeqAE) to
compress subobject segments of varying sizes and shapes into compact embedding
vectors, then fed the subobject embeddings into a large language model for
vision language learning. Empirical results demonstrated that our
subobject-level tokenization significantly facilitates efficient learning of
translating images into object and attribute descriptions compared to the
traditional patch-level tokenization. Codes and models will be open-sourced at
https://github.com/ChenDelong1999/subobjects.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要