VinVL: Making Visual Representations Matter in Vision-Language Models
Abstract:
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-desig...More
Code:
Data:
Full Text
Tags
Comments