VersaT2I: Improving Text-to-Image Models with Versatile Reward
arXiv (Cornell University)(2024)
Zhejiang University | University of Washington | Hong Kong University of Science and Technology (GZ | Fudan University
- Pretraining has recently greatly promoted the development of natural language processing (NLP)
- We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
- We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
- The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
- Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance

Outrageously Large Neural Networks: the Sparsely-Gated Mixture-of-Experts Layer.
被引用2870
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
被引用1424
Mplug: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections.
被引用139
MSP-Former: Multi-Scale Projection Transformer for Single Image Desnowing
被引用64
SnowFormer: Context Interaction Transformer with Scale-awareness for Single Image Desnowing
被引用8
DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models
被引用12
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
被引用499
ReVersion: Diffusion-Based Relation Inversion from Images.
被引用37
NightHazeFormer: Single Nighttime Haze Removal Using Prior Query Transformer.
被引用70
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation
被引用75
被引用4
StyleDrop: Text-to-Image Generation in Any Style
被引用82
被引用57
被引用3
Holistic Evaluation of Text-To-Image Models
被引用25
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
被引用82
Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now
被引用34
VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook
被引用12
VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation
被引用45