A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)
NeurIPS 2023(2024)
摘要
Contrastive Language-Image Pre-training (CLIP) models have demonstrated
remarkable generalization capabilities across multiple challenging distribution
shifts. However, there is still much to be explored in terms of their
robustness to the variations of specific visual factors. In real-world
applications, reliable and safe systems must consider other safety objectives
beyond classification accuracy, such as predictive uncertainty. Yet, the
effectiveness of CLIP models on such safety-related features is less-explored.
Driven by the above, this work comprehensively investigates the safety
objectives of CLIP models, specifically focusing on three key properties:
resilience to visual factor variations, calibrated uncertainty estimations, and
the ability to detect anomalous inputs. To this end, we study 83 CLIP models
and 127 ImageNet classifiers. They are diverse in architecture, (pre)training
distribution and training strategies. We consider 10 visual factors (e.g.,
shape and pattern), 5 types of out-of-distribution data, and 8 natural and
challenging test conditions with different shift types, such as texture, style,
and perturbation shifts. Our study has unveiled several previously unknown
insights into CLIP models. For instance, they are not consistently more
calibrated than other ImageNet models, which contradicts existing findings.
Additionally, our analysis underscores the significance of training source
design by showcasing its profound influence on the three safety-related
properties. We believe our comprehensive study can shed light on and help guide
the development of more robust and reliable CLIP models.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要