Improving Dialog Safety using Socially Aware Contrastive Learning
CoRR(2024)
摘要
State-of-the-art conversational AI systems raise concerns due to their
potential risks of generating unsafe, toxic, unethical, or dangerous content.
Previous works have developed datasets to teach conversational agents the
appropriate social paradigms to respond effectively to specifically designed
hazardous content. However, models trained on these adversarial datasets still
struggle to recognize subtle unsafe situations that appear naturally in
conversations or introduce an inappropriate response in a casual context. To
understand the extent of this problem, we study prosociality in both
adversarial and casual dialog contexts and audit the response quality of
general-purpose language models in terms of propensity to produce unsafe
content. We propose a dual-step fine-tuning process to address these issues
using a socially aware n-pair contrastive loss. Subsequently, we train a base
model that integrates prosocial behavior by leveraging datasets like Moral
Integrity Corpus (MIC) and ProsocialDialog. Experimental results on several
dialog datasets demonstrate the effectiveness of our approach in generating
socially appropriate responses.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要