Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
arxiv(2023)
摘要
This paper makes three contributions. First, it presents a generalizable,
novel framework dubbed toxicity rabbit hole that iteratively elicits
toxic content from a wide suite of large language models. Spanning a set of
1,266 identity groups, we first conduct a bias audit of
guardrails presenting key insights. Next, we report generalizability across
several other models. Through the elicited toxic content, we present a broad
analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia,
homophobia, and transphobia. Finally, driven by concrete examples, we discuss
potential ramifications.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要