Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

Arka Dutta,Adel Khorramrouz,Sujan Dutta,Ashiqur R. KhudaBukhsh

arxiv（2023）

引用 0|浏览5

暂无评分

摘要

This paper makes three contributions. First, it presents a generalizable, novel framework dubbed toxicity rabbit hole that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要