RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
CoRR(2024)
摘要
Individual neurons participate in the representation of multiple high-level
concepts. To what extent can different interpretability methods successfully
disentangle these roles? To help address this question, we introduce RAVEL
(Resolving Attribute-Value Entanglements in Language Models), a dataset that
enables tightly controlled, quantitative comparisons between a variety of
existing interpretability methods. We use the resulting conceptual framework to
define the new method of Multi-task Distributed Alignment Search (MDAS), which
allows us to find distributed representations satisfying multiple causal
criteria. With Llama2-7B as the target language model, MDAS achieves
state-of-the-art results on RAVEL, demonstrating the importance of going beyond
neuron-level analyses to identify features distributed across activations. We
release our benchmark at https://github.com/explanare/ravel.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要