A Canary in the AI Coal Mine: American Jews May Be Disproportionately Harmed by Intellectual Property Dispossession in Large Language Model Training

Cited 3|Views23

Abstract

Systemic property dispossession from minority groups has often been carriedout in the name of technological progress. In this paper, we identify evidencethat the current paradigm of large language models (LLMs) likely continues thislong history. Examining common LLM training datasets, we find that adisproportionate amount of content authored by Jewish Americans is used fortraining without their consent. The degree of over-representation ranges fromaround 2x to around 6.5x. Given that LLMs may substitute for the paid labor ofthose who produced their training data, they have the potential to cause evenmore substantial and disproportionate economic harm to Jewish Americans in thecoming years. This paper focuses on Jewish Americans as a case study, but it isprobable that other minority communities (e.g., Asian Americans, HinduAmericans) may be similarly affected and, most importantly, the results shouldlikely be interpreted as a "canary in the coal mine" that highlights deepstructural concerns about the current LLM paradigm whose harms could soonaffect nearly everyone. We discuss the implications of these results for thepolicymakers thinking about how to regulate LLMs as well as for those in the AIfield who are working to advance LLMs. Our findings stress the importance ofworking together towards alternative LLM paradigms that avoid both disparateimpacts and widespread societal harms.

Translated text

Key words

large language models,economic impacts,dataset documentation

Bibtex

AI Read Science

AI Summary

AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.

Example

Background

Key content

Introduction

Methods

Results

Related work

Fund

Key content

Pretraining has recently greatly promoted the development of natural language processing (NLP)
We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance

Try using models to generate summary,it takes about 60s

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper