DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows
Annual Meeting of the Association for Computational Linguistics(2024)
Abstract
Large language models (LLMs) have become a dominant and important tool forNLP researchers in a wide range of tasks. Today, many researchers use LLMs insynthetic data generation, task evaluation, fine-tuning, distillation, andother model-in-the-loop research workflows. However, challenges arise whenusing these models that stem from their scale, their closed source nature, andthe lack of standardized tooling for these new and emerging workflows. Therapid rise to prominence of these models and these unique challenges has hadimmediate adverse impacts on open science and on the reproducibility of workthat uses them. In this paper, we introduce DataDreamer, an open source Pythonlibrary that allows researchers to write simple code to implement powerful LLMworkflows. DataDreamer also helps researchers adhere to best practices that wepropose to encourage open science and reproducibility. The library anddocumentation are available at https://github.com/datadreamer-dev/DataDreamer .
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined