Self-learning Data Foundation for Scientific AI.

Annmary Justine,Sergey Serebryakov,Cong Xu,Aalap Tripathy,Suparna Bhattacharya,Paolo Faraboschi,Martin Foltin

SMC（2022）

引用 2|浏览20

暂无评分

摘要

The "Self-Learning Data Foundation for AI" is an open-source platform to manage Machine Learning (ML) metadata in complex end-to-end pipelines, and includes the intelligence to optimize data gradation, pipeline configuration, and compute performance. The work addresses several challenges: prioritizing data to reduce movement, tracking lineage to optimize complex ML pipelines, and enabling reproducibility and portability of data selection and ML model development. Off-the-shelf AI metadata management frameworks (such as MLflow or Weights & Biases) focus on fine-grain stage-level metadata, and only track parts of the pipeline, and lineage. Our proposed software layer sits between ML workflows and pipelines and storage/data access. The first implementation of the Data Foundation is the Common Metadata Framework (CMF), which captures metadata and tracks them automatically alongside references to data artifacts and application code. Its git-like nature allows parallel model development by different teams and is well suited for federated environments. It includes intelligence to optimize pipelines and storage, can learn the access patterns from pipeline execution to inform optimizations such as prestaging and caching. It also learns from model inference metrics to build iteratively more robust models. Through a data shaping use case for I/O optimization and an active learning use case to reduce labelling (on DeepCam AI model training on climate data running on NERSC Cori), we show the versatility of the data foundation layer, the potential benefits (4x reduction in training time and 2x reduction in labelling effort), and its central role in complex ML pipelines.

查看译文

关键词

AI metadata, Trustworthy AI, MLOps

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要