DStore: A Lightweight Scalable Learning Model Repository with Fine-Grained Tensor-Level Access

PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2023(2023)

引用 0|浏览20
暂无评分
摘要
The ability to share and reuse deep learning (DL) models is a key driver that facilitates the rapid adoption of artificial intelligence (AI) in both industrial and scientific applications. However, state-of-the-art approaches to store and access DL models efficiently at scale lag behind. Most often, DL models are serialized by using various formats (e.g., HDF5, SavedModel) and stored as files on POSIX file systems. While simple and portable, such an approach exhibits high serialization and I/O overheads, especially under concurrency. Additionally, the emergence of advanced AI techniques (transfer learning, sensitivity analysis, explainability, etc.) introduces the need for fine-grained access to tensors to facilitate the extraction and reuse of individual or subsets of tensors. Such patterns are underserved by state-of-the-art approaches. Requiring tensors to be read in bulk incurs suboptimal performance, scales poorly, and/or overutilizes network bandwidth. In this paper we propose a lightweight, distributed, RDMA-enabled learning model repository that addresses these challenges. Specifically we introduce several ideas: compact architecture graph representation with stable hashing and client-side metadata caching, scalable load balancing on multiple providers, RDMA-optimized data staging, and direct access to raw tensor data. We evaluate our proposal in extensive experiments that involve different access patterns using learning models of diverse shapes and sizes. Our evaluations show a significant improvement (between 2 and 30x over a variety of state-of-the-art model storage approaches while scaling to half the Cooley cluster at the Argonne Leadership Computing Facility.
更多
查看译文
关键词
DL model repository,fine-grained tensor storage and access,benchmarking
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要