Storing Parquet Tile by Tile: Application-Aware Storage with Deduplication

2019 29th International Conference on Field Programmable Logic and Applications (FPL)(2019)

引用 1|浏览7
暂无评分
摘要
Distributed storage in the cloud needs to offer both low latency and high bandwidth access to data and efficient use of storage capacity in order to keep up with emerging big data workloads. Deduplication has been successfully used to help with the latter requirement but it is often at odds with low latency data access. Deduplication ratios can be significantly increased if the storage nodes are aware of the file format and the ways clients interact with it - but implementing different file-type specific parsing on FPGAs for multiple tenants can be unfeasible due to area constraints. We show the benefits of making the storage system aware of the application through the example of Parquet files, a columnar format used in machine learning and big data frameworks to store and transfer datasets. We achieve high deduplication ratios by using a companion software library that allows Parquet files to be stored in a "divided" way. This makes deduplication more efficient and enables clients to access individual columns or meta-data fields selectively. At the same time, the storage nodes remain general purpose and can store and deduplicate arbitrary data. This work paves the way for in-storage processing for Parquet files and other columnar formats because the different columns can be accessed in a streaming fashion and their processing requires no specialized logic on the FPGA.
更多
查看译文
关键词
fpga,distributed storage,deduplication,column stores,near data processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要