Accelerating Spark Datasets by Inlining Deserialization

Jan Wroblewski,Kazuaki Ishizaki,Hiroshi Inoue,Moriyoshi Ohara

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)（2017）

引用 1|浏览27

暂无评分

摘要

Apache Spark is a framework for distributed computing that supports the map-reduce programming model. The SQL module of Spark contains Datasets, i.e., distributed collections of records stored in a serialized low-level format in a manually managed chunk of memory. However, the functions users provide to the map-reduce computations expect Java objects. Datasets perform an additional deserialization step beforehand to support the user-provided function, which increases the overhead. We tackled this problem by replacing map functions with their counterparts that accepted the serialized data. This allowed us to skip the unnecessary part of deserialization and achieve faster data processing speeds.

查看译文

关键词

Apache Spark,escape analysis,program transformation,serialization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要