Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)(2019)

引用 81|浏览163
Graph processing is an important analysis technique for a wide range of big data applications. The ability to explicitly represent relationships between entities gives graph analytics a significant performance advantage over traditional relational databases. However, at the microarchitecture level, performance is bounded by the inefficiencies in the memory subsystem for single-machine in-memory graph analytics. This paper consists of two contributions in which we analyze and optimize the memory hierarchy for graph processing workloads.First, we perform an in-depth data-type-aware characterization of graph processing workloads on a simulated multi-core architecture. We analyze 1) the memory-level parallelism in an out-of-order core and 2) the request reuse distance in the cache hierarchy. We find that the load-load dependency chains involving different application data types form the primary bottleneck in achieving a high memory-level parallelism. We also observe that different graph data types exhibit heterogeneous reuse distances. As a result, the private L2 cache has negligible contribution to performance, whereas the shared L3 cache shows higher performance sensitivity.Second, based on our profiling observations, we propose DROPLET, a Data-awaRe decOuPLed prEfeTcher for graph applications. DROPLET prefetches different graph data types differently according to their inherent reuse distances. In addition, DROPLET is physically decoupled to overcome the serialization due to the dependency chains between different data types. DROPLET achieves 19%-102% performance improvement over a no-prefetch baseline, 9%-74% performance improvement over a conventional stream prefetcher, 14%-74% performance improvement over a Variable Length Delta Prefetcher, and 19%-115% performance improvement over a delta correlation prefetcher implemented as a global history buffer. DROPLET performs 4%-12.5% better than a monolithic L1 prefetcher similar to the state-of-the-art prefetcher for graphs.
Prefetching,Random access memory,Arrays,Parallel processing,Hardware,Benchmark testing
AI 理解论文