Performance Evaluation of OpenCL Standard Support (and Beyond)
Proceedings of the International Workshop on OpenCL(2019)
摘要
In this talk, we will discuss how support (or lack of it) for various OpenCL (OCL) features affects performance of graph applications executing on GPU platforms. Given that adoption of OCL features varies widely across vendors, our results can help quantify the performance benefits and potentially motivate the timely adoption of these OCL features.
Our findings are drawn from the experience of developing an OCL backend for a state-of-the-art graph application DSL, IrGL, originally developed with a CUDA backend [1]. IrGL allows competitive algorithms for applications such as breadth-first-search, page-rank, and single-source-shortest-path to be written at a high level. A series of optimisations can then be applied by the compiler to generate OCL code. These user-selectable optimisations exercise various features of OCL: on one end of the spectrum, applications compiled without optimisations require only core OCL version 1.1 features; on the other end, a certain optimisation requires inter-workgroup forward progress guarantees, which are yet to be officially supported by OCL, but have been empirically validated and are relied upon e.g. to achieve global device-wide synchronisation [3]. Other optimisations require OCL features such as: fine-grained memory consistency guarantees (added in OCL 2.0) and subgroup primitives (added to core in OCL 2.1).
Our compiler can apply 6 independent optimisations (Table 1), each of which requires an associated minimum version of OCL to be supported. Increased OCL support enables more and more optimisations: 2 optimisations are supported with OCL 1.x; 1 additional optimization with OCL 2.0; and a further 2 with OCL 2.1. Using OCL FP to denote v2.1 extended with forward progress guarantees (not officially supported at present), the last optimisation is enabled. We will discuss the OCL features required for each optimisation and the idioms in which the features are used. Use-case discussions of these features (e.g. memory consistency and subgroup primitives) are valuable as there appear to be very few open-source examples: a GitHub search yields only a small number of results.
Our compiler enables us to carry out a large and controlled study, in which the performance benefit of various levels of OCL support can be evaluated. We gather runtime data exhaustively on all combinations across: all optimisations, 17 applications, 3 graph inputs, 6 different GPUs, spanning 4 vendors: Nvidia, AMD, Intel and ARM (Table 2).
We show two notable results in this abstract: our first result, summarised in Figure 1, shows that all optimizations can be beneficial across a range of GPUs, despite significant architectural differences (e.g. subgroup size as seen in Table 2). This provides motivation that previous vendor specific approaches (e.g. for Nvidia) can be ported to OCL and achieve speedups on range of devices.
Our second result, summarised in Figure 2, shows that if feature support is limited to OCL 2.0 (or below), the available optimisations (fg wg sz256) fail to achieve any speedups in over 70% of the chip/application/input benchmarks. If support for OCL 2.1 (adding the optimizations: sg coop-cv) is considered, this number drops to 60% but observed speedups are modest, rarely exceeding 2x. Finally, if forward progress guarantees are assumed (adding the oitergb optimization), speedups are observed in over half of the cases, including impressive speedups of over 14x for AMD and Intel GPUs. This provides compelling evidence for forward progress properties to be considered for adoption for a future OCL version.
An extended version of this material can be found in [2, ch. 5].
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络