A Study on the Performance Implications of AArch64 Atomics.

ISC(2023)

引用 0|浏览0
暂无评分
摘要
Atomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is “compare-and-swap” (CAS), which allows threads to perform concurrent read-modify-write operations on the same memory location, free of data races. On recent Arm architectures, CAS operations can be implemented either directly via CAS instructions, or via load-linked/store-conditional (LL-SC) instruction pairs. In this paper we explore the performance of the CAS and LL-SC approaches to implement CAS operations on recent high-performance Arm-based CPUs, namely the A64FX, ThunderX2, and Graviton3. We observe that CAS and LL-SC instructions can lead to fundamentally different performance profiles. On the A64FX, for example, the newer CAS instructions—often preferred by compilers over the older LL-SC pairs—can lead to a quadratic increase in average time per successful CAS operation as the number of threads increases, whereas the older LL-SC approach shows the expected linear scaling. For high thread counts, this difference translates into a speedup of more than 20 x when using LL-SC instructions. We characterise the conditions under which the LL-SC or CAS approaches are superior on each CPU, and the speedup that can be realised by preferring one strategy over the other.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要