A Study on the Performance Implications of AArch64 Atomics.
ISC(2023)
摘要
Atomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is “compare-and-swap” (CAS), which allows threads to perform concurrent read-modify-write operations on the same memory location, free of data races. On recent Arm architectures, CAS operations can be implemented either directly via CAS instructions, or via load-linked/store-conditional (LL-SC) instruction pairs. In this paper we explore the performance of the CAS and LL-SC approaches to implement CAS operations on recent high-performance Arm-based CPUs, namely the A64FX, ThunderX2, and Graviton3. We observe that CAS and LL-SC instructions can lead to fundamentally different performance profiles. On the A64FX, for example, the newer CAS instructions—often preferred by compilers over the older LL-SC pairs—can lead to a quadratic increase in average time per successful CAS operation as the number of threads increases, whereas the older LL-SC approach shows the expected linear scaling. For high thread counts, this difference translates into a speedup of more than 20 x when using LL-SC instructions. We characterise the conditions under which the LL-SC or CAS approaches are superior on each CPU, and the speedup that can be realised by preferring one strategy over the other.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要