Efficient, Distributed, and Non-Speculative Multi-Address Atomic Operations

Eduardo José Gómez-Hernández,Juan M. Cebrian,J. Rubén Titos Gil,Stefanos Kaxiras,Alberto Ros

MICRO（2021）

引用 4|浏览7

暂无评分

摘要

ABSTRACT Critical sections that read, modify, and write (RMW) a small set of addresses are common in parallel applications and concurrent data structures. However, to escape from the intricacies of fine-grained locks, which require reasoning about all possible thread interleavings, programmers often resort to coarse-grained locks to ensure atomicity. This results in atomic protection of a much larger set of potentially conflicting addresses, and, consequently, increased lock contention and unneeded serialization. As many before us have observed, these problems would be solved if only general RMW multi-address atomic operations were available, but current proposals are impractical because of deadlock scenarios that appear due to resource limitations. Alternatively, transactional memory can detect conflicts at run-time aiming to maximize concurrency, but it has significant overheads in highly-contended critical sections. In this work, we propose multi-address atomic operations (MAD atomics). MAD atomics achieve complexity-effective, non-speculative, non-deadlocking, fine-grained locking for multiple addresses, relying solely on the coherence protocol and a predetermined locking order. Unlike prior works, MAD atomics address the challenge of enabling atomic modification over a set of cachelines with arbitrary addresses, simultaneously locking all of them while side-stepping deadlock. MAD atomics only require a small storage per core (around 68 bytes), while significantly outperforming typical lock implementations. Indeed, our evaluation using gem5-20 shows that MAD atomics can improve performance by up to 18 × (3.4 ×, on average, for the applications and concurrent data structures evaluated in this work) over a baseline implemented with locks running on 16 cores. More importantly, the improvement still reaches 2.7 ×, on average, compared to an Intel hardware transactional memory implementation running on 16 cores.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要