A shared BTB design for multicore systems

Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization(2019)

引用 9|浏览23
暂无评分
摘要
With increasing use of runtime polymorphism and reliance on runtime type interpretation, the presence and importance of indirect branches has seen a considerable rise in recent workloads. Evidently, accurate target prediction for indirect branches has emerged as an important problem. While direction prediction of direct branches has received considerable research attention leading to efficient prediction policies and hardware structures implemented inside modern processors, proposals for target prediction for indirect branches has been relatively few. The problem of accurate target prediction for indirect branches is significantly tough since these transfer control to an address stored in a register that is known only at runtime. Unlike conditional direct branches, indirect branches can have more than two targets to be resolved at runtime, for which prediction requires a full 32-bit / 64-bit address to be predicted, in contrast to just a taken or not-taken decision as needed for direction prediction of direct branches. Recent research shows indirect branches, being mispredicted more frequently, can start to dominate the overall branch misprediction cost. In modern processors, the only hardware structure available to facilitate target address prediction for indirect branches is a fixed-size Branch Target Buffer (BTB). BTB is often designed as a set associative cache for storing recent target addresses for branch instructions encountered during execution, with a motivation of being able to reuse the same addresses for future instances, thereby saving latency cycles. Evidently, designing efficient indexing schemes and replacement mechanisms for BTB structures is crucial, more so, for indirect branches, since these serve as the only prediction handle. In this paper, our objective is to examine a hierarchical BTB design for multicores with total size comparable to what exists today, with a motivation towards possible improvement of prediction accuracy by facilitating collaborative constructive learning between programs executing in different cores that encounter similar histories. Specifically, we wish to propose the concept of a small on-chip L1 BTB inside each core, supported by a larger off-chip L2 BTB shared among the cores. Our motivation towards a hierarchical BTB design is twofold. On one hand, in a multiprogramming environment, where the same program is executed in multiple cores, on different test cases, there is a significant potential of target information reuse for indirect branches, as is evident from our experiments on SPEC 2006 workloads. This is typically useful for machine learning based workloads where the training phase is often executed in different cores with the same neural network being trained on different training sets. On the other hand, with different programs in different cores, there is some chance of reuse as well, due to sharing of system libraries. The essence of our idea rests on the fact that programs executed in different cores have similar history patterns that can similarly influence the target addresses. Our design aims to decrease on the on-chip L1 BTB size while investing more storage for the off-chip shared L2 BTB. Suitable allocation and replacement policies support our hierarchical design. Initial experiments show expected accuracy benefits.
更多
查看译文
关键词
BTB, Prediction accuracy
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要