Toward Automatic Optimized Code Generation for Multiprecision Modular Exponentiation on a GPU

Parallel and Distributed Processing Symposium Workshops & PhD Forum(2013)

引用 1|浏览0
暂无评分
摘要
Multiprocessing modular exponentiation has a variety of uses, including cryptography, prime testing and computational number theory. It is also a very costly operation to compute. GPU parallelism can be used to accelerate these computations, but to use the GPU efficiently, a problem must involve a significant number of simultaneous exponentiation operations. Handling a large number of TLS/SSL encrypted sessions in a data center is a significant problem that fits this profile. We have developed a framework that enables generation of highly efficient NVIDIA PTX implementations of exponentiation operations for different GPU architectures and problem instances. One of the challenges in generating such code is that PTX is not a true assembly language, but is instead a virtual instruction set that is compiled and optimized in different ways for different generations of GPU hardware. Thus, the same PTX code runs with different levels of efficiency on different machines. And as the precision of the exponentiation values changes, each architecture has its own break-even points where a different algorithm or parallelization strategy must be employed. To make the code efficient for a given problem instance and architecture thus requires searching a multidimensional space of algorithms and configurations, by generating thousands of lines of carefully constructed PTX code for each combination, executing it, validating the numerical result, and evaluating its actual performance. Our framework automates much of this process, and produces exponentiation code that is up to six times faster than the best known hand-coded implementations. More importantly, the framework enables users to relatively quickly find the best configuration for each new GPU architecture. Our framework is also the basis for the eventual creation of a multiprocessing matrix arithmetic package for GPU cluster systems that will be portable across multiple generations of GPU.
更多
查看译文
关键词
different algorithm,new gpu architecture,different level,problem instance,multiprecision modular exponentiation,different gpu architecture,gpu hardware,automatic optimized code generation,different generation,gpu parallelism,gpu cluster system,ptx code,break even points,modular exponentiation,generators,registers,parallel processing,assembly,computer architecture
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要