SpeedLimit: Neural Architecture Search for Quantized Transformer Models

Yuji Chai, Luke Bailey,Yunho Jin, Matthew Karle,Glenn G. Ko

arXiv (Cornell University)(2022)

引用 0|浏览0
暂无评分
摘要
While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.
更多
查看译文
关键词
neural architecture search,models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要