A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE
CoRR(2024)
摘要
Transformer is an emerging neural network model with attention mechanism. It
has been adopted to various tasks and achieved a favorable accuracy compared to
CNNs and RNNs. While the attention mechanism is recognized as a general-purpose
component, many of the Transformer models require a significant number of
parameters compared to the CNN-based ones. To mitigate the computational
complexity, recently, a hybrid approach has been proposed, which uses ResNet as
a backbone architecture and replaces a part of its convolution layers with an
MHSA (Multi-Head Self-Attention) mechanism. In this paper, we significantly
reduce the parameter size of such models by using Neural ODE (Ordinary
Differential Equation) as a backbone architecture instead of ResNet. The
proposed hybrid model reduces the parameter size by 94.6
CNN-based ones without degrading the accuracy. We then deploy the proposed
model on a modest-sized FPGA device for edge computing. To further reduce FPGA
resource utilization, we quantize the model following QAT (Quantization Aware
Training) scheme instead of PTQ (Post Training Quantization) to suppress the
accuracy loss. As a result, an extremely lightweight Transformer-based model
can be implemented on resource-limited FPGAs. The weights of the feature
extraction network are stored on-chip to minimize the memory transfer overhead,
allowing faster inference. By eliminating the overhead of memory transfers,
inference can be executed seamlessly, leading to accelerated inference. The
proposed FPGA implementation achieves 12.8x speedup and 9.21x energy efficiency
compared to ARM Cortex-A53 CPU.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要