Generation-based Differential Fuzzing for Deep Learning Libraries.

ACM Trans. Softw. Eng. Methodol.(2024)

引用 0|浏览13
暂无评分
摘要
Deep learning (DL) libraries have become the key component in developing and deploying DL-based software nowadays. With the growing popularity of applying DL models in both academia and industry across various domains, any bugs inherent in the DL libraries can potentially cause unexpected server outcomes. As such, there is an urgent demand for improving the software quality of DL libraries. Although there are some existing approaches specifically designed for testing DL libraries, their focus is usually limited to one specific domain, such as computer vision (CV). It is still not very clear how the existing approaches perform in detecting bugs of different DL libraries regarding different task domains and to what extent. To bridge this gap, we first conduct an empirical study on four representative and state-of-the-art DL library testing approaches. Our empirical study results reveal that it is hard for existing approaches to generalize to other task domains. We also find that the test inputs generated by these approaches usually lack diversity, with only a few types of bugs. What is worse, the false-positive rate of existing approaches is also high ( up to 58% ). To address these issues, we propose a guided differential fuzzing approach based on generation , namely, Gandalf . To generate testing inputs across diverse task domains effectively, Gandalf adopts the context-free grammar to ensure validity and utilizes a Deep Q-Network to maximize the diversity. Gandalf also includes 15 metamorphic relations to make it possible for the generated test cases to generalize across different DL libraries. Such a design can decrease the false positives because of the semantic difference for different APIs. We evaluate the effectiveness of Gandalf on nine versions of three representative DL libraries, covering 309 operators from computer vision, natural language processing, and automated speech recognition. The evaluation results demonstrate that Gandalf can effectively and efficiently generate diverse test inputs. Meanwhile, Gandalf successfully detects five categories of bugs with only 3.1% false-positive rates. We report all 49 new unique bugs found during the evaluation to the DL libraries’ developers, and most of these bugs have been confirmed. Details about our empirical study and evaluation results are available on our project website. 1
更多
查看译文
关键词
differential fuzzing,generation-based
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要