Learning Approximate Execution Semantics From Traces for Binary Function Similarity

IEEE Transactions on Software Engineering(2023)

引用 2|浏览51
暂无评分
摘要
Detecting semantically similar binary functions - a crucial capability with broad security usages including vulnerability detection, malware analysis, and forensics - requires understanding function behaviors and intentions. This task is challenging as semantically similar functions can be compiled to run on different architectures and with diverse compiler optimizations or obfuscations. Most existing approaches match functions based on syntactic features without understanding the functions' execution semantics. We present Trex, a transfer-learning-based framework, to automate learning approximate execution semantics explicitly from functions' traces collected via forced-execution (i.e., by violating the control flow semantics) and transfer the learned knowledge to match semantically similar functions. While it is known that forced-execution traces are too imprecise to be directly used to detect semantic similarity, our key insight is that these traces can instead be used to teach an ML model approximate execution semantics of diverse instructions and their compositions. We thus design a pretraining task, which trains the model to learn approximate execution semantics from the two modalities (i.e., forced-executed code and traces) of the function. We then finetune the pretrained model to match semantically similar functions. We evaluate Trex on 1,472,066 functions from 13 popular software projects, compiled to run on 4 architectures (x86, x64, ARM, and MIPS), and with 4 optimizations (O0-O3) and 5 obfuscations. Trex outperforms the state-of-the-art solutions by 7.8%, 7.2%, and 14.3% in cross-architecture, optimization, and obfuscation function matching, respectively, while running 8x faster. Ablation studies suggest that the pretraining significantly boosts the function matching performance, underscoring the importance of learning execution semantics. Our case studies demonstrate the practical use-cases of Trex - on 180 real-world firmware images, Trex uncovers 14 vulnerabilities not disclosed by previous studies. We release the code and dataset of Trex at https://github.com/CUMLSec/trex.
更多
查看译文
关键词
Semantics,Task analysis,Computer architecture,Optimization,Codes,Behavioral sciences,Computational modeling,Binary analysis,large language models,software security
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要