STEAM: Observability-Preserving Trace Sampling

PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023(2023)

引用 0|浏览22
暂无评分
摘要
In distributed systems and microservice applications, tracing is a crucial observability signal employed for comprehending their internal states. To mitigate the overhead associated with distributed tracing, most tracing frameworks utilize a uniform sampling strategy, which retains only a subset of traces. However, this approach is insufficient for preserving system observability. This is primarily attributed to the long-tail distribution of traces in practice, which results in the omission or rarity of minority yet critical traces after sampling. In this study, we introduce an observability-preserving trace sampling method, denoted as STEAM, which aims to retain as much information as possible in the sampled traces. We employ Graph Neural Networks (GNN) for trace representation, while incorporating domain knowledge of trace comparison through logical clauses. Subsequently, we employ a scalable approach to sample traces, emphasizing mutually dissimilar traces. STEAM has been implemented on top of OpenTelemetry, comprising approximately 1.6K lines of Golang code and 2K lines of Python code. Evaluation on four benchmark microservice applications and a production system demonstrates the superior performance of our approach compared to baseline methods. Furthermore, STEAM is capable of processing 15,000 traces in approximately 4 seconds.
更多
查看译文
关键词
distributed tracing,trace sampling,graph neural network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要