Unlocking the Power of Multi-institutional Data: Integrating and Harmonizing Genomic Data Across Institutions
CoRR(2024)
摘要
Cancer is a complex disease driven by genomic alterations, and tumor
sequencing is becoming a mainstay of clinical care for cancer patients. The
emergence of multi-institution sequencing data presents a powerful resource for
learning real-world evidence to enhance precision oncology. GENIE BPC, led by
the American Association for Cancer Research, establishes a unique database
linking genomic data with clinical information for patients treated at multiple
cancer centers. However, leveraging such multi-institutional sequencing data
presents significant challenges. Variations in gene panels result in loss of
information when the analysis is conducted on common gene sets. Additionally,
differences in sequencing techniques and patient heterogeneity across
institutions add complexity. High data dimensionality, sparse gene mutation
patterns, and weak signals at the individual gene level further complicate
matters. Motivated by these real-world challenges, we introduce the Bridge
model. It uses a quantile-matched latent variable approach to derive integrated
features to preserve information beyond common genes and maximize the
utilization of all available data while leveraging information sharing to
enhance both learning efficiency and the model's capacity to generalize. By
extracting harmonized and noise-reduced lower-dimensional latent variables, the
true mutation pattern unique to each individual is captured. We assess the
model's performance and parameter estimation through extensive simulation
studies. The extracted latent features from the Bridge model consistently excel
in predicting patient survival across six cancer types in GENIE BPC data.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要