RNA-to-image multi-cancer synthesis using cascaded diffusion models

biorxiv(2023)

引用 0|浏览8
暂无评分
摘要
Synthetic data generation offers a solution to the data scarcity problem in biomedicine where data are often expensive or difficult to obtain. By increasing the dataset size, more powerful and generalizable machine learning models can be trained, improving their performance in clinical decision support systems. The generation of synthetic data for cancer diagnosis has been explored in the literature, but typically in the single modality setting (e.g. whole-slide image tiles or RNA-Seq data). Given the success of text-to-image synthesis models for natural images, where one modality is used to generate a related one, we propose RNA-to-image synthesis (RNA-CDM) in a multi-cancer context. First, we trained a variational auto-encoder in order to reduce the dimensions of the patient’s gene expression profile, showing that this can accurately differentiate between different cancer types. Then, we trained a cascaded diffusion model to synthesize realistic whole-slide image tiles using the latent representation of the patient’s RNA-Seq data. We show that generated tiles preserved the cell-type distribution found in real-world data, with important cell types detectable by a state-of-the-art cell identification model in the synthetic samples. Next, we successfully used this synthetic data to pretrain a multi-cancer classification model, observing an improvement in performance after training from scratch across 5-fold cross validation. Our results demonstrate the potential utility of synthetic data for developing multi-modal machine learning models in data scarce settings, as well as the possibility of imputing missing data modalities by leveraging the information present in available data modalities. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要