谷歌浏览器插件
订阅小程序
在清言上使用

Basal Contamination of Bulk Sequencing: Lessons from the GTEx dataset

bioRxiv(2020)

引用 1|浏览25
暂无评分
摘要
Background: One of the challenges of next generation sequencing (NGS) is contaminating reads from other samples. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, as a useful resource to understand the factors that contribute to contamination.Results: We obtained 11,340 RNA-Seq samples, DNA variant call files (VCF) of 635 individuals, and technical metadata from GTEx as well as read count data from the Human Protein Atlas (HPA) and a pharmacogenetics study. We analyzed 48 tissues in GTEx. Of these, 24 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and CELA3A). Fifteen additional highly expressed genes from other tissues were also indicative of contamination (KRT4, KRT13, PGC, CPA1, GP2, PRL, LIPF, CTRB2, FGA, HP, CKM, FGG, MYBPC1, MYH2, ZG16B). Sample contamination by non-native genes was highly associated with a sample being sequenced on the same day as a tissue that natively has high levels of those genes. This was highly significant for both pancreas genes (p= 2.7E-75) and esophagus genes (p= 8.9E-154). We used genetic polymorphism differences between individuals as validation of the contamination. Specifically, 11 SNPs in five genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes. Low-level contamination affected 1,841 (15.8%) samples (defined as ≥500 PRSS1 read counts). It also led to eQTL assignments in inappropriate tissues among these 19 genes. In support of this type of contamination occurring widely, pancreas gene contamination (PRSS1) was also observed in the HPA dataset, where pancreas samples were sequenced, but not in the pharmacogenomics dataset, where they were not. Conclusions: Highly expressed, tissue-enriched genes basally contaminate the GTEx dataset impacting on some downstream GTEx data analyses. This type of contamination is not unique to GTEx, being shared with other datasets. Awareness of this process will reduce assigning variable, contaminating low-level gene expression to disease processes.
更多
查看译文
关键词
GTEx,RNA-Seq,Contamination,scRNA-Seq,eQTL,PEER factors
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要