Assessment of mapping strategies for determining the 5□-end of mRNAs and long-noncoding RNAs with short read sequences

biorxiv(2020)

引用 0|浏览11
暂无评分
摘要
Background Genome mapping is an essential step in data processing for transcriptome analysis, and many previous studies have evaluated various methods and strategies for mapping RNA-seq data. Cap Analysis of Gene Expression (CAGE) is a sequencing-based protocol particularly designed to capture the 5□-ends of transcripts for quantitatively measuring the expression levels of transcription start sites genome-wide. Because CAGE analysis can also predict the activities of promoters and enhancers, this protocol has been an essential tool in studies of transcriptional regulation. Typically, the same mapping software is used to align both RNA-seq data and CAGE reads to a reference genome, but which mapping software and options are most appropriate for mapping the 5□-end sequence reads obtained through CAGE has not previously been evaluated systematically. Results Here we assessed various strategies for aligning CAGE reads, particularly ∼50-bp sequences, with the human genome by using the HISAT2, LAST, and STAR programs both with and without a reference transcriptome. One of the major inconsistencies among the tested strategies involves alignments to pseudogenes and parent genes: some of the strategies prioritized alignments with pseudogenes even when the read could be aligned with coding genes with fewer mismatches. Another inconsistency concerned the detection of exon-exon junctions. These preferences depended on the program applied and whether a reference transcriptome was included. Overall, the choice of strategy yielded different mapping results for approximately 2% of all promoters. Conclusions Although the various alignment strategies produced very similar results overall, we noted several important and measurable differences. In particular, using the reference transcriptome in STAR yielded alignments with the fewest mismatches. In addition, the inconsistencies among the strategies were especially noticeable regarding alignments to pseudogenes and novel splice junctions. Our results indicate that the choice of alignment strategy is important because it might affect the biological interpretation of the data. * CAGE : Cap Analysis of Gene Expression CTSS : CAGE tag start sites FANTOM : Functional Annotation of the Mammalian Genome lincRNA : Long intervening non-coding RNA nAnT-iCAGE : No-amplification non-tagging CAGE RAMPAGE : RNA Annotation and Mapping of Promoters for Analysis of Gene Expression TSS : Transcriptional start site
更多
查看译文
关键词
Transcription Start Site,CAGE,Mapping,Transcriptome,Reference transcriptome
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要