These algorithms often sacrifice sensitivity for fast running time. While they are successful at mapping reads from organisms that exhibit low polymorphism rates, they do not perform well at mapping reads from highly polymorphic organisms
SHRiMP: Accurate Mapping of Short Color-space Reads.
PLoS Computational Biology, no. 5 (2009)
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25-70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads ...更多
下载 PDF 全文
- Generation sequencing (NGS) technologies are revolutionizing the study of variation among individuals in a population.
- The ability of sequencing platforms such as AB SOLiD and Illumina (Solexa) to sequence one billion basepairs or more in a few days has enabled the cheap re-sequencing of human genomes, with the genomes of a Chinese individual , a Yoruban individual , and matching tumor and healthy samples from a female individual  sequenced in the last few months.
- While matching with up to a few differences is sufficient in these regions, these methods fail when the polymorphism level is high
- Generation sequencing (NGS) technologies are revolutionizing the study of variation among individuals in a population
- One of the main application areas of Next generation sequencing (NGS) technologies is the discovery of genomic variation within a given species
- The first step in discovering this variation is the mapping of reads sequenced from a donor individual to a known (‘‘reference’’) genome
- These algorithms often sacrifice sensitivity for fast running time. While they are successful at mapping reads from organisms that exhibit low polymorphism rates, they do not perform well at mapping reads from highly polymorphic organisms
- We present a novel read mapping method, SHRiMP, that can handle much greater amounts of polymorphism
- Using Ciona savignyi as our target organism, we demonstrate that our method discovers significantly more variation than other methods
- Details of the SHRiMP Algorithm The algorithm starts with a rapid k-mer hashing step to localize potential areas of similarity between the reads and the genome.
- For each k-mer in the genome, all of the matches of that particular kmer among the reads are found.
- If a particular read has as many or more than a specified number of k-mer matches within a given window of the genome, the authors execute a vectorized Smith-Waterman step, described, to score and validate the similarity.
- SHRiMP was able to accurately map .46% of all reads with either 4 SNPs or 5 bp indels, despite the large number of sequencing errors in the dataset. doi:10.1371/journal.pcbi.1000386.t003.
- SHRiMP was able to accurately map 76% of reads with 2 SNPs and 0 indels, at 84% precision, and nearly half of all reads with 2 SNPs and 3 bp indels at 74% precision
- Generation Sequencing (NGS) technologies are revolutionizing the way biologists acquire and analyze genomic data.
- The first step in discovering this variation is the mapping of reads sequenced from a donor individual to a known (‘‘reference’’) genome.
- Since the introduction of NGS technologies, many methods have been devised for mapping reads to reference genomes.
- These algorithms often sacrifice sensitivity for fast running time.
- The authors develop color-space extensions to classical alignment algorithms, allowing them to map color-space, or ‘‘dibase’’, reads generated by AB SOLiD sequencers
- Table1: Running time of SHRiMP for mapping 500,000 35 bp SOLiD C. savignyi reads to the 180 Mb reference genome on a single Core2 2.66 GHz processor
- Table2: Mapping results for 135 million 35 bp SOLiD reads from Ciona savignyi using SHRiMP and the SOLiD mapper provided by Applied Biosystems
- Table3: Color-space mapping accuracy of SHRiMP
- Table4: Performance (in millions of cells per second) of the various Smith-Waterman implementations, including a regular implementation (not vectorized), Wozniak’s diagonal implementation with memory lookups, Farrar’s method and our diagonal approach without score lookups
- Funding: This work was sponsored by Natural Sciences and Engineering Research Council (NSERC) of Canada Undergraduate Student Research Awards, Canadian Institute for Health Research (CIHR), Applied Biosystems, NSERC Discovery Grant, MITACS, and a Canada Foundation for Innovation equipment grant
- Wang J, Wang W, Li R, Li Y, Tian G, et al. (2008) The diploid genome sequence of an asian individual. Nature 456: 60–65.
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59.
- Ley TJ, Mardis ER, Ding L, Fulton B, Mclellan MD, et al. (2008) DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456: 66–72.
- Bowtie. http://bowtie-bio.sourceforge.net.5.mapreads.http://www.solidsoftwaretools.com/gf/project/mapreads.
- 6. Maq. http://maq.sourceforge.net.
- 7. Li M, Ma B, Kisman D, Tromp J (2004) Patternhunter ii: highly sensitive and fast homology search. J Bioinform Comput Biol 2: 417–439.
- 8. Li R, Li Y, Kristiansen K, Wang J (2008) SOAP: short oligonucleotide alignment program. Bioinformatics.
- 9. Lin H, Zhang Z, Zhang MQ, Ma B, Li M (2008) Zoom! zillions of oligos mapped. Bioinformatics 24: 2431–2437.
- 10. Ma B, Tromp J, Li M (2002) Patternhunter: faster and more sensitive homology search. Bioinformatics 18: 440–445.
- 11. Small KS, Brudno M, Hill MM, Sidow A (2007) Extreme genomic variation in a natural population. PNAS 104: 5698–5703.
- 12. Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9: 225–242.
- 13. Ondov B, Varadarajan A, Passalacqua KDD, Bergman NHH (2008) Efficient mapping of applied biosystems solid sequence data to a reference genome for functional genomic applications. Bioinformatics (Oxford, England).
- 14. Rasmussen K, Stoye J, Myers EW (2006) Efficient q-gram filters for finding all ematches over a given length. J of Computational Biology 13: 296–308.
- 15. Califano A, Rigoutsos I (1993) Flash: a fast look-up algorithm for string homology. Computer Vision and Pattern Recognition, 1993 Proceedings CVPR ’93, 1993 IEEE Computer Society Conference on. pp 353–359.
- 16. Rognes T, Seeberg E (2000) Six-fold speed-up of smith-waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 16: 699–706.
- 17. Farrar M (2007) Striped smith-waterman speeds database searches six times over other simd implementations. Bioinformatics 23: 156–161.
- 18. Wozniak A (1997) Using video-oriented instructions to speed up sequence comparison. Comput Appl Biosci. pp 145–150.
- 19. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147: 195–197.
- 20. Yanovsky V, Rumble SM, Brudno M (2008) Read mapping algorithms for single molecule sequencing data. In: WABI. Springer, volume 5251 of Lecture Notes in Computer Science, 38–49. URL http://dblp.uni-trier.de/db/conf/wabi/wabi2008.html.
- 21. Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 87: 2264–2268.
- 22. Small KS, Brudno M, Hill MM, Sidow A (2007) A haplome alignment and reference sequence of the highly polymorphic ciona savignyi genome. Genome Biology 8: R41.