ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework


引用 2|浏览6
In this work, we show how commonly used genomics data format, Sequence Alignment/Map (SAM) can be presented in the Apache Arrow in-memory data representation to take benefits of in-memory processing to ensure the better scalability through shared memory Plasma Object Store by avoiding huge (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we presented an in-memory SAM representation, we called it ArrowSAM, Apache Arrow framework is integrated into genome pre-processing applications including BWA-MEM, Sorting and Picard as use cases to show the advantages of ArrowSAM. Our implementation comprises three components, First, We integrated Apache Arrow into BWA-MEM to write output SAM data in ArrowSAM. Secondly, we sorted all the ArrowSAM data by their coordinates in parallel through pandas dataframes. Finally, Apache Arrow is integrated into HTSJDK library (used in Picard for disk I/O handling), where all ArrowSAM data is processed in parallel for duplicates removal. This implementation gives promising performance improvements for genome data pre-processing in term of both, speedup and system resource utilization. Due to columnar data format, better cache locality is exploited in both applications and shared memory objects enable parallel processing.
Genomics,Whole Genome/Exome Sequencing,Big Data,Apache Arrow,In-Memory,Parallel Processing
AI 理解论文
Chat Paper