Storing Semi-structured Data on Disk Drives 1

semanticscholar(2008)

引用 3|浏览1
暂无评分
摘要
ion provided by a general-purpose object storage manager [Carey et al. 1994], or use a combination of flat files and indices (e.g., XALAN [Xalan 2007], XT [XT 2007], Galax [Galax 2007], BLAST [Altschul et al. 1990], Timber [Jagadish et al. 2002] and Natix [Kanne and Moerkotte 2006]). Since these approaches retrofit existing storage mechanisms to work with semi-structured data, their scope is restricted to the underlying mechanisms, which are predominantly optimized for sequential accesses. Consequently, these approaches may result in a mismatch between the structure and navigational primitives of semi-structured data and the access characteristics of disk drives. In particular, semi-structured data have a tree (or graph) structure with tree-type operations. Relational databases, on the other hand, store structured tables that are optimized for row-based access, and flat files are unstructured, optimized for sequential access. Further complicating this mismatch, the underlying storage device, i.e. disk drives, store information in circular tracks that are accessed with mechanical seek and rotational overhead. Given the growing amount of semi-structured data, there is a need for re-examining the current storage and access machinery that support them. In this paper, we explore strategies to optimize the storage and retrieval of semistructured data on disk drives by explicitly accounting for the mismatch between the structure of the data and the disk drive storage and access characteristics. In particular, we present algorithms that given the physical characteristics of a disk drive (number of tracks, sectors per track and rotational speed.), place semistructured data on the disk drive in a way that facilitates navigation of the data by reducing access overheads. Such low-level control of data layout is made possible using information provided by standard disk profiling tools [Worthington et al. 1995; Talagala et al. 1999; Dimitrijevic et al. 2004]. The proposed technique first addresses the problem of grouping nodes of semistructured data trees so that they can be mapped to disk blocks. We develop and experimentally evaluate our proposed grouping strategies and compare it with the Enhanced Kundu Misra (EKM) grouping strategy [Kanne and Moerkotte 2006]. Second, our proposed on-disk layout strategy for node groups optimizes common tree navigation operations such as parent-to-child and node-to-next-sibling traversals. Our on-disk layout strategies make use of semi-sequential disk access technique [Schindler et al. 2004] that allows the reduction and even elimination of rotational delay overhead during disk accesses. Given that our approach requires circumventing the prevalent logical block abstraction, applying our layout strategy to a general purpose storage system is not straightforward. Our goal in this paper is simply to expose the merits and demerits of this approach. Through experiments we show that our proposed approach is superior for a dedicated single-user storage system with standard caching and prefetching capabilities – for instance, a specialized system for analysis of biological data (suffix trees) [Bedathur and Haritsa 2006]. Based on this study, we believe that our approach provides a fresh perspective on the problem of storing semistructured data that is worth the attention and research time of the community. To evaluate the proposed native data layout techniques, we used XML as a case Prior research has made a similar argument in favor of fine-grained data layout by circumventing the logical block abstraction, for the case of tabular data [Schindler et al. 2004]. ACM Transactions on Storage, Vol. V, No. N, Month 20YY. Storing Semi-structured Data on Disk Drives · 3 study. XML is becoming increasingly popular due to its ability to represent arbitrary semi-structured data. It is the de facto data representation format for many modern applications, including Geographic Information Systems Markup Language (GML) [GML 2008], Medical Markup Language (MML) [MML 2008], Health Level HL7 [HL7 2008], Clinical Document Architecture (CDA) [Dolin et al. 2006] used to represent Electronic Health Records (EHRs), Open Document Format (ODF) [ODS 2008; OOX 2008], and Scalable Vector Graphics (SVG) [SVG 2008] used to describe two-dimensional graphics and graphical applications. Despite the widespread use of XML, the challenge of optimizing access to XML data stores is a key challenge also identified in the latest report [Abiteboul et al. 2005] on the future directions on database research, published every few years by the database research community. Table I. Query classification of popular XML benchmarks. Benchmark Workload Document Total # Non-deep# Deepsize queries focused focused TPoX Financial app 2 25 KB 11 4 7 XMach-1 E-commerce app 2 100 KB 7 4 3 XMark Auction Website 10MB 10 GB 20 13 7 XPathMark Education app 10MB 10GB 54 20 34 XOO7 Web app 4MB 1GB 23 4 19 XBench Publications DB 1KB 10 GB 17 11 6 MemBeR Synthetic 11 MB 7 0 7 MBench Synthetic 50MB 50GB 37 37 0 Total 176 93 83 Recent surveys of popular XML benchmarks [Afanasiev and Marx 2006; Böhme and Rahm 2003; Nambiar et al. 2001] show that all queries to XML data can be classified into deep-focused and non deep-focused queries. In Table I, we summarize the key XML benchmarks available in the public domain. The Transaction Processing over XML (TPoX) benchmark [Nicola et al. 2007] evaluates the performance of XML stores, XML databases, indexes, etc. by generating a mix of XQueries for various financial transactions on the generated XML documents. XMach-1 [Böhme and Rahm 2001; 2003], XOO7 [Bressan et al. ], XMark [Schmidt et al. 2002a] and XPathMark [Franceschet 2005] are typically used to evaluate query optimizations in XML. XMach-1 is based on an E-commerce website while XMark generates queries for an E-commerce website with information on bids, items, brokers and customers. XPathMark [Franceschet 2005] is an XPath based benchmark for XMark and generates an educational document that represents the English alphabet. The XBench [Yao et al. 2003] benchmark is an application oriented benchmark for XML databases. Finally, the MemBer [Afanasiev et al. 2005; Manolescu et al. 2006] and the Michigan Benchmark (MBench) [Runapongsa et al. 2003] are both micro-benchmarks that generate synthetic workloads wherein document structure can be finely controlled (varying their depth and fan-out) so as to be able to reproduce the access patterns of a variety of different real-world workloads. This collection of well-accepted and standardized XML benchmarks demonstrate (i) that XML document sizes can be fairly large running sometimes into tens of gigabytes; this combined with the fact that XML parsers can consume as much as ACM Transactions on Storage, Vol. V, No. N, Month 20YY. 4 · Medha Bhadkamkar et. al. 5X the amount of main memory during parsing as the original size of the XML document [Nicola and John 2003] implies that secondary storage accesses must be optimized if at all possible, and (ii) that the non deep-focused queries, form at least half of the total queries suggested within these popular XML benchmarks ; this implies that optimizing accesses to the non-deep-focused query class is at least as important as optimizing for the deep-focused class. Further, in the event that a workload generates both classes of queries with similar frequency, the storage system could conceivably store data using both the traditional approach and treebased approach with the caveat that this approach requires more consideration for write-dominant workloads that can incur an unacceptable amount of overhead for maintaining consistency. For evaluating our native layout proposals, we employ XPath queries [XPath 2007] obtained from the XPathMark benchmark for the evaluation. We examine the relative performance of native layout against the default approach, which stores XML files sequentially. To do so, we augmented an existing XML parsing engine to implement the grouping techniques that we propose. To evaluate disk I/O performance, we use an instrumented DiskSim disk simulator [Bucy et al. 2003] and replayed the block access traces generated by XML query processing engines. Our evaluation also addresses I/O performance in the presence of query parallelism as would be typical for server environments. Summarizing, these experiments reveal that while the default sequential layout provides superior performance for the deepfocused class of XML queries (or access patterns retrieving entire subtrees of semistructured data), the proposed native layout techniques outperform the default for all other query access patterns. The rest of the paper is organized as follows. Section 2 presents the architecture of a native semi-structured storage system and the model used for semi-structured data and their access. In Section 3, we present native data-layout strategies for semi-structured data on disk drives. In Section 4, we present strategies for organizing and grouping nodes in the tree so that they can be mapped to disk blocks. In Section 5 we conduct a theoretical analysis of the performance impact of data layout. In Section 6, we evaluate the proposed approach for the case of XML data by comparing it against the default sequential layout. We survey related work in Section 7. We conclude and discuss future directions in Section 8. 2. SYSTEM ARCHITECTURE AND DATA MODEL In this section, we propose an architecture for building a native semi-structured storage system which allows the use of our layout techniques with minimal changes to the current storage stack. We also present the semi-structured data and access model abstractions. 2.1 Modifying the Storage Stack Modern disk drives provide a high-level logical block abstraction to the operating system, which does not export information about the physical data layout, performance characteristics, and internal operation of the disk drive. We propose a modified
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要