Strabo 2: Distributed Management of Massive Geospatial RDF Datasets.

IEEE International Semantic Web Conference(2022)

引用 17|浏览17
暂无评分
摘要
We present STRABO 2, a distributed geospatial RDF store able to process GeoSPARQL queries over massive RDF datasets. STRABO 2 is based on robust technologies, able to scale on TBs of data distributed on hundreds of nodes. Specifically, we use the Spark framework, enhanced with the geospatial library SEDONA, for distributed in-memory processing on Hadoop clusters, and Hive for compact persistent storage of RDF data. STRABO 2 employs a flexible design that can store and partition thematic RDF data using different relational schemas, and spatial data in a separate Hive table, by taking into consideration the GeoSPARQL vocabulary. STRABO 2 is cluster friendly both memory and disk-wise, since it compresses triples using a partial encoding technique in addition to Parquet data file format compression schemes. GeoSPARQL queries are translated into the Spark SQL dialect, enhanced with the spatial functions and predicates offered by SEDONA. During this process the system takes into consideration SEDONA's capabilities for both spatial selections and spatial joins, in order to apply optimizations that result in efficient query processing. We experimentally test STRABO 2 on an award winning Hadoop based cluster environment and exhibit STRABO 2's excellent scalability while handling massive synthetic and real world datasets. We also show that STRABO 2 clearly outperforms state of the art centralized engines in a single server setup, once the dataset size increases beyond few GBs.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要