Pushing Data-Induced Predicates Through Joins in Big-Data Clusters

Proceedings of The Vldb Endowment, pp. 252-265, 2019.

Cited by: 0|Bibtex|Views92
Other Links: academic.microsoft.com
Weibo:
We present a new technique that extends the gains from data skipping; the predicate on a table is converted into new data-induced predicates that can apply on joining tables

Abstract:

Using data statistics, we convert predicates on a table into data induced predicates (diPs) that apply on the joining tables. Doing so substantially speeds up multi-relation queries because the bene ts of predicate pushdown can now apply beyond just the tables that have predicates. We use diPs to skip data exclusively during query optimiz...More

Code:

Data:

0
Introduction
  • The authors seek to extend the bene ts of predicate pushdown beyond just the tables that have predicates.
  • Some systems implement a form of sideways information passing over joins [ , ] during query execution
  • They may build a bloom lter over the values of the join column partkey in the rows that satisfy the predicate on the part table and use this bloom lter to skip rows from the lineitem table.
  • This technique only applies during query execution, does not extend to general joins and has high overheads, especially during parallel execution on large datasets because constructing the bloom lter becomes a scheduling barrier delaying the scan of lineitem until the bloom lter has been constructed
Highlights
  • In this paper, we seek to extend the bene ts of predicate pushdown beyond just the tables that have predicates
  • We present an optimal schedule for tree-like join graphs which converges to xed point and achieves all possible data skipping while computing the fewest number of data induced predicates (diPs)
  • We show the plan generated using magic-set transformations which push group-by above the join. diPs complement magic-set transformations; we see here that magic-set tx cannot skip partitions of lineitem but because group-by has been pushed above the join, moving diPs sideways once is enough unlike the case in Figure
  • While data-induced predicates are similar to the implied integrity constraints used by [ ], there are some key di erences and additional contributions. ( ) [ ] only exchanges constraints between a pair of relations; we o er a method which exchanges diPs between multiple relations, handles cyclic joins and supports queries having group-by’s, union’s and other operations. ( ) [ ] uses zone maps and two bucket histograms; we o er a new statistic that performs better. ( ) [ ] shows no query performance improvements; we show speed-ups in both a big-data cluster and a DBMS. ( ) [ ] o ers no results in the presence of data updates; we design and evaluate two maintenance techniques that can be built into transactional systems
  • We present a new technique that extends the gains from data skipping; the predicate on a table is converted into new data-induced predicates that can apply on joining tables
  • Data-induced predicates are possible, at a fundamental level, because of implicit or explicit clustering that already exists in datasets
Methods
  • Queries: The authors report results on TPC-H [ ], TPC-DS [ ] and the join order benchmark (JOB) [ ].
  • The authors use all queries from TPC-H but because TPC-DS and JOB have many more queries the authors pick from them and queries respectively.
  • The authors choose JOB for its cyclic join queries.
  • The authors choose TPC-DS because it has complex queries.
  • Query predicates are complex; e.g., q from TPC-H has clauses over columns from multiple relations.
  • While inner-joins dominate, the queries have self-, semi- and outer joins
Results

  • The authors will show that using diPs leads to sizable gains across queries from TPC-H, TPC-DS and Join Order Benchmark, across di erent data distributions and physical layouts and across statistics (§ . ).

    e costs to achieve these gains are small and range-sets o er more gains in more cases than zone-maps (§ . ).
  • The authors will show that using diPs leads to sizable gains across queries from TPC-H, TPC-DS and Join Order Benchmark, across di erent data distributions and physical layouts and across statistics (§ .
  • ). The authors show that diPs are complementary to and sometimes better than using join indexes, materializing denormalized views or clustering rows in § .
  • ; these alternatives have much higher maintenance costs unlike diPs which work in-situ using small per-table statistics and a small increase to QO time
  • The authors show that diPs are complementary to and sometimes better than using join indexes, materializing denormalized views or clustering rows in § . ; these alternatives have much higher maintenance costs unlike diPs which work in-situ using small per-table statistics and a small increase to QO time
Conclusion
  • Human-digestible insights increasingly use queries with selective predicates.
  • Data-induced predicates are possible, at a fundamental level, because of implicit or explicit clustering that already exists in datasets.
  • The authors' method to construct diPs leverages data statistics and works with a variety of simple statistics, some of which are already maintained in today’s clusters.
  • In contrast to prior work that o ers data skipping only in the presence of complex auxiliary structures, workload-aware adaptations and changes to query execution, using diPs is radically simple.
  • The authors' results in a large dataparallel cluster and a DBMS show that large gains are possible across a wide variety of queries, data distributions and layouts
Summary
  • Introduction:

    The authors seek to extend the bene ts of predicate pushdown beyond just the tables that have predicates.
  • Some systems implement a form of sideways information passing over joins [ , ] during query execution
  • They may build a bloom lter over the values of the join column partkey in the rows that satisfy the predicate on the part table and use this bloom lter to skip rows from the lineitem table.
  • This technique only applies during query execution, does not extend to general joins and has high overheads, especially during parallel execution on large datasets because constructing the bloom lter becomes a scheduling barrier delaying the scan of lineitem until the bloom lter has been constructed
  • Methods:

    Queries: The authors report results on TPC-H [ ], TPC-DS [ ] and the join order benchmark (JOB) [ ].
  • The authors use all queries from TPC-H but because TPC-DS and JOB have many more queries the authors pick from them and queries respectively.
  • The authors choose JOB for its cyclic join queries.
  • The authors choose TPC-DS because it has complex queries.
  • Query predicates are complex; e.g., q from TPC-H has clauses over columns from multiple relations.
  • While inner-joins dominate, the queries have self-, semi- and outer joins
  • Results:


    The authors will show that using diPs leads to sizable gains across queries from TPC-H, TPC-DS and Join Order Benchmark, across di erent data distributions and physical layouts and across statistics (§ . ).

    e costs to achieve these gains are small and range-sets o er more gains in more cases than zone-maps (§ . ).
  • The authors will show that using diPs leads to sizable gains across queries from TPC-H, TPC-DS and Join Order Benchmark, across di erent data distributions and physical layouts and across statistics (§ .
  • ). The authors show that diPs are complementary to and sometimes better than using join indexes, materializing denormalized views or clustering rows in § .
  • ; these alternatives have much higher maintenance costs unlike diPs which work in-situ using small per-table statistics and a small increase to QO time
  • The authors show that diPs are complementary to and sometimes better than using join indexes, materializing denormalized views or clustering rows in § . ; these alternatives have much higher maintenance costs unlike diPs which work in-situ using small per-table statistics and a small increase to QO time
  • Conclusion:

    Human-digestible insights increasingly use queries with selective predicates.
  • Data-induced predicates are possible, at a fundamental level, because of implicit or explicit clustering that already exists in datasets.
  • The authors' method to construct diPs leverages data statistics and works with a variety of simple statistics, some of which are already maintained in today’s clusters.
  • In contrast to prior work that o ers data skipping only in the presence of complex auxiliary structures, workload-aware adaptations and changes to query execution, using diPs is radically simple.
  • The authors' results in a large dataparallel cluster and a DBMS show that large gains are possible across a wide variety of queries, data distributions and layouts
Tables
  • Table1: Data statistics maintained by several systems
  • Table2: Constructing diPs using partition statistics
  • Table3: Notation used in this paper
  • Table4: Greedily growing a range-set in the presence of updates
  • Table5: The InputCut from diPs for different benchmarks and different layouts; each query and data layout (see §5.1) contribute a point to a CDF and the table shows values at various percentiles
  • Table6: The additional latency to derive diPs in seconds compared to the baseline QO latency; see §5.3
  • Table7: Additional results for experiments in Figure 11, Table 5 and Figure 12. The table shows data from our SCOPE cluster
  • Table8: The time to greedily update range-sets of various sizes (in nanoseconds) measured on a desktop
Download tables as Excel
Related work
  • To the best of our knowledge, this paper is the rst system to skip data across joins for complex queries during query optimization.

    ese are fundamental di erences: diPs rely only on simple percolumn statistics, are built on-the- y in the QO, can skip partitions of multiple joining relations, support di erent join types and work with complex queries; the resulting plans only read subsets of the input relations and have no execution-time overhead.

    Some research works discover data properties such as functional dependencies and column correlations and use them to improve query plans [ , , , ]. Inferring such data properties is a sizable cost (e.g., [ ] uses student t-test between every pair of columns). It is unclear if these properties can be maintained when data evolves. More importantly, imprecise data properties are less useful for QO (e.g., a so functional dependency does not preserve set multiplicity and hence cannot guarantee correctness of certain plan transformations over group-bys and joins). A SQL server option [ ] uses the fact that the l shipdate attribute of lineitem is between to days larger than o orderdate from orders [ ] to convert predicates on l shipdate to predicates on o orderdate and vice versa. Others discover similar constraints more broadly [ ,
Reference
  • [ ] big-data and analytics forecast. https://bit.ly/2TtKyjB.
    Findings
  • [ ] Apache orc spec. v. https://bit.ly/2J5BIkh.[ ] Apache spark join guidelines and performance tuning.
    Locate open access versionFindings
  • https://bit.ly/2Jd87We.[ ] Band join.https://bit.ly/2kixJJn.[ ] Bitmap join indexes in oracle.https://bit.ly/2TLBBTF.[ ] Clustered and nonclustered indexes described.
    Locate open access versionFindings
  • https://bit.ly/2Drdb9o.[ ] Columnstore index performance: Rowgroup elimination.
    Findings
  • https://bit.ly/2VFpljV.[ ] Columnstore indexes described.
    Findings
  • https://bit.ly/2F7LZuI.[ ] Data skipping index in spark.https://bit.ly/2qONacb.[ ] Date correlation optimzation in sql server &.
    Locate open access versionFindings
  • https://bit.ly/2VodSVN.[ ] Imdb datasets.https://imdb.to/2S3BzSF.[ ] Join order benchmark.https://bit.ly/2tTRyIb.[ ] e junction tree algorithm.https://bit.ly/2lPHNtA.[ ] Oracle database guide: Using zone maps.
    Findings
  • https://bit.ly/2qMeO9E.[ ] Oracle: Using zone maps.https://bit.ly/2vsUWKK.[ ] Parquet thri format.https://bit.ly/2vm6D5U.[ ] Presto: Repartitioned and replicated joins.
    Locate open access versionFindings
  • https://bit.ly/2JauYll.[ ] Processing petabytes of data in seconds with databricks delta.
    Findings
  • https://bit.ly/2Pryf2E.[ ] Pushing data-induced predicates through joins in bigdata clusters; extended version.https://bit.ly/2WhTwP1.[ ] Query in tpc-h, see page.https://bit.ly/2kJRV72.[ ] Query execution bitmap lters.https://bit.ly/2NJzzgF.[ ] Redshi: Choosing the best sort key.
    Locate open access versionFindings
  • https://amzn.to/2AmYbXh.[ ] Teradata: Join index.https://bit.ly/2FbalDT.[ ] TPC-DS Benchmark.http://www.tpc.org/tpcds/.[ ] Tpc-ds query.https://bit.ly/2U0rIk6.[ ] TPC-H Benchmark.http://www.tpc.org/tpch.[ ] Vertica: Choosing sort order: Best practices.
    Findings
  • https://bit.ly/2yrvPtG.[ ] Views in sql server.https://bit.ly/2CnbmIo.[ ] Program for tpc-h data generation with skew.
    Locate open access versionFindings
  • https://bit.ly/2wvdNVo,.[ ] A. Aboulnaga and S. Chaudhuri. Self-tuning histograms: Building histograms without looking at data. In SIGMOD,.
    Findings
  • [ ] A. Nanda. Oracle exadata: Smart scans meet storage indexes. http://bit.ly/2ha7C5u,.
    Findings
  • [ ] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. https://bit.ly/2yurPIS,.
    Findings
Full Text
Your rating :
0

 

Tags
Comments