Computing Data Distribution from Query Selectivities
CoRR(2024)
摘要
We are given a set 𝒵={(R_1,s_1),…, (R_n,s_n)}, where each
R_i is a range in ^d, such as rectangle or ball, and s_i ∈
[0,1] denotes its selectivity. The goal is to compute a small-size
discrete data distribution 𝒟={(q_1,w_1),…,
(q_m,w_m)}, where q_j∈^d and w_j∈ [0,1] for each 1≤ j≤ m,
and ∑_1≤ j≤ mw_j= 1, such that 𝒟 is the most
consistent with 𝒵, i.e.,
err_p(𝒟,𝒵)=1/n∑_i=1^n|s_i-∑_j=1^m w_j· 1(q_j∈ R_i)|^p is minimized. In a
database setting, 𝒵 corresponds to a workload of range queries over
some table, together with their observed selectivities (i.e., fraction of
tuples returned), and 𝒟 can be used as compact model for
approximating the data distribution within the table without accessing the
underlying contents.
In this paper, we obtain both upper and lower bounds for this problem. In
particular, we show that the problem of finding the best data distribution from
selectivity queries is 𝖭𝖯-complete. On the positive side, we
describe a Monte Carlo algorithm that constructs, in time
O((n+δ^-d)δ^-2polylog), a discrete
distribution 𝒟̃ of size O(δ^-2), such that
err_p(𝒟̃,𝒵)≤min_𝒟err_p(𝒟,𝒵)+δ (for
p=1,2,∞) where the minimum is taken over all discrete distributions. We
also establish conditional lower bounds, which strongly indicate the
infeasibility of relative approximations as well as removal of the exponential
dependency on the dimension for additive approximations. This suggests that
significant improvements to our algorithm are unlikely.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要