# Learning-Augmented Data Stream Algorithms

ICLR, 2020.

EI

Keywords:

streaming algorithms heavy hitters F_p moment distinct elements cascaded norms

Weibo:

Abstract:

The data stream model is a fundamental model for processing massive data sets with limited memory and fast processing time. Recently Hsu et al. (2019) incorporated machine learning techniques into the data stream model in order to learn relevant patterns in the input data. Such techniques were encapsulated by training an oracle to predict...More

Introduction

- Processing data streams has been an active research field in the past two decades. This is motivated by the increasingly common scenario where the size of the data far exceeds the size of the available storage, and the only feasible access to the data is to make a single or a few passes over the data.
- The authors note that the O-bit algorithm is optimal even given a heavy hitter oracle; see the Ω lower bound in Alon et al (1999), which holds even if all item frequencies are in the set {0, 1, 2}.

Highlights

- Processing data streams has been an active research field in the past two decades
- In the data stream model, we assume there is an underlying frequency vector x ∈ Zn, initialized to 0n, which evolves throughout the course of a stream
- The space complexity of the algorithm is measured in bits and the goal is to use much less than the trivial n bits of space required to store x
- Our Results We show that a heavy hitter oracle can greatly improve the complexity of a wide array of commonly studied problems in the data stream model, leading to the first optimal bounds for several important problems, and shattering lower bounds that have stood in the way of making further progress on important problems
- The goal is to estimate the frequency moments of the vector indicating the number of occurrences each search query appears
- Half of them were used to store the heavy items, and the other half are used by the sub-sampling algorithm with the precision sampling estimators to estimate the frequency moment of the light elements

Results

- Fp-Moment Estimation, 0 < p < 2: There is a long line of work on this problem with the best known bounds given in Kane et al (2011), which achieve an optimal O( −2 log n) bits of space, and O(log2(1/ ) log log 1/ ) time to process each element.
- In the higher-level ROUGHL0ESTIMATOR algorithm, whenever an update is made to a heavy coordinate i identified by the oracle, the corresponding bucket inside EXACTCOUNT is marked as nonempty regardless of the counter value, since the authors know the heavy item will never be entirely deleted, since by definition it is heavy at the end of the stream.
- The main idea is that the authors separately estimate the Fp-moment of the heavy hitters, and for the remaining light elements, the authors use sub-sampling to estimate their contribution with sampling rate 1/ρ.
- With a noisy heavy hitter oracle with error probability δ, the authors within a factor 1 ± 2 in O( −4n1/2−1/p log(n) log(M )) bits of space when can δ=
- To further demonstrate the advantage of having a heavy hitter oracle, the authors run the previous algorithm due to Kane et al (2010b) and the modified algorithm on the synthetic data designed as follows: first, the authors generate an input vector x of dimension n = 106 with i.i.d entries uniform on {0, 1, .
- They use the first 5 days for training, the following day for validation, and estimate the number of times different search queries appear in subsequent days.
- Half of them were used to store the heavy items, and the other half are used by the sub-sampling algorithm with the precision sampling estimators to estimate the frequency moment of the light elements.

Conclusion

- It is clear from the plots that the estimation error of the oracle-aided algorithm is about 1% even when the total number of buckets is small, demonstrating a strong advantage over classical precision sampling estimators.
- Using 2 · 106 buckets, i.e., 2% space of what is needed to store the entire vector x, the algorithm achieves a 15% relative estimation error while the classical precision sampling obtains only a 27% relative error.

Summary

- Processing data streams has been an active research field in the past two decades. This is motivated by the increasingly common scenario where the size of the data far exceeds the size of the available storage, and the only feasible access to the data is to make a single or a few passes over the data.
- The authors note that the O-bit algorithm is optimal even given a heavy hitter oracle; see the Ω lower bound in Alon et al (1999), which holds even if all item frequencies are in the set {0, 1, 2}.
- Fp-Moment Estimation, 0 < p < 2: There is a long line of work on this problem with the best known bounds given in Kane et al (2011), which achieve an optimal O( −2 log n) bits of space, and O(log2(1/ ) log log 1/ ) time to process each element.
- In the higher-level ROUGHL0ESTIMATOR algorithm, whenever an update is made to a heavy coordinate i identified by the oracle, the corresponding bucket inside EXACTCOUNT is marked as nonempty regardless of the counter value, since the authors know the heavy item will never be entirely deleted, since by definition it is heavy at the end of the stream.
- The main idea is that the authors separately estimate the Fp-moment of the heavy hitters, and for the remaining light elements, the authors use sub-sampling to estimate their contribution with sampling rate 1/ρ.
- With a noisy heavy hitter oracle with error probability δ, the authors within a factor 1 ± 2 in O( −4n1/2−1/p log(n) log(M )) bits of space when can δ=
- To further demonstrate the advantage of having a heavy hitter oracle, the authors run the previous algorithm due to Kane et al (2010b) and the modified algorithm on the synthetic data designed as follows: first, the authors generate an input vector x of dimension n = 106 with i.i.d entries uniform on {0, 1, .
- They use the first 5 days for training, the following day for validation, and estimate the number of times different search queries appear in subsequent days.
- Half of them were used to store the heavy items, and the other half are used by the sub-sampling algorithm with the precision sampling estimators to estimate the frequency moment of the light elements.
- It is clear from the plots that the estimation error of the oracle-aided algorithm is about 1% even when the total number of buckets is small, demonstrating a strong advantage over classical precision sampling estimators.
- Using 2 · 106 buckets, i.e., 2% space of what is needed to store the entire vector x, the algorithm achieves a 15% relative estimation error while the classical precision sampling obtains only a 27% relative error.

- Table1: Summary of previous results and the results obtained in this work. We assume that m, M = poly(n). In the column of result type, S denotes space complexity and T denotes time complexity. We view as a constant for the listed results of the Fp Moment and Cascaded Norm problems
- Table2: Summary of the threshold of the heavy hitter oracles used in each problem. Note that two oracles are used for the Fp Moment Fast Update problem

Funding

- Li was supported in part by Singapore Ministry of Education (AcRF) Tier 2 grant MOE2018-T2-1-013
- Woodruff would like to thank partial support from the National Science Foundation under Grant No CCF-1815840 and the Office of Naval Research (ONR) under grant N00014-181-2562

Reference

- Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137–147, 1999.
- Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David P. Woodruff. Efficient sketches for earthmover distance, with applications. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25-27, 2009, Atlanta, Georgia, USA, pp. 324–330, 2009.
- Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pp. 363–372, 2011.
- Maria-Florina Balcan, Travis Dick, and Ellen Vitercik. Dispersion for data-driven algorithm design, online learning, and private optimization. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, pp. 603–614, 2018a.
- Maria-Florina Balcan, Travis Dick, and Colin White. Data-driven clustering via parameterized lloyd’s families. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montreal, Canada., pp. 10664–10674, 2018b.
- Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702–732, 2004.
- Vladimir Braverman and Rafail Ostrovsky. Recursive sketching for frequency moments. CoRR, 2010.
- CAIDA. Caida internet traces 2016 chicago. http://www.caida.org/data/monitors/passive-equinix-chicago.xml.
- Amit Chakrabarti, Khanh Do Ba, and S. Muthukrishnan. Estimating entropy and entropy norm on data streams. Internet Mathematics, 3(1):63–78, 2006.
- Graham Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA, pp. 271–282, 2005.
- Travis Dick, Mu Li, Venkata Krishna Pillutla, Colin White, Nina Balcan, and Alexander J. Smola. Data driven resource allocation for distributed learning. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pp. 662–671, 2017.
- Rishi Gupta and Tim Roughgarden. A PAC approach to application-specific algorithm selection. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, Cambridge, MA, USA, January 14-16, 2016, pp. 123–134, 2016.
- Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimation algorithms. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
- Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pp. 189–197. IEEE, 2000.
- Piotr Indyk and David P. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005, pp. 202–208, 2005.
- Piotr Indyk, Ali Vakilian, and Yang Yuan. Learning-based low-rank approximations. CoRR, abs/1910.13984, 2019.
- T. S. Jayram and D. P. Woodruff. The data stream space complexity of cascaded norms. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pp. 765–774, Oct 2009.
- Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pp. 1161–1178, 2010a.
- Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’10, pp. 41–52, New York, NY, USA, 2010b. ACM.
- Daniel M Kane, Jelani Nelson, Ely Porat, and David P Woodruff. Fast moment estimation in data streams in optimal space. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 745–754. ACM, 2011.
- Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. CoRR, abs/1712.01208, 2017.
- Ping Li. Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008, pp. 10–19, 2008.
- Andrew McGregor, A Pavan, Srikanta Tirthapura, and David P Woodruff. Space-efficient estimation of statistics over sub-sampled streams. Algorithmica, 74(2):787–811, 2016.
- S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005.
- John Nolan. Stable distributions: models for heavy-tailed data. Birkhauser Boston, 2003. Anna Ostlin and Rasmus Pagh. Uniform hashing in constant time and linear space. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pp. 622–628. ACM, 2003. Srikanta Tirthapura and David Woodruff. Rectangle-efficient aggregation in spatial data streams. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, pp. 283–294. ACM, 2012. David P. Woodruff and Guang Yang. Separating k-player from t-player one-way communication, with applications to data streams. In 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece., pp. 97:1–97:14, 2019. Vladimir M Zolotarev. One-dimensional stable distributions, volume 65. American Mathematical Soc., 1986.
- Proof of Theorem 2. The proof is almost identical to that in Kane et al. (2010b). The space and time complexities are follow from the description of the algorithm.
- Next we show correctness. Let L0(j) denote the true number of distinct elements in the j-th scale. Then E L0(j) = L0/2j. Let j∗ = max{j: E L0(j) ≥ 1} and j∗∗ = max{j < j∗: E L0(j) ≥ 55}. As shown in Kane et al. (2010b), if j∗∗ exists, it holds that 55 ≤ E L0(j∗∗) < 110 and Pr{32 < L0(j∗∗) < 142} ≥ 8/9. With our choices of c = 141 and η = 1/16, EXACTCOUNT returns a nonzero value for the j∗∗-th scale with probability 8/9 − 1/16 > 13/16 by Lemma 1. It follows that the deepest scale j we find satisfies j∗∗ ≤ j ≤ j∗. Hence, L0 = 2j∗∗ E L0(j∗∗) ≤ 110 · 2j and L0 = 2j∗ E L0(j∗) ≥ 2j as desired. If j∗∗ does not exist, then L0 < 55, and L0 = 1 is a 55-approximation in this case.

Tags

Comments