iBTune: individualized buffer tuning for large-scale cloud databases

Jian Tan
Jian Tan
Tieying Zhang
Tieying Zhang
Jie Chen
Jie Chen
Qixing Zheng
Qixing Zheng
Honglin Qiao
Honglin Qiao
Rui Zhang
Rui Zhang

Proceedings of the VLDB Endowment, pp. 1221-1234, 2019.

Cited by: 8|Bibtex|Views86|DOI:https://doi.org/10.14778/3339490.3339503
EI
Other Links: dl.acm.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose iBTune to adjust DBMS buffer pool sizes by using a large deviation analysis for least recently used caching models and leveraging the similar instances based on performance metrics to find tolerable miss ratios

Abstract:

Tuning the buffer size appropriately is critical to the performance of a cloud database, since memory is usually the resource bottleneck. For large-scale databases supporting heterogeneous applications, configuring the individual buffer sizes for a significant number of database instances presents a scalability challenge. Manual optimizat...More

Code:

Data:

0
Introduction
  • Buffer pool is a critical resource for an OLTP database, serving as a data caching space to guarantee desirable system performance.
  • Existing buffer pool configurations are almost unanimously based on database administrators (DBAs)’ experiences and often take a small and fixed number of recommended values.
  • This manual process is neither efficient nor effective, and even not feasible for large cloud clusters, especially when the workload may dynamically change on individual database instances.
  • Size 29609.98M 8.00M 200.00M 0.13M 8.00M 0.13M sort buffer 1.25M 0.00%
Highlights
  • Buffer pool is a critical resource for an OLTP database, serving as a data caching space to guarantee desirable system performance
  • We see that response time (RT) increases around 30% ∼ 50%, but the latency still keeps relatively low
  • The performance (RT and query per second (QPS)) still meets the quality of service after we reduce the buffer size
  • We propose iBTune to adjust DBMS buffer pool sizes by using a large deviation analysis for least recently used (LRU) caching models and leveraging the similar instances based on performance metrics to find tolerable miss ratios
  • The deployment on our large-scale production environment shows that this solution can save more than 17% memory resource compared to the original system that only relies on experienced database administrators (DBAs)
  • This paper focuses on shrinking buffer pool sizes to reduce cost, which by far is the most important issue with our production deployment
Results
  • The authors first examine the online buffer pool adjustments in the production environment in Section 4.2.1.
  • 4.2.1 Online adjustment of buffer pool sizes.
  • The authors compare the performance before and after adjusting the buffer pool sizes, using the sizes computed by iBTune.
  • The authors' algorithm adjusts the buffer pool size from 96GB to 86GB, about 10% reduction.
  • Most RT after adjustment is under and close to the predicted upper bound of the response time
  • This indicates that the algorithm predicts the upper bound of RT reasonably well.
  • The performance (RT and QPS) still meets the quality of service after the authors reduce the buffer size
Conclusion
  • The authors propose iBTune to adjust DBMS buffer pool sizes by using a large deviation analysis for LRU caching models and leveraging the similar instances based on performance metrics to find tolerable miss ratios.
  • The deployment on the large-scale production environment shows that this solution can save more than 17% memory resource compared to the original system that only relies on experienced DBAs. Future work.
  • This paper focuses on shrinking buffer pool sizes to reduce cost, which by far is the most important issue with the production deployment.
  • The authors rely on DBAs to manually analyze the system expanding requirements before taking important actions.
  • The authors will explore how to automatically expand the buffer pools in the future
Summary
  • Introduction:

    Buffer pool is a critical resource for an OLTP database, serving as a data caching space to guarantee desirable system performance.
  • Existing buffer pool configurations are almost unanimously based on database administrators (DBAs)’ experiences and often take a small and fixed number of recommended values.
  • This manual process is neither efficient nor effective, and even not feasible for large cloud clusters, especially when the workload may dynamically change on individual database instances.
  • Size 29609.98M 8.00M 200.00M 0.13M 8.00M 0.13M sort buffer 1.25M 0.00%
  • Objectives:

    OtterTune’s objective is to achieve a good performance of a single DBMS instance by tuning important parameters in the configuration file of a DBMS kernel while the goal is to optimize memory usage by tuning buffer pool sizes of many different database instances.
  • Results:

    The authors first examine the online buffer pool adjustments in the production environment in Section 4.2.1.
  • 4.2.1 Online adjustment of buffer pool sizes.
  • The authors compare the performance before and after adjusting the buffer pool sizes, using the sizes computed by iBTune.
  • The authors' algorithm adjusts the buffer pool size from 96GB to 86GB, about 10% reduction.
  • Most RT after adjustment is under and close to the predicted upper bound of the response time
  • This indicates that the algorithm predicts the upper bound of RT reasonably well.
  • The performance (RT and QPS) still meets the quality of service after the authors reduce the buffer size
  • Conclusion:

    The authors propose iBTune to adjust DBMS buffer pool sizes by using a large deviation analysis for LRU caching models and leveraging the similar instances based on performance metrics to find tolerable miss ratios.
  • The deployment on the large-scale production environment shows that this solution can save more than 17% memory resource compared to the original system that only relies on experienced DBAs. Future work.
  • This paper focuses on shrinking buffer pool sizes to reduce cost, which by far is the most important issue with the production deployment.
  • The authors rely on DBAs to manually analyze the system expanding requirements before taking important actions.
  • The authors will explore how to automatically expand the buffer pools in the future
Tables
  • Table1: Usage of different memory pools
  • Table2: Average QPS from different business units
  • Table3: Table 3: Machine configurations
  • Table4: Average memory saving ratios for different sizes
  • Table5: Online performance
  • Table6: Training set performance (%)
  • Table7: Testing set performance (%)
Download tables as Excel
Related work
  • Database parameter tuning has been an active area in recent years. Andy Pavlo et al proposed a framework [31] for self-driving DBMS including several key components, like runtime architecture, workload modeling and control framework. They extended the details of this framework to automatically tune DBMS knob configurations, called OtterTune [42]. OtterTune uses a LASSO algorithm to select the most impactful knobs and recommends knob settings based on Gaussian Processes. OtterTune uses many (more than hundreds) metrics of different configurations to train the model. OtterTune’s objective is to achieve a good performance of a single DBMS instance by tuning important parameters in the configuration file of a DBMS kernel while our goal is to optimize memory usage by tuning buffer pool sizes of many different database instances.
Funding
  • The successful deployment on a production environment, which safely reduces the memory footprint by more than 17% compared to the original system that relies on manual configurations, demonstrates the effectiveness of our solution
  • Since iBTune is deployed online, we have successfully reduced the memory consumption by more than 17% while still satisfying the required quality of service for our diverse business applications
  • Compared with a model that only uses a single data point in the original data set, we improve the performance by utilizing all observations in similar environments, as demonstrated by real experiments in Section 4.2.2
  • We see that RT increases around 30% ∼ 50%, but the latency still keeps relatively low (under 1ms)
  • Most of the observed results (more than 83% for RT and 85% for MR) are consistent with the predictions
  • The average of the hourly RT varies by more than 70% on the training and testing data sets due to workload change
  • The deployment on our large-scale production environment shows that this solution can save more than 17% memory resource compared to the original system that only relies on experienced DBAs
Reference
  • Docker. https://www.docker.com.
    Findings
  • M. Akdere, U. Cetintemel, M. Riondato, E. Upfal, and S. B. Zdonik. Learning-based query performance modeling and prediction. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE ’12, pages 390–401, Washington, DC, USA, 201IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • M. Arlitt and L. W. C. ̇Internet web servers: Workload characterization and performance implications. In IEEE/ACM Transaction on Networking, October 1997.
    Google ScholarLocate open access versionFindings
  • C. Berthet. Approximation of LRU caches miss rate: Application to power-law popularities. arXiv:1705.10738, 2017.
    Findings
  • L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: evidence and implications. In Proceedings of the 18th Conference on Information Communications, 1999.
    Google ScholarLocate open access versionFindings
  • T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
    Google ScholarLocate open access versionFindings
  • F. J. Corbato. A paging experiment with the multics system. MIT Project MAC Report, MAC-M-384, 1968.
    Google ScholarFindings
  • G. Dan and N. Carlsson. Power-law revisited: Large scale measurement study of p2p content popularity. In Proceedings of the 9th International Conference on Peer-to-peer Systems, IPTPS’10, pages 12–12, Berkeley, CA, USA, 2010. USENIX Association.
    Google ScholarLocate open access versionFindings
  • S. Das, F. Li, V. R. Narasayya, and A. C. Konig. Automated demand-driven resource scaling in relational database-as-a-service. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pages 1923–1934, New York, NY, USA, 2016. ACM.
    Google ScholarLocate open access versionFindings
  • K. G. Derpanis. Overview of the ransac algorithm. Image Rochester NY, 4(1):2–3, 2010.
    Google ScholarFindings
  • E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765–2781, 2013.
    Google ScholarFindings
  • C. Fricker, P. Robert, and J. Roberts. A versatile and accurate approximation for lru cache performance. In Proceedings of the 24th International Teletraffic Congress, page 8. International Teletraffic Congress, 2012.
    Google ScholarLocate open access versionFindings
  • A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pages 592–603, Washington, DC, USA, 2009. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • S. Garcia, J. Derrac, J. Cano, and F. Herrera. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE transactions on pattern analysis and machine intelligence, 34(3):417–435, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Geng, S. Liu, Z. Yin, A. Naik, B. Prabhakar, M. Rosenblum, and A. Vahdat. Exploiting a natural network effect for scalable, fine-grained clock synchronization. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 81–94, Renton, WA, 2018. USENIX Association.
    Google ScholarLocate open access versionFindings
  • P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine learning, 63(1):3–42, 2006.
    Google ScholarLocate open access versionFindings
  • G. Huang, X. Cheng, J. Wang, Y. Wang, D. He, T. Zhang, F. Li, S. Wang, W. Cao, and Q. Li. X-engine: An optimized storage engine for large-scale e-commerce transaction processing. In Proceedings of the 2019 ACM International Conference on Management of Data, SIGMOD ’19. ACM, 2019.
    Google ScholarLocate open access versionFindings
  • P. J. Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73–101, 03 1964.
    Google ScholarLocate open access versionFindings
  • P. R. Jelenkovic. Least-recently-used caching with Zipfs law requests. In The Sixth INFORMS Telecommunications Conference. Boca Raton, Florida, 2002.
    Google ScholarFindings
  • A. Kadiyala and A. Kumar. Applications of python to evaluate the performance of bagging methods. Environmental Progress & Sustainable Energy, 37(5):1555–1559, 2018.
    Google ScholarLocate open access versionFindings
  • T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In SIGMOD, pages 489–504, 2018.
    Google ScholarLocate open access versionFindings
  • S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica. Learning to Optimize Join Queries With Deep Reinforcement Learning. ArXiv e-prints, Aug. 2018.
    Google ScholarFindings
  • L. Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2):133–169, 1998.
    Google ScholarLocate open access versionFindings
  • D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies. In Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’99, pages 134–143, New York, NY, USA, 1999. ACM.
    Google ScholarLocate open access versionFindings
  • Z. L. Li, M. C.-J. Liang, W. He, L. Zhu, W. Dai, J. Jiang, and G. Sun. Metis: Robustly tuning tail latencies of cloud systems. In ATC (USENIX Annual Technical Conference). USENIX, July 2018.
    Google ScholarLocate open access versionFindings
  • A. Liaw, M. Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
    Google ScholarLocate open access versionFindings
  • L. Ma, D. Van Aken, A. Hefny, G. Mezerhane, A. Pavlo, and G. J. Gordon. Query-based workload forecasting for self-driving database management systems. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pages 631–645, New York, NY, USA, 2018. ACM.
    Google ScholarLocate open access versionFindings
  • V. Narasayya, I. Menache, M. Singh, F. Li, M. Syamala, and S. Chaudhuri. Sharing buffer pool memory in multi-tenant relational database-as-a-service. PVLDB, 8(7):726–737, 2015.
    Google ScholarLocate open access versionFindings
  • D. Narayanan, E. Thereska, and A. Ailamaki. Continuous resource monitoring for self-predicting dbms. In 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pages 239–248, Sept 2005.
    Google ScholarLocate open access versionFindings
  • E. J. O’neil, P. E. O’neil, and G. Weikum. The lru-k page replacement algorithm for database disk buffering. ACM SIGMOD Record, 22(2):297–306, 1993.
    Google ScholarLocate open access versionFindings
  • A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. Mowry, M. Perron, I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-driving database management systems. In Proceedings of the 2017 Conference on Innovative Data Systems Research, CIDR ’17, 2017.
    Google ScholarLocate open access versionFindings
  • J. Petrovic. Using Memcached for data distribution in industrial environment. In Proceeding ICONS ’08 Proceedings of the Third International Conference on Systems, pages 368–372, April 2008.
    Google ScholarLocate open access versionFindings
  • S. Podlipnig and L. Boszormenyi. A survey of web cache replacement strategies. ACM Computing Surveys (CSUR), 35(4):374–398, Dec. 2003.
    Google ScholarLocate open access versionFindings
  • L. Rokach and O. Z. Maimon. Data mining with decision trees: theory and applications, volume 69. World scientific, 2008.
    Google ScholarFindings
  • D. L. Shrestha and D. P. Solomatine. Experiments with adaboost. rt, an improved boosting scheme for regression. Neural computation, 18(7):1678–1710, 2006.
    Google ScholarLocate open access versionFindings
  • Y. Smaragdakis, S. Kaplan, and P. Wilson. The eelru adaptive replacement algorithm. Perform. Eval., 53(2):93–123, July 2003.
    Google ScholarLocate open access versionFindings
  • A. J. Storm, C. Garcia-Arellano, S. S. Lightstone, Y. Diao, and M. Surendra. Adaptive self-tuning memory in db2. In Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB ’06, pages 1081–1092. VLDB Endowment, 2006.
    Google ScholarLocate open access versionFindings
  • T. Sugimoto and N. Miyoshi. On the asymptotics of fault probability in least-recently-used caching with Zipf-type request distribution. Random Structures & Algorithms, 29(3):296–323, 2006.
    Google ScholarLocate open access versionFindings
  • R. Taft, N. El-Sayed, M. Serafini, Y. Lu, A. Aboulnaga, M. Stonebraker, R. Mayerhofer, and F. Andrade. P-store: An elastic database system with predictive provisioning. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pages 205–219, New York, NY, USA, 2018. ACM.
    Google ScholarLocate open access versionFindings
  • J. Tan, G. Quan, K. Ji, and N. Shroff. On resource pooling and separation for LRU caching. In Proceedings of the 2018 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science. ACM, 2018.
    Google ScholarLocate open access versionFindings
  • D. N. Tran, P. C. Huynh, Y. C. Tay, and A. K. H. Tung. A new approach to dynamic self-tuning of database buffers. Trans. Storage, 4(1):3:1–3:25, May 2008.
    Google ScholarLocate open access versionFindings
  • D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1009–1024, New York, NY, USA, 2017. ACM.
    Google ScholarLocate open access versionFindings
  • J. Wang. A survey of web caching schemes for the internet. SIGCOMM Computer Communication Review, 29(5):36–46, Oct. 1999.
    Google ScholarLocate open access versionFindings
  • W. Wu, Y. Chi, H. Hacıgumus, and J. F. Naughton. Towards predicting query execution time for concurrent and dynamic database workloads. PVLDB, 6(10):925–936, 2013.
    Google ScholarLocate open access versionFindings
  • Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Characterizing facebook’s memcached workload. IEEE Internet Computing, 18(2):41–49, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Yang and J. Zhu. Write skew and zipf distribution: Evidence and implications. ACM Trans. Storage, 12(4):21:1–21:19, June 2016.
    Google ScholarLocate open access versionFindings
  • J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 2061–2064. ACM, 2009.
    Google ScholarLocate open access versionFindings
  • H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments