PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems.

OSDI, pp. 611-626, 2018.

被引用36|浏览160
EI
微博一下
Memory: in the first scenario, we want to show how much memory saving PRETZEL’s white box approach is able to provide with respect to regular Machine Learning.Net and Machine Learning.Net boxed into Docker containers managed by Clipper

摘要

Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training time, prediction serving has different requirements such as low latency, high throughput and graceful performance degradation under heavy load. Current prediction serving systems c...更多

代码

数据

ZH
下载 PDF 全文
引用
微博一下
简介
  • Many Machine Learning (ML) frameworks such as Google TensorFlow [4], Facebook Caffe2 [6], Scikitlearn [48], or Microsoft ML.Net [14] allow data scientists to declaratively author pipelines of transformations to train models from large-scale input datasets.
  • Model pipelines are internally represented as Directed Acyclic Graphs (DAGs) of operators comprising data transformations and featurizers, and ML models.
  • Pipelines have different system characteristics based on the phase in which they are employed: for instance, at training time ML models run complex algorithms to scale over large datasets, while, once trained, they behave as other regular featurizers and data transformations; during inference pipelines are often surfaced for direct users’ servicing and require low latency, high throughput, and graceful degradation of performance in case of load spikes
重点内容
  • Many Machine Learning (ML) frameworks such as Google TensorFlow [4], Facebook Caffe2 [6], Scikitlearn [48], or Microsoft Machine Learning.Net [14] allow data scientists to declaratively author pipelines of transformations to train models from large-scale input datasets
  • Scenarios: The goals of our experimental evaluation are to evaluate how the white box approach performs compared to black box
  • Memory: in the first scenario, we want to show how much memory saving PRETZEL’s white box approach is able to provide with respect to regular Machine Learning.Net and Machine Learning.Net boxed into Docker containers managed by Clipper
  • Structured Text (40 dimensions) 10KB - 20MB (Mean: 9MB) Principal Components Analysis, KMeans, Ensemble of multiple models a client submitting a request. throughput: this scenario simulates a batch pattern (e.g., [8]) and we use it to assess the throughput of PRETZEL compared to Machine Learning.Net. heavy-load: we mix the above experiments and show PRETZEL’s ability to maintain high throughput and graceful degradation of latency, as load increases
  • As for the latency experiment, we report first the PRETZEL’s performance using a micro-benchmark, and we compare it against the containerized version of Machine Learning.Net in an end-to-end setting
  • Inspired by the growth of Machine Learning applications and Machine Learning-asa-service platforms, this paper identified how existing systems fall short in key requirements for Machine Learning predictionserving, disregarding the optimization of model execution in favor of ease of deployment
结果
  • PRETZEL implementation is a mix of C# and C++. In its current version, the system comprises 12.6K LOC (11.3K in C#, 1.3K in C++) and supports about two dozens of ML.Net operators, among which linear models (e.g., linear/logistic/Poisson regression), tree-based models, clustering models (e.g., K-Means), Principal Components Analysis (PCA), and several featurizers.

    Scenarios: The goals of the experimental evaluation are to evaluate how the white box approach performs compared to black box.
  • As for the latency experiment, the authors report first the PRETZEL’s performance using a micro-benchmark, and the authors compare it against the containerized version of ML.Net in an end-to-end setting
结论
  • Inspired by the growth of ML applications and ML-asa-service platforms, this paper identified how existing systems fall short in key requirements for ML predictionserving, disregarding the optimization of model execution in favor of ease of deployment.
  • This work casts the problem of serving inference as a database problem where end-to-end and multi-query optimization strategies are applied to ML pipelines.
  • The authors have developed an optimizer and compiler framework generating efficient model plans end-to-end.
  • To decrease memory footprint and increase resource utilization and throughput, the authors allow pipelines to share parameters and physical operators, and defer the problem of inference execution to a scheduler that allows running multiple predictions concurrently on shared resources
总结
  • Introduction:

    Many Machine Learning (ML) frameworks such as Google TensorFlow [4], Facebook Caffe2 [6], Scikitlearn [48], or Microsoft ML.Net [14] allow data scientists to declaratively author pipelines of transformations to train models from large-scale input datasets.
  • Model pipelines are internally represented as Directed Acyclic Graphs (DAGs) of operators comprising data transformations and featurizers, and ML models.
  • Pipelines have different system characteristics based on the phase in which they are employed: for instance, at training time ML models run complex algorithms to scale over large datasets, while, once trained, they behave as other regular featurizers and data transformations; during inference pipelines are often surfaced for direct users’ servicing and require low latency, high throughput, and graceful degradation of performance in case of load spikes
  • Results:

    PRETZEL implementation is a mix of C# and C++. In its current version, the system comprises 12.6K LOC (11.3K in C#, 1.3K in C++) and supports about two dozens of ML.Net operators, among which linear models (e.g., linear/logistic/Poisson regression), tree-based models, clustering models (e.g., K-Means), Principal Components Analysis (PCA), and several featurizers.

    Scenarios: The goals of the experimental evaluation are to evaluate how the white box approach performs compared to black box.
  • As for the latency experiment, the authors report first the PRETZEL’s performance using a micro-benchmark, and the authors compare it against the containerized version of ML.Net in an end-to-end setting
  • Conclusion:

    Inspired by the growth of ML applications and ML-asa-service platforms, this paper identified how existing systems fall short in key requirements for ML predictionserving, disregarding the optimization of model execution in favor of ease of deployment.
  • This work casts the problem of serving inference as a database problem where end-to-end and multi-query optimization strategies are applied to ML pipelines.
  • The authors have developed an optimizer and compiler framework generating efficient model plans end-to-end.
  • To decrease memory footprint and increase resource utilization and throughput, the authors allow pipelines to share parameters and physical operators, and defer the problem of inference execution to a scheduler that allows running multiple predictions concurrently on shared resources
表格
  • Table1: Characteristics of pipelines in experiments
Download tables as Excel
相关工作
  • Prediction Serving: As from the Introduction, current ML prediction systems [9, 32, 5, 46, 17, 30, 18, 43, 59, 15] aim to minimize the cost of deployment and maximize code re-use between training and inference phases [65]. Conversely, PRETZEL casts prediction serving as a database problem and applies end-to-end and multi-query optimizations to maximize performance and resource utilization. Clipper and Rafiki deploy pipelines as Docker containers connected through RPC to a front end. Both systems apply external model-agnostic techniques to achieve better latency, throughput, and accuracy. While we employed similar techniques in the FrontEnd, in PRETZEL we have not yet explored “best effort” techniques such as ensembles, straggler mitigation, and model selection. TensorFlow Serving deploys pipelines as Servables, which are units of execution scheduling and version management. One Servable is executed as a black box, although users are allowed to split model pipelines and surface them into different Servables, similarly to PRETZEL’s stage-based execution. Such optimization is however not automatic. LASER [22] enables large scale training and inference of logistic regression models, applying specific system optimizations to the problem at hand (i.e., advertising where multiple ad campaigns are run on each user) such as caching of partial results and graceful degradation of accuracy. Finally, runtimes such as Core ML [10] and Windows ML [21] provide on-device inference engines and accelerators. To our knowledge, only single operator optimizations are enforced (e.g., using target mathematical libraries or hardware), while neither end-to-end nor multi-model optimizations are used. As PRETZEL, TVM [20, 28] provides a set of logical operators and related physical implementations, backed by an optimizer based on the Halide language [49]. TVM is specialized on neural network models and does not support featurizers nor “classical” models.
基金
  • Yunseong Lee and Byung-Gon Chun were partly supported by the MSIT (Ministry of Science and ICT), Korea, under the SW Starlab support program (IITP-2018-R0126-18-1093) supervised by the IITP (Institute for Information & communications Technology Promotion), and by the ICT R&D program of MSIT/IITP (No.2017-0-01772, Development of QA systems for Video Story Understanding to pass the Video Turing Test)
引用论文
  • H2O. https://www.h2o.ai/.
    Findings
  • Michelangelo. michelangelo/. https://eng.uber.com/
    Findings
  • TensorFlow XLA. https://www.tensorflow.org/performance/xla/.
    Findings
  • TensorFlow. 2016. https://www.tensorflow.org,
    Findings
  • TensorFlow serving. https://www.tensorflow.org/serving, 2016.
    Findings
  • Caffe2. https://caffe2.ai, 2017.
    Findings
  • Open Neural Network Exchange (ONNX). https://onnx.ai, 2017.
    Findings
  • Batch python API in Microsoft machine learning server, 2018.
    Google ScholarFindings
  • Clipper. http://clipper.ai/, 2018.
    Findings
  • Core ML. https://developer.apple.com/documentation/coreml, 2018.
    Findings
  • Docker. https://www.docker.com/, 2018.
    Findings
  • Ec2 large instances and numa. https://forums.aws.amazon.com/thread.jspa?threadID=144982, 2018.
    Findings
  • Keras. https://www.tensorflow.org/api_docs/python/tf/keras, 2018.
    Findings
  • ML.Net. https://dot.net/ml, 2018.
    Findings
  • MXNet Model Server (MMS). https://github.com/awslabs/mxnet-model-server, 2018.
    Findings
  • .Net Core Ahead of Time Compilation with CrossGen. https://github.com/dotnet/coreclr/blob/master/Documentation/building/crossgen.md, 2018.
    Findings
  • PredictionIO. https://predictionio.apache.org/, 2018. https://github.com/
    Findings
  • RedisLabsModules/redis-ml, 2018.
    Google ScholarFindings
  • Request response python API in Microsoft machine learning server. https://docs.microsoft.com/en-us/machine-learning-server/operationalize/python/how-to-consume-web-services, 2018.
    Findings
  • TVM. https://tvm.ai/, 2018.
    Findings
  • Windows ml. https://docs.microsoft.com/en-us/windows/uwp/machine-learning/overview, 2018.
    Findings
  • D. Agarwal, B. Long, J. Traupman, D. Xin, and L. Zhang. LASER: A scalable response prediction platform for online advertising. In WSDM, 2014.
    Google ScholarLocate open access versionFindings
  • Z. Ahmed and et al. Machine learning for applications, not containers (under submission), 2018.
    Google ScholarFindings
  • M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational data processing in spark. In SIGMOD, 2015.
    Google ScholarLocate open access versionFindings
  • B. Babcock, S. Babu, M. Datar, R. Motwani, and D. Thomas. Operator scheduling in data stream systems. The VLDB Journal, 13(4):333–353, Dec. 2004.
    Google ScholarLocate open access versionFindings
  • P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. pages 225–237, 2005.
    Google ScholarFindings
  • T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, 2015.
    Google ScholarLocate open access versionFindings
  • T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, 2018.
    Google ScholarLocate open access versionFindings
  • R. Chirkova and J. Yang. Materialized views. Foundations and Trends in Databases, 4(4):295–405, 2012.
    Google ScholarLocate open access versionFindings
  • D. Crankshaw, P. Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, and M. I. Jordan. The missing piece in complex analytics: Low latency, scalable model management and serving with Velox. In CIDR, 2015.
    Google ScholarLocate open access versionFindings
  • D. Crankshaw and J. Gonzalez. Prediction-serving systems. Queue, 16(1):70:83–70:97, Feb. 2018.
    Google ScholarLocate open access versionFindings
  • D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica. Clipper: A low-latency online prediction serving system. In NSDI, 2017.
    Google ScholarLocate open access versionFindings
  • A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Cetintemel, and S. Zdonik. An architecture for compiling UDF-centric workflows. PVLDB, 8(12):1466–1477, Aug. 2015.
    Google ScholarLocate open access versionFindings
  • A. Deshpande and S. Madden. MauveDB: Supporting model-based user views in database systems. In SIGMOD, 2006.
    Google ScholarLocate open access versionFindings
  • B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In OSDI, 1999.
    Google ScholarLocate open access versionFindings
  • G. Graefe. Volcano: An extensible and parallel query evaluation system. IEEE Trans. on Knowl. and Data Eng., 6(1):120–135, Feb. 1994.
    Google ScholarLocate open access versionFindings
  • A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270–294, Dec. 2001.
    Google ScholarLocate open access versionFindings
  • R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW, 2016.
    Google ScholarLocate open access versionFindings
  • D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia. NoScope: Optimizing neural network queries over video at scale. PVLDB, 10(11):1586–1597, Aug. 2017.
    Google ScholarLocate open access versionFindings
  • A. Kemper, T. Neumann, J. Finis, F. Funke, V. Leis, H. Mühe, T. Mühlbauer, and W. Rödiger. Processing in the hybrid OLTP & OLAP main-memory database system hyper. IEEE Data Eng. Bull., 36(2):41–47, 2013.
    Google ScholarLocate open access versionFindings
  • K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, 2010.
    Google ScholarLocate open access versionFindings
  • E. Meijer, B. Beckman, and G. Bierman. LINQ: Reconciling object, relations and XML in the.NET framework. In SIGMOD, 2006.
    Google ScholarLocate open access versionFindings
  • A. N. Modi, C. Y. Koo, C. Y. Foo, C. Mewald, D. M. Baylor, E. Breck, H.-T. Cheng, J. Wilkiewicz, L. Koc, L. Lew, M. A. Zinkevich, M. Wicke, M. Ispir, N. Polyzotis, N. Fiedel, S. E. Haykal, S. Whang, S. Roy, S. Ramesh, V. Jain, X. Zhang, and Z. Haque. TFX: A TensorFlowbased production-scale machine learning platform. In SIGKDD, 2017.
    Google ScholarLocate open access versionFindings
  • G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, M. Ballesteros, D. Chiang, D. Clothiaux, T. Cohn, K. Duh, M. Faruqui, C. Gan, D. Garrette, Y. Ji, L. Kong, A. Kuncoro, G. Kumar, C. Malaviya, P. Michel, Y. Oda, M. Richardson, N. Saphra, S. Swayamdipta, and P. Yin. DyNet: The dynamic neural network toolkit. ArXiv e-prints, 2017.
    Google ScholarFindings
  • T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539–550, June 2011.
    Google ScholarLocate open access versionFindings
  • C. Olston, F. Li, J. Harmsen, J. Soyke, K. Gorovoy, L. Lao, N. Fiedel, S. Ramesh, and V. Rajashekhar. Tensorflowserving: Flexible, high-performance ml serving. In Workshop on ML Systems at NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: Distributed, low latency scheduling. In SOSP, 2013.
    Google ScholarFindings
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in python. J. Mach. Learn. Res., 12:2825–2830, Nov. 2011.
    Google ScholarLocate open access versionFindings
  • J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In PLDI, 2013.
    Google ScholarFindings
  • B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, NIPS. 2011.
    Google ScholarFindings
  • A. Rheinländer, A. Heise, F. Hueske, U. Leser, and F. Naumann. SOFA: an extensible logical optimizer for udf-heavy data flows. Inf. Syst., 52:96–125, 2015.
    Google ScholarLocate open access versionFindings
  • S. Ruder. An overview of gradient descent optimization algorithms. CoRR, 2016.
    Google ScholarLocate open access versionFindings
  • A. Scolari, Y. Lee, M. Weimer, and M. Interlandi. Towards accelerating generic machine learning prediction pipelines. In IEEE ICCD, 2017.
    Google ScholarLocate open access versionFindings
  • S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res., 14(1):567–599, Feb. 2013.
    Google ScholarLocate open access versionFindings
  • E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. KeystoneML: Optimizing pipelines for large-scale advanced analytics. In ICDE, 2017.
    Google ScholarLocate open access versionFindings
  • T. Um, G. Lee, S. Lee, K. Kim, and B.-G. Chun. Scaling up IoT stream processing. In APSys, 2017.
    Google ScholarLocate open access versionFindings
  • N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen. Tensor comprehensions: Framework-agnostic highperformance machine learning abstractions. CoRR, 2018.
    Google ScholarLocate open access versionFindings
  • S. Wanderman-Milne and N. Li. Runtime code generation in cloudera impala. IEEE Data Eng. Bull., 37:31–37, 2014.
    Google ScholarLocate open access versionFindings
  • W. Wang, S. Wang, J. Gao, M. Zhang, G. Chen, T. K. Ng, and B. C. Ooi. Rafiki: Machine Learning as an Analytics Service System. ArXiv e-prints, Apr. 2018.
    Google ScholarFindings
  • M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In SOSP, 2001.
    Google ScholarLocate open access versionFindings
  • J.-M. Yun, Y. He, S. Elnikety, and S. Ren. Optimal aggregation policy for reducing tail latency of web search. In SIGIR, 2015.
    Google ScholarLocate open access versionFindings
  • M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010.
    Google ScholarLocate open access versionFindings
  • M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.
    Google ScholarLocate open access versionFindings
  • C. Zhang, A. Kumar, and C. Ré. Materialization optimizations for feature selection workloads. ACM Trans. Database Syst., 41(1):2:1–2:32, Feb. 2016. https://developers.google.com/machine-learning/rules-ofml.
    Locate open access versionFindings
  • [66] M. Zukowski, P. A. Boncz, N. Nes, and S. Héman. MonetDB/X100 - a DBMS in the CPU cache. IEEE Data Eng. Bull., 28(2):17–22, 2005.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论