Resource Management with Deep Reinforcement Learning

HotNets, pp. 50-56, 2016.

Cited by: 407|Bibtex|Views151|DOI:https://doi.org/10.1145/3005745.3005750
EI
Other Links: dblp.uni-trier.de|dl.acm.org|academic.microsoft.com
Weibo:
This paper shows that it is feasible to apply state-of-the-art Deep Reinforcement Learning techniques to large-scale systems

Abstract:

Resource management problems in systems and networking often manifest as difficult online decision making tasks where appropriate solutions depend on understanding the workload and environment. Inspired by recent advances in deep reinforcement learning for AI problems, we consider building systems that learn to manage resources directly f...More

Code:

Data:

Introduction
  • Examples include job scheduling in compute clusters [17], bitrate adaptation in video streaming [23, 39], relay selection in Internet telephony [40], virtual machine placement in cloud computing [20, 6], congestion control [38, 37, 13], and so on
  • The majority of these problems are solved today using meticulously designed heuristics.
  • Perusing recent research in the field, the typical design flow is: (1) come up with clever heuristics for a simplified model of the problem; and (2) painstakingly test and tune the heuristics for good performance in practice.
  • The state transitions and rewards are stochastic and are assumed to have the Markov property; i.e. the state transition probabilities and rewards depend only on the state of the environment st and the action taken by the agent at
Highlights
  • Resource management problems are ubiquitous in computer systems and networks
  • We take a step back to understand some reasons for why real world resource management problems are challenging: 1
  • We briefly review Reinforcement Learning (RL) techniques that we build on in this paper; we refer readers to [34] for a detailed survey and rigorous derivations
  • We focus on a class of Reinforcement Learning algorithms that learn by performing gradient-descent on the policy parameters
  • This paper shows that it is feasible to apply state-of-the-art Deep Reinforcement Learning techniques to large-scale systems
  • Our early experiments show that the Reinforcement Learning agent is comparable and sometimes better than ad-hoc heuristics for a multi-resource cluster scheduling problem
Methods
  • The average job arrival rate is chosen such that the average load varies between 10% to 190% of cluster capacity.
  • I.e., with capacity {1r, 1r}.
  • Job durations and resource demands are chosen as follows: 80% of the jobs have duration uniformly chosen between 1t and 3t; the remaining are chosen uniformly from 10t to 15t.
  • Each job has a dominant resource which is picked independently at random.
  • The demand for the dominant resource is chosen uniformly between 0.25r and 0.5r and the demand of the other resource is chosen uniformly between 0.05r and 0.1r
Conclusion
  • The authors elaborate on current limitations of the solution, which motivate several challenging research directions.

    Machine boundaries and locality.
  • The authors' problem formulation assumes a single “large resource pool"; this abstracts away machine boundaries and potential resource fragmentation.
  • This formulation is more practical than might appear since many cluster schedulers make independent scheduling decisions per machine.
  • Learning resource management strategies directly from experience, if the authors can make it work in a practical context, could offer a real alternative to current heuristic based approaches
Summary
  • Introduction:

    Examples include job scheduling in compute clusters [17], bitrate adaptation in video streaming [23, 39], relay selection in Internet telephony [40], virtual machine placement in cloud computing [20, 6], congestion control [38, 37, 13], and so on
  • The majority of these problems are solved today using meticulously designed heuristics.
  • Perusing recent research in the field, the typical design flow is: (1) come up with clever heuristics for a simplified model of the problem; and (2) painstakingly test and tune the heuristics for good performance in practice.
  • The state transitions and rewards are stochastic and are assumed to have the Markov property; i.e. the state transition probabilities and rewards depend only on the state of the environment st and the action taken by the agent at
  • Methods:

    The average job arrival rate is chosen such that the average load varies between 10% to 190% of cluster capacity.
  • I.e., with capacity {1r, 1r}.
  • Job durations and resource demands are chosen as follows: 80% of the jobs have duration uniformly chosen between 1t and 3t; the remaining are chosen uniformly from 10t to 15t.
  • Each job has a dominant resource which is picked independently at random.
  • The demand for the dominant resource is chosen uniformly between 0.25r and 0.5r and the demand of the other resource is chosen uniformly between 0.05r and 0.1r
  • Conclusion:

    The authors elaborate on current limitations of the solution, which motivate several challenging research directions.

    Machine boundaries and locality.
  • The authors' problem formulation assumes a single “large resource pool"; this abstracts away machine boundaries and potential resource fragmentation.
  • This formulation is more practical than might appear since many cluster schedulers make independent scheduling decisions per machine.
  • Learning resource management strategies directly from experience, if the authors can make it work in a practical context, could offer a real alternative to current heuristic based approaches
Related work
  • RL has been used for a variety of learning tasks, ranging from robotics [25, 24] to industrial manufacturing [26] and computer game playing [34]. Of specific relevance to our work is Zhang and Dietterich’s paper [42] on allocating human resources to tasks before and after NASA shuttle missions. Our job scheduling setup has similarities (e.g., multiple jobs and resources), but differs crucially in being an online problem, whereas the NASA task is offline (all input is known in advance). Some early work uses RL for decentralized packet routing in a switch [8], but the problem sizes were small and neural network machinery was not needed. Recently, learning has been applied to designing congestion control protocols using a large number of offline [37] or online [13] experiments. RL could provide a useful framework for learning such congestion control algorithms as well.
Funding
  • This work was funded in part by NSF grants CNS-1617702 and CNS-1563826
Reference
  • Terminator, http://www.imdb.com/title/tt0088247/.
    Findings
  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 2015.
    Google ScholarLocate open access versionFindings
  • P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An application of reinforcement learning to aerobatic helicopter flight. Advances in neural information processing systems, page 1, 2007.
    Google ScholarFindings
  • S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Reoptimizing data parallel computing. In NSDI, pages 281–294, San Jose, CA, 2012. USENIX.
    Google ScholarFindings
  • G. Ananthanarayanan, S. Kandula, A. G. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using mantri. In OSDI, number 1, page 24, 2010.
    Google ScholarLocate open access versionFindings
  • M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al. A view of cloud computing. Communications of the ACM, (4), 2010.
    Google ScholarLocate open access versionFindings
  • D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming: an overview. In Decision and Control,. IEEE, 1995.
    Google ScholarLocate open access versionFindings
  • J. A. Boyan and M. L. Littman. Packet routing in dynamically changing networks: A reinforcement learning approach. Advances in neural information processing systems, 1994.
    Google ScholarFindings
  • A. R. Cassandra and L. P. Kaelbling. Learning policies for partially observable environments: Scaling up. In Machine Learning Proceedings 1995, page 362. Morgan Kaufmann, 2016.
    Google ScholarLocate open access versionFindings
  • T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In OSDI, pages 571–582, Broomfield, CO, Oct. 2014. USENIX Association.
    Google ScholarLocate open access versionFindings
  • J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, pages 74–80, 2013.
    Google ScholarFindings
  • C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ASPLOS ’14, pages 127–144, New York, NY, USA, 2014. ACM.
    Google ScholarFindings
  • M. Dong, Q. Li, D. Zarchy, P. B. Godfrey, and M. Schapira. Pcc: Re-architecting congestion control for consistent high performance. In NSDI, pages 395–408, Oakland, CA, May 2015. USENIX Association.
    Google ScholarLocate open access versionFindings
  • A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM european conference on Computer Systems. ACM, 2012.
    Google ScholarLocate open access versionFindings
  • J. Gao and R. Evans. Deepmind ai reduces google data centre cooling bill by 40%. https://deepmind.com/blog/deepmind-ai-reducesgoogle-data-centre-cooling-bill-40/.
    Findings
  • A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. NSDI’11, pages 323–336, Berkeley, CA, USA, 2011. USENIX Association.
    Google ScholarLocate open access versionFindings
  • R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. SIGCOMM ’14, pages 455–466, New York, NY, USA, 2014. ACM.
    Google ScholarFindings
  • M. T. Hagan, H. B. Demuth, M. H. Beale, and O. De Jesús. Neural network design. PWS publishing company Boston, 1996.
    Google ScholarFindings
  • W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, (1), 1970.
    Google ScholarLocate open access versionFindings
  • B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee, and N. McKeown. Elastictree: Saving energy in data center networks. NSDI’10, Berkeley, CA, USA, 2010. USENIX Association.
    Google ScholarFindings
  • [22] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In ACM SIGOPS, 2009.
    Google ScholarLocate open access versionFindings
  • [23] J. Junchen, D. Rajdeep, A. Ganesh, C. Philip, P. Venkata, S. Vyas, D. Esbjorn, G. Marcin, K. Dalibor, V. Renat, and Z. Hui. A control-theoretic approach for dynamic adaptive video streaming over http. SIGCOMM ’15, New York, NY, USA, 2015. ACM.
    Google ScholarFindings
  • [24] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 1996.
    Google ScholarLocate open access versionFindings
  • [25] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 2013.
    Google ScholarLocate open access versionFindings
  • [26] S. Mahadevan and G. Theocharous. Optimizing production manufacturing using reinforcement learning. In FLAIRS Conference, 1998.
    Google ScholarLocate open access versionFindings
  • [27] I. Menache, S. Mannor, and N. Shimkin. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, (1), 2005.
    Google ScholarLocate open access versionFindings
  • [28] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, 2016.
    Google ScholarLocate open access versionFindings
  • [29] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, 2013.
    Google ScholarLocate open access versionFindings
  • [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. H. I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Human-level control through deep reinforcement learning. Nature, 2015.
    Google ScholarLocate open access versionFindings
  • [31] G. E. Monahan. State of the art - a survey of partially observable markov decision processes: theory, models, and algorithms. Management Science, (1), 1982.
    Google ScholarLocate open access versionFindings
  • [32] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
    Findings
  • [33] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershevlvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
    Google ScholarLocate open access versionFindings
  • [34] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
    Google ScholarFindings
  • [35] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999.
    Google ScholarLocate open access versionFindings
  • [36] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. SOCC ’13, pages 5:1–5:16, New York, NY, USA, 2013. ACM.
    Google ScholarFindings
  • [37] K. Winstein and H. Balakrishnan. TCP Ex Machina: Computer-generated Congestion Control. In SIGCOMM, 2013.
    Google ScholarLocate open access versionFindings
  • [38] K. Winstein, A. Sivaraman, and H. Balakrishnan. Stochastic forecasts achieve high throughput and low delay over cellular networks. In NSDI, pages 459–471, Lombard, IL, 2013. USENIX.
    Google ScholarFindings
  • [39] S. Yi, Y. Xiaoqi, J. Junchen, S. Vyas, L. Fuyuan, W. Nanshu, L. Tao, and B. Sinopoli. Cs2p: Improving video bitrate selection and adaptation with data-driven throughput prediction. SIGCOMM, New York, NY, USA, 2016. ACM.
    Google ScholarFindings
  • [40] X. Yin, A. Jindal, V. Sekar, and B. Sinopoli. Via: Improving internet telephony call quality using predictive relay selection. In SIGCOMM, SIGCOMM ’16, 2016.
    Google ScholarLocate open access versionFindings
  • [41] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010.
    Google ScholarLocate open access versionFindings
  • [42] W. Zhang and T. G. Dietterich. A reinforcement learning approach to job-shop scheduling. In IJCAI. Citeseer, 1995.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments