# Resource Management with Deep Reinforcement Learning

HotNets, pp. 50-56, 2016.

EI

Weibo:

Abstract:

Resource management problems in systems and networking often manifest as difficult online decision making tasks where appropriate solutions depend on understanding the workload and environment. Inspired by recent advances in deep reinforcement learning for AI problems, we consider building systems that learn to manage resources directly f...More

Code:

Data:

Introduction

- Examples include job scheduling in compute clusters [17], bitrate adaptation in video streaming [23, 39], relay selection in Internet telephony [40], virtual machine placement in cloud computing [20, 6], congestion control [38, 37, 13], and so on
- The majority of these problems are solved today using meticulously designed heuristics.
- Perusing recent research in the field, the typical design flow is: (1) come up with clever heuristics for a simplified model of the problem; and (2) painstakingly test and tune the heuristics for good performance in practice.
- The state transitions and rewards are stochastic and are assumed to have the Markov property; i.e. the state transition probabilities and rewards depend only on the state of the environment st and the action taken by the agent at

Highlights

- Resource management problems are ubiquitous in computer systems and networks
- We take a step back to understand some reasons for why real world resource management problems are challenging: 1
- We briefly review Reinforcement Learning (RL) techniques that we build on in this paper; we refer readers to [34] for a detailed survey and rigorous derivations
- We focus on a class of Reinforcement Learning algorithms that learn by performing gradient-descent on the policy parameters
- This paper shows that it is feasible to apply state-of-the-art Deep Reinforcement Learning techniques to large-scale systems
- Our early experiments show that the Reinforcement Learning agent is comparable and sometimes better than ad-hoc heuristics for a multi-resource cluster scheduling problem

Methods

- The average job arrival rate is chosen such that the average load varies between 10% to 190% of cluster capacity.
- I.e., with capacity {1r, 1r}.
- Job durations and resource demands are chosen as follows: 80% of the jobs have duration uniformly chosen between 1t and 3t; the remaining are chosen uniformly from 10t to 15t.
- Each job has a dominant resource which is picked independently at random.
- The demand for the dominant resource is chosen uniformly between 0.25r and 0.5r and the demand of the other resource is chosen uniformly between 0.05r and 0.1r

Conclusion

- The authors elaborate on current limitations of the solution, which motivate several challenging research directions.

Machine boundaries and locality. - The authors' problem formulation assumes a single “large resource pool"; this abstracts away machine boundaries and potential resource fragmentation.
- This formulation is more practical than might appear since many cluster schedulers make independent scheduling decisions per machine.
- Learning resource management strategies directly from experience, if the authors can make it work in a practical context, could offer a real alternative to current heuristic based approaches

Summary

## Introduction:

Examples include job scheduling in compute clusters [17], bitrate adaptation in video streaming [23, 39], relay selection in Internet telephony [40], virtual machine placement in cloud computing [20, 6], congestion control [38, 37, 13], and so on- The majority of these problems are solved today using meticulously designed heuristics.
- Perusing recent research in the field, the typical design flow is: (1) come up with clever heuristics for a simplified model of the problem; and (2) painstakingly test and tune the heuristics for good performance in practice.
- The state transitions and rewards are stochastic and are assumed to have the Markov property; i.e. the state transition probabilities and rewards depend only on the state of the environment st and the action taken by the agent at
## Methods:

The average job arrival rate is chosen such that the average load varies between 10% to 190% of cluster capacity.- I.e., with capacity {1r, 1r}.
- Job durations and resource demands are chosen as follows: 80% of the jobs have duration uniformly chosen between 1t and 3t; the remaining are chosen uniformly from 10t to 15t.
- Each job has a dominant resource which is picked independently at random.
- The demand for the dominant resource is chosen uniformly between 0.25r and 0.5r and the demand of the other resource is chosen uniformly between 0.05r and 0.1r
## Conclusion:

The authors elaborate on current limitations of the solution, which motivate several challenging research directions.

Machine boundaries and locality.- The authors' problem formulation assumes a single “large resource pool"; this abstracts away machine boundaries and potential resource fragmentation.
- This formulation is more practical than might appear since many cluster schedulers make independent scheduling decisions per machine.
- Learning resource management strategies directly from experience, if the authors can make it work in a practical context, could offer a real alternative to current heuristic based approaches

Related work

- RL has been used for a variety of learning tasks, ranging from robotics [25, 24] to industrial manufacturing [26] and computer game playing [34]. Of specific relevance to our work is Zhang and Dietterich’s paper [42] on allocating human resources to tasks before and after NASA shuttle missions. Our job scheduling setup has similarities (e.g., multiple jobs and resources), but differs crucially in being an online problem, whereas the NASA task is offline (all input is known in advance). Some early work uses RL for decentralized packet routing in a switch [8], but the problem sizes were small and neural network machinery was not needed. Recently, learning has been applied to designing congestion control protocols using a large number of offline [37] or online [13] experiments. RL could provide a useful framework for learning such congestion control algorithms as well.

Funding

- This work was funded in part by NSF grants CNS-1617702 and CNS-1563826

Reference

- Terminator, http://www.imdb.com/title/tt0088247/.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 2015.
- P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An application of reinforcement learning to aerobatic helicopter flight. Advances in neural information processing systems, page 1, 2007.
- S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Reoptimizing data parallel computing. In NSDI, pages 281–294, San Jose, CA, 2012. USENIX.
- G. Ananthanarayanan, S. Kandula, A. G. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using mantri. In OSDI, number 1, page 24, 2010.
- M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al. A view of cloud computing. Communications of the ACM, (4), 2010.
- D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming: an overview. In Decision and Control,. IEEE, 1995.
- J. A. Boyan and M. L. Littman. Packet routing in dynamically changing networks: A reinforcement learning approach. Advances in neural information processing systems, 1994.
- A. R. Cassandra and L. P. Kaelbling. Learning policies for partially observable environments: Scaling up. In Machine Learning Proceedings 1995, page 362. Morgan Kaufmann, 2016.
- T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In OSDI, pages 571–582, Broomfield, CO, Oct. 2014. USENIX Association.
- J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, pages 74–80, 2013.
- C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ASPLOS ’14, pages 127–144, New York, NY, USA, 2014. ACM.
- M. Dong, Q. Li, D. Zarchy, P. B. Godfrey, and M. Schapira. Pcc: Re-architecting congestion control for consistent high performance. In NSDI, pages 395–408, Oakland, CA, May 2015. USENIX Association.
- A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM european conference on Computer Systems. ACM, 2012.
- J. Gao and R. Evans. Deepmind ai reduces google data centre cooling bill by 40%. https://deepmind.com/blog/deepmind-ai-reducesgoogle-data-centre-cooling-bill-40/.
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. NSDI’11, pages 323–336, Berkeley, CA, USA, 2011. USENIX Association.
- R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. SIGCOMM ’14, pages 455–466, New York, NY, USA, 2014. ACM.
- M. T. Hagan, H. B. Demuth, M. H. Beale, and O. De Jesús. Neural network design. PWS publishing company Boston, 1996.
- W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, (1), 1970.
- B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee, and N. McKeown. Elastictree: Saving energy in data center networks. NSDI’10, Berkeley, CA, USA, 2010. USENIX Association.
- [22] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In ACM SIGOPS, 2009.
- [23] J. Junchen, D. Rajdeep, A. Ganesh, C. Philip, P. Venkata, S. Vyas, D. Esbjorn, G. Marcin, K. Dalibor, V. Renat, and Z. Hui. A control-theoretic approach for dynamic adaptive video streaming over http. SIGCOMM ’15, New York, NY, USA, 2015. ACM.
- [24] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 1996.
- [25] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 2013.
- [26] S. Mahadevan and G. Theocharous. Optimizing production manufacturing using reinforcement learning. In FLAIRS Conference, 1998.
- [27] I. Menache, S. Mannor, and N. Shimkin. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, (1), 2005.
- [28] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, 2016.
- [29] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, 2013.
- [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. H. I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Human-level control through deep reinforcement learning. Nature, 2015.
- [31] G. E. Monahan. State of the art - a survey of partially observable markov decision processes: theory, models, and algorithms. Management Science, (1), 1982.
- [32] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
- [33] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershevlvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
- [34] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
- [35] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999.
- [36] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. SOCC ’13, pages 5:1–5:16, New York, NY, USA, 2013. ACM.
- [37] K. Winstein and H. Balakrishnan. TCP Ex Machina: Computer-generated Congestion Control. In SIGCOMM, 2013.
- [38] K. Winstein, A. Sivaraman, and H. Balakrishnan. Stochastic forecasts achieve high throughput and low delay over cellular networks. In NSDI, pages 459–471, Lombard, IL, 2013. USENIX.
- [39] S. Yi, Y. Xiaoqi, J. Junchen, S. Vyas, L. Fuyuan, W. Nanshu, L. Tao, and B. Sinopoli. Cs2p: Improving video bitrate selection and adaptation with data-driven throughput prediction. SIGCOMM, New York, NY, USA, 2016. ACM.
- [40] X. Yin, A. Jindal, V. Sekar, and B. Sinopoli. Via: Improving internet telephony call quality using predictive relay selection. In SIGCOMM, SIGCOMM ’16, 2016.
- [41] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010.
- [42] W. Zhang and T. G. Dietterich. A reinforcement learning approach to job-shop scheduling. In IJCAI. Citeseer, 1995.

Full Text

Tags

Comments