When and How to Retrain Machine Learning-based Cloud Management Systems

Lidia Kidane,Paul Townend,Thijs Metsch,Erik Elmroth

2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022)（2022）

引用 0|浏览6

暂无评分

摘要

Cloud management systems increasingly rely on machine learning (ML) models to predict incoming workload rates, load, and other system behaviours for efficient dynamic resource management. Current state-of-the-art prediction models demonstrate high accuracy but assume that data patterns remain stable. However, in production use, systems may face hardware upgrades, changes in user behaviour etc. that lead to concept drifts - significant changes in the characteristics of data streams over time. To mitigate prediction deterioration, ML models need to be updated - but questions of when and how to best retrain these models are unsolved in the context of cloud management. We present a pilot study that addresses these questions for one of the most common models for adaptive prediction - Long Short Term Memory (LSTM) - using synthetic and real-world workload data. Our analysis of when to retrain explores approaches for detecting when retraining is required using both concept drift detection and prediction error thresholds, and at what point retraining should actually take place. Our analysis of how to retrain focuses on the data required for retraining, and what proportion should be taken from before and after the need for retraining is detected. We present initial results that indicate that retraining of existing models can achieve prediction accuracy close to that of newly trained models but for much less cost, and present initial advice for how to provide cloud management systems with support for automatic retraining of ML-based models.

查看译文

关键词

cloud computing,cloud workload prediction,concept drift,machine learning,time series prediction

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要