Distributed Equivalent Substitution Training for Large-Scale Recommender Systems

Zhou Feihu
Zhou Feihu
Wu Haiyang
Wu Haiyang
Lan Rui
Lan Rui
Li Fan
Li Fan
Zhang Han
Zhang Han
Yang Yuekui
Yang Yuekui
Guo Zhenyu
Guo Zhenyu
Wang Di
Wang Di

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval Virtual Event China July, 2020, pp. 911-920, 2019.

Cited by: 0|Bibtex|Views61|DOI:https://doi.org/10.1145/3397271.3401113
EI
Other Links: arxiv.org|dl.acm.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Experiments that compare our implementation with a popular mesh-based implementation show that our framework achieves up to 83% communication savings,and can bring up to 4.5x improvement on throughput for deep models

Abstract:

We present Distributed Equivalent Substitution (DES) training, a novel distributed training framework for large-scale recommender systems with dynamic sparse features. DES introduces fully synchronous training to large-scale recommendation system for the first time by reducing communication, thus making the training of commercial recommen...More

Code:

Data:

0
Introduction
  • Large-scale recommender systems are critical tools to enhance user experience and promote sales/services for many online websites and mobile applications.
  • The recommender system will return a list of items for the user to further interact or ignore.
  • These user operations, queries and interactions are recorded in the log as training data for future use.
  • This paper mainly studies the core component of a recommender system: models that are used for ranking and online learning
Highlights
  • Large-scale recommender systems are critical tools to enhance user experience and promote sales/services for many online websites and mobile applications
  • We show that for different types of models that recommender systems use, we can always find computational equivalent substitutions and splitting strategies for their weights-rich operators that reduce the amount of data we need to send over the network
  • We propose a novel framework for models with large-scale sparse dynamic features in recommender systems
  • We take advantage of the observation that for all models in recommender systems, the first one or few weights-rich layers only participate in straightforward computation, and can be replaced by multiple suboperators that form a computationally equivalent substitution
  • The application of Distributed Equivalent Substitution (DES) on popular DLRMs such as Logistic regression (LR), Factorization machine (FM), DNN, Wide&Deep, and DeepFM shows the generality of our algorithm
  • Experiments that compare our implementation with a popular mesh-based implementation show that our framework achieves up to 83% communication savings,and can bring up to 4.5x improvement on throughput for deep models
Methods
  • Software: The authors' DES framework is based on an enhanced version of TensorFlow 1.13.1 and a standard OpenMPI with version 4.0.1.
  • The mesh-based strategy the authors compare with is implemented using a popular open-source framework: DiFacto (Li et al, 2016).
  • Dataset: In order to verify the performance of DES in real industrial context, the authors extract a continuous segment of samples from a recommender system in use internally.
  • The total number of samples is 10,809,440.
  • It is stored in a remote sample server
Results
  • Evaluation Metrics

    The authors use AUC as the evaluation metric for all the experiments.

    Performance Summary Figure 10 compares the framework to mesh-based implementation using DiFacto on three different widely-adopted models in mainstream recommender systems: LR, W&D, and DeepFM.
  • The authors use AUC as the evaluation metric for all the experiments.
  • Performance Summary Figure 10 compares the framework to mesh-based implementation using DiFacto on three different widely-adopted models in mainstream recommender systems: LR, W&D, and DeepFM.
  • On all three models, DES can achieve better AUC in smaller number of iterations with order of magnitude smaller communication cost.
  • LR W&D DEEPFM DIFACTO DES
Conclusion
  • CONCLUSIONS AND FUTURE WORKS

    The authors propose a novel framework for models with large-scale sparse dynamic features in recommender systems.
  • The authors' framework achieves efficient synchronous distributed training due to its core component: Distributed Equivalent Substitution (DES) algorithm.
  • The authors take advantage of the observation that for all models in recommender systems, the first one or few weights-rich layers only participate in straightforward computation, and can be replaced by multiple suboperators that form a computationally equivalent substitution.
  • Experiments that compare the implementation with a popular mesh-based implementation show that the framework achieves up to 83% communication savings,and can bring up to 4.5x improvement on throughput for deep models
Summary
  • Introduction:

    Large-scale recommender systems are critical tools to enhance user experience and promote sales/services for many online websites and mobile applications.
  • The recommender system will return a list of items for the user to further interact or ignore.
  • These user operations, queries and interactions are recorded in the log as training data for future use.
  • This paper mainly studies the core component of a recommender system: models that are used for ranking and online learning
  • Methods:

    Software: The authors' DES framework is based on an enhanced version of TensorFlow 1.13.1 and a standard OpenMPI with version 4.0.1.
  • The mesh-based strategy the authors compare with is implemented using a popular open-source framework: DiFacto (Li et al, 2016).
  • Dataset: In order to verify the performance of DES in real industrial context, the authors extract a continuous segment of samples from a recommender system in use internally.
  • The total number of samples is 10,809,440.
  • It is stored in a remote sample server
  • Results:

    Evaluation Metrics

    The authors use AUC as the evaluation metric for all the experiments.

    Performance Summary Figure 10 compares the framework to mesh-based implementation using DiFacto on three different widely-adopted models in mainstream recommender systems: LR, W&D, and DeepFM.
  • The authors use AUC as the evaluation metric for all the experiments.
  • Performance Summary Figure 10 compares the framework to mesh-based implementation using DiFacto on three different widely-adopted models in mainstream recommender systems: LR, W&D, and DeepFM.
  • On all three models, DES can achieve better AUC in smaller number of iterations with order of magnitude smaller communication cost.
  • LR W&D DEEPFM DIFACTO DES
  • Conclusion:

    CONCLUSIONS AND FUTURE WORKS

    The authors propose a novel framework for models with large-scale sparse dynamic features in recommender systems.
  • The authors' framework achieves efficient synchronous distributed training due to its core component: Distributed Equivalent Substitution (DES) algorithm.
  • The authors take advantage of the observation that for all models in recommender systems, the first one or few weights-rich layers only participate in straightforward computation, and can be replaced by multiple suboperators that form a computationally equivalent substitution.
  • Experiments that compare the implementation with a popular mesh-based implementation show that the framework achieves up to 83% communication savings,and can bring up to 4.5x improvement on throughput for deep models
Tables
  • Table1: Some common components that are shared among different recommender system models
  • Table2: The number of unique features per batch for different batch sizes on a real-world recommender system
  • Table3: The network reduction ratio of different models using a 4-node cluster
  • Table4: Average AUC for three models after a 7-day training session
  • Table5: Throughput of DES and DiFacto on three models
Download tables as Excel
Related work
  • Large-scale recommender systems are distributed systems designed specifically for training recommendation models. This section reviews related works from the perspectives of both fields: 2.1 Large-Scale Distributed Training Systems

    Data Parallelism splits training data on the batch domain and keeps replica of the entire model on each device. The popularity of ring-based AllReduce (Gibiansky, 2017) has enabled large-scale data parallelism training (Goyal et al, 2017; Xianyan Jia, 2018; You et al, 2019). Parameter Server (PS) is a primary method for training large-scale recommender systems due to its simplicity and scalability (Dean et al, 2012; Li et al, 2014). Each worker processes on a subset of the input data, and is allowed to use stale weights and update either its weights or that of a parameter server. Model Parallelism is another commonly used distributed training strategy (Krizhevsky, 2014; Dean et al, 2012). More recent model parallelism strategy learns the device placement (Mirhoseini et al, 2017) or uses pipelining (Huang et al, 2018). These works usually focus on enabling the system to process complex models with large amount of weights.
Reference
  • Chen, J., Monga, R., Bengio, S., and Jozefowicz, R. Revisiting distributed synchronous SGD. CoRR, abs/1604.00981, 2016. URL http://arxiv.org/abs/1604.00981.
    Findings
  • Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., and Shah, H. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS 2016, pp. 7– 10, New York, NY, USA, 2016. ACM. ISBN 978-1-45034795-doi: 10.1145/2988450.2988454. URL http://doi.acm.org/10.1145/2988450.2988454.
    Locate open access versionFindings
  • Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 191–198, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4035-9. doi: 10.1145/2959100.2959190. URL http://doi.acm.org/10.1145/2959100.2959190.
    Locate open access versionFindings
  • Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp. 1223–1231, USA, 2012. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999271.
    Locate open access versionFindings
  • Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. ISSN 1532-443URL http://dl.acm.org/citation.cfm?id=1953048.2021068.
    Locate open access versionFindings
  • Gholami, A., Azad, A., Jin, P., Keutzer, K., and Buluc, A. Integrated model, batch, and domain parallelism in training neural networks. In SPAA’18: 30th ACM Symposium on Parallelism in Algorithms and Architectures, 2018. URL http://eecs.berkeley.edu/̃aydin/integrateddnn_spaa2018.pdf.
    Locate open access versionFindings
  • Gibiansky, A. Bringing HPC techniques to deep learning, 201URL http://research.baidu.com/bringing-hpc-techniques-deep-learning.
    Findings
  • Goyal, P., Dollar, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677.
    Findings
  • Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: A factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp. 1725–1731. AAAI Press, 2017. ISBN 978-0-9992411-03. URL http://dl.acm.org/citation.cfm?id=3172077.3172127.
    Locate open access versionFindings
  • Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR, abs/1811.06965, 2018. URL http://arxiv.org/abs/1811.06965.
    Findings
  • Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring hidden dimensions in parallelizing convolutional neural networks. CoRR, abs/1802.04924, 2018. URL http://arxiv.org/abs/1802.04924.
    Findings
  • Juan, Y., Zhuang, Y., Chin, W.-S., and Lin, C.-J. Fieldaware factorization machines for ctr prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 43–50, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4035-9. doi: 10.1145/2959100.2959134. URL http://doi.acm.org/10.1145/2959100.2959134.
    Locate open access versionFindings
  • Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014. URL http://arxiv.org/abs/1404.5997.
    Findings
  • Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.Y. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pp. 583–598, Berkeley, CA, USA, 20USENIX Association. ISBN 978-1-931971-164. URL http://dl.acm.org/citation.cfm?id=2685048.2685095.
    Locate open access versionFindings
  • Li, M., Liu, Z., Smola, A. J., and Wang, Y.-X. DiFacto: Distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, pp. 377–386, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3716-8. doi: 10.1145/2835776.2835781. URL http://doi.acm.org/10.1145/2835776.2835781.
    Locate open access versionFindings
  • Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., and Sun, G. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, pp. 1754–1763, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5552-0. doi: 10.1145/3219819.
    Locate open access versionFindings
  • 3220023. URL http://doi.acm.org/10.1145/3219819.3220023.
    Findings
  • McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L., Phillips, T., Davydov, E., Golovin, D., Chikkerur, S., Liu, D., Wattenberg, M., Hrafnkelsson, A. M., Boulos, T., and Kubica, J. Ad click prediction: A view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pp. 1222–1230, New York, NY, USA, 2013. ACM. ISBN 978-1-45032174-7. doi: 10.1145/2487575.2488200. URL http://doi.acm.org/10.1145/2487575.2488200.
    Locate open access versionFindings
  • Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean, J. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 2430–2439. JMLR.org, 2017. URL http://dl.acm.org/citation.cfm?id=3305890.3305932.
    Locate open access versionFindings
  • learning training system with mixed-precision: Training imagenet in four minutes. CoRR, abs/1807.11205 (1807.11205v1), July 2018.
    Findings
  • You, Y., Li, J., Hseu, J., Song, X., Demmel, J., and Hsieh, C. Large batch optimization for deep learning: Training BERT in 76 minutes. CoRR, abs/1904.00962, 2019. URL http://arxiv.org/abs/1904.00962.
    Findings
  • Zhang, W., Du, T., and Wang, J. Deep learning over multifield categorical data: A case study on user response prediction. ArXiv, abs/1601.02376, 2016.
    Findings
  • Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, pp. 1059–1068, New York, NY, USA, 2018. ACM. ISBN 978-1-45035552-0. doi: 10.1145/3219819.32198URL http://doi.acm.org/10.1145/3219819.3219823.
    Locate open access versionFindings
  • Oord, A. v. d., Dieleman, S., and Schrauwen, B. Deep content-based music recommendation. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 2643–2651, USA, 2013. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999792.2999907.
    Locate open access versionFindings
  • Rendle, S. Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, pp. 995–1000, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4256-0. doi: 10.1109/ICDM.2010.127. URL http://dx.doi.org/10.1109/ICDM.2010.127.
    Locate open access versionFindings
  • Richardson, M., Dominowska, E., and Ragno, R. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 521–530, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-654-7. doi: 10.1145/1242572.1242643. URL http://doi.acm.org/10.1145/1242572.1242643.
    Locate open access versionFindings
  • Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., Sepassi, R., and Hechtman, B. Mesh-tensorflow: Deep learning for supercomputers. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pp. 10435–10444, USA, 2018. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=3327546.3327703.
    Locate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments