AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We study knowledge distillation for the learning to rank problem that is the core in recommender systems and many other information retrieval systems

Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender System.

KDD, (2018): 2289-2298

Cited by: 29|Views23
EI

Abstract

We propose a novel way to train ranking models, such as recommender systems, that are both effective and efficient. Knowledge distillation (KD) was shown to be successful in image recognition to achieve both effectiveness and efficiency. We propose a KD technique for learning to rank problems, called ranking distillation (RD). Specificall...More

Code:

Data:

0
Introduction
  • Google and Yahoo; personalized item retrieval a.k.a recommender systems for Amazon and Netflix
  • The core in such systems is a ranking model for computing the relevance score of each (q, d) pair for future use, where q is the query and d is a document.
  • The size of such models increases by an order of magnitude or more than previous methods
  • While such models have better ranking performance by capturing more query-document interactions, they incur a larger latency at online inference phase when responding to user requests due to the larger model size
Highlights
  • In recent years, information retrieval (IR) systems become a core technology of many large companies, such as web page retrieval for Ke Wang

    Google and Yahoo; personalized item retrieval a.k.a recommender systems for Amazon and Netflix
  • We study knowledge distillation for the learning to rank problem that is the core in recommender systems and many other IR systems
  • Traditional Module Distillation Module training data, we introduce extra information generated from a well-trained teacher model and make the student model as effective as the teacher
  • Under the paradigm of ranking distillation, for a certain query q, besides the labeled documents, we use a top-K ranking for unlabeled documents generated by a well-trained teacher ranking model MT as extra information to guide the training of the student ranking model MS with less parameters
  • We adopt as many parameters as possible for the teacher model to achieve a good performance on each data set
  • This paper focused on several key issues of ranking distillation, i.e., the problem formulation, the representation of teacher’s supervision, and the balance between the trust on the training data and the trust on the teacher, and presented our solutions
Methods
  • The authors use recommendation as the task for evaluating the performance of ranking distillation.
  • In this problem, we.
  • To measure the online inference efficiency, the authors count the number of parameters in each model and report the wall time for making a recommendation list to every user based on her/his last 5 actions in the training data set.
  • The authors apply the proposed ranking distillation to two sequential recommendation models that have been shown to have strong performances:
Results
  • The results of each method are summarized in Table 2.
  • The authors included three non-sequential recommendation baselines: the popularity based item recommendation (POP), the item based Collaborative Filtering5 (ItemCF) [32], and the Datasets Model Fossil-T Gowalla.
  • Fossil-RD Caser-T Caser-RD Foursquare Time (CPU) Time (GPU) #Params Ratio.
  • Bayesian personalized ranking (BPR) [30].
  • The performance of these non-sequential baselines is worse than that of the sequential recommenders, i.e., Fossil and Caser
Conclusion
  • Under the paradigm of ranking distillation, for a certain query q, besides the labeled documents, the authors use a top-K ranking for unlabeled documents generated by a well-trained teacher ranking model MT as extra information to guide the training of the student ranking model MS with less parameters.
  • The authors found that if the authors use pair-wise distillation loss to place much focus on the partial order within teacher’s ranking, it will produce both upward and downward gradients, making the training unstable and sometimes even fail to converge.
Tables
  • Table1: Statistics of the data sets
  • Table2: Performance comparison. (1) The performance of the models with ranking distillation, Fossil-RD and Caser-RD, always has statistically significant improvements over the student-only models, Fossil-S and Caser-S. (2) The performance of the models with ranking distillation, Fossil-RD and Caser-RD, has no significant degradation from that of the teacher models, Fossil-T and Caser-T. We use the one-tail t-test with significance level at 0.05
  • Table3: Model compactness and online inference efficiency. Time (seconds) indicates the wall time used for generating a recommendation list for every user. Ratio is the student model’s parameter size relative to the teacher model’s parameter size
  • Table4: Performance of Caser-RD with different choices of weighting scheme on two data sets
Download tables as Excel
Related work
  • In this section, we compared our works with several related research areas. Knowledge Distillation Knowledge distillation has been used in image recognition [3, 15, 31] and neural machine translation [20] as a way to generate compact models. As pointed out in Introduction, it is not straightforward to apply KD to ranking models and new issues must be addressed. In the context of ranking problems, the most relevant work is [6], which uses knowledge distillation for image retrieval. This method applies the sampling technique to rank a sample of the image from all data each time. In general, training on a sample works if the sample shares similar patterns with the rest of data through some content information, such as image contents in the case of [6]. But this technique is not applicable to training a recommender model when items and users are represented by IDs with no content information, as in the case of collaborative filtering. In this case, the recommender model training cannot be easily generalize to all users and items. Semi-Supervised Learning Another related research area is semisupervised learning [4, 43]. Unlike the teacher-student model learning paradigm in knowledge distillation and in our work, semisupervised learning usually trains a single model and utilizes weaklabeled or unlabeled data as well as the labeled data to gain a better performance. Several works in information retrieval followed this direction, using weak-labeled or unlabeled data to construct test collections [2], to provide extra features [11] and labels [10] for ranking model training. The basic idea of ranking distillation and semi-supervised learning is similar as they both utilize unlabeled data while with different purpose. Transfer Learning for Recommender System Transfer learning has been widely used in the field of recommender systems [5, 12]. These methods mainly focus on how to transfer knowledge (e.g., user rating patterns) from a source domain (e.g., movies) to a target domain (e.g., musics) for improving the recommendation performance. If we consider the student as a target model and the teacher as a source model, our teacher-student learning can be seen as a special transfer learning. However, unlike transfer learning, our teacher-student learning does not require two domains because the teacher and student models are learned from the same domain. Having a compact student model to enhance online inference efficiency is another purpose of our teacher-student learning.
Funding
  • The work of the second author is partially supported by a Discovery Grant from Natural Sciences and Engineering Research Council of Canada
Reference
  • Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormándi, George E. Dahl, and Geoffrey E. Hinton. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235 (2018).
    Findings
  • Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo test collections for learning web search ranking functions. In International Conference on Research and development in Information Retrieval. ACM, 1073–1082.
    Google ScholarLocate open access versionFindings
  • Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?. In Advances in neural information processing systems. 2654–2662.
    Google ScholarFindings
  • Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20, 3 (2009), 542–542.
    Google ScholarLocate open access versionFindings
  • Xu Chen, Zheng Qin, Yongfeng Zhang, and Tao Xu. 2016. Learning to rank features for recommendation over multiple categories. In International ACM SIGIR conference on Research and Development in Information Retrieval. 305–314.
    Google ScholarLocate open access versionFindings
  • Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. 2017. DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer. arXiv preprint arXiv:1707.01220 (2017).
    Findings
  • Eunjoon Cho, Seth A Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In International Conference on Knowledge Discovery and Data Mining. ACM, 1082–1090.
    Google ScholarLocate open access versionFindings
  • Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. 2015. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics. 192–204.
    Google ScholarLocate open access versionFindings
  • Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In ACM Conference on Recommender systems. 191–198.
    Google ScholarLocate open access versionFindings
  • Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. arXiv preprint arXiv:1704.08803 (2017).
    Findings
  • Fernando Diaz. 2016. Learning to Rank with Labeled Features. In International Conference on the Theory of Information Retrieval. ACM, 41–44.
    Google ScholarLocate open access versionFindings
  • Ignacio Fernández-Tobías, Iván Cantador, Marius Kaminskas, and Francesco Ricci. 20Cross-domain recommender systems: A survey of the state of the art. In Spanish Conference on Information Retrieval. sn, 24.
    Google ScholarLocate open access versionFindings
  • Ruining He and Julian McAuley. 2016. Fusing Similarity Models with Markov Chains for Sparse Sequential Recommendation. In International Conference on Data Mining. IEEE.
    Google ScholarLocate open access versionFindings
  • Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In International Conference on World Wide Web. ACM, 173–182.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 20Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
    Findings
  • Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and Deborah Estrin. 2017. Collaborative metric learning. In International Conference on World Wide Web. 193–201.
    Google ScholarFindings
  • Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In IEEE conference on Computer Vision and Pattern Recognition. 1725–1732.
    Google ScholarLocate open access versionFindings
  • Kenji Kawaguchi. 2016. Deep learning without poor local minima. In Advances in Neural Information Processing Systems. 586–594.
    Google ScholarLocate open access versionFindings
  • Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Conference on Empirical Methods on Natural Language Processing. ACL, 1756– 1751.
    Google ScholarLocate open access versionFindings
  • Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016).
    Findings
  • Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009).
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105.
    Google ScholarLocate open access versionFindings
  • Hui Li, Tsz Nam Chan, Man Lung Yiu, and Nikos Mamoulis. 2017. FEXIPRO: Fast and Exact Inner Product Retrieval in Recommender Systems. In International Conference on Management of Data. ACM, 835–850.
    Google ScholarLocate open access versionFindings
  • David C Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C Ma, Zhigang Zhong, Jenny Liu, and Yushi Jing. 2017. Related pins at pinterest: The evolution of a real-world recommender system. In International Conference on World Wide Web. 583–592.
    Google ScholarFindings
  • Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model.. In Interspeech.
    Google ScholarFindings
  • Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. 2013. Learning with noisy labels. In Advances in neural information processing systems. 1196–1204.
    Google ScholarFindings
  • Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text Matching As Image Recognition. In AAAI Conference on Artificial Intelligence. AAAI Press, 2793–2799.
    Google ScholarLocate open access versionFindings
  • Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017. DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval. arXiv preprint arXiv:1710.05649 (2017).
    Findings
  • Steffen Rendle and Christoph Freudenthaler. 2014. Improving pairwise learning for item recommendation from implicit feedback. In International Conference on Web Search and Data Mining. ACM, 273–282.
    Google ScholarLocate open access versionFindings
  • Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Conference on Uncertainty in Artificial Intelligence. AUAI Press, 452–461.
    Google ScholarLocate open access versionFindings
  • Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
    Findings
  • Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In International Conference on World Wide Web. ACM, 285–295.
    Google ScholarLocate open access versionFindings
  • Jiaxi Tang and Ke Wang. 2018. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding. In ACM International Conference on Web Search and Data Mining.
    Google ScholarLocate open access versionFindings
  • Christina Teflioudi, Rainer Gemulla, and Olga Mykytiuk. 2015. Lemp: Fast retrieval of large entries in a matrix product. In International Conference on Management of Data. ACM, 107–122.
    Google ScholarLocate open access versionFindings
  • Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learning for recommender systems. In International Conference on Knowledge Discovery and Data Mining. ACM, 1235–1244.
    Google ScholarLocate open access versionFindings
  • Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning 81, 1 (2010), 21–35.
    Google ScholarLocate open access versionFindings
  • Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In International ACM SIGIR conference on Research and Development in Information Retrieval. 55–64.
    Google ScholarLocate open access versionFindings
  • Quan Yuan, Gao Cong, and Aixin Sun. 2014. Graph-based point-of-interest recommendation with geographical and temporal influences. In International Conference on Information and Knowledge Management. ACM, 659–668.
    Google ScholarLocate open access versionFindings
  • Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and TatSeng Chua. 2016. Discrete collaborative filtering. In International Conference on Research and Development in Information Retrieval. ACM, 325–334.
    Google ScholarLocate open access versionFindings
  • Yan Zhang, Defu Lian, and Guowu Yang. 2017. Discrete Personalized Ranking for Fast Collaborative Filtering from Implicit Feedback.. In AAAI Conference on Artificial Intelligence. AAAI Press, 1669–1675.
    Google ScholarLocate open access versionFindings
  • Zhiwei Zhang, Qifan Wang, Lingyun Ruan, and Luo Si. 2014. Preference preserving hashing for efficient recommendation. In International Conference on Research and Development in Information Retrieval. ACM, 183–192.
    Google ScholarLocate open access versionFindings
  • Ke Zhou and Hongyuan Zha. 2012. Learning binary codes for collaborative filtering. In International Conference on Knowledge Discovery and Data Mining. ACM, 498–506.
    Google ScholarLocate open access versionFindings
  • Xiaojin Zhu. 2005. Semi-supervised learning literature survey. (2005).
    Google ScholarFindings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科