AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We study knowledge distillation for the learning to rank problem that is the core in recommender systems and many other information retrieval systems
Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender System.
KDD, (2018): 2289-2298
We propose a novel way to train ranking models, such as recommender systems, that are both effective and efficient. Knowledge distillation (KD) was shown to be successful in image recognition to achieve both effectiveness and efficiency. We propose a KD technique for learning to rank problems, called ranking distillation (RD). Specificall...More
PPT (Upload PPT)
- Google and Yahoo; personalized item retrieval a.k.a recommender systems for Amazon and Netflix
- The core in such systems is a ranking model for computing the relevance score of each (q, d) pair for future use, where q is the query and d is a document.
- The size of such models increases by an order of magnitude or more than previous methods
- While such models have better ranking performance by capturing more query-document interactions, they incur a larger latency at online inference phase when responding to user requests due to the larger model size
- In recent years, information retrieval (IR) systems become a core technology of many large companies, such as web page retrieval for Ke Wang
Google and Yahoo; personalized item retrieval a.k.a recommender systems for Amazon and Netflix
- We study knowledge distillation for the learning to rank problem that is the core in recommender systems and many other IR systems
- Traditional Module Distillation Module training data, we introduce extra information generated from a well-trained teacher model and make the student model as effective as the teacher
- Under the paradigm of ranking distillation, for a certain query q, besides the labeled documents, we use a top-K ranking for unlabeled documents generated by a well-trained teacher ranking model MT as extra information to guide the training of the student ranking model MS with less parameters
- We adopt as many parameters as possible for the teacher model to achieve a good performance on each data set
- This paper focused on several key issues of ranking distillation, i.e., the problem formulation, the representation of teacher’s supervision, and the balance between the trust on the training data and the trust on the teacher, and presented our solutions
- The authors use recommendation as the task for evaluating the performance of ranking distillation.
- In this problem, we.
- To measure the online inference efficiency, the authors count the number of parameters in each model and report the wall time for making a recommendation list to every user based on her/his last 5 actions in the training data set.
- The authors apply the proposed ranking distillation to two sequential recommendation models that have been shown to have strong performances:
- The results of each method are summarized in Table 2.
- The authors included three non-sequential recommendation baselines: the popularity based item recommendation (POP), the item based Collaborative Filtering5 (ItemCF) , and the Datasets Model Fossil-T Gowalla.
- Fossil-RD Caser-T Caser-RD Foursquare Time (CPU) Time (GPU) #Params Ratio.
- Bayesian personalized ranking (BPR) .
- The performance of these non-sequential baselines is worse than that of the sequential recommenders, i.e., Fossil and Caser
- Under the paradigm of ranking distillation, for a certain query q, besides the labeled documents, the authors use a top-K ranking for unlabeled documents generated by a well-trained teacher ranking model MT as extra information to guide the training of the student ranking model MS with less parameters.
- The authors found that if the authors use pair-wise distillation loss to place much focus on the partial order within teacher’s ranking, it will produce both upward and downward gradients, making the training unstable and sometimes even fail to converge.
- Table1: Statistics of the data sets
- Table2: Performance comparison. (1) The performance of the models with ranking distillation, Fossil-RD and Caser-RD, always has statistically significant improvements over the student-only models, Fossil-S and Caser-S. (2) The performance of the models with ranking distillation, Fossil-RD and Caser-RD, has no significant degradation from that of the teacher models, Fossil-T and Caser-T. We use the one-tail t-test with significance level at 0.05
- Table3: Model compactness and online inference efficiency. Time (seconds) indicates the wall time used for generating a recommendation list for every user. Ratio is the student model’s parameter size relative to the teacher model’s parameter size
- Table4: Performance of Caser-RD with different choices of weighting scheme on two data sets
- In this section, we compared our works with several related research areas. Knowledge Distillation Knowledge distillation has been used in image recognition [3, 15, 31] and neural machine translation  as a way to generate compact models. As pointed out in Introduction, it is not straightforward to apply KD to ranking models and new issues must be addressed. In the context of ranking problems, the most relevant work is , which uses knowledge distillation for image retrieval. This method applies the sampling technique to rank a sample of the image from all data each time. In general, training on a sample works if the sample shares similar patterns with the rest of data through some content information, such as image contents in the case of . But this technique is not applicable to training a recommender model when items and users are represented by IDs with no content information, as in the case of collaborative filtering. In this case, the recommender model training cannot be easily generalize to all users and items. Semi-Supervised Learning Another related research area is semisupervised learning [4, 43]. Unlike the teacher-student model learning paradigm in knowledge distillation and in our work, semisupervised learning usually trains a single model and utilizes weaklabeled or unlabeled data as well as the labeled data to gain a better performance. Several works in information retrieval followed this direction, using weak-labeled or unlabeled data to construct test collections , to provide extra features  and labels  for ranking model training. The basic idea of ranking distillation and semi-supervised learning is similar as they both utilize unlabeled data while with different purpose. Transfer Learning for Recommender System Transfer learning has been widely used in the field of recommender systems [5, 12]. These methods mainly focus on how to transfer knowledge (e.g., user rating patterns) from a source domain (e.g., movies) to a target domain (e.g., musics) for improving the recommendation performance. If we consider the student as a target model and the teacher as a source model, our teacher-student learning can be seen as a special transfer learning. However, unlike transfer learning, our teacher-student learning does not require two domains because the teacher and student models are learned from the same domain. Having a compact student model to enhance online inference efficiency is another purpose of our teacher-student learning.
- The work of the second author is partially supported by a Discovery Grant from Natural Sciences and Engineering Research Council of Canada
- Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormándi, George E. Dahl, and Geoffrey E. Hinton. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235 (2018).
- Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo test collections for learning web search ranking functions. In International Conference on Research and development in Information Retrieval. ACM, 1073–1082.
- Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?. In Advances in neural information processing systems. 2654–2662.
- Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20, 3 (2009), 542–542.
- Xu Chen, Zheng Qin, Yongfeng Zhang, and Tao Xu. 2016. Learning to rank features for recommendation over multiple categories. In International ACM SIGIR conference on Research and Development in Information Retrieval. 305–314.
- Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. 2017. DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer. arXiv preprint arXiv:1707.01220 (2017).
- Eunjoon Cho, Seth A Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In International Conference on Knowledge Discovery and Data Mining. ACM, 1082–1090.
- Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. 2015. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics. 192–204.
- Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In ACM Conference on Recommender systems. 191–198.
- Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. arXiv preprint arXiv:1704.08803 (2017).
- Fernando Diaz. 2016. Learning to Rank with Labeled Features. In International Conference on the Theory of Information Retrieval. ACM, 41–44.
- Ignacio Fernández-Tobías, Iván Cantador, Marius Kaminskas, and Francesco Ricci. 20Cross-domain recommender systems: A survey of the state of the art. In Spanish Conference on Information Retrieval. sn, 24.
- Ruining He and Julian McAuley. 2016. Fusing Similarity Models with Markov Chains for Sparse Sequential Recommendation. In International Conference on Data Mining. IEEE.
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In International Conference on World Wide Web. ACM, 173–182.
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 20Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
- Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and Deborah Estrin. 2017. Collaborative metric learning. In International Conference on World Wide Web. 193–201.
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In IEEE conference on Computer Vision and Pattern Recognition. 1725–1732.
- Kenji Kawaguchi. 2016. Deep learning without poor local minima. In Advances in Neural Information Processing Systems. 586–594.
- Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Conference on Empirical Methods on Natural Language Processing. ACL, 1756– 1751.
- Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016).
- Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009).
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105.
- Hui Li, Tsz Nam Chan, Man Lung Yiu, and Nikos Mamoulis. 2017. FEXIPRO: Fast and Exact Inner Product Retrieval in Recommender Systems. In International Conference on Management of Data. ACM, 835–850.
- David C Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C Ma, Zhigang Zhong, Jenny Liu, and Yushi Jing. 2017. Related pins at pinterest: The evolution of a real-world recommender system. In International Conference on World Wide Web. 583–592.
- Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model.. In Interspeech.
- Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. 2013. Learning with noisy labels. In Advances in neural information processing systems. 1196–1204.
- Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text Matching As Image Recognition. In AAAI Conference on Artificial Intelligence. AAAI Press, 2793–2799.
- Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017. DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval. arXiv preprint arXiv:1710.05649 (2017).
- Steffen Rendle and Christoph Freudenthaler. 2014. Improving pairwise learning for item recommendation from implicit feedback. In International Conference on Web Search and Data Mining. ACM, 273–282.
- Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Conference on Uncertainty in Artificial Intelligence. AUAI Press, 452–461.
- Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
- Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In International Conference on World Wide Web. ACM, 285–295.
- Jiaxi Tang and Ke Wang. 2018. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding. In ACM International Conference on Web Search and Data Mining.
- Christina Teflioudi, Rainer Gemulla, and Olga Mykytiuk. 2015. Lemp: Fast retrieval of large entries in a matrix product. In International Conference on Management of Data. ACM, 107–122.
- Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learning for recommender systems. In International Conference on Knowledge Discovery and Data Mining. ACM, 1235–1244.
- Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning 81, 1 (2010), 21–35.
- Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In International ACM SIGIR conference on Research and Development in Information Retrieval. 55–64.
- Quan Yuan, Gao Cong, and Aixin Sun. 2014. Graph-based point-of-interest recommendation with geographical and temporal influences. In International Conference on Information and Knowledge Management. ACM, 659–668.
- Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and TatSeng Chua. 2016. Discrete collaborative filtering. In International Conference on Research and Development in Information Retrieval. ACM, 325–334.
- Yan Zhang, Defu Lian, and Guowu Yang. 2017. Discrete Personalized Ranking for Fast Collaborative Filtering from Implicit Feedback.. In AAAI Conference on Artificial Intelligence. AAAI Press, 1669–1675.
- Zhiwei Zhang, Qifan Wang, Lingyun Ruan, and Luo Si. 2014. Preference preserving hashing for efficient recommendation. In International Conference on Research and Development in Information Retrieval. ACM, 183–192.
- Ke Zhou and Hongyuan Zha. 2012. Learning binary codes for collaborative filtering. In International Conference on Knowledge Discovery and Data Mining. ACM, 498–506.
- Xiaojin Zhu. 2005. Semi-supervised learning literature survey. (2005).