Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and Recommendation

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval Virtual Event China July, 2020, pp. 1469-1478, 2020.

Cited by: 0|Bibtex|Views225|DOI:https://doi.org/10.1145/3397271.3401156
EI
Other Links: dl.acm.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We have shown that it is possible to learn universal user representations by modeling only unsupervised user sequential behaviors; and it is possible to adapt the learned representations for a variety of downstream tasks

Abstract:

Inductive transfer learning has had a big impact on computer vision and NLP domains but has not been used in the area of recommender systems. Even though there has been a large body of research on generating recommendations based on modeling user-item interaction sequences, few of them attempt to represent and transfer these models for se...More

Code:

Data:

0
Introduction
  • The last 10 years have seen the ever increasing use of social media platforms and e-commerce systems, such as Tiktok, Amazon or Net￿ix.
  • Most of the past work has been focused on the task of recommending items on the same platform, from where the data came from.
  • Few of these methods exploit this data to learn a universal user representation that could be used for a di￿erent downstream task, such as for instance the cold-start user problem on a di￿erent recommendation platform or the prediction of a user pro￿le
Highlights
  • The last 10 years have seen the ever increasing use of social media platforms and e-commerce systems, such as Tiktok, Amazon or Net￿ix
  • We propose a universal user representational learning architecture, a method that can be used to achieve NLP or computer vision (CV)-like transfer learning for various downstream tasks
  • To evaluate the performance of PeterRec in the downstream tasks, we randomly split the target dataset into training (70%), validation (3%) and testing (27%) sets
  • PeterRec outperforms MTL in all tasks, which implies that the proposed two-stage pre-training & ￿ne-tuning paradigm is more powerful than the joint training in MTL. We argue this is because the optimal parameters learned for two objectives in MTL does not gurantee optimal performance for ￿ne-tuning
  • We have shown that (1) it is possible to learn universal user representations by modeling only unsupervised user sequential behaviors; and (2) it is possible to adapt the learned representations for a variety of downstream tasks
  • The core idea of learning-to-learn is that the parameters of deep neural networks can be predicted from another [4, 26]; [6] demonstrated that it is possible to predict more than 95% parameters of a network in a layer given the remaining 5%
  • By releasing both high-quality datasets and codes, we hope PeterRec serves as a benchmark for transfer learning in the recommender system domain
Methods
  • Having developed the model patch architecture, the question is how to inject it into the current DC.
  • ReLU Layer-Norm DC Layer 1×3 ReLU Layer-Norm MP DC Layer 1×3 ReLU Layer-Norm Input DC Layer 1×3.
  • (a) original (b) serial √ MP ReLU.
  • DC Layer 1×3 ReLU (c) serial √
Results
  • To evaluate the performance of PeterRec in the downstream tasks, the authors randomly split the target dataset into training (70%), validation (3%) and testing (27%) sets.
  • Note that to speed up the experiments of item recommendation tasks, the authors follow the common strategy in [13] by randomly sampling negative examples for the true example, and evaluate top-5 accuracy among the items.
  • To answer RQ1, the authors compare PeterRec in two cases: wellpre-trained and no-pre-trained settings.
  • The authors refer to PeterRec with randomly initialized weights as PeterZero
Conclusion
  • The authors have shown that (1) it is possible to learn universal user representations by modeling only unsupervised user sequential behaviors; and (2) it is possible to adapt the learned representations for a variety of downstream tasks.
  • The authors have evaluated several alternative designs of PeterRec, and made insightful observations by extensive ablation studies
  • By releasing both high-quality datasets and codes, the authors hope PeterRec serves as a benchmark for transfer learning in the recommender system domain.
  • If the authors have the video watch behaviors of a teenager, the authors may know whether he has depression or propensity for violence by PeterRec without resorting to much feature engineering and human-labeled data.
  • The authors may explore PeteRec with more tasks
Summary
  • Introduction:

    The last 10 years have seen the ever increasing use of social media platforms and e-commerce systems, such as Tiktok, Amazon or Net￿ix.
  • Most of the past work has been focused on the task of recommending items on the same platform, from where the data came from.
  • Few of these methods exploit this data to learn a universal user representation that could be used for a di￿erent downstream task, such as for instance the cold-start user problem on a di￿erent recommendation platform or the prediction of a user pro￿le
  • Objectives:

    The authors aim to demonstrate how to modify the pre-trained network to obtain better accuracy in related but very di￿erent tasks by training only few parameters.
  • Methods:

    Having developed the model patch architecture, the question is how to inject it into the current DC.
  • ReLU Layer-Norm DC Layer 1×3 ReLU Layer-Norm MP DC Layer 1×3 ReLU Layer-Norm Input DC Layer 1×3.
  • (a) original (b) serial √ MP ReLU.
  • DC Layer 1×3 ReLU (c) serial √
  • Results:

    To evaluate the performance of PeterRec in the downstream tasks, the authors randomly split the target dataset into training (70%), validation (3%) and testing (27%) sets.
  • Note that to speed up the experiments of item recommendation tasks, the authors follow the common strategy in [13] by randomly sampling negative examples for the true example, and evaluate top-5 accuracy among the items.
  • To answer RQ1, the authors compare PeterRec in two cases: wellpre-trained and no-pre-trained settings.
  • The authors refer to PeterRec with randomly initialized weights as PeterZero
  • Conclusion:

    The authors have shown that (1) it is possible to learn universal user representations by modeling only unsupervised user sequential behaviors; and (2) it is possible to adapt the learned representations for a variety of downstream tasks.
  • The authors have evaluated several alternative designs of PeterRec, and made insightful observations by extensive ablation studies
  • By releasing both high-quality datasets and codes, the authors hope PeterRec serves as a benchmark for transfer learning in the recommender system domain.
  • If the authors have the video watch behaviors of a teenager, the authors may know whether he has depression or propensity for violence by PeterRec without resorting to much feature engineering and human-labeled data.
  • The authors may explore PeteRec with more tasks
Tables
  • Table1: Number of instances. Each instance in S and T represents (u, xu ) and (u, ) pairs, respectively. The number of source items |X |=191K , 645K , 645K , 645K , 645K (K = 1000), and the number of target labels | Y |=20K , 17K , 2, 8, 6 for the ￿ve dataset from left to right in the below table. M = 1000K
  • Table2: Impacts of pre-training — FineZero vs. FineAll (with the causal CNN architectures). Without special mention, in the following we only report ColdRec-1 with HR@5 and ColdRec-2 with MRR@5 for demonstration
  • Table3: Performance comparison (with the non-causal CNN architectures). The number of ￿ne-tuned parameters ( and ) of PeterRec accounts for 9.4%, 2.7%, 0.16%, 0.16%, 0.16% of FineAll on the ￿ve datasets from left to right
  • Table4: Results regarding user pro￿le prediction
  • Table5: Top-5 Accuracy in the cold user scenario
  • Table6: PeterRecal vs. PeterRecon. The results of the ￿rst and last two columns are ColdRec-1 and AgeEst datasets, respectively
  • Table7: Performance of di￿erent insertions in Figure 3 on AgeEst
Download tables as Excel
Related work
  • PeterRec tackles two research questions: (1) training an e￿ective and e￿cient base model, and (2) transferring the learned user representations from the base model to downstream tasks with a high degree of parameter sharing. Since we choose the sequential recommendation models to perform this upstream task, we brie￿y review related literature. Then we recapitalize work in transfer learning and user representation adaptation.

    2.1 Sequential Recommendation Models

    A sequential recommendation (SR) model takes in a sequence (session) of user-item interactions, and taking sequentially each item of the sequence as input aims to predict the next one(s) that the user likes. SR have demonstrated obvious accuracy gains compared to traditional content or context-based recommendations when modeling users sequential actions [18]. Another merit of SR is that sequential models do not necessarily require user pro￿le information since user representations can be implicitly re￿ected by their past sequential behaviors. Amongst these models, researchers have paid special attention to three lines of work: RNN-based [14], CNNbased [34, 39, 40], and pure attention-based [18] sequential models. In general, typical RNN models strictly rely on sequential dependencies during training, and thus, cannot take full advantage of modern computing architectures, such as GPUs or TPU [40]. CNN and attention-based recommendation models do not have such a problems since the entire sequence can be observed during training and thus can be fully parallel. One well-known obstacle that prevents CNN from being a strong sequential model is the limited receptive ￿eld due to its small kernel size (e.g., 3 ⇥ 3). This issue has been cleverly approached by introducing the dilated convolutional operation, which enables an exponentially increased receptive ￿eld with unchanged kernel [39, 40]. By contrast, self-attention based sequential models, such as SASRec [18] may have time complexity and memory issues since they grow quadratically with the sequence length. Thereby, we choose dilated convolution-based sequential neural network to build the pre-trained model by investigating both causal (i.e., NextItNet [40]) and non-causal (i.e., the bidirectional encoder of GRec [39]) convolutions in this paper.
Funding
  • This work is partly supported by the National Natural Science Foundation of China (61972372, U19A2079)
Reference
  • John Anderson, Qingqing Huang, Walid Krichene, Ste￿en Rendle, and Li Zhang. 2020. Superbloom: Bloom ￿lter meets Transformer. arXiv preprint arXiv:2002.04723 (2020).
    Findings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geo￿rey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
    Findings
  • Yoshua Bengio and Samy Bengio. 2000. Modeling high-dimensional discrete data with multi-layer neural networks. In Advances in Neural Information Processing Systems. 400–406.
    Google ScholarLocate open access versionFindings
  • Luca Bertinetto, João F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. 2016. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems. 523–531.
    Google ScholarLocate open access versionFindings
  • Chong Chen, Min Zhang, Chenyang Wang, Weizhi Ma, Minming Li, Yiqun Liu, and Shaoping Ma. 2019. An E￿cient Adaptive Transfer Neural Network for Social-aware Recommendation. (2019).
    Google ScholarFindings
  • Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando De Freitas. 2013. Predicting parameters in deep learning. In Advances in neural information processing systems. 2148–2156.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
    Findings
  • Guibing Guo, Shichang Ouyang, Xiaodong He, Fajie Yuan, and Xiaohua Liu. 2019. Dynamic item block and prediction enhancing block for sequential recommendation. International Joint Conferences on Arti￿cial Intelligence Organization.
    Google ScholarFindings
  • Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
    Google ScholarLocate open access versionFindings
  • Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 355–364.
    Google ScholarLocate open access versionFindings
  • Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. Proceedings of the 43th International ACM SIGIR conference on Research and Development in Information Retrieval (2020).
    Google ScholarLocate open access versionFindings
  • Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative ￿ltering. In Proceedings of the 26th international conference on world wide web. International World Wide Web Conferences Steering Committee, 173–182.
    Google ScholarLocate open access versionFindings
  • Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
    Findings
  • Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-E￿cient Transfer Learning for NLP. arXiv preprint arXiv:1902.00751 (2019).
    Findings
  • Guangneng Hu, Yu Zhang, and Qiang Yang. 2018. Conet: Collaborative cross networks for cross-domain recommendation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 667– 676.
    Google ScholarLocate open access versionFindings
  • Guangneng Hu, Yu Zhang, and Qiang Yang. 2018. MTNet: a neural approach for cross-domain recommendation with unstructured text.
    Google ScholarFindings
  • Wang-Cheng Kang and Julian McAuley. 20Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 197–206.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
    Findings
  • Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan, and Tat-Seng Chua. 20Estimation-action-re￿ection: Towards deep interaction between conversational and recommender systems. In Proceedings of the 13th International Conference on Web Search and Data Mining. 304–312.
    Google ScholarLocate open access versionFindings
  • Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. 2018. K For The Price Of 1: Parameter E￿cient Multi-task And Transfer Learning. arXiv preprint arXiv:1810.10703 (2018).
    Findings
  • Vinod Nair and Geo￿rey E Hinton. 2010. Recti￿ed linear units improve restricted boltzmann machines. In ICML. 807–814.
    Google ScholarLocate open access versionFindings
  • Yabo Ni, Dan Ou, Shichen Liu, Xiang Li, Wenwu Ou, Anxiang Zeng, and Luo Si.
    Google ScholarFindings
  • 2018. Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 596–605.
    Google ScholarLocate open access versionFindings
  • [24] Shilin Qu, Fajie Yuan, Guibing Guo, Liguang Zhang, and Wei Wei. 2020. CmnRec: Sequential Recommendations with Chunk-accelerated Memory Network. arXiv preprint arXiv:2004.13401 (2020).
    Findings
  • [25] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. [n.d.].
    Google ScholarFindings
  • [26] Sylvestre-Alvise Rebu￿, Hakan Bilen, and Andrea Vedaldi. 2017. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems. 506–516.
    Google ScholarLocate open access versionFindings
  • [27] Sylvestre-Alvise Rebu￿, Hakan Bilen, and Andrea Vedaldi. 2018. E￿cient parametrization of multi-domain deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8119–8127.
    Google ScholarLocate open access versionFindings
  • [28] Ste￿en Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-￿fth conference on uncertainty in arti￿cial intelligence. AUAI Press, 452–461.
    Google ScholarLocate open access versionFindings
  • [29] Ste￿en Rendle, Walid Krichene, Li Zhang, and John Anderson. 2020. Neural Collaborative Filtering vs. Matrix Factorization Revisited. arXiv preprint arXiv:2005.09683 (2020).
    Findings
  • [30] Amir Rosenfeld and John K Tsotsos. 2018. Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence (2018).
    Google ScholarLocate open access versionFindings
  • [31] Asa Cooper Stickland and Iain Murray. 2019. BERT and PALs: Projected Attention Layers for E￿cient Adaptation in Multi-Task Learning. arXiv preprint arXiv:1902.02671 (2019).
    Findings
  • [32] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).
    Findings
  • [33] Yang Sun, Fajie Yuan, Ming Yang, Guoao Wei, Zhou Zhao, and Duo Liu. 2020. A Generic Network Compression Framework for Sequential Recommender Systems. Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (2020).
    Google ScholarLocate open access versionFindings
  • [34] Jiaxi Tang and Ke Wang. 2018. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding. In ACM International Conference on Web Search and Data Mining.
    Google ScholarLocate open access versionFindings
  • [35] Jingyi Wang, Qiang Liu, Zhaocheng Liu, and Shu Wu. 2019. Towards Accurate and Interpretable Sequential Prediction: A CNN & Attention-Based Feature Extractor. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 1703–1712.
    Google ScholarLocate open access versionFindings
  • [36] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500.
    Google ScholarLocate open access versionFindings
  • [37] Jason Yosinski, Je￿ Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?. In Advances in neural information processing systems. 3320–3328.
    Google ScholarFindings
  • [38] Fajie Yuan, Guibing Guo, Joemon M Jose, Long Chen, Haitao Yu, and Weinan Zhang. 2016. Lambdafm: learning optimal ranking with factorization machines using lambda surrogates. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 227–236.
    Google ScholarLocate open access versionFindings
  • [39] Fajie Yuan, Xiangnan He, Haochuan Jiang, Guibing Guo, Jian Xiong, Zhezhao Xu, and Yilin Xiong. 2020. Future Data Helps Training: Modeling Future Contexts for Session-based Recommendation. In Proceedings of The Web Conference 2020. 303–313.
    Google ScholarLocate open access versionFindings
  • [40] Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xiangnan He. 2019. A Simple Convolutional Generative Network for Next Item Recommendation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 582–590.
    Google ScholarLocate open access versionFindings
  • [41] Fajie Yuan, Xin Xin, Xiangnan He, Guibing Guo, Weinan Zhang, Chua Tat-Seng, and Joemon M Jose. 2018. fBGD: Learning embeddings from positive unlabeled data with BGD. (2018).
    Google ScholarFindings
  • [42] Feng Yuan, Lina Yao, and Boualem Benatallah. 2019. DARec: Deep Domain Adaptation for Cross-Domain Recommendation via Transferring Rating Patterns. arXiv preprint arXiv:1905.10760 (2019).
    Findings
  • [43] Kui Zhao, Yuechuan Li, Zhaoqian Shuai, and Cheng Yang. 2018. Learning and Transferring IDs Representation in E-commerce. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1031–1039.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments