# On Sampled Metrics for Item Recommendation

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event CA USA July, 2020, pp. 1748-1757, 2020.

EI

Weibo:

Abstract:

The task of item recommendation requires ranking a large catalogue of items given a context. Item recommendation algorithms are evaluated using ranking metrics that depend on the positions of relevant items. To speed up the computation of metrics, recent work often uses sampled metrics where only a smaller set of random items and the rele...More

Code:

Data:

Introduction

- Item recommendation from implicit feedback has received a lot of attention from the recommender system research community.
- The catalogue of items to retrieve from is large: tens of thousands in academic studies and often many millions in industrial applications.
- Usually sharp metrics such as precision or recall over the few highest scoring items are chosen.
- Another popular class are smooth metrics such as average precision or normalized discounted cumulative gain (NDCG) which place a strong emphasis on the top ranked items

Highlights

- Over recent years, item recommendation from implicit feedback has received a lot of attention from the recommender system research community
- Our results have shown that a sampled metric can be a poor indicator of the true performance of recommender algorithms under this metric
- Metrics are usually motivated by applications, e.g., does the top 10 list contain a relevant item? Sampled metrics do not measure the intended quantities – not even in expectation
- If an experimental study needs to sample, we propose correction methods that give a better estimate of the true metric, at the cost of increased variance
- Our analysis focused on the case of a single relevant item
- Deriving correction methods without independence is an interesting direction for future research

Methods

- The authors study sampled metrics on real recommender algorithms and a real dataset.
- Corrected Metric on Samples sampled Rank Estimate Constr.
- Sampled Average Precision sampled Rank Estimate Constr.
- The authors study the behavior of sampled metrics on three popular recommender system algorithms: matrix factorization and two variations of item-based collaborative filtering.
- To de-emphasize the particular recommender method and hyperparameter choice, the authors will refer to matrix factorization as ‘recommender X’, to the two item-based collaborative filtering variations as ‘recommender Y’ and ‘recommender Z’

Conclusion

- This work seeks to bring attention to some issues with sampling of evaluation metrics.
- Sampled metrics do not measure the intended quantities – not even in expectation.
- Metrics are usually motivated by applications, e.g., does the top 10 list contain a relevant item?
- For this reason, sampling should be avoided as much as possible during evaluation.
- If an experimental study needs to sample, the authors propose correction methods that give a better estimate of the true metric, at the cost of increased variance.
- Deriving correction methods without independence is an interesting direction for future research

Summary

## Introduction:

Item recommendation from implicit feedback has received a lot of attention from the recommender system research community.- The catalogue of items to retrieve from is large: tens of thousands in academic studies and often many millions in industrial applications.
- Usually sharp metrics such as precision or recall over the few highest scoring items are chosen.
- Another popular class are smooth metrics such as average precision or normalized discounted cumulative gain (NDCG) which place a strong emphasis on the top ranked items
## Objectives:

The authors want to emphasize that the purpose of the study is not to judge if a particular recommender algorithm is good.## Methods:

The authors study sampled metrics on real recommender algorithms and a real dataset.- Corrected Metric on Samples sampled Rank Estimate Constr.
- Sampled Average Precision sampled Rank Estimate Constr.
- The authors study the behavior of sampled metrics on three popular recommender system algorithms: matrix factorization and two variations of item-based collaborative filtering.
- To de-emphasize the particular recommender method and hyperparameter choice, the authors will refer to matrix factorization as ‘recommender X’, to the two item-based collaborative filtering variations as ‘recommender Y’ and ‘recommender Z’
## Conclusion:

This work seeks to bring attention to some issues with sampling of evaluation metrics.- Sampled metrics do not measure the intended quantities – not even in expectation.
- Metrics are usually motivated by applications, e.g., does the top 10 list contain a relevant item?
- For this reason, sampling should be avoided as much as possible during evaluation.
- If an experimental study needs to sample, the authors propose correction methods that give a better estimate of the true metric, at the cost of increased variance.
- Deriving correction methods without independence is an interesting direction for future research

- Table1: Toy example of evaluating three recommenders A, B and C on five instances
- Table2: Sampled evaluation for the recommenders from Table 1. On sampled metrics, the relative ordering of A, B, C is not preserved, except for AUC
- Table3: Evaluation of three recommenders (X, Y and Z) on the Movielens dataset. Sampled metrics are inconsistent with the exact metrics. Corrected metrics, especially Bias2 + ∗Variance with ≤ 0.1 produce the correct relative ordering in expectation
- Table4: For the 100 repetitions of the experiment in Table 3, how many times the metric for a pair of recommenders show the correct ordering. For example: for Recall and "X vs Y", how often the sampled metric of X was smaller than the sampled metric of Y. In any of the comparisons, a value of 100 indicates the evaluation was always correct, 0 indicates it was always wrong. The exact metric would always score 100

Study subjects and analysis

samples for datasets: 100

Recently, it has become common to sample a small set of irrelevant items, add the relevant items, and compute the metrics only on the ranking generated by this subset [7, 9, 10, 12, 15,16,17]. It is common to pick the number of sampled irrelevant items, , in the order of a hundred while the number of items is much larger, e.g., = 100 samples for datasets with = {4 , 10 , 17 , 140 , 2 } items [7, 9, 15], = 50 samples for ∈ {2 , 18 , 14 } items [10], or = 200 samples for ∈ {17 , 450 } items [17]. This section will highlight that this approach is problematic

samples: 5000

Similar observations can be made for NDCG. Recall is even more sensitive to the sample size, and it takes about = 5, 000 samples out of = 10, 000 items for the metric to become consistent. Only AUC is consistent for all , and the expected metric is independent of sample size

test users: 6040

6.1 Rank Distributions. For each of the 6040 test users, we rank all items (leaving out the user’s training items) and record at which position the withheld relevant item appears. In total we get 6040 ranks

users: 1600

The plot indicates the different characteristics of the three recommenders. Z is the best in the top 10 but has very poor performance at higher ranks as it puts the relevant items of over 1600 users in the worst bucket. X is more balanced and puts only few items at poor ranks; 2310 items are in the top 100 and less than 300 are in the bottom half

samples: 1000

Figure 6 shows the expected Recall@10 for different choices of the sampling size. As we can see, the uncorrected metric performs poorly and needs more than = 1000 samples (equivalent to 1/3rd sampling rate) to correctly order recommenders X and Y. The corrected metric using a bias-variance trade-off with = 0.1 already has the correct ordering with less than = 60 samples

samples: 60

As we can see, the uncorrected metric performs poorly and needs more than = 1000 samples (equivalent to 1/3rd sampling rate) to correctly order recommenders X and Y. The corrected metric using a bias-variance trade-off with = 0.1 already has the correct ordering with less than = 60 samples. While the corrections seem to be effective in expectation, one also needs to consider the variance of these measurements

samples: 1000

Expected sampling metrics for the running example (Section 3.2 and 4.2) while increasing the number of samples. For Average Precision, NDCG and Recall, even the relative order of recommender performance changes with the number of samples. That means, conclusions drawn from a subsample are not consistent with the true performance of the recommender. Characteristics of a sampled metric with a varying number of samples. Sampled Average Precision, NDCG and Recall change their characteristics substantially compared to exact computation of the metric. Even large sampling sizes ( = 1000 samples of = 10000 items) show large bias. Note this plot zooms into the top 1000 ranks out of = 10000 items. Evaluating the corrected metric AP on a sample of = 100 items (left) is equivalent to measuring the metric on the full item set of = 10, 000 (right). Different choices of correction algorithms are plotted

samples: 1000

Distribution of predicted ranks for three recommender algorithms on the Movielens 1M dataset. Evaluating recommenders with a varying sample size . Plots show expected Recall@10 for the uncorrected metric and the metric corrected by Bias2 + 0.1 ∗ Variance. The uncorrected metric needs = 1000 samples to order X and Y correctly in expectation, while for the corrected metric requires only = 60.

Reference

- Fabio Aiolli. 2013. Efficient Top-n Recommendation for Very Large Scale Binary Rated Datasets. In Proceedings of the 7th ACM Conference on Recommender Systems (Hong Kong, China) (RecSys ’13). Association for Computing Machinery, New York, NY, USA, 273–280. https://doi.org/10.1145/2507157.2507189
- R.E. Barlow, D.J. Bartholomew, J. M. Bremner, and Brunk H. D. 197Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. J. Wiley.
- Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. 2017. A Generic Coordinate Descent Framework for Learning from Implicit Feedback. In Proceedings of the 26th International Conference on World Wide Web (Perth, Australia) (WWW ’17). 1341–1350. https://doi.org/10.1145/3038912.3052694
- Yoshua Bengio and Jean-Sébastien Senecal. 2003. Quick Training of Probabilistic Neural Nets by Importance Sampling. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, AISTATS 2003, Key West, Florida, USA, January 3-6, 2003.
- Yoshua Bengio and Jean-Sébastien Senecal. 2008. Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model. IEEE Trans. Neural Networks 19, 4 (2008), 713–722.
- Guy Blanc and Steffen Rendle. 2018. Adaptive Sampled Softmax with Kernel Based Sampling. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, Stockholmsmässan, Stockholm Sweden, 590–599.
- Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative Memory Network for Recommendation Systems. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (Ann Arbor, MI, USA) (SIGIR ’18). ACM, New York, NY, USA, 515–524. https://doi.org/10.1145/3209978.3209991
- F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (Dec. 2015), 19 pages. https://doi.org/10.1145/2827872
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web (Perth, Australia) (WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 173–182. https://doi.org/10.1145/3038912.3052569
- Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S. Yu. 2018. Leveraging Metapath Based Context for Top- N Recommendation with A Neural Co-Attention Model. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD ’18). ACM, New York, NY, USA, 1531–1540. https://doi.org/10.1145/3219819.3219965
- Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM ’08). 263–272.
- Walid Krichene, Nicolas Mayoraz, Steffen Rendle, Li Zhang, Xinyang Yi, Lichan Hong, Ed Chi, and John Anderson. 2019. Efficient Training on Very Large Corpora via Gramian Estimation. In International Conference on Learning Representations.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 20Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
- Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. ItemBased Collaborative Filtering Recommendation Algorithms. In Proceedings of the 10th International Conference on World Wide Web (Hong Kong, Hong Kong) (WWW ’01). Association for Computing Machinery, New York, NY, USA, 285–295. https://doi.org/10.1145/371920.372071
- Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, and Tat-Seng Chua. 2019. Explainable Reasoning over Knowledge Graphs for Recommendation. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI) (AAAI ’19). 5329–5336.
- Longqi Yang, Eugene Bagdasaryan, Joshua Gruenstein, Cheng-Kang Hsieh, and Deborah Estrin. 2018. OpenRec: A Modular Framework for Extensible and Adaptable Recommendation Algorithms. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM ’18). ACM, New York, NY, USA, 664–672. https://doi.org/10.1145/3159652.3159681
- Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased Offline Recommender Evaluation for Missing-not-atrandom Implicit Feedback. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys ’18). ACM, New York, NY, USA, 279–287. https://doi.org/10.1145/3240323.3240355
- Hsiang-Fu Yu, Mikhail Bilenko, and Chih-Jen Lin. 2017. Selection of Negative Samples for One-class Matrix Factorization. In Proceedings of the 2017 SIAM International Conference on Data Mining. 363–371.

Full Text

Tags

Comments