# Neural Collaborative Filtering vs. Matrix Factorization Revisited

RecSys, pp. 240-248, 2020.

EI

微博一下：

摘要：

Embedding based models have been the state of the art in collaborative filtering for over a decade. Traditionally, the dot product or higher order equivalents have been used to combine two or more embeddings, e.g., most notably in matrix factorization. In recent years, it was suggested to replace the dot product with a learned similarit...更多

代码：

数据：

简介

- Embedding based models have been the state of the art in collaborative filtering for over a decade.
- Combining a user embedding with an item embedding to obtain a single score that indicates the preference of the user for the item.
- This can be viewed as a similarity function in the embedding space.
- It has become popular to learn the similarity function with a neural network.

重点内容

- Embedding based models have been the state of the art in collaborative filtering for over a decade
- Our main interest is to investigate if the multilayer perceptron (MLP)-learned similarity is superior to a simple dot product
- With a properly set up matrix factorization model, the experiments do not show any evidence that a MLP is superior
- The neural collaborative filtering (NCF) paper [16] proposes a combined model where the similarity function is a sum of dot-product and MLP, as in Eq (5) – this is called NeuMF2
- The experiments do not support the claim in [16] that a dot product model can be enhanced by feeding some part of its embeddings through an MLP
- Our findings indicate that a dot product might be a better default choice for combining embeddings than learned similarities using MLP or neural matrix factorization (NeuMF)

方法

- The NCF paper [16] evaluates on an item retrieval task on two datasets: a binarized version of Movielens 1M [13] and a dataset from Pinterest [12].
- The authors create a second test set that consists of fresh embeddings that did not appear in the training or test set, i.e., the authors sample the embeddings for every case from N (0, σe2mbI) instead of picking them from P and Q.
- The motivation for this setup is to investigate if the learned similarity function generalizes to embeddings that were not seen during training

结果

- The results are reported in Fig. 2.
- As can be seen in Fig. 2, the dot product substantially outperforms MLP on all datasets, evaluation metrics and embedding dimensions.
- The NCF paper [16] proposes a combined model where the similarity function is a sum of dot-product and MLP, as in Eq (5) – this is called NeuMF2.
- Figure 3 shows the approximation error of the MLP for different choices of embedding dimensions and as a function of training data.
- The authors suggest that with enough training data and wide enough hidden layers, an MLP can approximate a dot product.

结论

- Following the arguments in [33], it is possible that the studies in [16] and [8] did not properly set up MLP and NeuMF, and that these results could be further improved.
- It is possible that the performance of these models is different on other datasets
- At this point, the revised experiments from [16] provide no evidence supporting the claim that a MLP-learned similarity is superior to a dot product.
- This negative result holds for NeuMF where a GMF is added to the MLP.

总结

## Introduction:

Embedding based models have been the state of the art in collaborative filtering for over a decade.- Combining a user embedding with an item embedding to obtain a single score that indicates the preference of the user for the item.
- This can be viewed as a similarity function in the embedding space.
- It has become popular to learn the similarity function with a neural network.
## Methods:

The NCF paper [16] evaluates on an item retrieval task on two datasets: a binarized version of Movielens 1M [13] and a dataset from Pinterest [12].- The authors create a second test set that consists of fresh embeddings that did not appear in the training or test set, i.e., the authors sample the embeddings for every case from N (0, σe2mbI) instead of picking them from P and Q.
- The motivation for this setup is to investigate if the learned similarity function generalizes to embeddings that were not seen during training
## Results:

The results are reported in Fig. 2.- As can be seen in Fig. 2, the dot product substantially outperforms MLP on all datasets, evaluation metrics and embedding dimensions.
- The NCF paper [16] proposes a combined model where the similarity function is a sum of dot-product and MLP, as in Eq (5) – this is called NeuMF2.
- Figure 3 shows the approximation error of the MLP for different choices of embedding dimensions and as a function of training data.
- The authors suggest that with enough training data and wide enough hidden layers, an MLP can approximate a dot product.
## Conclusion:

Following the arguments in [33], it is possible that the studies in [16] and [8] did not properly set up MLP and NeuMF, and that these results could be further improved.- It is possible that the performance of these models is different on other datasets
- At this point, the revised experiments from [16] provide no evidence supporting the claim that a MLP-learned similarity is superior to a dot product.
- This negative result holds for NeuMF where a GMF is added to the MLP.

- Table1: Comparison from [<a class="ref-link" id="c8" href="#r8">8</a>] of MLP+GMF (NeuMF) with various baselines and our results. The best results are highlighted in bold, the second best result is underlined. Method

相关工作

- 6.1 Dot products at the Output Layer of DNNs

At first glance it might appear that our work questions the use of neural networks in recommender systems. This is not the case, and as we will discuss now, many of the most competitive neural networks use a dot product for the output but not an MLP. Consider the general multiclass classification task where (x, y) is a labeled training example with input x and label y ∈ {1, . . . , n}. A common approach is to define a DNN f that maps the input x to a representation (or embedding) f(x) ∈ Rd. At the final stage, this representation is combined with the class labels to produce a vector of scores. Commonly, this is done by multiplying the input representation f(x) ∈ Rd with a class matrix Q ∈ Rn×d to obtain a scalar score for each of the n classes. This vector is then used in the loss function, for example as logits in a softmax cross entropy with the label y. This falls exactly under the family of models discussed in this paper, where p = f(x) ∈ Rd and the classes are the items. In fact, the model as described above is a dot product model because at the output Q f(x) = Q p = [ p, qi ]ni=1 which means each input-label or user-item combination is a dot product between an input (or user) embedding and label (or item) embedding. This dot product combination of input and class representation is commonly used in sophisticated DNNs for image classification [23, 14] and for natural language processing [4, 28, 9]. This makes our findings that a dot product is a powerful embedding combiner well aligned with the broader DNN community where it is common to apply a dot product at the output for multiclass classification.

研究对象与分析

datasets: 3

We draw the embeddings p, q from N (0, σe2mbI) and set the true label as y(p, q) = p, q + where ∼ N (0, σl2abel) models the label noise. From this process we create three datasets each consisting of tuples (p, q, y). One of the datasets is used for training and the remaining two for testing

users: 128000

In all cases, the approximation error is well above what is considered a large difference for problems with comparable scale. For example, for the moderate d = 128, with 128000 users, the error is still above 0.02, much higher than the very significant difference of 0.01. This experiment shows the difficulty of using an MLP to approximate the dot product, even when explicitly trained to do so

引用论文

- Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning (2019), pp. 242–252.
- Andoni, A., Panigrahy, R., Valiant, G., and Zhang, L. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (2014), ICML’14, JMLR.org, p. II–1908–II–1916.
- Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory 39, 3 (1993), 930–945.
- Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research 3, Feb (2003), 1137–1155.
- Beutel, A., Covington, P., Jain, S., Xu, C., Li, J., Gatto, V., and Chi, E. H. Latent cross: Making use of context in recurrent recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (New York, NY, USA, 2018), WSDM ’18, Association for Computing Machinery, p. 46–54.
- Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (New York, NY, USA, 2016), RecSys ’16, Association for Computing Machinery, p. 191–198.
- Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 4 (1989), 303–314.
- Dacrema, M. F., Boglio, S., Cremonesi, P., and Jannach, D. A troubling analysis of reproducibility and progress in recommender systems research, 2019.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.
- Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning (2019), pp. 1675–1685.
- Dziugaite, G. K., and Roy, D. M. Neural network matrix factorization, 2015.
- Geng, X., Zhang, H., Bian, J., and Chua, T. Learning image and user features for recommendation in social networks. In 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 4274–4282.
- Harper, F. M., and Konstan, J. A. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. 5, 4 (Dec. 2015).
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2016).
- He, X., Du, X., Wang, X., Tian, F., Tang, J., and Chua, T.-S. Outer product-based neural collaborative filtering. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18 (7 2018), International Joint Conferences on Artificial Intelligence Organization, pp. 2227–2233.
- He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.-S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web (Republic and Canton of Geneva, Switzerland, 2017), WWW ’17, International World Wide Web Conferences Steering Committee, pp. 173–182.
- Hornik, K., Stinchcombe, M., White, H., et al. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
- Hu, B., Shi, C., Zhao, W. X., and Yu, P. S. Leveraging meta-path based context for top- n recommendation with a neural co-attention model. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (New York, NY, USA, 2018), KDD ’18, Association for Computing Machinery, p. 1531–1540.
- Hu, Y., Koren, Y., and Volinsky, C. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (2008), ICDM ’08, pp. 263–272.
- Jawarneh, I. M. A., Bellavista, P., Corradi, A., Foschini, L., Montanari, R., Berrocal, J., and Murillo, J. M. A pre-filtering approach for incorporating contextual information into deep learning based recommender systems. IEEE Access 8 (2020), 40485–40498.
- Koren, Y. The bellkor solution to the netflix grand prize, 2009.
- Koren, Y., and Bell, R. Advances in Collaborative Filtering. Springer US, Boston, MA, 2011, pp. 145–186.
- Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 2012, pp. 1097–1105.
- Levy, M., and Jack, K. Efficient top-n recommendation by linear regression. In RecSys Large Scale Recommender Systems Workshop (2013).
- Li, D., Chen, C., Liu, W., Lu, T., Gu, N., and Chu, S. Mixture-rank matrix approximation for collaborative filtering. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 477–485.
- Liu, T., Moore, A. W., Gray, A., and Yang, K. An investigation of practical approximate nearest neighbor algorithms. In Proceedings of the 17th International Conference on Neural Information Processing Systems (Cambridge, MA, USA, 2004), NIPS’04, MIT Press, p. 825–832.
- Mattson, P., Cheng, C., Coleman, C., Diamos, G., Micikevicius, P., Patterson, D., Tang, H., Wei, G.-Y., Bailis, P., Bittorf, V., Brooks, D., Chen, D., Dutta, D., Gupta, U., Hazelwood, K., Hock, A., Huang, X., Ike, A., Jia, B., Kang, D., Kanter, D., Kumar, N., Liao, J., Ma, G., Narayanan, D., Oguntebi, T., Pekhimenko, G., Pentecost, L., Reddi, V. J., Robie, T., John, T. S., Tabaru, T., Wu, C.-J., Xu, L., Yamazaki, M., Young, C., and Zaharia, M. Mlperf training benchmark, 2019.
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (2013), pp. 3111–3119.
- Ning, X., and Karypis, G. Slim: Sparse linear methods for top-n recommender systems. In 2011 IEEE 11th International Conference on Data Mining (2011), IEEE, pp. 497–506.
- Niu, W., Caverlee, J., and Lu, H. Neural personalized ranking for image recommendation. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (New York, NY, USA, 2018), WSDM ’18, Association for Computing Machinery, p. 423–431.
- Paterek, A. Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD cup and workshop (2007), vol. 2007, pp. 5–8.
- Qin, J., Ren, K., Fang, Y., Zhang, W., and Yu, Y. Sequential recommendation with dual side neighbor-based collaborative relation modeling. In Proceedings of the 13th International Conference on Web Search and Data Mining (New York, NY, USA, 2020), WSDM ’20, Association for Computing Machinery, p. 465–473.
- Rendle, S., Zhang, L., and Koren, Y. On the difficulty of evaluating baselines: A study on recommender systems. CoRR abs/1905.01395 (2019).
- Shrivastava, A., and Li, P. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Cambridge, MA, USA, 2014), NIPS’14, MIT Press, p. 2321–2329.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems (2017), pp. 5998–6008.
- Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
- Zamani, H., and Croft, W. B. Learning a joint search and recommendation model from user-item interactions. In Proceedings of the 13th International Conference on Web Search and Data Mining (New York, NY, USA, 2020), WSDM ’20, Association for Computing Machinery, p. 717–725.
- Zhao, X., Zhu, Z., Zhang, Y., and Caverlee, J. Improving the estimation of tail ratings in recommender system with multi-latent representations. In Proceedings of the 13th International Conference on Web Search and Data Mining (New York, NY, USA, 2020), WSDM ’20, Association for Computing Machinery, p. 762–770.

下载 PDF 全文

标签

评论