## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Feature hashing for large scale multitask learning

international conference on machine learning, (2009): 1113-1120

EI

关键词

摘要

Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this appro...更多

代码：

数据：

简介

- Eq (1) is often famously referred to as the kernel-trick
- It allows the use of inner products between very high dimensional feature vectors φ and φ implicitly through the definition of a positive semi-definite kernel matrix k without ever having to compute a vector φ directly.
- This can be powerful in classification settings where the original input representation has a non-linear decision boundary.
- Linear separability can be achieved in a high dimensional feature space φ

重点内容

- Kernel methods use inner products as the basic tool for comparisons between objects
- In section 2 we introduce specialized hash functions with unbiased inner-products that are directly applicable to a large variety of kernel-methods
- We follow the convention to set the classification threshold during test time such that exactly 1% of the not − spam test data is classified as spam Our implementation of the personalized hash functions is illustrated in Figure 1
- As part of our theoretical analysis we introduce unbiased hash functions and provide exponential tail bounds for hash kernels
- We derive that random subspaces of the hashed space are likely to not interact, which makes multitask learning with many tasks possible

结果

- The authors used a proprietary email spamclassification task of n = 3.2 million emails, properly anonymized, collected from |U | = 433167 users.
- The data set consists of 40 million unique words.
- The authors follow the convention to set the classification threshold during test time such that exactly 1% of the not − spam test data is classified as spam The authors' implementation of the personalized hash functions is illustrated in Figure 1.
- To obtain a personalized hash function φu for user u, the authors concatenate a unique

结论

- In this paper the authors analyze the hashing-trick for dimensionality reduction theoretically and empirically.
- As part of the theoretical analysis the authors introduce unbiased hash functions and provide exponential tail bounds for hash kernels.
- These give further insight into hash-spaces and explain previously made empirical observations.
- The authors demonstrate that even with a very large number of tasks and features, all mapped into a joint lower dimensional hashspace, one can obtain impressive classification results with finite memory guarantee

相关工作

- A number of researchers have tackled related, albeit different problems.

(Rahimi & Recht, 2008) use Bochner’s theorem and sampling to obtain approximate inner products for Radial Basis Function kernels. (Rahimi & Recht, 2009) extend this to sparse approximation of weighted combinations of basis functions. This is computationally efficient for many function spaces. Note that the representation is dense.

(Li et al, 2007) take a complementary approach: for sparse feature vectors, φ(x), they devise a scheme of reducing the number of nonzero terms even further. While this is in principle desirable, it does not resolve the problem of φ(x) being high dimensional. More succinctly, it is necessary to express the function in the dual representation rather than expressing f as a linear function, where w is unlikely to be compactly represented: f (x) = φ(x), w .

(Achlioptas, 2003) provides computationally efficient randomization schemes for dimensionality reduction. Instead of performing a dense d·m dimensional matrix vector multiplication to reduce the dimensionality for a vector of dimensionality d to one of dimensionality m, as is required by the algorithm of (Gionis et al., 1999), he only requires

引用论文

- Achlioptas, D. (2003). Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of Computer and System Sciences, 66, 671–687.
- Bennett, J., & Lanning, S. (2007). The Netflix Prize. Proceedings of Conference on Knowledge Discovery and Data Mining Cup and Workshop 2007.
- Bernstein, S. (1946). The theory of probabilities. Moscow: Gastehizdat Publishing House.
- Daume, H. (2007). Frustratingly easy domain adaptation. Annual Meeting of the Association for Computational Linguistics (p. 256).
- Ganchev, K., & Dredze, M. (2008). Small statistical models by random feature mixing. Workshop on Mobile Language Processing, Annual Meeting of the Association for Computational Linguistics.
- Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. Proceedings of the 25th VLDB Conference (pp. 518–529). Edinburgh, Scotland: Morgan Kaufmann.
- Langford, J., Li, L., & Strehl, A. (2007). Vowpal wabbit online learning project (Technical Report). http://hunch.net/?p=309.
- Ledoux, M. (2001). The concentration of measure phenomenon. Providence, RI: AMS.
- Rahimi, A., & Recht, B. (2009). Randomized kitchen sinks. In L. Bottou, Y. Bengio, D. Schuurmans and D. Koller (Eds.), Advances in neural information processing systems 21.
- Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A., Strehl, A., & Vishwanathan, V. (2009). Hash kernels. Proc. Intl. Workshop on Artificial Intelligence and Statistics 12.
- Li, P., Church, K., & Hastie, T. (2007). Conditional random sampling: A sketch-based sampling technique for sparse data. In B. Scholkopf, J. Platt and T. Hoffman (Eds.), Advances in neural information processing systems 19, 873–880.
- Rahimi, A., & Recht, B. (2008). Random features for largescale kernel machines. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in neural information processing systems 20.
- Next denote by d(φ, A) the distance between a hash function and a set A of hash functions, that is d(φ, A) = infφ ∈A d(φ, φ ). In this case Talagrand’s convex distance inequality (Ledoux, 2001) holds. If Pr(A) denotes the total probability mass of the set A, then

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn