AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
As part of our theoretical analysis we introduce unbiased hash functions and provide exponential tail bounds for hash kernels

Feature hashing for large scale multitask learning

international conference on machine learning, (2009): 1113-1120

引用961|浏览90
EI
下载 PDF 全文
引用
微博一下

摘要

Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this appro...更多

代码

数据

0
简介
  • Eq (1) is often famously referred to as the kernel-trick
  • It allows the use of inner products between very high dimensional feature vectors φ and φ implicitly through the definition of a positive semi-definite kernel matrix k without ever having to compute a vector φ directly.
  • This can be powerful in classification settings where the original input representation has a non-linear decision boundary.
  • Linear separability can be achieved in a high dimensional feature space φ
重点内容
  • Kernel methods use inner products as the basic tool for comparisons between objects
  • In section 2 we introduce specialized hash functions with unbiased inner-products that are directly applicable to a large variety of kernel-methods
  • We follow the convention to set the classification threshold during test time such that exactly 1% of the not − spam test data is classified as spam Our implementation of the personalized hash functions is illustrated in Figure 1
  • As part of our theoretical analysis we introduce unbiased hash functions and provide exponential tail bounds for hash kernels
  • We derive that random subspaces of the hashed space are likely to not interact, which makes multitask learning with many tasks possible
结果
  • The authors used a proprietary email spamclassification task of n = 3.2 million emails, properly anonymized, collected from |U | = 433167 users.
  • The data set consists of 40 million unique words.
  • The authors follow the convention to set the classification threshold during test time such that exactly 1% of the not − spam test data is classified as spam The authors' implementation of the personalized hash functions is illustrated in Figure 1.
  • To obtain a personalized hash function φu for user u, the authors concatenate a unique
结论
  • In this paper the authors analyze the hashing-trick for dimensionality reduction theoretically and empirically.
  • As part of the theoretical analysis the authors introduce unbiased hash functions and provide exponential tail bounds for hash kernels.
  • These give further insight into hash-spaces and explain previously made empirical observations.
  • The authors demonstrate that even with a very large number of tasks and features, all mapped into a joint lower dimensional hashspace, one can obtain impressive classification results with finite memory guarantee
相关工作
  • A number of researchers have tackled related, albeit different problems.

    (Rahimi & Recht, 2008) use Bochner’s theorem and sampling to obtain approximate inner products for Radial Basis Function kernels. (Rahimi & Recht, 2009) extend this to sparse approximation of weighted combinations of basis functions. This is computationally efficient for many function spaces. Note that the representation is dense.

    (Li et al, 2007) take a complementary approach: for sparse feature vectors, φ(x), they devise a scheme of reducing the number of nonzero terms even further. While this is in principle desirable, it does not resolve the problem of φ(x) being high dimensional. More succinctly, it is necessary to express the function in the dual representation rather than expressing f as a linear function, where w is unlikely to be compactly represented: f (x) = φ(x), w .

    (Achlioptas, 2003) provides computationally efficient randomization schemes for dimensionality reduction. Instead of performing a dense d·m dimensional matrix vector multiplication to reduce the dimensionality for a vector of dimensionality d to one of dimensionality m, as is required by the algorithm of (Gionis et al., 1999), he only requires
引用论文
  • Achlioptas, D. (2003). Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of Computer and System Sciences, 66, 671–687.
    Google ScholarLocate open access versionFindings
  • Bennett, J., & Lanning, S. (2007). The Netflix Prize. Proceedings of Conference on Knowledge Discovery and Data Mining Cup and Workshop 2007.
    Google ScholarLocate open access versionFindings
  • Bernstein, S. (1946). The theory of probabilities. Moscow: Gastehizdat Publishing House.
    Google ScholarFindings
  • Daume, H. (2007). Frustratingly easy domain adaptation. Annual Meeting of the Association for Computational Linguistics (p. 256).
    Google ScholarFindings
  • Ganchev, K., & Dredze, M. (2008). Small statistical models by random feature mixing. Workshop on Mobile Language Processing, Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. Proceedings of the 25th VLDB Conference (pp. 518–529). Edinburgh, Scotland: Morgan Kaufmann.
    Google ScholarLocate open access versionFindings
  • Langford, J., Li, L., & Strehl, A. (2007). Vowpal wabbit online learning project (Technical Report). http://hunch.net/?p=309.
    Findings
  • Ledoux, M. (2001). The concentration of measure phenomenon. Providence, RI: AMS.
    Google ScholarFindings
  • Rahimi, A., & Recht, B. (2009). Randomized kitchen sinks. In L. Bottou, Y. Bengio, D. Schuurmans and D. Koller (Eds.), Advances in neural information processing systems 21.
    Google ScholarLocate open access versionFindings
  • Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A., Strehl, A., & Vishwanathan, V. (2009). Hash kernels. Proc. Intl. Workshop on Artificial Intelligence and Statistics 12.
    Google ScholarLocate open access versionFindings
  • Li, P., Church, K., & Hastie, T. (2007). Conditional random sampling: A sketch-based sampling technique for sparse data. In B. Scholkopf, J. Platt and T. Hoffman (Eds.), Advances in neural information processing systems 19, 873–880.
    Google ScholarLocate open access versionFindings
  • Rahimi, A., & Recht, B. (2008). Random features for largescale kernel machines. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in neural information processing systems 20.
    Google ScholarLocate open access versionFindings
  • Next denote by d(φ, A) the distance between a hash function and a set A of hash functions, that is d(φ, A) = infφ ∈A d(φ, φ ). In this case Talagrand’s convex distance inequality (Ledoux, 2001) holds. If Pr(A) denotes the total probability mass of the set A, then
    Google ScholarLocate open access versionFindings
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn