A theory of learning from different domains

Machine Learning, no. 1-2 (2010): 151-175

引用1302|浏览122
EI
下载 PDF 全文
引用
微博一下

摘要

Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. Often, however, we have plentiful labeled training data from a source domain but wish to learn a classifier which performs well on a target domain with a different distribution and little or no labeled training...更多

代码

数据

0
简介
  • Most research in machine learning, both theoretical and empirical, assumes that models are trained and tested using data drawn from some fixed distribution.
  • This single domain setting has been well studied, and uniform convergence theory guarantees that a model’s empirical training error is close to its true error under such assumptions.
  • The challenge is that each user receives a unique distribution of email
重点内容
  • Most research in machine learning, both theoretical and empirical, assumes that models are trained and tested using data drawn from some fixed distribution
  • We might have a spam filter trained from a large email collection received by a group of current users and wish to adapt it for a new user
  • In this work we investigate the problem of domain adaptation
  • We explore an extension of our theory to the case of multiple source domains
  • We presented a theoretical investigation of the task of domain adaptation, a task in which we have a large amount of training data from a source domain, but we wish to apply a model in a target domain with a much smaller amount of training data
  • Our bounds on the divergence between source and target distribution are in terms of VC dimension
方法
  • The authors explore Theorem 3 further by comparing its predictions to the predictions of an approximation that can be computed from finite labeled source and unlabeled source and target samples.
  • If the authors had enough target data to do this accurately, the authors would not need to adapt a source classifier in the first place.
  • Λ is small enough to be a negligible term in the bound.
结果
  • The authors illustrate the theory on the natural language processing task of sentiment classification (Pang et al 2002).
  • The point of these experiments is not to instantiate the.
  • The data set consists of reviews from the Amazon website for several different types of products.
结论
  • One might ask whether there exist settings where a non-uniform weighting can lead to a significantly lower value of the bound than a uniform weighting.
  • This is true, for example, in the setting studied by Mansour et al (2009a, 2009b), who derive results for combining pre-computed hypotheses
  • They show that for arbitrary convex losses, if the Rényi divergence between the target and a mixture of sources is small, it is possible to combine low-error source hypotheses to create a low-error target hypothesis.
  • It would be interesting to investigate algorithms that choose a convex combination of multiple sources to minimize the bound in Theorem 5 as possible approaches to adaptation from multiple sources
相关工作
  • Crammer et al (2008) introduced a PAC-style model of learning from multiple sources in which the distribution over input points is assumed to be the same across sources but each source may have its own deterministic labeling function. They derive bounds on the target error of the function that minimizes the empirical error on (uniformly weighted) data from any subset of the sources. As discussed in Sect. 8.2, the bounds that they derive are equivalent to ours in certain restricted settings, but their theory is significantly less general.

    Daumé (2007) and Finkel (2009) suggest an empirically successful method for domain adaptation based on multi-task learning. The crucial difference between our domain adaptation setting and analyses of multi-task methods is that multi-task bounds require labeled data from each task, and make no attempt to exploit unlabeled data. Although these bounds have a more limited scope than ours, they can sometimes yield useful results even when the optimal predictors for each task (or domain in the case of Daumé 2007) are quite different (Baxter 2000; Ando and Zhang 2005).
基金
  • This material is based upon work partially supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No NBCHD030010 (CALO), by the National Science Foundation under grants ITR 0428193 and RI 0803256, and by a gift from Google, Inc. to the University of Pennsylvania
  • Koby Crammer is a Horev fellow, supported by the Taub Foundations
引用论文
  • Ando, R., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853.
    Google ScholarLocate open access versionFindings
  • Anthony, M., & Bartlett, P. (1999). Neural network learning: theoretical foundations. Cambridge: Cambridge University Press.
    Google ScholarFindings
  • Bartlett, P., & Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
    Google ScholarLocate open access versionFindings
  • Batu, T., Fortnow, L., Rubinfeld, R., Smith, W., & White, P. (2000). Testing that distributions are close. In: IEEE symposium on foundations of computer science (Vol. 41, pp. 259–269).
    Google ScholarLocate open access versionFindings
  • Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149– 198.
    Google ScholarLocate open access versionFindings
  • Ben-David, S., Eiron, N., & Long, P. (2003). On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66, 496–514.
    Google ScholarLocate open access versionFindings
  • Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2006). Analysis of representations for domain adaptation. In: Advances in neural information processing systems.
    Google ScholarFindings
  • Bickel, S., Brückner, M., & Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In: Proceedings of the international conference on machine learning.
    Google ScholarLocate open access versionFindings
  • Bikel, D., Miller, S., Schwartz, R., & Weischedel, R. (1997). Nymble: a high-performance learning namefinder. In: Conference on applied natural language processing.
    Google ScholarLocate open access versionFindings
  • Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Wortman, J. (2007a). Learning bounds for domain adaptation. In: Advances in neural information processing systems.
    Google ScholarFindings
  • Blitzer, J., Dredze, M., & Pereira, F. (2007b) Biographies, Bollywood, boomboxes and blenders: domain adaptation for sentiment classification. In: ACL.
    Google ScholarFindings
  • Collins, M. (1999). Head-driven statistical models for natural language parsing. PhD thesis, University of Pennsylvania.
    Google ScholarFindings
  • Cortes, C., Mohri, M., Riley, M., & Rostamizadeh, A. (2008). Sample selection bias correction theory. In: Proceedings of the 19th annual conference on algorithmic learning theory.
    Google ScholarLocate open access versionFindings
  • Crammer, K., Kearns, M., & Wortman, J. (2008). Learning from multiple sources. Journal of Machine Learning Research, 9, 1757–1774.
    Google ScholarLocate open access versionFindings
  • Dai, W., Yang, Q., Xue, G., & Yu, Y. (2007). Boosting for transfer learning. In: Proceedings of the international conference on machine learning.
    Google ScholarLocate open access versionFindings
  • Das, S., & Chen, M. (2001). Yahoo! for Amazon: extracting market sentiment from stock message boards. In: Proceedings of the Asia pacific finance association annual conference.
    Google ScholarLocate open access versionFindings
  • Daumé, H. (2007). Frustratingly easy domain adaptation. In: Association for computational linguistics (ACL).
    Google ScholarFindings
  • Finkel, J. R. Manning, C. D. (2009). Hierarchical Bayesian domain adaptation. In: Proceedings of the north American association for computational linguistics.
    Google ScholarLocate open access versionFindings
  • Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153–161.
    Google ScholarLocate open access versionFindings
  • Huang, J., Smola, A., Gretton, A., Borgwardt, K., & Schoelkopf, B. (2007). Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems.
    Google ScholarFindings
  • Jiang, J., & Zhai, C. (2007). Instance weighting for domain adaptation. In: Proceedings of the association for computational linguistics.
    Google ScholarLocate open access versionFindings
  • Kifer, D., Ben-David, S., & Gehrke, J. (2004). Detecting change in data streams. In: Ver large databases.
    Google ScholarFindings
  • Li, X., & Bilmes, J. (2007). A Bayesian divergence prior for classification adaptation. In: Proceedings of the international conference on artificial intelligence and statistics.
    Google ScholarLocate open access versionFindings
  • Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009a). Domain adaptation with multiple sources. In: Advances in neural information processing systems.
    Google ScholarFindings
  • Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009b). Multiple source adaptation and the rényi divergence. In: Proceedings of the conference on uncertainty in artificial intelligence.
    Google ScholarLocate open access versionFindings
  • McAllester, D. (2003). Simplified PAC-Bayesian margin bounds. In: Proceedings of the sixteenth annual conference on learning theory.
    Google ScholarLocate open access versionFindings
  • Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of empirical methods in natural language processing.
    Google ScholarLocate open access versionFindings
  • Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In: Proceedings of empirical methods in natural language processing.
    Google ScholarLocate open access versionFindings
  • Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60, 699–746.
    Google ScholarLocate open access versionFindings
  • Thomas, M., Pang, B., & Lee, L. (2006). Get out the vote: determining support or opposition from congressional floor-debate transcripts. In: Proceedings of empirical methods in natural language processing.
    Google ScholarLocate open access versionFindings
  • Turney, P. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the association for computational linguistics.
    Google ScholarLocate open access versionFindings
  • Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
    Google ScholarFindings
  • Zhang, T. (2004). Solving large-scale linear prediction problems with stochastic gradient descent. In: Proceedings of the international conference on machine learning.
    Google ScholarLocate open access versionFindings
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn