A theory of learning from different domains
Machine Learning, no. 1-2 (2010): 151-175
Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. Often, however, we have plentiful labeled training data from a source domain but wish to learn a classifier which performs well on a target domain with a different distribution and little or no labeled training...更多
下载 PDF 全文
- Most research in machine learning, both theoretical and empirical, assumes that models are trained and tested using data drawn from some fixed distribution.
- This single domain setting has been well studied, and uniform convergence theory guarantees that a model’s empirical training error is close to its true error under such assumptions.
- The challenge is that each user receives a unique distribution of email
- Most research in machine learning, both theoretical and empirical, assumes that models are trained and tested using data drawn from some fixed distribution
- We might have a spam filter trained from a large email collection received by a group of current users and wish to adapt it for a new user
- In this work we investigate the problem of domain adaptation
- We explore an extension of our theory to the case of multiple source domains
- We presented a theoretical investigation of the task of domain adaptation, a task in which we have a large amount of training data from a source domain, but we wish to apply a model in a target domain with a much smaller amount of training data
- Our bounds on the divergence between source and target distribution are in terms of VC dimension
- The authors explore Theorem 3 further by comparing its predictions to the predictions of an approximation that can be computed from finite labeled source and unlabeled source and target samples.
- If the authors had enough target data to do this accurately, the authors would not need to adapt a source classifier in the first place.
- Λ is small enough to be a negligible term in the bound.
- The authors illustrate the theory on the natural language processing task of sentiment classification (Pang et al 2002).
- The point of these experiments is not to instantiate the.
- The data set consists of reviews from the Amazon website for several different types of products.
- One might ask whether there exist settings where a non-uniform weighting can lead to a significantly lower value of the bound than a uniform weighting.
- This is true, for example, in the setting studied by Mansour et al (2009a, 2009b), who derive results for combining pre-computed hypotheses
- They show that for arbitrary convex losses, if the Rényi divergence between the target and a mixture of sources is small, it is possible to combine low-error source hypotheses to create a low-error target hypothesis.
- It would be interesting to investigate algorithms that choose a convex combination of multiple sources to minimize the bound in Theorem 5 as possible approaches to adaptation from multiple sources
- Crammer et al (2008) introduced a PAC-style model of learning from multiple sources in which the distribution over input points is assumed to be the same across sources but each source may have its own deterministic labeling function. They derive bounds on the target error of the function that minimizes the empirical error on (uniformly weighted) data from any subset of the sources. As discussed in Sect. 8.2, the bounds that they derive are equivalent to ours in certain restricted settings, but their theory is significantly less general.
Daumé (2007) and Finkel (2009) suggest an empirically successful method for domain adaptation based on multi-task learning. The crucial difference between our domain adaptation setting and analyses of multi-task methods is that multi-task bounds require labeled data from each task, and make no attempt to exploit unlabeled data. Although these bounds have a more limited scope than ours, they can sometimes yield useful results even when the optimal predictors for each task (or domain in the case of Daumé 2007) are quite different (Baxter 2000; Ando and Zhang 2005).
- This material is based upon work partially supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No NBCHD030010 (CALO), by the National Science Foundation under grants ITR 0428193 and RI 0803256, and by a gift from Google, Inc. to the University of Pennsylvania
- Koby Crammer is a Horev fellow, supported by the Taub Foundations
- Ando, R., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853.
- Anthony, M., & Bartlett, P. (1999). Neural network learning: theoretical foundations. Cambridge: Cambridge University Press.
- Bartlett, P., & Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
- Batu, T., Fortnow, L., Rubinfeld, R., Smith, W., & White, P. (2000). Testing that distributions are close. In: IEEE symposium on foundations of computer science (Vol. 41, pp. 259–269).
- Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149– 198.
- Ben-David, S., Eiron, N., & Long, P. (2003). On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66, 496–514.
- Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2006). Analysis of representations for domain adaptation. In: Advances in neural information processing systems.
- Bickel, S., Brückner, M., & Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In: Proceedings of the international conference on machine learning.
- Bikel, D., Miller, S., Schwartz, R., & Weischedel, R. (1997). Nymble: a high-performance learning namefinder. In: Conference on applied natural language processing.
- Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Wortman, J. (2007a). Learning bounds for domain adaptation. In: Advances in neural information processing systems.
- Blitzer, J., Dredze, M., & Pereira, F. (2007b) Biographies, Bollywood, boomboxes and blenders: domain adaptation for sentiment classification. In: ACL.
- Collins, M. (1999). Head-driven statistical models for natural language parsing. PhD thesis, University of Pennsylvania.
- Cortes, C., Mohri, M., Riley, M., & Rostamizadeh, A. (2008). Sample selection bias correction theory. In: Proceedings of the 19th annual conference on algorithmic learning theory.
- Crammer, K., Kearns, M., & Wortman, J. (2008). Learning from multiple sources. Journal of Machine Learning Research, 9, 1757–1774.
- Dai, W., Yang, Q., Xue, G., & Yu, Y. (2007). Boosting for transfer learning. In: Proceedings of the international conference on machine learning.
- Das, S., & Chen, M. (2001). Yahoo! for Amazon: extracting market sentiment from stock message boards. In: Proceedings of the Asia pacific finance association annual conference.
- Daumé, H. (2007). Frustratingly easy domain adaptation. In: Association for computational linguistics (ACL).
- Finkel, J. R. Manning, C. D. (2009). Hierarchical Bayesian domain adaptation. In: Proceedings of the north American association for computational linguistics.
- Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153–161.
- Huang, J., Smola, A., Gretton, A., Borgwardt, K., & Schoelkopf, B. (2007). Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems.
- Jiang, J., & Zhai, C. (2007). Instance weighting for domain adaptation. In: Proceedings of the association for computational linguistics.
- Kifer, D., Ben-David, S., & Gehrke, J. (2004). Detecting change in data streams. In: Ver large databases.
- Li, X., & Bilmes, J. (2007). A Bayesian divergence prior for classification adaptation. In: Proceedings of the international conference on artificial intelligence and statistics.
- Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009a). Domain adaptation with multiple sources. In: Advances in neural information processing systems.
- Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009b). Multiple source adaptation and the rényi divergence. In: Proceedings of the conference on uncertainty in artificial intelligence.
- McAllester, D. (2003). Simplified PAC-Bayesian margin bounds. In: Proceedings of the sixteenth annual conference on learning theory.
- Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of empirical methods in natural language processing.
- Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In: Proceedings of empirical methods in natural language processing.
- Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60, 699–746.
- Thomas, M., Pang, B., & Lee, L. (2006). Get out the vote: determining support or opposition from congressional floor-debate transcripts. In: Proceedings of empirical methods in natural language processing.
- Turney, P. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the association for computational linguistics.
- Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
- Zhang, T. (2004). Solving large-scale linear prediction problems with stochastic gradient descent. In: Proceedings of the international conference on machine learning.