AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Based on our evaluation over multi-domain language identification and multidomain sentiment analysis, we show our models to substantially outperform a baseline deep learning method, and set a new benchmark for state-of-theart cross-domain LangID

What's in a Domain? Learning Domain-Robust Text Representations using Adversarial Training.

NAACL-HLT, (2018): 474-479

Cited by: 11|Views40
EI
Full Text
Bibtex
Weibo

Abstract

Most real world language problems require learning from heterogenous corpora, raising the problem of learning robust models which generalise well to both similar (in domain) and dissimilar (out of domain) instances to those seen in training. This requires learning an underlying task, while not learning irrelevant signals and biases specif...More

Code:

Data:

0
Introduction
  • Heterogeneity is pervasive in NLP, arising from corpora being constructed from different sources, featuring different topics, register, writing style, etc.
  • To illustrate, Bitvai and Cohn (2015) report learning formatting quirks of specific reviewers in a review text regression task, which are unlikely to prove useful on other texts
  • This classic problem in NLP has been tackled under the guise of “domain adaptation”, known as unsupervised transfer learning, using feature-based methods to support knowledge transfer over multiple domains (Blitzer et al, 2007; Daume III, 2007; Joshi et al, 2012; Williams, 2013; Kim et al, 2016).
  • Ganin and Lempitsky (2015) proposed a method to encourage domain-general text representations, which transfer better to new domains
Highlights
  • Heterogeneity is pervasive in NLP, arising from corpora being constructed from different sources, featuring different topics, register, writing style, etc
  • We primarily evaluate on the task of language identification (“LangID”: Cavnar and Trenkle (1994)), using the corpora of Lui and Baldwin (2012), which combine large training sets over a diverse range of text domains
  • The raw Conditional Model (COND) and GEN perform better than the baseline
  • As discussed earlier, multi-domain data can introduce noise to the shared representation, causing the performance to drop over TCL, Wikipedia2 and EMEA. This observation demonstrates the necessity of applying adversarial learning to COND
  • We have proposed a novel deep learning method for multi-domain learning, based on joint learning of domain-specific and domain-general components, using either domain conditioning or domain generation
  • Based on our evaluation over multi-domain language identification and multidomain sentiment analysis, we show our models to substantially outperform a baseline deep learning method, and set a new benchmark for state-of-theart cross-domain LangID
Methods
  • 3.1 Language Identification To evaluate the approach, the authors first consider the language identification task.

    Data The authors follow the settings of Lui and Baldwin (2012), involving 5 training sets from 5 different domains with 97 languages in total: Debian, JRC-Acquis, Wikipedia, ClueWeb and RCV2, derived from Lui and Baldwin (2011).5 The authors evaluate accuracy on seven holdout benchmarks: EuroGov, TCL, Wikipedia26 (all from Baldwin and Lui (2010)), EMEA (Tiedemann, 2009), EuroPARL (Koehn, 2005), T-BE (Tromp and Pechenizkiy, 2011), and T-SC (Carter et al, 2013).

    Documents are tokenized as a byte sequence (consistent with Lui and Baldwin (2012)), and truncated or padded to a length of 1k bytes.7

    Hyper-parameters The authors perform a grid search for the hyper-parameters, and selected the following settings to optimise accuracy over heldout data from each of the training domains.
  • The authors evaluate accuracy on seven holdout benchmarks: EuroGov, TCL, Wikipedia26 (all from Baldwin and Lui (2010)), EMEA (Tiedemann, 2009), EuroPARL (Koehn, 2005), T-BE (Tromp and Pechenizkiy, 2011), and T-SC (Carter et al, 2013).
  • Documents are tokenized as a byte sequence (consistent with Lui and Baldwin (2012)), and truncated or padded to a length of 1k bytes.7.
  • Hyper-parameters The authors perform a grid search for the hyper-parameters, and selected the following settings to optimise accuracy over heldout data from each of the training domains.
  • All the models are optimized using the Adam Optimizer (Kingma and Ba, 2015) with a learning rate of 10−4
Results
  • Results and Analysis

    Baseline and comparisons For comparison, the authors implement a CNN baseline8 which is trained using all the data without domain knowledge.
  • For COND, the authors observed performance gains on EuroPARL, T-BE and T-SC
  • These three datasets are notable in containing shorter documents, which benefit the most from shared learning.
  • As discussed earlier, multi-domain data can introduce noise to the shared representation, causing the performance to drop over TCL, Wikipedia2 and EMEA.
  • This observation demonstrates the necessity of applying adversarial learning to COND.
  • It is a different story for GEN: vanilla GEN achieves accuracy gains relative to the baseline over 5 domains, but is slightly below COND for 4 domains, a result of parameter-sharing over the private representation
Conclusion
  • The authors have proposed a novel deep learning method for multi-domain learning, based on joint learning of domain-specific and domain-general components, using either domain conditioning or domain generation.
  • Based on the evaluation over multi-domain language identification and multidomain sentiment analysis, the authors show the models to substantially outperform a baseline deep learning method, and set a new benchmark for state-of-theart cross-domain LangID.
  • The authors' approach has potential to benefit other NLP applications involving multi-domain data
Summary
  • Introduction:

    Heterogeneity is pervasive in NLP, arising from corpora being constructed from different sources, featuring different topics, register, writing style, etc.
  • To illustrate, Bitvai and Cohn (2015) report learning formatting quirks of specific reviewers in a review text regression task, which are unlikely to prove useful on other texts
  • This classic problem in NLP has been tackled under the guise of “domain adaptation”, known as unsupervised transfer learning, using feature-based methods to support knowledge transfer over multiple domains (Blitzer et al, 2007; Daume III, 2007; Joshi et al, 2012; Williams, 2013; Kim et al, 2016).
  • Ganin and Lempitsky (2015) proposed a method to encourage domain-general text representations, which transfer better to new domains
  • Methods:

    3.1 Language Identification To evaluate the approach, the authors first consider the language identification task.

    Data The authors follow the settings of Lui and Baldwin (2012), involving 5 training sets from 5 different domains with 97 languages in total: Debian, JRC-Acquis, Wikipedia, ClueWeb and RCV2, derived from Lui and Baldwin (2011).5 The authors evaluate accuracy on seven holdout benchmarks: EuroGov, TCL, Wikipedia26 (all from Baldwin and Lui (2010)), EMEA (Tiedemann, 2009), EuroPARL (Koehn, 2005), T-BE (Tromp and Pechenizkiy, 2011), and T-SC (Carter et al, 2013).

    Documents are tokenized as a byte sequence (consistent with Lui and Baldwin (2012)), and truncated or padded to a length of 1k bytes.7

    Hyper-parameters The authors perform a grid search for the hyper-parameters, and selected the following settings to optimise accuracy over heldout data from each of the training domains.
  • The authors evaluate accuracy on seven holdout benchmarks: EuroGov, TCL, Wikipedia26 (all from Baldwin and Lui (2010)), EMEA (Tiedemann, 2009), EuroPARL (Koehn, 2005), T-BE (Tromp and Pechenizkiy, 2011), and T-SC (Carter et al, 2013).
  • Documents are tokenized as a byte sequence (consistent with Lui and Baldwin (2012)), and truncated or padded to a length of 1k bytes.7.
  • Hyper-parameters The authors perform a grid search for the hyper-parameters, and selected the following settings to optimise accuracy over heldout data from each of the training domains.
  • All the models are optimized using the Adam Optimizer (Kingma and Ba, 2015) with a learning rate of 10−4
  • Results:

    Results and Analysis

    Baseline and comparisons For comparison, the authors implement a CNN baseline8 which is trained using all the data without domain knowledge.
  • For COND, the authors observed performance gains on EuroPARL, T-BE and T-SC
  • These three datasets are notable in containing shorter documents, which benefit the most from shared learning.
  • As discussed earlier, multi-domain data can introduce noise to the shared representation, causing the performance to drop over TCL, Wikipedia2 and EMEA.
  • This observation demonstrates the necessity of applying adversarial learning to COND.
  • It is a different story for GEN: vanilla GEN achieves accuracy gains relative to the baseline over 5 domains, but is slightly below COND for 4 domains, a result of parameter-sharing over the private representation
  • Conclusion:

    The authors have proposed a novel deep learning method for multi-domain learning, based on joint learning of domain-specific and domain-general components, using either domain conditioning or domain generation.
  • Based on the evaluation over multi-domain language identification and multidomain sentiment analysis, the authors show the models to substantially outperform a baseline deep learning method, and set a new benchmark for state-of-theart cross-domain LangID.
  • The authors' approach has potential to benefit other NLP applications involving multi-domain data
Tables
  • Table1: Accuracy [%] of the different models over the seven heldout datasets, and the macro-averaged accuracy out-of-domain over the 7 test domains (“ALLout”). The best result for each dataset is indicated in bold. Key: +d = domain adversarial, +g = domain generation component
  • Table2: Accuracy [%] of different models over five indomain datasets using cross-validation evaluation and macro-averaged accuracy (“ALLin”)
  • Table3: Accuracy [%] of different models over 4 domains (B, D, E and K) under out-of-domain evaluations on Multi Domain Sentiment Dataset. Key: ♣ from <a class="ref-link" id="cBlitzer_et+al_2007_a" href="#rBlitzer_et+al_2007_a">Blitzer et al (2007</a>); ♦ from <a class="ref-link" id="cGanin_2015_a" href="#rGanin_2015_a">Ganin and Lempitsky (2015</a>)
Download tables as Excel
Funding
  • This work was supported by the Australian Research Council (FT130101105)
Study subjects and analysis
datasets: 3
Specifically, for COND, we observed performance gains on EuroPARL, T-BE and T-SC. These three datasets are notable in containing shorter documents, which benefit the most from shared learning. However, as discussed earlier, multi-domain data can introduce noise to the shared representation, causing the performance to drop over TCL, Wikipedia2 and EMEA

heldout datasets: 7
. Accuracy [%] of the different models over the seven heldout datasets, and the macro-averaged accuracy out-of-domain over the 7 test domains (“ALLout”). The best result for each dataset is indicated in bold. Key: +d = domain adversarial, +g = domain generation component. Accuracy [%] of different models over five indomain datasets using cross-validation evaluation and macro-averaged accuracy (“ALLin”)

indomain datasets: 5
Accuracy [%] of the different models over the seven heldout datasets, and the macro-averaged accuracy out-of-domain over the 7 test domains (“ALLout”). The best result for each dataset is indicated in bold. Key: +d = domain adversarial, +g = domain generation component. Accuracy [%] of different models over five indomain datasets using cross-validation evaluation and macro-averaged accuracy (“ALLin”). Accuracy [%] of different models over 4 domains (B, D, E and K) under out-of-domain evaluations on Multi Domain Sentiment Dataset. Key: ♣ from Blitzer et al (2007); ♦ from Ganin and Lempitsky (2015)

Reference
  • Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Proceedings of Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics. pages 229–237.
    Google ScholarLocate open access versionFindings
  • Zsolt Bitvai and Trevor Cohn. 2015. Non-linear text regression with a deep convolutional neural network. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing Short Papers.
    Google ScholarLocate open access versionFindings
  • John Blitzer, Mark Dredze, and Fernando Pereira. 2007.
    Google ScholarFindings
  • Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. pages 440–447.
    Google ScholarLocate open access versionFindings
  • Simon Carter, Wouter Weerkamp, and Manos Tsagkias. 2013. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation 47(1):195–215.
    Google ScholarLocate open access versionFindings
  • equitable language identification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. pages 51–57. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. pages 1746–1751. Young-Bum Kim, Karl Stratos, and Ruhi Sarikaya. 201Frustratingly easy neural domain adaptation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. pages 387–396.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit 2005. pages 79–86. Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Fifth International Joint Conference on Natural Language Processing. pages 553–561.
    Google ScholarLocate open access versionFindings
  • William B Cavnar and John M Trenkle. 1994. Ngram-based text categorization. In Proceedings of the Third Symposium on Document Analysis and Information Retrieval.
    Google ScholarLocate open access versionFindings
  • Marco Lui and Timothy Baldwin. 2012. langid.py: An off-the-shelf language identification tool. In Proceedings of ACL 2012 System Demonstrations. pages 25–30.
    Google ScholarLocate open access versionFindings
  • Hal Daume III. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. pages 256–263.
    Google ScholarLocate open access versionFindings
  • Marco Lui and Timothy Baldwin. 2014. Accurate language identification of Twitter messages. In Proceedings of the 5th workshop on language analysis for social media. pages 17–25.
    Google ScholarLocate open access versionFindings
  • Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning 2015. pages 1180–1189.
    Google ScholarLocate open access versionFindings
  • Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17:59:1–59:35.
    Google ScholarLocate open access versionFindings
  • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 20Generative adversarial nets. In Advances in Neural Information Processing Systems 27. pages 2672–2680.
    Google ScholarLocate open access versionFindings
  • Jorg Tiedemann. 2009. News from OPUS – a collection of multilingual parallel corpora with tools and interfaces. In Recent Advances in Natural Language Processing. volume 5, pages 237–248.
    Google ScholarLocate open access versionFindings
  • Erik Tromp and Mykola Pechenizkiy. 2011. Graphbased n-gram language identification on short texts. In Proceedings of the 20th Machine Learning Conference of Belgium and The Netherlands. pages 27– 34.
    Google ScholarLocate open access versionFindings
  • Jason Williams. 2013. Multi-domain learning and generalization in dialog state tracking. In Proceedings of the SIGDIAL 2013 Conference. pages 433–441.
    Google ScholarLocate open access versionFindings
  • Mahesh Joshi, Mark Dredze, William W. Cohen, and Carolyn Penstein Rose. 2012. Multi-domain learning: When do domains matter? In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pages 1302–1312.
    Google ScholarLocate open access versionFindings
  • David Jurgens, Yulia Tsvetkov, and Dan Jurafsky. 2017. Incorporating dialectal variability for socially
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科