AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
For Thai Word Segmentation, the results showed that our method is an effective domain adaptation method and has similar performance as the transfer learning method

Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble

EMNLP 2020, pp.3841-3847, (2020)

被引用0|浏览196
下载 PDF 全文
引用
微博一下

摘要

Like many Natural Language Processing tasks, Thai word segmentation is domain-dependent. Researchers have been relying on transfer learning to adapt an existing model to a new domain. However, this approach is inapplicable to cases where we can interact with only input and output layers of the models, also known as “black boxes”. We propo...更多

代码

数据

0
简介
  • Like many NLP tasks, Thai WS is domaindependent. For instance, Chormai et al (2019) recorded an accuracy drop from 91% to 81% when their model trained on a generic domain corpus (Kosawat et al, 2009) was tested on a social media one (bact’ et al, 2019).
  • Word Segmentation (WS) is an essential process for ments (Chormai et al, 2019; Chuang, 2019; Ikeda, several Natural Language Processing (NLP) tasks 2018).
  • The authors call this type of model a black box.
  • Instead of making changes to the existing in experimental results from Nguyen et al and model directly, the authors build a separate model to improve the accuracy of predictions made by the black
重点内容
  • Like many Natural Language Processing (NLP) tasks, Thai Word Segmentation (WS) is domaindependent
  • Instead of making changes to the existing in experimental results from Nguyen et al and model directly, we build a separate model to improve the accuracy of predictions made by the black
  • We focus on classical learning methods that historically provide good results in WS problems, such as Logistic Regression (LR), Support Vector Machine (SVM), and Conditional Random Field (CRF)
  • We proposed a novel solution for adapting a blackbox model to a new domain by formulating it as an ensemble learning problem
  • For Thai Word Segmentation, the results showed that our method is an effective domain adaptation method and has similar performance as the transfer learning method
  • The results from Japanese and Chinese Word Segmentation experiments showed that our method could improve the performance of Japanese and Chinese black-box models
方法
  • Two state-of-the-art models for Thai WS were chosen as the competitive methods, i.e., DeepCut (Rakpong Kittinaradorn, 2019) and AttaCut-SC (Chormai et al, 2019).
  • Both are deep learning models based on the Convolution Neural Network (CNN).
  • The authors note that the authors of DeepCut provided the weights trained on the BEST corpus.
  • The authors compared the method with a model pre-trained on BEST-2010 and transferred to the target task
结果
  • Evaluation on Chinese and Japanese

    Chinese Word Segmentation (CWS). In this experiment, the authors used the existing CWS model called PyWordSeg (Chuang, 2019) with character-level ELMO embedding.
  • Japanese Word Segmentation (JWS)
  • In this experiment, the authors performed JWS using Nagisa (Ikeda, 2018), trained on the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa et al.).
  • The authors performed JWS using Nagisa (Ikeda, 2018), trained on the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa et al.)
  • This model categorizes characters into four classes: (i) beginning (B) (ii) middle (M) (iii) ending (E), and (iv) single-word (S) (Kitagawa and Komachi, 2018)
结论
  • The authors proposed a novel solution for adapting a blackbox model to a new domain by formulating it as an ensemble learning problem.
  • The authors conducted extensive experimental studies using nine benchmark corpora from three languages.
  • For Thai Word Segmentation, the results showed that the method is an effective domain adaptation method and has similar performance as the transfer learning method.
  • The results from Japanese and Chinese Word Segmentation experiments showed that the method could improve the performance of Japanese and Chinese black-box models
总结
  • Introduction:

    Like many NLP tasks, Thai WS is domaindependent. For instance, Chormai et al (2019) recorded an accuracy drop from 91% to 81% when their model trained on a generic domain corpus (Kosawat et al, 2009) was tested on a social media one (bact’ et al, 2019).
  • Word Segmentation (WS) is an essential process for ments (Chormai et al, 2019; Chuang, 2019; Ikeda, several Natural Language Processing (NLP) tasks 2018).
  • The authors call this type of model a black box.
  • Instead of making changes to the existing in experimental results from Nguyen et al and model directly, the authors build a separate model to improve the accuracy of predictions made by the black
  • Methods:

    Two state-of-the-art models for Thai WS were chosen as the competitive methods, i.e., DeepCut (Rakpong Kittinaradorn, 2019) and AttaCut-SC (Chormai et al, 2019).
  • Both are deep learning models based on the Convolution Neural Network (CNN).
  • The authors note that the authors of DeepCut provided the weights trained on the BEST corpus.
  • The authors compared the method with a model pre-trained on BEST-2010 and transferred to the target task
  • Results:

    Evaluation on Chinese and Japanese

    Chinese Word Segmentation (CWS). In this experiment, the authors used the existing CWS model called PyWordSeg (Chuang, 2019) with character-level ELMO embedding.
  • Japanese Word Segmentation (JWS)
  • In this experiment, the authors performed JWS using Nagisa (Ikeda, 2018), trained on the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa et al.).
  • The authors performed JWS using Nagisa (Ikeda, 2018), trained on the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa et al.)
  • This model categorizes characters into four classes: (i) beginning (B) (ii) middle (M) (iii) ending (E), and (iv) single-word (S) (Kitagawa and Komachi, 2018)
  • Conclusion:

    The authors proposed a novel solution for adapting a blackbox model to a new domain by formulating it as an ensemble learning problem.
  • The authors conducted extensive experimental studies using nine benchmark corpora from three languages.
  • For Thai Word Segmentation, the results showed that the method is an effective domain adaptation method and has similar performance as the transfer learning method.
  • The results from Japanese and Chinese Word Segmentation experiments showed that the method could improve the performance of Japanese and Chinese black-box models
表格
  • Table1: Parameter settings in Deepcut and AttaCut
  • Table2: Summary of WS corpora (# Training [# testing]), TH = Thai, CN = Chinese, and JP = Japanese
  • Table3: Performance comparison on WS160
  • Table4: Performance comparison on TNHC
  • Table5: Comparison between our method on CWS
  • Table6: Comparison between our method on JWS
  • Table7: Performance and Efficiency (Wisesight, TNHC): Effect of top-k
  • Table8: Parameter settings in CRF
  • Table9: Effect of Feature Types
  • Table10: Performance comparison on BEST-2010
Download tables as Excel
基金
  • Instead of making changes to the existing in experimental results from Nguyen et al and model directly, we build a separate model to improve the accuracy of predictions made by the black
  • Experimental results showed that our proposed solution achieved the accuracy level comparable to those of transfer learning solutions in Thai
  • For Chinese and Japanese, we showed that model adaptation using the SEFR technique could improve the performance of black-box models when used in a cross-domain setting
  • The performance of the Thai WS is typically evaluated using F1 scores at the character level
  • To avoid the overestimation of WS performance, we also evaluated the F1 scores at the word level
  • For the TNHC corpus, SE+DeepCut performed better than DeepCut by 1.7% and 0.3% at the character and word levels, i.e., char F1 and word F1, respectively
  • Our method reports performance improvement of 3% on GSD, 4.7% on Modern, and 11.5% on PUD using the character-level F1
  • The results from Japanese and Chinese Word Segmentation experiments showed that our method could improve the performance of Japanese and Chinese black-box models
研究对象与分析
papers: 8
Theeramunkong) on Thai WS for the past ten years. the predictions from Domain-Specific with the re-. On the other hand, there are at least eight papers maining from Domain-Generic to form the final from well-established conferences on Chinese and predictive results. Japanese WS (Li et al, 2019; Aguirre and Aguiar, We conducted extensive experimental studies to

引用论文
  • Stalin Aguirre and Josafa Aguiar. 2019. A Japanese Word Segmentation Proposal. In ACL.
    Google ScholarFindings
  • Masayuki Asahara, Hiroshi Kanayama, Takaaki Tanaka, Yusuke Miyao, Sumire Uematsu, Shinsuke Mori, Yuji Matsumoto, Mai Omura, and Yugo Murawaki. 2018. Universal Dependencies Version 2 for Japanese. In LREC.
    Google ScholarFindings
  • bact’, Pattarawat Chormai, Charin, and ekapolc. 2019. Pythainlp/wisesight-sentiment: First release.
    Google ScholarFindings
  • Deng Cai, Hai Zhao, Zhisong Zhang, Yuan Xin, Yongjian Wu, and Feiyue Huang. 2017. Fast and Accurate Neural Word Segmentation for Chinese. In ACL.
    Google ScholarFindings
  • Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial Multi-Criteria Learning for Chinese Word Segmentation. In ACL.
    Google ScholarFindings
  • Pattarawat Chormai, Ponrawee Prasertsom, and Attapol Rutherford. 2019. Attacut: A Fast and Accurate Neural Thai Word Segmenter. CoRR, abs/1911.07056.
    Findings
  • Yung-Sung Chuang. 2019. Robust Chinese Word Segmentation with Contextualized Word Representations. CoRR, abs/1901.05816.
    Findings
  • Chen Gong, Zhenghua Li, Min Zhang, and Xinzhou Jiang. 2017. Multi-Grained Chinese Word Segmentation. In EMNLP.
    Google ScholarFindings
  • Taishi Ikeda. 2018. nagisa: A Japanese tokenizer based on recurrent neural networks. https://github.com/taishi-i/nagisa.
    Findings
  • Yoshiaki Kitagawa and Mamoru Komachi. 2018. Long Short-Term memory for Japanese Word Segmentation. In PACLIC.
    Google ScholarFindings
  • K. Kosawat, M. Boriboon, P. Chootrakool, A. Chotimongkol, S. Klaithin, S. Kongyoung, K. Kriengket, S. Phaholphinyo, S. Purodakananda, T. Thanakulwarapas, and C. Wutiwiwatchai. 2009. Best 2009: Thai Word Segmentation software contest. In SNLP.
    Google ScholarFindings
  • Xiaoya Li, Yuxian Meng, Xiaofei Sun, Qinghong Han, Arianna Yuan, and Jiwei Li. 2019. Is Word Segmentation Necessary for Deep Learning of Chinese Representations? In ACL.
    Google ScholarFindings
  • Ji Ma, Kuzman Ganchev, and David Weiss. 2018. State-of-the-art Chinese Word Segmentation with Bi-LSTMs. In EMNLP.
    Google ScholarFindings
  • Rungsiman Nararatwong, Natthawut Kertkeidkachorn, Nagul Cooharojananone, and Hitoshi Okada. 2018. Improving Thai Word and Sentence Segmentation Using Linguistic Knowledge. IEICE Trans. Inf. Syst.
    Google ScholarLocate open access versionFindings
  • Korakot Chaovavanich Kittinan Srithaworn Pattarawat Chormai Chanwit Kaewkasi Tulakan Ruangrong Krichkorn Oparad Rakpong Kittinaradorn, Titipat Achakulvisut. 2019. DeepCut: A Thai word tokenization library using Deep Neural Network.
    Google ScholarFindings
  • Hao Zhou, Zhenting Yu, Yue Zhang, Shujian Huang, Xin-Yu Dai, and Jiajun Chen. 2017. Word-Context Character embeddings for Chinese Word Segmentation. In EMNLP.
    Google ScholarFindings
作者
Peerat Limkonchotiwat
Peerat Limkonchotiwat
Wannaphong Phatthiyaphaibun
Wannaphong Phatthiyaphaibun
Raheem Sarwar
Raheem Sarwar
Ekapol Chuangsuwanich
Ekapol Chuangsuwanich
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科