SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain

Chenhui Chu,Zhuoyuan Mao,Toshiaki Nakazawa,Daisuke Kawahara,Sadao Kurohashi

LANGUAGE RESOURCES AND EVALUATION（2022）

引用 0|浏览2

暂无评分

摘要

Word segmentation, part-of-speech (POS) tagging, and syntactic parsing are three fundamental Chinese analysis tasks for Chinese language processing, which are also crucial for various downstream tasks such as machine translation and information extraction. To achieve high accuracy for these tasks, treebanks that contain sentences manually annotated with word segmentation, part-of-speech tags, and phrase structures are essential. Although there are large-scale Chinese treebanks in the news domain, such treebanks are unavailable in the scientific domain. This significantly limits the performance of Chinese language processing for scientific text. To address this problem, we annotate the 2nd version of the Chinese treebank in the scientific domain (SCTB-V2). SCTB-V2 contains 12,175 sentences annotated with word segmentation, part-of-speech tags, and phrase structures. We conducted Chinese analyses and machine translation experiments on SCTB-V2. The results show the effectiveness of SCTB-V2. We release this treebank to promote scientific Chinese language processing research http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20 in%20Scientific%20Domain%20%28SCTB%29 .

查看译文

关键词

Treebank, Chinese, Scientific domain

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要