Extracting Grammatical Error Corrections From Wikipedia Revision History

Jhih-Jie Chen, Yi-Dong Wu, Yu-Chuan Tai,Ching-Yu Yang,Hai-Lun Tu,Jason S. Chang

2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)（2019）

引用 0|浏览14

暂无评分

摘要

This paper describes the process of extracting and filtering Wikipedia revision history as a resource 14 grammatical error correction (GEC). Edits in Wikipedia revision history vary widely, including grammatical error corrections, information supplements, format amendments, and even vandalism. To extract only GEC-related revisions, we use an automated error annotation toolkit, ERRANT(1), and extend it to process large data in parallel efficiently. With error-type analysis, we can then identify GEC-related edits and omit other unrelated edits (i.e., only the correction parts are reserved). The resulting corpus is - to our knowledge - the largest publicly available corpus of parallel possibly erroneous and correct sentences with error type labels.

查看译文

关键词

Wikipedia, Grammatical Error Correction, MapReduce

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要