A Comprehensive Dataset of Spelling Errors and Users' Corrections in Croatian Language.

Gordan Gledec,Marko Horvat,Miljenko Mikuc,Bruno Blaskovic

Data（2023）

引用 0|浏览4

暂无评分

摘要

This paper presents a unique and extensive dataset containing over 33 million entries with pairs in the form "spelling error -> correction" from ispravi.me, the most popular Croatian online spellchecking service, collected since 2008. The dataset, compiled from the contribution of nearly 900,000 users, is a valuable resource for researchers and developers in the field of natural language processing (NLP), improving spellcheck accuracy, and language learning applications. The dataset may be used to accomplish several goals: (1) improving spellchecking accuracy by incorporating common user corrections and reducing false positives and negatives; (2) helping language learners identify common errors and learn correct spelling through targeted feedback; (3) analyzing data trends and patterns to uncover the most common spelling errors and their underlying causes; (4) identifying and evaluating factors that influence typing input; (5) improving NLP applications such as text recognition and machine translation. Tasks specific to the Croatian language include the creation of a letter-level confusion matrix and the refinement of word suggestions based on historical usage of the service. This comprehensive dataset provides researchers and practitioners with a wealth of information, opening the path for advancements in spellchecking, language learning, and NLP applications in the Croatian language. Dataset: https://github.com/Ispravi-Me/Dataset-of-Misspelings-and-Corrections Dataset License: CC BY-NC-SA 4.0

查看译文

关键词

spelling errors,corrections,language,comprehensive dataset

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要