MORPHER: Structural Transformation of Ill-formed Rows

PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023(2023)

引用 0|浏览1
暂无评分
摘要
Open data portals contain a plethora of data files, with comma-separated value (CSV) files being particularly popular with users and businesses due to their flexible standard. However, this flexibility comes with much responsibility for data consumers, as many files contain various structural problems, e.g., a different number of cells across data rows, multiple value formats within the same column, different variants of quoted fields due to user specifications, etc. We refer to rows that contain such structural inconsistencies as ill-formed. Consequently, ingesting them into a host system, such as a database or an analytics platform, often requires prior data preparation steps. We propose to demonstrate Morpher, a desktop-based system that incorporates our state-of-the-art error detection system, SURAGH [9] and extends it to also clean the files at hand. Morpher facilitates ingesting CSV files by automatically identifying and cleaning ill-formed rows while preserving all data. It comprises three key components: 1) The pattern modeler, which generates syntax-based patterns for each row of the input file. The system uses these patterns to classify rows into ill-formed and well-formed. 2) The pattern classifier obtains row patterns for ill-formed rows and uses them to distinguish ill-formed but wanted rows from ill-formed unwanted rows. 3) The pattern wrangler transforms the identified wanted rows into well-formed rows, effectively repairing a wide range of formatting problems.
更多
查看译文
关键词
data preparation,data representation,file structure transformation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要