Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal Datasets

Renato Bruni,Cinzia Daraio,Simone Di Leo

SYMMETRY-BASEL（2024）

引用 0|浏览6

暂无评分

摘要

The increasing availability of longitudinal data (repeated numerical observations of the same units at different times) requires the development of flexible techniques to automatically detect errors in such data. Besides standard types of errors, which can be treated with generic error correction techniques, large longitudinal datasets may present specific problems not easily traceable by the generic techniques. In particular, after applying those generic techniques, time series in the data may contain trends, natural fluctuations and possible surviving errors. To study the data evolution, one main issue is distinguishing those elusive errors from the rest, which should be kept as they are and not flattened or altered. This work responds to this need by identifying some types of elusive errors and by proposing a statistical-mathematical approach to capture their complexity that can be applied after the above generic techniques. The proposed approach is based on a system of indicators and works at the formal level by studying the differences between consecutive values of data series and the symmetries and asymmetries of these differences. It operates regardless of the specific meaning of the data and is thus applicable in a variety of contexts. We implement this approach in a relevant database of European Higher Education institutions (ETER) by analyzing two key variables: "Total academic staff" and "Total number of enrolled students", which are two of the most important variables, often used in empirical analysis as a proxy for size, and are considered by policymakers at the European level. The results are very promising.

查看译文

关键词

big data,information processing,information reconstruction,data quality,longitudinal data sequences

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要