Record Matching in Digital Library Metadata Using evidence from external sources to create more accurate matching systems

msra(2008)

引用 23|浏览3
暂无评分
摘要
When data stores grow large, data quality, cleaning, and integrity become issues. The commercial sector spends a massive amount of time and energy canonicalizing customer and product records as their lists of products and consumers expand. An Accenture study in 2006 found that a high-tech equipment manufacturer saved $6 million per year by removing redundant customer records used in customer mailings. In 2000, the U.K. Ministry of Defence embarked on the massive “The Cleansing Project,” solving key problems with its inventory and logistics and saving over $25 million over four years. In digital libraries, such problems manifest most urgently not in the customer, product, or item records, but in the metadata that describes the library’s holdings. Several well-known citation lists of computer science research contain over 50% duplicate citations, although none of these duplicates are exact string matches [2]. Without metadata cleaning, libraries might end up listing multiple records for the same item, causing circulation problems, and skewing the distribution of their holdings. In addition, when different authors share the same name (for example, Wei Wang, J. Brown), author disambiguation must be performed to correctly link authors to their respective monographs and articles, and not to others. Metadata inconsistencies can be due to problems with varying ordering of fields, type of delimiters used, omission of fields, multiple representations of names of people and organizations, and typographical errors. When libraries import large volumes of metadata from sources that follow a metadata standard, a manually compiled set of rules called a crosswalk may be used to transform the metadata into the library’s own format. However, such crosswalks are expensive to create manually, and public ones exist only for a few, well-used formats. Crucially, they also do not address how to detect and remove inexact duplicates. As digital libraries mine and incorporate data from a wider variety of sources, especially noisy sources, such as the Web, finding a suitable and scalable matching solution becomes critical. Here, we examine this problem and its solutions. The de-duplication task takes a list of metadata records as input and returns the list with duplicate records removed. For example, the search results shown in the figure here are identical and should have been combined into a single entry. It should be noted that many disciplines of computer science have instances of similar inexact matchTechnical Opinion
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要