Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks
arxiv(2024)
摘要
Geoparsing is the task of estimating the latitude and longitude (coordinates)
of location expressions in texts. Geoparsing must deal with the ambiguity of
the expressions that indicate multiple locations with the same notation. For
evaluating geoparsing systems, several corpora have been proposed in previous
work. However, these corpora are small-scale and suffer from the coverage of
location expressions on general domains. In this paper, we propose Wikipedia
Hyperlink-based Location Linking (WHLL), a novel method to construct a
large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages
hyperlinks in Wikipedia to annotate multiple location expressions with
coordinates. With this method, we constructed the WHLL corpus, a new
large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles,
each containing about 7.8 unique location expressions. 45.6
expressions are ambiguous and refer to more than one location with the same
notation. In each article, location expressions of the article title and those
hyperlinks to other articles are assigned with coordinates. By utilizing
hyperlinks, we can accurately assign location expressions with coordinates even
with ambiguous location expressions in the texts. Experimental results show
that there remains room for improvement by disambiguating location expressions.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要