DLRGeoTweet: A comprehensive social media geocoding corpus featuring fine-grained places

Information Processing & Management(2024)

引用 0|浏览2
暂无评分
摘要
Every day, many short text messages on social media are generated in response to real-world events, providing a valuable resource for various domains such as emergency response and traffic management. Since exact coordinates of social media posts are rarely attached by users, accurately recognizing and resolving fine-grained place names, such as home addresses and Points of Interest, from these posts is crucial for understanding the precise locations of critical events, such as rescue requests. This task, known as geoparsing, involves toponym recognition and toponym resolution or geocoding. However, existing social media datasets for evaluating geoparsing approaches often lack sufficient fine-grained place names with associated geo-coordinates or linked to gazetteers, making evaluating, comparing, and training geocoding methods for such locations challenging. Moreover, the absence of supportive annotation tools compounds this challenge. To address these gaps, we implemented a lightweight Python tool leveraging Nominatim. Using this tool, we annotated a comprehensive X (formerly Twitter) geocoding corpus called DLRGeoTweet. The corpus underwent a rigorous cross-validation process to guarantee its quality. This corpus includes a total of 7,364 tweets and 12,510 places, of which 6,012 are fine-grained. It comprises two global datasets encompassing worldwide events and three local datasets related to local events such as the 2017 Hurricane Harvey. The annotation process spanned over ten months and required approximately 1000 person-hours to complete. We then evaluate 15 latest and representative geocoding approaches, including many deep learning-based, on DLRGeoTweet. The results highlight the inherent challenges in resolving fine-grained places accurately. Despite increasing access constraints to Twitter data, our corpus’s focus on short, informal text makes it a valuable resource for geocoding across multiple social media platforms.
更多
查看译文
关键词
Annotated twitter corpus,Geoparsing,Geocoding,Toponym resolution,Toponym disambiguation,Fine-grained places
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要