News Headline Corpus Construction and High Frequency Word Extraction

Jianhui Ling, Qihang Zhang, Keke Ling,Baili Zhang

Communications in Computer and Information ScienceCognitive Cities(2020)

引用 0|浏览0
暂无评分
摘要
It is a fascinating research topic to use the high-frequency words in the news headlines to compare the cultural values of China and the United States. However, the amount of information on various types of news websites is huge on the Internet. If no adopting any intelligent tools it is difficult to obtain information that meets certain conditions (such as specific content for a specific time period) and conduct specialized research. Therefore, this paper proposes a complete solution. First it designs a targeted crawler tool for news headline crawling. The tool can be flexibly set according to the user’s needs, and is targeted to specific content such as specific headlines or Web content that meets certain conditions, with the advantages of fast crawling speed, low computing and network resource overhead. Then, by using the crawler tool to achieve a quick crawling of the news headlines of the Xinhuanet (Chinese) and VOA (English) websites for a specific time period, a news headline corpus is constructed for comparison research. Based on the improved TF-IDF algorithm, high-frequency word is extracted in the headline corpus, which provides good data preparation for the study of Sino-US cultural value orientation differences.
更多
查看译文
关键词
corpus,news,extraction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要