Semantic-Web Access to Patent Annotations.

SWAT4LS(2015)

引用 23|浏览96
暂无评分
摘要
SureChEMBL (https://www.surechembl.org) is a patent chemistry resource, originally a commercial product developed by SureChem/Digital Science, and recently made freely available at EMBL-EBI [1]. SureChEMBL uses a live and fully automated cloud-based pipeline that combines text-mining and chemistry tools to extract compounds named or depicted in patent documents and make them readily structure searchable by users. Over 50,000 new patent documents and 80,000 new compounds are entered into the system per month and new chemical annotations are usually available in the SureChEMBL interface within 1-7 days of the patent being released by the patent office. While the current SureChEMBL system addresses several chemistry use-cases, such as the identification of novel scaffolds and chemistry, there is an enormous amount of additional knowledge captured within the patent corpus. Much of this information will never be published elsewhere and may be of great value to the drug-discovery and broader life-science community. The Open PHACTS Discovery Platform is a semantic-web data integration platform, developed for the purpose of providing both the pharmaceutical industry and academic researchers with open access to interoperable drug discovery information [2, 3]. The platform currently includes data from a wide variety of public databases and provides API access to the integrated information. However, the further addition of biological and chemical patent information to the platform was considered to be of great potential utility. We have therefore developed a pipeline to identify and annotate additional entities (namely genes and diseases) within the SureChEMBL patent corpus using the Termite text-mining tool (https://scibite.com /content/termite.html). Since patent documents are often designed to obfuscate the key subject matter, it was essential to also develop an algorithm to assess the relevance of each gene or disease within a particular patent document, allowing users to restrict results to only highly relevant entities if they wish. An RDF model has been developed to capture the relationships between patent documents and annotated compounds, genes and diseases, and annotations for more than 6 million life-science patents have been made available in this format via the Open PHACTS platform (https://dev. openphacts.org/). A series of API calls have been developed to allow users of the platform to query the data and to integrate it with the extensive range of other data resources included in the platform (e.g., protein, pathway, bioactivity and disease information). In addition, KNIME and Pipeline Pilot nodes have also been created to facilitate the construction of workflows using patent data, for example, identifying all of the compounds from patents that mention a particular target or disease with high relevance. This represents the first large-scale, semantically-annotated life-science patent knowledgebase, freely available to both industrial and academic researchers.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要