GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles.

ISMB (Supplement of Bioinformatics), (2001): S74-82

被引用638|浏览496
WOS EI
下载 PDF 全文
引用
微博一下

摘要

Systems that extract structured information from natural language passages have been highly successful in specialized domains. The time is opportune for developing analogous applications for molecular biology and genomics. We present a system, GENIES, that extracts and structures information about cellular pathways from the biological lit...更多

代码

数据

0
简介
  • The fields of molecular biology and medicine have enjoyed an explosive development; as a result, individual researchers find it difficult to keep up with all the new, relevant information.
  • GENIES exploits a term tagging component (Krauthammer et al, 2000) that identifies gene and protein names in text by using both rules and external knowledge sources.
  • Another type of system extracts both functional relations and molecular entities from text.
  • Sekimizu and colleagues (Sekimizu et al, 1998) extract relations associated with seven different verbs found in Medline abstracts
重点内容
  • The fields of molecular biology and medicine have enjoyed an explosive development; as a result, individual researchers find it difficult to keep up with all the new, relevant information
  • The article contained 7,790 words and took 1.3 minutes to process on a 500 MHZ PC with 128 MB RAM
  • Thirteen of the relations identified by the expert contained nesting; GENIES captured 8; 7 were correct and
  • Our pilot evaluation was based on only one article; The article was chosen by the expert, rather than by the system developers
  • We have demonstrated that it is possible to apply the general information-extraction system MedLEE, previously applied to the domain of clinical records to the domain of literature associated with molecular information
  • Our pilot evaluation demonstrated high precision (96%) and satisfactory recall (63%)
方法
  • The authors report on the preliminary evaluation and present results, and in the Discussion Section we discuss their significance.
  • If the object has a modifier, it is represented as a nested frame; for example, the output for activated Il-2 is [protein, Il2, [state, active]].
  • In this example, activated is interpreted to be a state with a target value active
结果
  • The article contained 7,790 words and took 1.3 minutes to process on a 500 MHZ PC with 128 MB RAM.
  • The expert identified 51 binary relations; GENIES correctly extracted 27 (53%) stemming from the same sentences.
  • Many of the relations were redundant: in the whole article only 19 relations were unique.
  • Of the 19, GENIES retrieved 12 (63%; Figure 4).
  • Thirteen of the relations identified by the expert contained nesting; GENIES captured 8; 7 were correct and
结论
  • The authors' pilot evaluation was based on only one article; The article was chosen by the expert, rather than by the system developers.
  • S80 was considerable and in general it is rather difficult to find volunteers for such an evaluation.
  • To address this problem, the authors are currently developing tools to assist the expert in recording and editing interactions.
  • The authors will continue to refine, improve, and evaluate GENIES because it demonstrated its effectiveness for acquiring worthwhile knowledge from journal articles
表格
  • Table1: Semantic classes associated with actions, processes, and other relations
Download tables as Excel
基金
  • This publication was supported in part by grants LM06274 from the National Library of Medicine and by the Columbia CAT supported by the NYS Science and Technology Foundation
研究对象与分析
complete journal articles: 140
Ideally, we would like to evaluate the system with a large number of articles (containing several hundred relations), although that would require an extraordinary amount of work. We have subsequently processed 140 complete journal articles in preparation for a second more comprehensive evaluation. GENIES processes complete articles, whereas other systems process abstracts only

引用论文
  • G. (2000). Gene ontology: tool for the unification of biology. The
    Google ScholarLocate open access versionFindings
  • Bairoch A., and Apweiler R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic
    Google ScholarLocate open access versionFindings
  • Brass A. (1999). An ontology for bioinformatics applications.
    Google ScholarFindings
  • A., and Wheeler D. L. (2000). GenBank. Nucleic Acids Res., 28, 15-8.
    Google ScholarLocate open access versionFindings
  • Blaschke C., Andrade M. A., Ouzounis C., and Valencia A. (1999).
    Google ScholarFindings
  • Chen R. O., Felciano R., and Altman R. B. (1997). RIBOWEB: linking structural computations to a knowledge base of published experimental data. Ismb, 5, 84-7.
    Google ScholarLocate open access versionFindings
  • Hripcsak G. (2000). Coding neuroradiology reports for the
    Google ScholarFindings
  • S. B. (1994). A general natural-language text processor for clinical radiology. J. Am. Med. Inform. Assoc., 1, 161-74.
    Google ScholarLocate open access versionFindings
  • Friedman C., and Hripcsak G. (1998). Evaluating natural language processors in the clinical domain. Methods Inf. Med., 37, 334-44.
    Google ScholarLocate open access versionFindings
  • Fukuda K., Tamura A., Tsunoda T., and Takagi T. (1998). Toward information extraction: identifying protein names from biological papers. Pac. Symp. Biocomput., 707-18.
    Google ScholarLocate open access versionFindings
  • S. (1994). Creating a knowledge base of biological research papers. Ismb, 2, 147-55.
    Google ScholarLocate open access versionFindings
  • Hatzivassiloglou V., Duboue P. A., and Rzhetsky A. (2001).
    Google ScholarFindings
  • S. B., and Clayton P. D. (1995). Unlocking clinical data from narrative reports: a study of natural language processing. Ann.
    Google ScholarLocate open access versionFindings
  • Hripcsak G., Kuperman G. J., and Friedman C. (1998). Extracting findings from narrative reports: software transferability and sources of physician disagreement. Methods Inf. Med., 37, 1-7.
    Google ScholarLocate open access versionFindings
  • G. O. (1998). The Unified Medical Language System: an
    Google ScholarFindings
  • Humphreys K., Demetriou G., and Gaizauskas R. (2000). Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Pac.
    Google ScholarLocate open access versionFindings
  • Iliopoulos I., Enright A. J., and Ouzounis C. (2001). TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept
    Google ScholarFindings
  • Jain N. L., and Friedman C. (1997). Identification of findings suspicious for breast cancer based on natural language processing of mammogram reports. Proc. AMIA Annu. Fall Symp., 829-33.
    Google ScholarLocate open access versionFindings
  • Jain N. L., Knirsch C. A., Friedman C., and Hripcsak G. (1996).
    Google ScholarFindings
  • Jenssen T. K., and Vinterbo S. (2000). A set-covering approach to specific search for literature about human genes. Proc. AMIA
    Google ScholarFindings
  • Kanehisa M., and Goto S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28, 27-30.
    Google ScholarLocate open access versionFindings
  • Karp P. D., Riley M., Paley S. M., Pellegrini-Toole A., and Krummenacker M. (1999). Eco Cyc: Encyclopedia of Escherichia coli genes and metabolism. Nucleic Acids Res., 27, 55-58.
    Google ScholarLocate open access versionFindings
  • Hripcsak G. (1998). Respiratory isolation of tuberculosis patients using clinical guidelines and an automated clinical decision support system. Infect. Control Hosp. Epidemiol., 19, 94-100.
    Google ScholarLocate open access versionFindings
  • Koike T., and Rzhetsky A. (2000). A graphic editor for analyzing signal-transduction pathways. Gene, 259, 235-244.
    Google ScholarLocate open access versionFindings
  • (2000). Using BLAST for identifying gene and protein names in journal articles. Gene, 259, 245-252.
    Google ScholarLocate open access versionFindings
  • Maroto M., Reshef R., Munsterberg A. E., Koester S., Goulding M., and Lassar A. B. (1997). Ectopic Pax-3 activates MyoD and Myf5 expression in embryonic mesoderm and neural tissue. Cell, 89, 139-48.
    Google ScholarLocate open access versionFindings
  • Stavri P. Z. (1996). The UMLS Knowledge Source Server: a versatile Internet-based research tool. Proc. AMIA Annu. Fall
    Google ScholarFindings
  • Park J. C., Kim H. S., and Kim J. J. (2001). Bidirectional
    Google ScholarFindings
  • Pereira F. C. N., and Warren D. (1980). Definite clause grammars for language analysis – a survey of the formalism and comparison with augmented transition networks. Artificial Intelligence, 13, 231-278.
    Google ScholarLocate open access versionFindings
  • Rindflesch T. C., Hunter L., and Aronson A. R. (1999). Mining molecular binding terminology from biomedical text. Proc.
    Google ScholarLocate open access versionFindings
  • Rindflesch T. C., Tanabe L., Weinstein J. N., and Hunter L. (2000).
    Google ScholarFindings
  • M., Kaplan S. H., Kra P., Russo J. J., and Friedman C. (2000).
    Google ScholarFindings
  • Sekimizu T., Park H. S., and Tsujii J. (1998). Identifying the Interaction between Genes and Gene Products Based on Frequently
    Google ScholarFindings
  • R., Panyushkina E., Pronevitch L., and Selkov E., Jr. (1997). The metabolic pathway collection: an update. Nucleic Acids Res., 25, 37-8.
    Google ScholarLocate open access versionFindings
  • M. (2000). Automatic extraction of protein interactions from scientific abstracts. Pac. Symp. Biocomput., 541-52.
    Google ScholarLocate open access versionFindings
  • Yakushiji A., Tateisi Y., Miyao Y., and Tsujii J. (2001). Event
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科