AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We describe an algorithm called Bio-LDA that uses extracted biological terminology to automatically identify latent topics, and provides a variety of measures to uncover putative relations among topics and bio-terms

Finding Complex Biological Relationships In Recent Pubmed Articles Using Bio-Lda

PLOS ONE, no. 3 (2011): e17243-e17243

Cited: 84|Views89
WOS SCOPUS

Abstract

The overwhelming amount of available scholarly literature in the life sciences poses significant challenges to scientists wishing to keep up with important developments related to their research, but also provides a useful resource for the discovery of recent information concerning genes, diseases, compounds and the interactions between t...More

Code:

Data:

0
Introduction
  • Translational research in medicine is concerned with transforming basic laboratory science into effective patient therapies as quickly as possible.
  • At the same time, sophisticated interdisciplinary research has lead to the development and application of powerful methods to generate enormous amounts of new data resulting in an increased topical complexity of research articles.
  • This complexity makes it challenging to efficiently discover, evaluate and synthesize the latest information, trends, and findings deposited in published literature in a reasonable amount of time.
  • Generating useful approaches to facilitate knowledge discovery through systematic analysis of abstracts and full-text journal articles is an important and ongoing challenge
Highlights
  • Translational research in medicine is concerned with transforming basic laboratory science into effective patient therapies as quickly as possible
  • At the same time, sophisticated interdisciplinary research has lead to the development and application of powerful methods to generate enormous amounts of new data resulting in an increased topical complexity of research articles
  • We develop a Bio-Latent Dirichlet Allocation model, which extends the Latent Dirichlet Allocation model by incorporating bio-terms as input variables to the classic Latent Dirichlet Allocation model
  • Chem2Bio2RDF[9] consists of about 78 million RDF triples over 25 datasets relating to systems chemical biology, which is grouped into 6 domains, namely chemical (PubChem Compound, ChEBI, PDB Ligand), chemogenomics (KEGG Ligand, CTD Chemical, BindingDB, MATADOR, PubChem BioAssay, QSAR, TTD, DrugBank, ChEMBL, Binding MOAD, PDSP, PharmGKB), biological (UNIPROT, HGNC, PDB, GI), systems (KEGG Pathway, Reactome, PPI, DIP), phenotype (OMIM, Diseasome, SIDER, CTD diseases) and literature (MEDLINE/PubMed
  • We describe the architecture and main features of the Bio-Latent Dirichlet Allocation model
  • We demonstrate how Bio-Latent Dirichlet Allocation, in contrast to natural language processing methods, can automatically derive a collection of topics of related biological terms that map to clearly understandable biological themes, and which allow the complexity of topics addressed in individual papers to be represented by probabilities of association with topics
Methods
  • PubMed offers a web-based and programmatic search service over its content [1]
  • This interface is limited to small- to medium-scale queries, and text mining using this interface is not possible.
  • The entire content of MEDLINE is available as a set of text files formatted in XML
  • In this project, the 2010 MEDLINE/PubMed baseline database is used as the primary data source, which contains 617 files and 18,502,916 records
Results
  • Analyzing the Bio-LDA Model Results In the experiments, the authors applied the Bio-LDA model to 336,899

    MEDLINE abstracts (, 330M in size) published in 2009, which contains 308686 words, 13338 extracted bio-terms, and 4450 Topic 13

    Word patient transplant platelet studi group donor factor risk result graft Bio-Terms Thrombosis Venous Thromboembolism Heparin Tacrolimus Cyclosporine VWF Thrombocytopenia Mycophenolate mofetil IMPACT ABO Journal Transplant.
  • Analyzing the Bio-LDA Model Results In the experiments, the authors applied the Bio-LDA model to 336,899.
  • MEDLINE abstracts (, 330M in size) published in 2009, which contains 308686 words, 13338 extracted bio-terms, and 4450 Topic 13.
  • Word patient transplant platelet studi group donor factor risk result graft Bio-Terms Thrombosis Venous Thromboembolism Heparin Tacrolimus Cyclosporine VWF Thrombocytopenia Mycophenolate mofetil IMPACT ABO Journal Transplant.
  • Transplantation Thromb.
  • Res. Transfusion type DISEASE DISEASE DRUG DRUG DRUG GENE DISEASE DRUG GENE GENE doi:10.1371/journal.pone.0017243.t002
Conclusion
  • Association predication, association search, and connectivity map generation, are presented which the authors believe are useful for biomedical and drug discovery applications, especially when combining the Bio-LDA model with a pre-knowledge network, i.e. Chem2Bio2Rdf. Three applications, association predication, association search, and connectivity map generation, are presented which the authors believe are useful for biomedical and drug discovery applications, especially when combining the Bio-LDA model with a pre-knowledge network, i.e. Chem2Bio2Rdf
  • The authors believe these experiments demonstrate great value in performing this kind of analysis for enhancing biological knowledge.
Tables
  • Table1: Statistics of the bio-terms extraction
  • Table2: Representations for selected topics
  • Table3: Top topics for the selected bio-terms
  • Table4: a) Frequency word sets of LDA model and Bio-LDA model. b) Mappings between Bio-LDA model and LDA model
  • Table5: Compare word representation of topics in the BioLDA model to topics in the LDA model
  • Table6: Bio-terms associated with topics
  • Table7: Calculated association score for Venlafaxine and HTR1A, HTR2A
  • Table8: Comparing the co-occurrence method and the BioLDA in identifying associated bio-terms
  • Table9: Bio-term entropies for nodes shown in the top 3 paths
  • Table10: Symmetric KL divergence for the top 3 paths
Download tables as Excel
Funding
  • Funding: The authors have no support or funding to report
Reference
  • Muin M, Fontelo P, Ackerman M (2006) PubMed Interact: an interactive search application for MEDLINE/PubMed. AMIA Annual Symposium proceedings/ AMIA Symposium AMIA Symposium 1089.
    Google ScholarLocate open access versionFindings
  • Cohen KB, Hunter L (2004) Natural language processing and systems biology. Artificial intelligence and systems biology. pp 147–174.
    Google ScholarFindings
  • Feldman R, Regev Y, Hurvitz E, Finkelstein-Landau M (2003) Mining the biomedical literature using semantic analysis and natural language processing techniques. 1: 69–80.
    Google ScholarLocate open access versionFindings
  • Blei D, Ng A, Jordan M (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research 3: 993–1022.
    Google ScholarLocate open access versionFindings
  • Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents; Banff, Canada: AUAI Press. pp 487–494.
    Google ScholarFindings
  • Tang J, Zhang J, Yao L, Li J, Zhang L, et al. (2008) ArnetMiner:extraction and mining of academic social networks; Las Vegas, Nevada, USA: ACM. pp 990–998.
    Google ScholarFindings
  • Wild DJ (2009) Mining large heterogeneous data sets in drug discovery. Expert Opinion on Drug Discovery 4: 995–1004.
    Google ScholarLocate open access versionFindings
  • Belleau Fo, Nolin M-A, Tourigny N, Rigault P, Morissette J (2008) Bio2RDF: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics 41: 706–716.
    Google ScholarLocate open access versionFindings
  • Chen B, Dong X, Jiao D, Wang H, Zhu Q, et al. (2010) Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics 11: 255.
    Google ScholarLocate open access versionFindings
  • Jentzsch AZJ, Hassanzadeh O, Cheung K, Samwald K, Andersson B (2009) Linking open drug data; Graz, Austria. associated with topics with a given level of probability, and through the KL Divergence measure, a distance between any two terms can be generated via their probabilities of association with topics. This opens up the possibility of using the method for ranking paths through the data, or for an alternate way of measuring degree of association between, for example, drugs and genes, or pathways and diseases.
    Google ScholarLocate open access versionFindings
  • Hofmann T (2009) Probabilistic latent semantic indexing; Berkeley, California, United States: ACM. pp 50–57.
    Google ScholarFindings
  • Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends; Philadelphia, PA, USA: ACM. pp 424–433.
    Google ScholarFindings
  • Si X, Sun M (2009) Tag-LDA for Scalable Real-time Tag Recommendation. Journal of Computational Information Systems.
    Google ScholarLocate open access versionFindings
  • Xu H, Wang J, Hua X, Li S (2009) Tag refinement by regularized LDA; Beijing, China: ACM. pp 573–576.
    Google ScholarFindings
  • Wang X, Grimson WEL, Westin C-F (2007) Tractography segmentation using a hierarchical Dirichlet processes mixture model.
    Google ScholarFindings
  • Blei D, McAuliffe J (2010) Supervised Topic Models. Available: http://arxiv.org/abs/1003.0783v1, Accessed 2010 Jul 3.
    Findings
  • Newman D, Asuncion A, Smyth P, Welling M (2007) Distributed inference for latent Dirichlet allocation. Neural Information Processing Systems (NIPS) 20: 1081–1088.
    Google ScholarLocate open access versionFindings
  • Steyvers M, Smyth P, Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery; Seattle, WA, USA: ACM. pp 306–315.
    Google ScholarFindings
  • Wang Y, Bai H, Stanton M, Chen W-Y, Chang EY (2009) PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications; San Francisco, CA, USA: Springer-Verlag. pp 301–314.
    Google ScholarFindings
  • Blei DM, Franks K, Jordan MI, Mian IS (2006) Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span. BMC Bioinformatics 7.
    Google ScholarLocate open access versionFindings
  • Zheng B, McLean D, Lu X (2006) Identifying biological concepts from a proteinrelated corpus with a probabilistic topic model. BMC Bioinformatics 7: 58–58.
    Google ScholarLocate open access versionFindings
  • Morchen F, Dejori Mu, Fradkin D, Etienne J, Wachmann B, et al. (2008) Anticipating annotations and emerging trends in biomedical literature; Las Vegas, Nevada, USA: ACM. pp 954–962.
    Google ScholarFindings
  • Alako B, Veldhoven A, van Baal S, Jelier R, Verhoeven S, et al. (2005) CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 6: 51.
    Google ScholarLocate open access versionFindings
  • Frijters R, van Vugt M, Smeets R, van Schaik R, de Vlieg J, et al. (2010) Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases. PLoS Comput Biol 6: e1000943.
    Google ScholarLocate open access versionFindings
  • Bizer CCR (2006) D2R Server - Publishing Relational Databases on the Semantic Web. the 5th International Semantic Web Conference. Athens, GA, USA.
    Google ScholarFindings
  • Anyanwu K, Sheth A (2002) The p Operator: Discovering and Ranking on the Semantic Web. SIGMOD Record 31: 42–47.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
avatar
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn