AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We describe an algorithm called Bio-LDA that uses extracted biological terminology to automatically identify latent topics, and provides a variety of measures to uncover putative relations among topics and bio-terms
Finding Complex Biological Relationships In Recent Pubmed Articles Using Bio-Lda
PLOS ONE, no. 3 (2011): e17243-e17243
The overwhelming amount of available scholarly literature in the life sciences poses significant challenges to scientists wishing to keep up with important developments related to their research, but also provides a useful resource for the discovery of recent information concerning genes, diseases, compounds and the interactions between t...More
PPT (Upload PPT)
- Translational research in medicine is concerned with transforming basic laboratory science into effective patient therapies as quickly as possible.
- At the same time, sophisticated interdisciplinary research has lead to the development and application of powerful methods to generate enormous amounts of new data resulting in an increased topical complexity of research articles.
- This complexity makes it challenging to efficiently discover, evaluate and synthesize the latest information, trends, and findings deposited in published literature in a reasonable amount of time.
- Generating useful approaches to facilitate knowledge discovery through systematic analysis of abstracts and full-text journal articles is an important and ongoing challenge
- Translational research in medicine is concerned with transforming basic laboratory science into effective patient therapies as quickly as possible
- At the same time, sophisticated interdisciplinary research has lead to the development and application of powerful methods to generate enormous amounts of new data resulting in an increased topical complexity of research articles
- We develop a Bio-Latent Dirichlet Allocation model, which extends the Latent Dirichlet Allocation model by incorporating bio-terms as input variables to the classic Latent Dirichlet Allocation model
- Chem2Bio2RDF consists of about 78 million RDF triples over 25 datasets relating to systems chemical biology, which is grouped into 6 domains, namely chemical (PubChem Compound, ChEBI, PDB Ligand), chemogenomics (KEGG Ligand, CTD Chemical, BindingDB, MATADOR, PubChem BioAssay, QSAR, TTD, DrugBank, ChEMBL, Binding MOAD, PDSP, PharmGKB), biological (UNIPROT, HGNC, PDB, GI), systems (KEGG Pathway, Reactome, PPI, DIP), phenotype (OMIM, Diseasome, SIDER, CTD diseases) and literature (MEDLINE/PubMed
- We describe the architecture and main features of the Bio-Latent Dirichlet Allocation model
- We demonstrate how Bio-Latent Dirichlet Allocation, in contrast to natural language processing methods, can automatically derive a collection of topics of related biological terms that map to clearly understandable biological themes, and which allow the complexity of topics addressed in individual papers to be represented by probabilities of association with topics
- PubMed offers a web-based and programmatic search service over its content 
- This interface is limited to small- to medium-scale queries, and text mining using this interface is not possible.
- The entire content of MEDLINE is available as a set of text files formatted in XML
- In this project, the 2010 MEDLINE/PubMed baseline database is used as the primary data source, which contains 617 files and 18,502,916 records
- Analyzing the Bio-LDA Model Results In the experiments, the authors applied the Bio-LDA model to 336,899
MEDLINE abstracts (, 330M in size) published in 2009, which contains 308686 words, 13338 extracted bio-terms, and 4450 Topic 13
Word patient transplant platelet studi group donor factor risk result graft Bio-Terms Thrombosis Venous Thromboembolism Heparin Tacrolimus Cyclosporine VWF Thrombocytopenia Mycophenolate mofetil IMPACT ABO Journal Transplant.
- Analyzing the Bio-LDA Model Results In the experiments, the authors applied the Bio-LDA model to 336,899.
- MEDLINE abstracts (, 330M in size) published in 2009, which contains 308686 words, 13338 extracted bio-terms, and 4450 Topic 13.
- Word patient transplant platelet studi group donor factor risk result graft Bio-Terms Thrombosis Venous Thromboembolism Heparin Tacrolimus Cyclosporine VWF Thrombocytopenia Mycophenolate mofetil IMPACT ABO Journal Transplant.
- Transplantation Thromb.
- Res. Transfusion type DISEASE DISEASE DRUG DRUG DRUG GENE DISEASE DRUG GENE GENE doi:10.1371/journal.pone.0017243.t002
- Association predication, association search, and connectivity map generation, are presented which the authors believe are useful for biomedical and drug discovery applications, especially when combining the Bio-LDA model with a pre-knowledge network, i.e. Chem2Bio2Rdf. Three applications, association predication, association search, and connectivity map generation, are presented which the authors believe are useful for biomedical and drug discovery applications, especially when combining the Bio-LDA model with a pre-knowledge network, i.e. Chem2Bio2Rdf
- The authors believe these experiments demonstrate great value in performing this kind of analysis for enhancing biological knowledge.
- Table1: Statistics of the bio-terms extraction
- Table2: Representations for selected topics
- Table3: Top topics for the selected bio-terms
- Table4: a) Frequency word sets of LDA model and Bio-LDA model. b) Mappings between Bio-LDA model and LDA model
- Table5: Compare word representation of topics in the BioLDA model to topics in the LDA model
- Table6: Bio-terms associated with topics
- Table7: Calculated association score for Venlafaxine and HTR1A, HTR2A
- Table8: Comparing the co-occurrence method and the BioLDA in identifying associated bio-terms
- Table9: Bio-term entropies for nodes shown in the top 3 paths
- Table10: Symmetric KL divergence for the top 3 paths
- Funding: The authors have no support or funding to report
- Muin M, Fontelo P, Ackerman M (2006) PubMed Interact: an interactive search application for MEDLINE/PubMed. AMIA Annual Symposium proceedings/ AMIA Symposium AMIA Symposium 1089.
- Cohen KB, Hunter L (2004) Natural language processing and systems biology. Artificial intelligence and systems biology. pp 147–174.
- Feldman R, Regev Y, Hurvitz E, Finkelstein-Landau M (2003) Mining the biomedical literature using semantic analysis and natural language processing techniques. 1: 69–80.
- Blei D, Ng A, Jordan M (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research 3: 993–1022.
- Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents; Banff, Canada: AUAI Press. pp 487–494.
- Tang J, Zhang J, Yao L, Li J, Zhang L, et al. (2008) ArnetMiner:extraction and mining of academic social networks; Las Vegas, Nevada, USA: ACM. pp 990–998.
- Wild DJ (2009) Mining large heterogeneous data sets in drug discovery. Expert Opinion on Drug Discovery 4: 995–1004.
- Belleau Fo, Nolin M-A, Tourigny N, Rigault P, Morissette J (2008) Bio2RDF: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics 41: 706–716.
- Chen B, Dong X, Jiao D, Wang H, Zhu Q, et al. (2010) Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics 11: 255.
- Jentzsch AZJ, Hassanzadeh O, Cheung K, Samwald K, Andersson B (2009) Linking open drug data; Graz, Austria. associated with topics with a given level of probability, and through the KL Divergence measure, a distance between any two terms can be generated via their probabilities of association with topics. This opens up the possibility of using the method for ranking paths through the data, or for an alternate way of measuring degree of association between, for example, drugs and genes, or pathways and diseases.
- Hofmann T (2009) Probabilistic latent semantic indexing; Berkeley, California, United States: ACM. pp 50–57.
- Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends; Philadelphia, PA, USA: ACM. pp 424–433.
- Si X, Sun M (2009) Tag-LDA for Scalable Real-time Tag Recommendation. Journal of Computational Information Systems.
- Xu H, Wang J, Hua X, Li S (2009) Tag refinement by regularized LDA; Beijing, China: ACM. pp 573–576.
- Wang X, Grimson WEL, Westin C-F (2007) Tractography segmentation using a hierarchical Dirichlet processes mixture model.
- Blei D, McAuliffe J (2010) Supervised Topic Models. Available: http://arxiv.org/abs/1003.0783v1, Accessed 2010 Jul 3.
- Newman D, Asuncion A, Smyth P, Welling M (2007) Distributed inference for latent Dirichlet allocation. Neural Information Processing Systems (NIPS) 20: 1081–1088.
- Steyvers M, Smyth P, Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery; Seattle, WA, USA: ACM. pp 306–315.
- Wang Y, Bai H, Stanton M, Chen W-Y, Chang EY (2009) PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications; San Francisco, CA, USA: Springer-Verlag. pp 301–314.
- Blei DM, Franks K, Jordan MI, Mian IS (2006) Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span. BMC Bioinformatics 7.
- Zheng B, McLean D, Lu X (2006) Identifying biological concepts from a proteinrelated corpus with a probabilistic topic model. BMC Bioinformatics 7: 58–58.
- Morchen F, Dejori Mu, Fradkin D, Etienne J, Wachmann B, et al. (2008) Anticipating annotations and emerging trends in biomedical literature; Las Vegas, Nevada, USA: ACM. pp 954–962.
- Alako B, Veldhoven A, van Baal S, Jelier R, Verhoeven S, et al. (2005) CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 6: 51.
- Frijters R, van Vugt M, Smeets R, van Schaik R, de Vlieg J, et al. (2010) Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases. PLoS Comput Biol 6: e1000943.
- Bizer CCR (2006) D2R Server - Publishing Relational Databases on the Semantic Web. the 5th International Semantic Web Conference. Athens, GA, USA.
- Anyanwu K, Sheth A (2002) The p Operator: Discovering and Ranking on the Semantic Web. SIGMOD Record 31: 42–47.