Understanding human disease knowledge through text mining

doi:10.1017/CBO9780511989421.004

3 - Understanding human disease knowledge through text mining

Published online by Cambridge University Press: 05 February 2016

Raul Rodriguez-Esteban

Edited by

William T. Loging

Show author details

Raul Rodriguez-Esteban: Affiliation:
Roche Inc
William T. Loging: Affiliation:
Mount Sinai School of Medicine, New York

Book contents

Get access

Summary

The aim of text mining in biomedicine is to extract valuable information from large amounts of biomedical text. For this purpose it borrows techniques from fields such as natural language processing (NLP), information retrieval (IR), information extraction (IE), and artificial intelligence (AI). However, many of these techniques need to be adapted to the particularities of biomedical text, because this text possesses a unique diversity of vocabularies and writing styles, as can be seen in clinical narratives, regulatory reports, and scientific articles. For example, an NLP algorithm that recognized sentences in newspapers would need to be adjusted for biomedical text, because periods that do not separate sentences are used more frequently in biomedical text than in newspapers, which would disorient the NLP algorithm (Tomanek et al., 2007). The particular information needs in biomedicine have also led to the development of specialized text-mining techniques for extracting knowledge specific to the biomedical domain, such as, for example, molecular events, perturbations and interactions.

Pharmaceutical companies are data-intensive organizations whose success depends on their ability to efficiently process large quantities of data from internal and external sources. Much valuable knowledge is locked within textual sources such as patents, clinical records, conference abstracts, and full-text articles. The growth of these textual sources means that even experts on a subject matter cannot cope with the content appearing in their niche. For example, more than 27,000 articles mentioning diabetes were listed in PubMed during the year 2013. Text mining enables the processing of such documents within practical time frames and impacting every stage of the drug discovery pipeline.

Before the late 1990s, IR was the main research field that dealt with biomedical documents. Its main focus was on improving access to literature records from biomedical databases such as Medline, a comprehensive database of scientific abstracts managed by the US National Library of Medicine (NLM). Then, in 1996, the launch of PubMed made available the majority of Medline content online (Canese, 2006). This event was followed by an increase in research about biomedical documents with a scope broader than IR. Such research was coined “text mining” due to the emergence of data and text mining during the same period (Rodriguez-Esteban, 2008). The first publication dealing with biomedical text that used the name “text mining” came from the National Institutes of Health (NIH) in 1999 (Tanabe et al., 1999).

Type: Chapter
Information: Bioinformatics and Computational Biology in Drug Discovery and Development , pp. 47 - 62

DOI: https://doi.org/10.1017/CBO9780511989421.004 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agarwal, S. and Yu, H.Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion. Bioinformatics. 2009;25(23):3174–3180.CrossRef Google Scholar PubMed

Björk, B. C., Welling, P., Laakso, M., et al. Open access to the scientific journal literature: situation 2009. PLoS ONE. 2010;5(6): e11273.CrossRef Google Scholar PubMed

Breiner, D. A. and Rodriguez-Esteban, R. Web Scraping Technology as a Cost-Effective Solution for News Alerting. Special Libraries Association, Pharmaceutical and Health Technology, Spring Meeting, Philadelphia, April 2013.

Canese, K.PubMed Celebrates its 10th Anniversary!NLM Technical Bulletin. 2006;(352):e5.Google Scholar

Caporaso, J. G., Deshpande, N., Fink, J. L., et al. Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. Pacific Symposium on Biocomputing. 2008;640–651.Google Scholar

Clark, A., Körner, C. and Nielsen, H. P.From punched cards to apps and iPads. Fifty-five years of the P-D-R. Business Information Review. 2013;30(2):96–101.CrossRef Google Scholar

Clegg, A. B. and Shepherd, A. J.Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinformatics. 2007;8:24.CrossRef Google Scholar PubMed

Cohen, K. B., Johnson, H. L., Verspoor, K., Roeder, C. and Hunter, L. E. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics. 2010;11:492. doi: 10.1186/1471-2105-11-492.CrossRef Google Scholar PubMed

Constantin, A., Pettifer, S. and Voronkov, A.PDFX: Fully-automated PDF-to-XML conversion of scientic literature. Proceedings of the 2013 ACM symposium on Document Engineering (DocEng 2013). 2013;177–180.Google Scholar

Dahlmeier, D. and Ng, H. T.Domain adaptation for semantic role labeling in the biomedical domain. Bioinformatics. 2010;26(8):1098–1104.CrossRef Google Scholar PubMed

Divoli, A. Biomedical Text Mining Approaches: Applications in Protein Family Annotation (dissertation). Manchester: University of Manchester, 2006.Google Scholar

Eder, J., Sedrani, R. and Wiesmann, C.The discovery of first-in-class drugs: Origins and evolution. Nature Reviews Drug Discovery. 2014;13(8):577–587.CrossRef Google Scholar PubMed

Eriksson, R., Jensen, P. B., Frankild, S., Jensen, L. J. and Brunak, S.Dictionary construction and identification of possible adverse drug events in Danish clinical narrative text. Journal of the American Medical Information Association. 2013;20(5):947–953.CrossRef Google Scholar PubMed

Feldman, R. and Dagan, I.Knowledge Discovery in Textual Databases (KDT). First International Conference on Knowledge Discovery (KDD-95). Montreal, Canada, 1995.Google Scholar

Ferraro, J. P., Daumé, H. 3rd, Duvall, S. L., et al. Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation. Journal of the American Medical Information Association. 2013;20(5):931–939.CrossRef Google Scholar PubMed

Friedman, C., Kra, P. and Rzhetsky, A.Two biomedical sublanguages: A description based on the theories of Zellig Harris. Journal of Biomedical Information. 2002;35(4):222–235.Google Scholar PubMed

Golder, S. and Loke, Y. K. The contribution of different information sources for adverse effects data. International Journal of Technological Assessment in Health Care. 2012;28(2):133–137.Google Scholar PubMed

Gomes, B., Hayes, W. and Podowski, R. M. Text mining. In In Silico Technologies in Drug Target Identification and Validation, ed. Leon, D. and Markel, S (pp. 153–194). Boca Raton, FL: CRC Press, 2006.Google Scholar

Jensen, P. B., Jensen, L. J. and Brunak, S.Mining electronic health records: Towards better research applications and clinical care. Nature Reviews Genetics. 2012;13(6):395–405.CrossRef Google Scholar PubMed

Jiang, Y., Lin, C., Meng, W., et al. Rule-based deduplication of article records from bibliographic databases. Database (Oxford). 2014;2014:bat086.CrossRef Google Scholar PubMed

Jones, C. W., Handler, L., Crowell, K. E., et al. Non-publication of large randomized clinical trials: Cross sectional analysis. British Medical Journal. 2013;347:f6104.CrossRef Google Scholar PubMed

Kabiljo, R., Clegg, A. B. and Shepherd, A. J. A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinformatics. 2009;10:233.CrossRef Google Scholar PubMed

Kiritchenko, S., de Bruijn, B., Carini, S., Martin, J. and Sim, I.ExaCT: Automatic extraction of clinical trial characteristics from journal publications. BMC Medical Information and Decision Making. 2010;10:56.CrossRef Google Scholar PubMed

Kulkarni, A. V., Aziz, B., Shams, I. and Busse, J. W. Comparisons of citations in Web of Science, Scopus, and Google Scholar for articles published in general medical journals. Journal of the American Medical Association. 2009;302(10):1092–1096.Google Scholar PubMed

Leaman, R., Wojtulewicz, L., Sullivan, R., et al. Towards internet-age pharmacovigilance: Extracting adverse drug reactions from user posts to health-related social networks. Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010;117–125.Google Scholar

Loging, W., Rodriguez-Esteban, R., Hill, J., Freeman, T. and Miglietta, J. Cheminformatic/bioinformatic analysis of large corporate databases: application to drug repurposing. Drug Discovery Today. 2011;8(3–4):109–116.Google Scholar

Martin, E. P. G., Bremer, E. G., Guerin, M., DeSesa, C. and Jouve, O. Analysis of protein/protein interactions through biomedical literature: Text mining of abstracts vs. text mining of full text articles. In Knowledge Exploration in Life Science Informatics (pp. 96–108). Lecture Notes in Computer Science. New York, NY: Springer, 2004.Google Scholar

McIntosh, T. and Curran, J. R. Challenges for automatically extracting molecular interactions from full-text articles. BMC Bioinformatics. 2009;10:311.CrossRef Google Scholar PubMed

Miwa, M., Thompson, P. and Ananiadou, S.Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics. 2012;28(13):1759–1765.CrossRef Google Scholar PubMed

Miwa, M., Pyysalo, S., Ohta, T. and Ananiadou, S.Wide coverage biomedical event extraction using multiple partially overlapping corpora. BMC Bioinformatics. 2013;14:175.CrossRef Google Scholar PubMed

Primo Peña, E., Vázquez Valero, M. and García Sicilia, J. Comparative study of journal selection criteria used by MEDLINE and EMBASE, and their application to Spanish biomedical journals. The 9th European Conference of Medical and Health Libraries. 2004.

Pyysalo, S., Salakoski, T., Aubin, S. and Nazarenko, A. Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches. BMC Bioinformatics. 2006;7(Suppl 3):S2.CrossRef Google Scholar PubMed

Ramakrishnan, C., Patnia, A., Hovy, E. and Burns, G. A. Layout-aware text extraction from full-text PDF of scientific articles. Source Code in Biology and Medicine. 2012;7(1):7.CrossRef Google Scholar PubMed

Rodriguez-Esteban, R. Methods in Biomedical Text Mining (dissertation). New York, NY: Columbia University, 2008.Google Scholar

Rodriguez-Esteban, R. and Iossifov, I. Figure mining for biomedical research. Bioinformatics. 2009;25(16):2082–2084.CrossRef Google Scholar PubMed

Schmoch, U. Indicators and the relations between science and technology. Scientometrics. 1997;38(1):103–116.CrossRef Google Scholar

Schuemie, M. J., Weeber, M., Schijvenaars, B. J., et al. Distribution of information in biomedical abstracts and full-text publications. Bioinformatics. 2004;20(16):2597–2604.CrossRef Google Scholar PubMed

Searls, D. B. Mining the bibliome. Pharmacogenomics J. 2001;1(2):88–89.CrossRef Google Scholar PubMed

Shultz, M. Comparing test searches in PubMed and Google Scholar. Journal of the Medical Library Association. 2007;95(4):442–445.CrossRef Google Scholar PubMed

Swinney, D. C. and Anthony, J.How were new medicines discovered?Nature Reviews Drug Discovery. 2011;10(7):507–519.CrossRef Google Scholar PubMed

Tanabe, L., Scherf, U., Smith, L. H., et al. MedMiner: An Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques. 1999;27(6):1210–1214, 1216–1217.Google Scholar PubMed

The Europe PMC Consortium. Europe PMC: A full-text literature database for the life sciences and platform for innovation. Nucleic Acids Research. 2014;pii: gku1061.

Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. and Leser, U.A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Computational Biology. 2010;6:e1000837.CrossRef Google Scholar PubMed

Tomanek, K., Wermter, J. and Hahn, U. Sentence and Token Splitting Based on Conditional Random Fields. Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007). Melbourne, Australia. 2007.

Van Landeghem, S., Hakala, K., Rönnqvist, S., et al. Exploring biomolecular literature with EVEX: Connecting genes through events, homology, and indirect associations. Advances in Bioinformatics. 2012;2012:582765.CrossRef Google Scholar PubMed

Van Noorden, R. Trouble at the text mine. Nature. 2012;483(7388):134–135.CrossRef Google Scholar PubMed

Van Noorden, R.Text-mining spat heats up. Nature. 2013;495(7441):295.CrossRef Google Scholar PubMed

Van Noorden, R. Elsevier opens its papers to text-mining. Nature. 2014;506(7486):17.CrossRef

Verspoor, K., Cohen, K. B., Lanfranchi, A., et al. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics. 2012;13:207.CrossRef Google Scholar PubMed

Vlachos, A. and Craven, M. Biomedical event extraction from abstracts and full papers using search-based structured prediction. BMC Bioinformatics. 2012;13(Suppl 11):S5.CrossRef Google Scholar PubMed

Weiss, G. M. Mining with rarity: a unifying framework. SIGKDD Explorations Newsletter. 2004;6:7–19.CrossRef Google Scholar

Winnenburg, R., Wächter, T., Plake, C., Doms, A. and Schroeder, M. Facts from text: Can text mining help to scale-up high-quality manual curation of gene products with ontologies?Briefings in Bioinformatics. 2008;9(6):466–478.CrossRef Google Scholar PubMed

Xu, S., Yoon, H. J. and Tourassi, G.A user-oriented web crawler for selectively acquiring online content in e-health research. Bioinformatics. 2014;30(1):104–114.CrossRef Google Scholar PubMed

Book contents

3 - Understanding human disease knowledge through text mining

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive