Skip to main content Accessibility help
×
Hostname: page-component-77c89778f8-vpsfw Total loading time: 0 Render date: 2024-07-18T17:19:43.711Z Has data issue: false hasContentIssue false

3 - Understanding human disease knowledge through text mining

Published online by Cambridge University Press:  05 February 2016

Raul Rodriguez-Esteban
Affiliation:
Roche Inc
William T. Loging
Affiliation:
Mount Sinai School of Medicine, New York
Get access

Summary

The aim of text mining in biomedicine is to extract valuable information from large amounts of biomedical text. For this purpose it borrows techniques from fields such as natural language processing (NLP), information retrieval (IR), information extraction (IE), and artificial intelligence (AI). However, many of these techniques need to be adapted to the particularities of biomedical text, because this text possesses a unique diversity of vocabularies and writing styles, as can be seen in clinical narratives, regulatory reports, and scientific articles. For example, an NLP algorithm that recognized sentences in newspapers would need to be adjusted for biomedical text, because periods that do not separate sentences are used more frequently in biomedical text than in newspapers, which would disorient the NLP algorithm (Tomanek et al., 2007). The particular information needs in biomedicine have also led to the development of specialized text-mining techniques for extracting knowledge specific to the biomedical domain, such as, for example, molecular events, perturbations and interactions.

Pharmaceutical companies are data-intensive organizations whose success depends on their ability to efficiently process large quantities of data from internal and external sources. Much valuable knowledge is locked within textual sources such as patents, clinical records, conference abstracts, and full-text articles. The growth of these textual sources means that even experts on a subject matter cannot cope with the content appearing in their niche. For example, more than 27,000 articles mentioning diabetes were listed in PubMed during the year 2013. Text mining enables the processing of such documents within practical time frames and impacting every stage of the drug discovery pipeline.

Before the late 1990s, IR was the main research field that dealt with biomedical documents. Its main focus was on improving access to literature records from biomedical databases such as Medline, a comprehensive database of scientific abstracts managed by the US National Library of Medicine (NLM). Then, in 1996, the launch of PubMed made available the majority of Medline content online (Canese, 2006). This event was followed by an increase in research about biomedical documents with a scope broader than IR. Such research was coined “text mining” due to the emergence of data and text mining during the same period (Rodriguez-Esteban, 2008). The first publication dealing with biomedical text that used the name “text mining” came from the National Institutes of Health (NIH) in 1999 (Tanabe et al., 1999).

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agarwal, S. and Yu, H.Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion. Bioinformatics. 2009;25(23):3174–3180.CrossRefGoogle ScholarPubMed
Björk, B. C., Welling, P., Laakso, M., et al. Open access to the scientific journal literature: situation 2009. PLoS ONE. 2010;5(6): e11273.CrossRefGoogle ScholarPubMed
Breiner, D. A. and Rodriguez-Esteban, R. Web Scraping Technology as a Cost-Effective Solution for News Alerting. Special Libraries Association, Pharmaceutical and Health Technology, Spring Meeting, Philadelphia, April 2013.
Canese, K.PubMed Celebrates its 10th Anniversary!NLM Technical Bulletin. 2006;(352):e5.Google Scholar
Caporaso, J. G., Deshpande, N., Fink, J. L., et al. Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. Pacific Symposium on Biocomputing. 2008;640–651.Google Scholar
Clark, A., Körner, C. and Nielsen, H. P.From punched cards to apps and iPads. Fifty-five years of the P-D-R. Business Information Review. 2013;30(2):96–101.CrossRefGoogle Scholar
Clegg, A. B. and Shepherd, A. J.Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinformatics. 2007;8:24.CrossRefGoogle ScholarPubMed
Cohen, K. B., Johnson, H. L., Verspoor, K., Roeder, C. and Hunter, L. E. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics. 2010;11:492. doi: 10.1186/1471-2105-11-492.CrossRefGoogle ScholarPubMed
Constantin, A., Pettifer, S. and Voronkov, A.PDFX: Fully-automated PDF-to-XML conversion of scientic literature. Proceedings of the 2013 ACM symposium on Document Engineering (DocEng 2013). 2013;177–180.Google Scholar
Dahlmeier, D. and Ng, H. T.Domain adaptation for semantic role labeling in the biomedical domain. Bioinformatics. 2010;26(8):1098–1104.CrossRefGoogle ScholarPubMed
Divoli, A. Biomedical Text Mining Approaches: Applications in Protein Family Annotation (dissertation). Manchester: University of Manchester, 2006.Google Scholar
Eder, J., Sedrani, R. and Wiesmann, C.The discovery of first-in-class drugs: Origins and evolution. Nature Reviews Drug Discovery. 2014;13(8):577–587.CrossRefGoogle ScholarPubMed
Eriksson, R., Jensen, P. B., Frankild, S., Jensen, L. J. and Brunak, S.Dictionary construction and identification of possible adverse drug events in Danish clinical narrative text. Journal of the American Medical Information Association. 2013;20(5):947–953.CrossRefGoogle ScholarPubMed
Feldman, R. and Dagan, I.Knowledge Discovery in Textual Databases (KDT). First International Conference on Knowledge Discovery (KDD-95). Montreal, Canada, 1995.Google Scholar
Ferraro, J. P., Daumé, H. 3rd, Duvall, S. L., et al. Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation. Journal of the American Medical Information Association. 2013;20(5):931–939.CrossRefGoogle ScholarPubMed
Friedman, C., Kra, P. and Rzhetsky, A.Two biomedical sublanguages: A description based on the theories of Zellig Harris. Journal of Biomedical Information. 2002;35(4):222–235.Google ScholarPubMed
Golder, S. and Loke, Y. K. The contribution of different information sources for adverse effects data. International Journal of Technological Assessment in Health Care. 2012;28(2):133–137.Google ScholarPubMed
Gomes, B., Hayes, W. and Podowski, R. M. Text mining. In In Silico Technologies in Drug Target Identification and Validation, ed. Leon, D. and Markel, S (pp. 153–194). Boca Raton, FL: CRC Press, 2006.Google Scholar
Jensen, P. B., Jensen, L. J. and Brunak, S.Mining electronic health records: Towards better research applications and clinical care. Nature Reviews Genetics. 2012;13(6):395–405.CrossRefGoogle ScholarPubMed
Jiang, Y., Lin, C., Meng, W., et al. Rule-based deduplication of article records from bibliographic databases. Database (Oxford). 2014;2014:bat086.CrossRefGoogle ScholarPubMed
Jones, C. W., Handler, L., Crowell, K. E., et al. Non-publication of large randomized clinical trials: Cross sectional analysis. British Medical Journal. 2013;347:f6104.CrossRefGoogle ScholarPubMed
Kabiljo, R., Clegg, A. B. and Shepherd, A. J. A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinformatics. 2009;10:233.CrossRefGoogle ScholarPubMed
Kiritchenko, S., de Bruijn, B., Carini, S., Martin, J. and Sim, I.ExaCT: Automatic extraction of clinical trial characteristics from journal publications. BMC Medical Information and Decision Making. 2010;10:56.CrossRefGoogle ScholarPubMed
Kulkarni, A. V., Aziz, B., Shams, I. and Busse, J. W. Comparisons of citations in Web of Science, Scopus, and Google Scholar for articles published in general medical journals. Journal of the American Medical Association. 2009;302(10):1092–1096.Google ScholarPubMed
Leaman, R., Wojtulewicz, L., Sullivan, R., et al. Towards internet-age pharmacovigilance: Extracting adverse drug reactions from user posts to health-related social networks. Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010;117–125.Google Scholar
Loging, W., Rodriguez-Esteban, R., Hill, J., Freeman, T. and Miglietta, J. Cheminformatic/bioinformatic analysis of large corporate databases: application to drug repurposing. Drug Discovery Today. 2011;8(3–4):109–116.Google Scholar
Martin, E. P. G., Bremer, E. G., Guerin, M., DeSesa, C. and Jouve, O. Analysis of protein/protein interactions through biomedical literature: Text mining of abstracts vs. text mining of full text articles. In Knowledge Exploration in Life Science Informatics (pp. 96–108). Lecture Notes in Computer Science. New York, NY: Springer, 2004.Google Scholar
McIntosh, T. and Curran, J. R. Challenges for automatically extracting molecular interactions from full-text articles. BMC Bioinformatics. 2009;10:311.CrossRefGoogle ScholarPubMed
Miwa, M., Thompson, P. and Ananiadou, S.Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics. 2012;28(13):1759–1765.CrossRefGoogle ScholarPubMed
Miwa, M., Pyysalo, S., Ohta, T. and Ananiadou, S.Wide coverage biomedical event extraction using multiple partially overlapping corpora. BMC Bioinformatics. 2013;14:175.CrossRefGoogle ScholarPubMed
Primo Peña, E., Vázquez Valero, M. and García Sicilia, J. Comparative study of journal selection criteria used by MEDLINE and EMBASE, and their application to Spanish biomedical journals. The 9th European Conference of Medical and Health Libraries. 2004.
Pyysalo, S., Salakoski, T., Aubin, S. and Nazarenko, A. Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches. BMC Bioinformatics. 2006;7(Suppl 3):S2.CrossRefGoogle ScholarPubMed
Ramakrishnan, C., Patnia, A., Hovy, E. and Burns, G. A. Layout-aware text extraction from full-text PDF of scientific articles. Source Code in Biology and Medicine. 2012;7(1):7.CrossRefGoogle ScholarPubMed
Rodriguez-Esteban, R. Methods in Biomedical Text Mining (dissertation). New York, NY: Columbia University, 2008.Google Scholar
Rodriguez-Esteban, R. and Iossifov, I. Figure mining for biomedical research. Bioinformatics. 2009;25(16):2082–2084.CrossRefGoogle ScholarPubMed
Schmoch, U. Indicators and the relations between science and technology. Scientometrics. 1997;38(1):103–116.CrossRefGoogle Scholar
Schuemie, M. J., Weeber, M., Schijvenaars, B. J., et al. Distribution of information in biomedical abstracts and full-text publications. Bioinformatics. 2004;20(16):2597–2604.CrossRefGoogle ScholarPubMed
Searls, D. B. Mining the bibliome. Pharmacogenomics J. 2001;1(2):88–89.CrossRefGoogle ScholarPubMed
Shultz, M. Comparing test searches in PubMed and Google Scholar. Journal of the Medical Library Association. 2007;95(4):442–445.CrossRefGoogle ScholarPubMed
Swinney, D. C. and Anthony, J.How were new medicines discovered?Nature Reviews Drug Discovery. 2011;10(7):507–519.CrossRefGoogle ScholarPubMed
Tanabe, L., Scherf, U., Smith, L. H., et al. MedMiner: An Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques. 1999;27(6):1210–1214, 1216–1217.Google ScholarPubMed
The Europe PMC Consortium. Europe PMC: A full-text literature database for the life sciences and platform for innovation. Nucleic Acids Research. 2014;pii: gku1061.
Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. and Leser, U.A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Computational Biology. 2010;6:e1000837.CrossRefGoogle ScholarPubMed
Tomanek, K., Wermter, J. and Hahn, U. Sentence and Token Splitting Based on Conditional Random Fields. Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007). Melbourne, Australia. 2007.
Van Landeghem, S., Hakala, K., Rönnqvist, S., et al. Exploring biomolecular literature with EVEX: Connecting genes through events, homology, and indirect associations. Advances in Bioinformatics. 2012;2012:582765.CrossRefGoogle ScholarPubMed
Van Noorden, R. Trouble at the text mine. Nature. 2012;483(7388):134–135.CrossRefGoogle ScholarPubMed
Van Noorden, R.Text-mining spat heats up. Nature. 2013;495(7441):295.CrossRefGoogle ScholarPubMed
Van Noorden, R. Elsevier opens its papers to text-mining. Nature. 2014;506(7486):17.CrossRef
Verspoor, K., Cohen, K. B., Lanfranchi, A., et al. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics. 2012;13:207.CrossRefGoogle ScholarPubMed
Vlachos, A. and Craven, M. Biomedical event extraction from abstracts and full papers using search-based structured prediction. BMC Bioinformatics. 2012;13(Suppl 11):S5.CrossRefGoogle ScholarPubMed
Weiss, G. M. Mining with rarity: a unifying framework. SIGKDD Explorations Newsletter. 2004;6:7–19.CrossRefGoogle Scholar
Winnenburg, R., Wächter, T., Plake, C., Doms, A. and Schroeder, M. Facts from text: Can text mining help to scale-up high-quality manual curation of gene products with ontologies?Briefings in Bioinformatics. 2008;9(6):466–478.CrossRefGoogle ScholarPubMed
Xu, S., Yoon, H. J. and Tourassi, G.A user-oriented web crawler for selectively acquiring online content in e-health research. Bioinformatics. 2014;30(1):104–114.CrossRefGoogle ScholarPubMed

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×