Book contents
- Frontmatter
- Contents
- List of contributors
- Preface
- Part I Mathematical foundations
- Part II Big data over cyber networks
- Part III Big data over social networks
- Part IV Big data over biological networks
- 12 Inference of gene regulatory networks: validation and uncertainty
- 13 Inference of gene networks associated with the host response to infectious disease
- 14 Gene-set-based inference of biological network topologies from big molecular profiling data
- 15 Large-scale correlation mining for biomolecular network discovery
- Index
- References
15 - Large-scale correlation mining for biomolecular network discovery
from Part IV - Big data over biological networks
Published online by Cambridge University Press: 18 December 2015
- Frontmatter
- Contents
- List of contributors
- Preface
- Part I Mathematical foundations
- Part II Big data over cyber networks
- Part III Big data over social networks
- Part IV Big data over biological networks
- 12 Inference of gene regulatory networks: validation and uncertainty
- 13 Inference of gene networks associated with the host response to infectious disease
- 14 Gene-set-based inference of biological network topologies from big molecular profiling data
- 15 Large-scale correlation mining for biomolecular network discovery
- Index
- References
Summary
Continuing advances in high-throughput mRNA probing, gene sequencing, and microscopic imaging technology is producing a wealth of biomarker data on many different living organisms and conditions. Scientists hope that increasing amounts of relevant data will eventually lead to better understanding of the network of interactions between the thousands of molecules that regulate these organisms. Thus progress in understanding the biological science has become increasingly dependent on progress in understanding the data science. Data-mining tools have been of particular relevance since they can sometimes be used to effectively separate the “wheat” from the “chaff”, winnowing the massive amount of data down to a few important data dimensions. Correlation mining is a data-mining tool that is particularly useful for probing statistical correlations between biomarkers and recovering properties of their correlation networks. However, since the number of correlations between biomarkers is quadratically larger than the number biomarkers, the scalability of correlation mining in the big data setting becomes an issue. Furthermore, there are phase transitions that govern the correlation mining discoveries that must be understood in order for these discoveries to be reliable and of high confidence. This is especially important to understand at big data scales where the number of samples is fixed and the number of biomarkers becomes unbounded, a sampling regime referred to as the “purely high-dimensional setting”. In this chapter, we will discuss some of the main advances and challenges in correlation mining in the context of large scale biomolecular networks with a focus on medicine. A new correlation mining application will be introduced: discovery of correlation sign flips between edges in a pair of correlation or partial correlation networks. The pair of networks could respectively correspond to a disease (or treatment) group and a control group.
Introduction
Data mining at a large scale has matured over the past 50 years to a point where, every minute, millions of searches over billions of data dimensions are routinely handled by search engines at Google, Yahoo, LinkedIn, Facebook, Twitter, and other media. Similarly, large ontological databases like GO [1] and DAVID [2] have enabled large-scale text data mining for researchers in the life sciences [3].
- Type
- Chapter
- Information
- Big Data over Networks , pp. 409 - 436Publisher: Cambridge University PressPrint publication year: 2016
References
- 1
- Cited by