Developing Understanding in an Ocean of Facts

The Duke Superfund Research Center’s primary research goal is to develop better understanding of the world. The omics revolution is generating an ever-increasing deluge of facts. Big data can be very helpful for developing better understanding, but it presents challenges as well. We know from the past that understanding can be limited by too few facts, but we are now discovering that understanding can also be clouded by too many facts, when the great majority of those facts are not relevant to the processes we seek to understand.

The unbiased approach to collecting information without a hypothesis being tested surely provides a great deal of data, but the potential value of each datum is less than data collected in hypothesis testing studies. Unbiased data collection provides opportunities to discover unanticipated relationships that may be missed with testing of specific hypotheses. However, wading through the masses of data to sort relevant from irrelevant becomes an ever-increasing challenge. We must guard against mis-interpreting spurious relationships as real. The process to tease out real relationships from so much data requires accommodations such as Bonferroni or similar statistical corrections which require greater and greater significance thresholds. Such corrections are effective in reducing chance findings being interpreted as real, but they also keep us from seeing relationships that are less than the most extreme. With evolved complex systems like those seen in the biology of organisms, the most important systems are the most well controlled. Small changes in well-controlled systems are key for understanding biology. Relatively small changes in well-controlled systems would be lost in the guarding Bonferroni accommodations.

Just because a relationship is real does not mean that it is important. We need to discern which facts are relevant for driving important biological processes and which are stray. Attending only to the most extreme relationships is akin to only looking for the keys under the statistical streetlamp, not because those are the most important, but that they are the easiest to see. We need to think of new approaches to develop understanding with ever-bigger data sets, and not be blinded by the fog of a myriad unimportant facts.

One idea to add to the conversation would to be to take advantage of the ever-decreasing cost of omics analyses to repeat tests rapidly in sequence. We could take repeated samples in quick succession after a perturbation to see the sequential pattern of effects as one process in the biological system affects others against data that are either static or changing in a more random way. Then we can identify not just the extreme peaks of the Manhattan plot, but also begin to listen to the physiologically important and well-regulated conversations along the streets and avenues of Manhattan as well, so to speak. Small changes in the signaling in well-controlled processes are apt to play important roles in controlling function in the organism.  In our quest for better understanding, we need to adapt strategies for swimming through great oceans of data that allow us to stop and think.

By Edward D. Levin, Ph.D., Duke University