Big Data Brings Big Changes to Science


Image from Bill Schmarzo‘s Post “The Big Data Storymap

The birthplace of Silicon Valley was not in a fancy building or with a large team of fancy scientists. Rather, the innovative force that swept the South Bay of San Francisco, began in 1939 with the work of two scientists David Packard and William Hewlett, in a Palo Alto garage. Almost 75 years later, we are becoming witness to a new wave of innovation, and this time, it’s not brewing in a backyard Palo Alto garage, it’s precipitating right in the clouds, Big Data Clouds.


“Big Data” is a term used to describe an immense quantity of information that is complex and almost overwhelming for typical computer algorithms to process. Specifically designed computing tools are required to curate and mine through big datasets before they can be properly analyzed. Big data can originate from large population sets ranging from genome sequencing and healthcare record collection to wireless and cellular usage and social media sources. The flurry of knowledge and technological innovation that can arise from curating large population datasets is especially significant in the biomedical fields.


Systems biology approaches to medicine are inspiring new Big Data studies like Craig Venter’s latest venture Human Longevity, where 40,000 human genomes will be sequenced a year in hopes of uncovering the causes of aging and also the upcoming launch of the 100K Wellness Project, pioneered by Leeroy Hood at the Institute for Systems Biology (ISB) in Seattle, Washington- another Big Data project aimed at understanding and capitalizing on sequence and metabolite information from human patient volunteers. During his talk at the University of British Columbia in Vancouver , Hood emphasized the immeasurable significance of predictive, preventive, personalized and participatory medicine- a systems approach to medicine he’s coined ‘P4 Medicine’ that will revolutionize healthcare through studies like the 100K Wellness Project in creating clouds of Big data.


The 100K Wellness Project aims to monitor individual wellness over time in 100,000 patients, generating data clouds filled with biological information about individual patient and community wellness that can in turn, create countless spin-off companies in diagnostics, analytics and therapeutics- a Big Data Silicon Valley.


Microbiome, proteome, genome, transcriptome, virome, brain, and immunome studies underway and set to launch are all contributing Big Data that can be capitalized and provide insight in to the prediction, prevention and management of various diseases ranging from cancers, autoimmune and rare heritable diseases. The value in accelerating and accumulating mass information of the biological unknown is undoubtedly priceless, but as with any new advancement in technology- societal, financial and ethical costs must be weighed.


Big Data made a big appearance at this year’s American Association for the Advancement of Science (AAAS) annual meeting, where entire sessions over the course of the 5 day conglomeration were solely dedicated to the discussion of Big Data. Its sources, its usefulness and applicability, the ethics of sharing, energy required for its storage, privacy protection and societal implications were all of interest and concern to scientists, journalists and the general public in attendance. With the cost of producing raw megabase genetic sequence down to less than 10 cents, the donation and collection of genetic data is heavily expanding at a rate faster than can meet our morals and our legislation. Establishing trust in Big Data collection for personal and research-based use is of particular concern with respect to large genomic studies. Companies like 23andMe have facilitated the collection of genomic data but are now at the cusp of controversy with the FDA and the ethics of maintaining a balance in privacy and clinical care brought to attention. Should researchers have access to big data sets in order to further clinical care? Should patients have access to the analysis and results of meta-analysis? Is there an inherent right to donate and/or share data when it concerns public health and at what cost? These are all ethical questions that have spurred from Big Data collection in the healthcare and biomedical sectors and merit further attention.


At the AAAS meeting in February, concerns regarding personal re-identification from genomic databases were raised. Yaniv Erlich, a computation biologist from the Whitehead Institute for Biomedical Research, known for his hacker/Suduko type approaches in genomic re-indentification, claimed during his session at the AAAS that only 33bits of genetic information or 200 single nucleotide polymorphisms out of a million are needed to identify a person. Kristin Lauter of Microsoft Research reassured in the same AAAS session that the privacy protection offered by homomorphic encryption solutions developed at Microsoft surpass other previous encryptions and that even though these encryption solutions are not standardized yet, there has yet to be a polynomial capable of breaking them. John Wilbanks of Sage Bionetworks gave a personal and encouraging take in the AAAS session on the rights of patients with rare heritable diseases and their family members to have access to share and receive results of genetic databases. He fosters the notion that we should invest personal data for medical research advancements and not monetize it, trusting that anonymity is kept. More of John Wilbanks’ ethos of pooling genetic data is discussed in his talk at GlobalTED 2012.

Large-scale data generation in the physics realm like the 25 petabytes (1015) of data generated annually by the Large Hadron Collider requires teams of specialized tech-savy scientists to handle the mass quantity of information. The technical skills required to manage Big Data does instill new technical requirements of scientists and some skeptics as Jake Vanderplas discusses in his essay the Big Data Brain Drain, fear that the trend towards Big Data research will demand a new breed of scientists beyond the traditional researcher that must not only be trained in their faculty but must now also possess advanced computing skills to keep up with the ever-accelerating data tsunami.


Another caviat to the Big Data buzz is that public understanding and access to the data or its analysis is very limited. Pressures from federal agencies, researchers and the general public has prompted a movement towards the development of widely accessible tools that help visualize and simplify the complexity in Big Data sets. Leaders of large data experiments like David Reitze of the California Institute of Technology and Trevor Mundel, President of the Global Health Fund, are speaking out about the serious financial and technical efforts made by Big Data Collaborators to address open access to the Big data boom. At the AAAS meeting Reitze discussed the efforts by the US Advanced Laser Interferometer Gravitational-wave Observatory (LIGO) Scientific Collaborators to make the petabyte of gravitational wave data generated annually by the LIGO broadly accessible online by 2015. Details of how they’re going to make the data available, what data will be generated and how it will be analyzed are outlined in a data management plan published online January of last year.

Mundel also discussed at the AAAS meeting the pressures for public access to Big Data generated in the biomedical field and how this has driven the Gates Foundation to fund the development of the Global Disease Burden (GDB) 2.0 tool that is helping to bring attention to and monitor the spread and impact of non-communicable diseases including autoimmune and rare heritable diseases as well as communicable diseases like viral infections on regional, national and global levels.


Lastly, in the age of climate woes, we must also not forget the environmental impacts of innovation. Our interest and capacity to support and analyze Big Data is accelerating with technological innovation. As big success in modernization came with the industrial revolution, so came the Carbon saga plaguing our world’s ecosystems today. Big Data’s potential to revolutionize technology and our knowledge across many fields of Science, from astrophysics to biomedicine, demands the advancements not only in legislation and technical skills of its handlers, but Big Data is demanding a lot of our energy. In fact, the former U.S. Secretary of Energy, Steven Chu, warned in his plenary talk at the AAAS meeting that the global energy used by data centers for computation is greater than 30GW. With Big data projects on the rise, the implementation and advancement of renewable energy is not only critical for sustaining and progressing the technology, but is critical for the protection of our World’s energy consumption.


We are not quite at “the singularity” as Ray Kurzweil predicts in our technological future, but we are nearing a burst of translational information with Big Data that is bringing big success and revolutionizing science and healthcare. We should embrace the acceleration of technological innovation that Big Data offers yet still approach with caution and consideration of the ethical implications at personal, societal and environmental levels. Big data is bringing big changes to science, and as William Pollard has so eloquently stated “Without change there is no innovation, creativity, or incentive for improvement. Those who initiate change will have a better opportunity to manage the change that is inevitable.”