What Will Google Do With Your DNA in the Cloud?

Source: Thinkstock

Cloud computing services by powerful tech companies like Google and Amazon put both analytical tools and large volumes of data on servers that can be accessed remotely. Cloud computing is a tool of growing importance for researchers in a wide array of fields, but especially for those conducting research on the human genome and all of the insight that DNA can provide into health and disease.

Google is one of several major tech companies that are looking to get more genomics researchers to turn to powerful cloud computing platforms to store their data and conduct their research on human genetics. As Antonio Regalado reports for MIT’s Technology Review, Google is actively approaching hospitals and universities and proposing that they store and analyze their patients’ genomes in the cloud — in Google’s cloud, specifically.

The service is offered through Google Genomics, a cloud computing platform that the company launched last March to significantly less fanfare than met other announcements, like the company’s efforts to develop cancer-detecting nanoparticles. But as Regalado notes, Google Genomics could prove more significant than any of Google’s other health-related moonshots.

Empowered by the new capabilities of cloud computing, researchers will soon be able to connect and compare thousands, even millions, of genomes, using staggeringly large amounts of genetic data to make new medical discoveries. Google joins other major tech companies like Amazon, Microsoft, and IBM in competing to store universities’ and hospitals’ troves of genome data and to make that data more useful for research and virtual experiments.

As The Wall Street Journal reported in June, the raw digital data that represents a genome takes up approximately 100 gigabytes of storage, so only about 10 genomes fit on the typical desktop computer. Regalado reports that the polished version of a patient’s genetic code is much smaller — less than a gigabyte — and may make the move to a cloud storage platform even more economical.

Work on Google Genomics began 18 months ago, with the company consulting with scientists to build an API that enables them to move genetic data onto Google’s servers and perform experiments with it, using the same database technology that Google uses to index the Web and to track billions of Internet users. David Glazer, the software engineer who led the effort, told Technology Review: “We saw biologists moving from studying one genome at a time to studying millions. The opportunity is how to apply breakthroughs in data technology to help with this transition.” While skeptics think that genome data is too complex for Google, others are more confident that a sea change is on the horizon.

The Human Genome Project needed 13 years and $3 billion to complete the sequencing of the first human genome — which was actually a hypothetical or reference genome that used stretches of DNA from different people — to build a sort of “highway map” to the human genome Eric D. Green, director of the National Human Genome Research Institute at the National Institutes of Health, explained to The New York Times on the 10th anniversary of the project’s completion.

Since then, the technology has changed immeasurably. The Broad Institute in Cambridge, Massachusetts, told Technology Review that during the month of October, it decoded the equivalent of one human genome every 32 minutes. While that equaled about 200 terabytes of raw data, the amount of data that genomics companies handle is much smaller than what’s produced by larger Internet companies. Regalado reports that in the space of two months, the Broad Institute produces the equivalent of what gets uploaded to YouTube in one day, but the flow of data far exceeds what genetics researchers have dealt with in the past.

That means that researchers are looking to store and access their data at central locations. The National Cancer Institute, for instance, announced last month that it would pay $19 million to move copies of the 2.6 petabyte Cancer Genome Atlas into the cloud. The body of data, from several thousand cancer patients, will be housed in both Google Genomics and in Amazon’s data centers.

Sheila Reynolds, a research scientist at the Institute for Systems Biology in Seattle, told Regalado that cloud computing services will enable the creation of “cancer genome clouds,” where researchers can share information and run virtual experiments “as easily as a Web search.”

That opens new possibilities for researchers everywhere, but especially those at universities and hospitals that don’t have the capability to download and work on huge sets of data. Startups like Tute Genomics, Seven Bridges, and NextCode Health build “browsers” that can be used to explore genetic data. Google or Amazon act as the back end for these genomics companies, and the rise of platforms and services built on top of these commercial clouds highlights the importance of cloud computing for genetics research.

Some think that doctors will eventually rely on an “Internet of DNA,” which they’ll be able to search by sequencing a specific patient’s genome and potentially a tumor’s genome, and then query them against a database of millions of other genomes. Google and Amazon have engaged in a price war, which has brought down the cost of storing and analyzing DNA data in the cloud. The prices for such services are expected to continue to drop. That’s a good thing for researchers who would be hard-pressed to find another tool to handle the scale of the genetics research that cloud computing makes possible.

Genetic research has promise not only for wide fields like cancer but for any disease or condition with genes that might hint at how an illness progresses or who is most likely to develop a condition. When genetic mutations are linked to a disease, specifically targeted treatments can be developed and tested. Placing vast numbers of patients’ genetic data in a central, cloud-based location, where many researchers can access it, is expected to accelerate the process of finding connections among millions of pieces of data.

That, in turn, will change the process of identifying the biomarkers associated with a disease, and help develop effective treatments for diseases from cancer to autism. Google said to Technology Review that it charges $25 per year to store a genome, and more to perform computations on it. Google hopes that not only will the Google Genomics storage and analytical tools represent a good value for universities and hospitals, but that the platform will gain enough data on people’s genomes to create a truly valuable database that will aid researchers in treating, preventing, and curing disease.

More from Tech Cheat Sheet: