Disposable Genomes: When Reading DNA Becomes Cheaper Than Storing the Data

Last Updated Apr 23, 2010 9:42 AM EDT

The fascinating thing about the emerging field of commercial genomics is how it ties together so many different areas of research -- from biology to computer science, and just about everything in between. Progress in all these disciplines is deeply interconnected, which is why data storage turns out to be as vital as lower priced gene mapping to the commercial success of genomics technology.

As a result of this marriage, genome sequencing is getting much cheaper, and will soon reach the point where it can be used in a clinical setting and covered by health insurance. A chart created by Eric Lander of the Broad Institute shows that sequencing costs have dropped by a factor of 14,000 over the past decade, roughly 100 times faster than Moore's Law in semiconductors.

One obvious consequence is a likely explosion of stored genomic data, and a whole new raft of associated costs. At a recent panel discussion I attended -- Exploring Personal Genetics: The Brave New World -- Michael Goldberg, a partner with Mohr Davidow Ventures, made an interesting comment: "Moore's Law has now been married to biology in ways that were only conceptualized in the 1990s."

In response, David Magnus, the director of Stanford's Center for Biomedical Ethics, noted that data-storage costs are another key economic factor for the future of the genomics business. I recently caught up with Magnus on the phone and asked him to elaborate. If storage becomes economically burdensome, he said, mapped genomes may just become disposable. "If you assume the rate of acceleration for another ten to fifteen years, the cost is going to be nothing," he said. "At some point it will be easier to re-sequence rather than store the data on a chip or a server."

In an April 2009 report published in the journal Biotechniques, the authors write:
The cost of storing the gigabytes of raw data produced by each run of the Illumina GAII or AB SOLiD has been estimated to be greater than the cost of generating the data in the first place. It is now common practice to delete the raw image files once they have been processed to produce the relatively small text sequence and quality data files. While the long-term storage of the text sequence files is feasible using current tape and disc technology, maintaining the data in a readily usable form where it may readily be interrogated by users is more of a challenge.

There are currently projects to re-sequence 1000 human genomes, as well as multiple plant and animal varieties to identify genetic variation within species associated with phenotypic variation. The submission of complete re-sequence data to the international repositories would result in the storage of highly redundant data sets, bloating the size of the database and reducing the efficiency of queries. As an increasing number of reference genome sequences become available and the cost of re-sequencing continues to decline, the problem of data redundancy will increase to a point where storage within the primary data repositories becomes impractical.

The theory that sequence repositories will constantly increase in size is likely to be challenged with the increasing availability of reference genome sequences. Once a reference genome sequence has been produced, users are predominantly interested in variation from this reference.
Jay Flatley, the CEO of Illunima, recently told me the cost of storing a single genome is only 40-50 megabytes, but that the bloating problem will be a problem in storage facilities holding hundreds of thousands of genomes.

"The biggest value to companies is to have the genomes of more than 5000 people because that is where scientists are going to learn the most about genetic variation," Flatley said. It's also where there's the greatest cost demand on the computing infrastruture in terms of storage, tracking software, aggregation, and privacy protection.

So the near future may not lie in more efficient space for redundant data, but in more efficient reference genome sequences and better access to them. That way doctors can sequence genomes, check the reference data, and then dispose of the genome knowing they can always be re-sequenced. This would also help protect the privacy of the patients. As the genomes increasingly lend themselves to radical file compression, however, even storage costs in large facilities may eventually be surmountable relative to the cost of making genomes disposable.

Photo Source: Davos_'s photostream
  • Rachel Lehmann-Haupt

Comments