Scientists say they’ve found a comparatively error-free way to store data in DNA, and have already used it to encode Shakespeare’s sonnets and MP3 music files.
The researchers, from the EMBL-European Bioinformatics Institute (EMBL-EBI), say their method makes it possible to store more than 100 million hours of high-definition video in a cupful of DNA.
“We already know that DNA is a robust way to store information because we can extract it from bones of woolly mammoths, which date back tens of thousands of years, and make sense of it,” says Nick Goldman of EMBL-EBI. “It’s also incredibly small, dense and does not need any power for storage, so shipping and keeping it is easy.”
While reading DNA is fairly straightforward, writing it has been more of a challenge. Using current methods, it’s only possible to manufacture DNA in short strings. In addition, both writing and reading DNA are prone to errors, particularly when the same DNA letter is repeated.
“We knew we needed to make a code using only short strings of DNA, and to do it in such a way that creating a run of the same letter would be impossible,” says EMBL-EBI associate director Ewan Birney.
“So we figured, let’s break up the code into lots of overlapping fragments going in both directions, with indexing information showing where each fragment belongs in the overall code, and make a coding scheme that doesn’t allow repeats. That way, you would have to have the same error on four different fragments for it to fail – and that would be very rare.”
And to demonstrate that their method works, the researchers sent encoded data to California-based company Agilent Technologies: an .mp3 of Martin Luther King’s speech, ‘I Have a Dream’; a .jpg photo of EMBL-EBI; a .pdf of Watson and Crick’s seminal paper, “Molecular structure of nucleic acids”; a .txt file of all of Shakespeare’s sonnets; and a file that describes the encoding.
“We downloaded the files from the web and used them to synthesise hundreds of thousands of pieces of DNA – the result looks like a tiny piece of dust,” says Emily Leproust of Agilent.
Agilent mailed the sample to EMBL-EBI, where the researchers were able to sequence the DNA and decode the files without errors.
“We’ve created a code that’s error tolerant using a molecular form we know will last in the right conditions for 10,000 years, or possibly longer,” says Goldman. “As long as someone knows what the code is, you will be able to read it back if you have a machine that can read DNA.”