PC204 – Challenge #1

The Case of the Watermarked Bacteria

In May 2010, Science published a paper authored by Craig Venter and others at the J. Craig Venter Institute about the creation of the first self-replicating synthetic bacterial cell. Basically they took a bacterium's DNA sequence, modified it, made physical DNA from this sequence, stuck this DNA into a cell, which then reproduced under control of the new DNA to create a new bacterium. The link to the original paper is here.

One interesting feature of this synthetic bacterium is it includes four "watermarks," special sequences of DNA that prove this bacterium was created artificially and is not natural. However, they didn't reveal how the watermarks were encoded. The DNA sequences were published, but how to get the meaning out of this was left as a puzzle. (See figure S1 on page 15 of supplementary material, here.)

JCVI also released quotations used in the watermarks. These are:

"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE."
"SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."
"WHAT I CANNOT BUILD, I CANNOT UNDERSTAND."
(Obviously that's not the only text contained in the watermarks.)

The goal of this challenge is to write a Python program that figures out what the watermarks say. The first person in this year's class to submit a program that successfully decodes the watermarks wins a super-cool UCSF Chimera T-shirt.

This is actually kinda of a tough problem to solve, so we'll give you several hints. Just as with classical DNA codons, the watermark sequences need to be considered as DNA triplets. But unlike a biological system, you don't need to worry about frameshifts. So all you need to do to crack the code is figure out the mapping between the characters and the DNA triplets based on the quotations given above. Secondly, you can't assume that the start of the triplets corresponds to the start of one of the quotations. Also, the first watermark doesn't contain one of the quotations, but does contains HTML, complete with a DOCTYPE string. (That's right, JCVI actually put HTML into the DNA of a living organism.)

You'll find that once you get a partial solution, you can add your own "quotation" to the list of known text in the watermarks and then re-run your program to get a more complete solution. Lastly, a few triplets only appear once in the watermarks, so there's no way to unambiguously determine the correct triplet-to-character mapping for these.

The four watermarks are here.

Send your successful program to pc204@cgl.ucsf.edu