# Andy's Math/CS page

## Tuesday, November 06, 2007

### Goofball DNA-Bonanza

Since this blog's readership is, I'm guessing, essentially a (very-)proper subset of the readership of Shtetl-Optimized, you may be familiar with his recent musings and observations on DNA rewriting rules (to which I refer readers before looking further at my post). I've been finding this stuff a fun alternative to my usual research (the last time I got this worked up over DNA might've been when I heard about ANDi the monkey), and have come up with three simple results in the past couple of days. The latest is a bit too long for Scott's comments section, so I wrote it up in LaTex and am putting it here.

I encourage all readers to generate, pose, and/or answer additional questions on related themes! Collaborative online research is lots of fun in my experience, and something I'm considering trying to do more of with this blog.

Let me explain the third result after reviewing the other two. But before I begin, I want to emphasize that I make no claims as to the biological realism of any of the models under consideration, though I'm happy to try to work out results in more realistic models if they still seem like interesting puzzles.

First, I showed that, in synthesizing a string x via Scott's model of DNA rewrite rules, the substring-reversal rule can be eliminated at the cost of increasing by constant factors the number of steps and the size of intermediate strings.

Then, I showed that a factor-2 increase in steps allows one to hold the intermediate string to within a factor 2 of the final string length at all times. This result, by constrast, holds whether one is trying to synthesize x de novo or to transform some other string x' into x (although I only wrote up the first case, the other is essentially the same analysis). In the process, I was especially pleased to introduce an analytical tool called 'garbage monsters'; Scott and I hope the terminology will catch on in mainstream biology.

The newest result is the following: suppose you want to implement an error-correcting code x --> C(x) via DNA rewrite rules, where the choices of operations made can depend on x; then unfortunately, something on the order of n/(lg n) operations are needed in the encoding process to ensure that the code is actually robust. A matching upper bound can be easily extracted from Scott's observations in his comments section, and it's a dispiriting bound because for that price, you could synthesize C(x) without even manipulating the string x that you're given! (A similar, simpler result holds for the decoding process, but I haven't written this up.)

For completeness, I'll reproduce my sketch proofs for the first two results below... but first let me note that monkeys continue to make news (thanks again, Chaya!).

***First Result, Comment 21 on Scott's post***

"So how powerful is the reversal operation, really?

Suppose we can produce string x in k applications of operations (1)-(4). I claim we can do it in 4k + 1 operations, without reversals. (Of course, this will show that the Lempel-Ziv approach is also a constant-factor approximation in our setting.)

First, note that, if you can produce x in k steps with (1)-(4), you can also produce x^R, i.e. the reversal of x, in k steps with (1)-(4), by ‘flipping’ each operation.

Second, note that you can produce x.x^R, their concatenation, in 2k steps by running both productions in parallel, interleaving steps on the two halves. If the i’th step in the original production of x was s, then the 2i’th step in the new production will be s.s^R.

Now, in this ‘joint’ production we eliminate reversals as follows: say in the production of x, at the i’th stage we wish to move from

s = a.b.c to s’ = a.b^R.c …

well, in the joint production, at stage 2i we’ve got the string

a.b.c.c^R.b^R.a^R

By two copy and two deletion operations, we reach

a.b^R.c.c^R.b.a^R,

as needed to advance both sides’ progress.

Finally, when the joint production’s finished, delete the right-hand side to yield x. That’s 4k + 1 steps."

***Second Result, Comments 30-31 on Scott's post***

"I believe I can show that, if we’re producing a string x of length n, we can keep the intermediate strings to length at most 2n, while incurring at most a factor-2 blowup in the number of steps used.

As Scott suggested to me in conversation, let’s keep track of which symbols in the intermediate workstring will and won’t make it into the final string x without deletion. If they won’t, call them ‘nonterminals’. (If a symbol will be replicated, and at least one of its descendants will be included in the final string, it’s not considered a nonterminal.)

First note that, if in a production of x there appears at some stage a block of consecutive nonterminals, we can modify the production so that they are treated as a ‘unit’ in subsequent steps, i.e. no string operation takes an endpoint that’s properly inside the block. They get shuffled around together and eventually destroyed together. The production’s length is not affected.

In a stage of such a production, for analysis purposes, we call a maximal block of nonterminals a ‘garbage monster’. Each time a new garbage monster is created (and many may be created by a single copy operation–even if there were no nonterminals in the original substring), we give it an unused, goofy monster name.

When string manipulations cause two garbage monsters to become adjacent, we combine them under one of their two names (and say that that one ‘eats’ the other one).

Key claim: in our modified type of production, at most one garbage monster gets eaten or destroyed at every step! Draw some pictures to see if you agree.

Now, with this fact in hand, we can modify the production again and argue it gives what we want.
Here’s the strategy: given a production of x of the type we described, modify the production so that between each step, we stop and delete all garbage monsters, one deletion per monster.

This is a bit wasteful since some of these might’ve been eaten by other garbage monsters; but if our production had k steps to begin with, only at most k monsters got eaten total, and all garbage monsters must die somehow, so we only add at most k new steps.

Finally, observe that the string is culled to length at most n after every new round of ‘executions’, and since the workstring’s length is at most doubled by any string operation, its length is bounded by 2n throughout."