# Andy's Math/CS page

## Sunday, September 16, 2007

### Information is not a Substance

I wanted to share a cool idea from the young field of 'network coding', which explores techniques for efficiently disseminating information over networks like the web. I believe it originated in this seminal paper, which I learned about indirectly by reading Michael Mitzenmacher's very interesting research blog My Biased Coin (which reflects his professional activity in network coding and is a good source; ref. #7 on this link is a good first article to read).

Consider the following directed network:

Here, each of the three top nodes have a stream of data they need to communicate to the node of corresponding color on the bottom. Each edge shown can transmit a bit per second in the direction indicated, and the purple nodes in the middle are willing to assist in communication.

The perverse structure of this network seems to pose a problem: each node on top has two pink edges which are headed towards the wrong parties on the bottom. Since bottom nodes cannot themselves communicate, the pink edges appear useless. It would seem that all information flow must happen thru the black edges, whose bottleneck would allow only one party to get a bit across each second.

WRONG! (puzzle: stop reading and figure out a better strategy.)

Here's the idea: Say the top nodes want to send bits x, y, z respectively on a round. They send their bit on their every available edge. The top purple node receives all of them and computes their XOR, that is, x + y + z (mod 2). It sends it to bottom-purple node, which then broadcasts it to the three receivers.

Then, for instance, bottom-red receives y, z, and x + y + z, so it can add up all the bits it receives to get y + z + (x + y + z) = x, just what it wanted. Similarly for the other nodes. Repeating this process, the network transmits at three times the rate we were led to expect by our faulty initial reasoning (a line of thought which restricted us to the best so-called 'multicommodity flow' on the network).

If these were roads, and we were sending formed substances to their destinations, our first approach would be valid. But information is not a substance--it obeys zany laws of its own, and can at times be merged, fractionalized, and delocalized in unexpected and useful ways (often involving that mischievous XOR). Better understanding the information capacity of networks more complex than the one above is a fascinating and important research frontier--so keep following Mitzenmacher's blog!

Labels:

• Hi Andy,
I think this concept is similar to the concept of CDMA. The idea of Information not being a substance is introduced by the "data processing" operation (here 'xor'ing at the purple node). It is something like in a general setting, where i have a no idea of the information being sent the initial idea was right. Later when we introduce the idea that information is "binary-coded", we can do intelligent operation on the code words.

I have actually a practical worry about this. if we were to design this "fast" system, we shall incur lot of cost because of too many purple wires, i.e my question is...
Say I have 128 sources and destinations. By introducing a "few" more purple nodes in between and a more efficient "data-processing algorithm", can we save on the 127-purple-lines-per-node??

By  Mad 'Spy' Hatter, at 3:56 AM

• If this communication scheme proves superior it could perhaps put a spanner in the works for those who try to charge network traffic based on content and sender/receiver. Not only is the content of transmissions between nodes unintelligible locally, but so are the exact identity of the receipients.

By  Anonymous, at 7:44 AM

I agree, this is quite similar to what little I have read about CDMA.

If we were designing a network, and we were charged on a per-edge basis, then we probably wouldn't want a perverse setup as in the problem here. If all desired communications are between fixed, disjoint pairs of nodes as in the example, we'd want to just add direct edges joining these pairs.

On the other hand, if every source wants to talk to every destination every instant, every destination must have at least 128 incoming edges, no matter how many purple nodes we add.

But I think that in many settings neither of these give either the right cost model or the right desired-communications model. We all want to talk to many different people (but not all of them, all the time), and we have a low-cost public channel--the air, with its many frequencies--at our disposal, along with more expensive private channels.

So let me know more about the detailed problem you're working on, and we can think about possible solutions.

By  Andy, at 1:09 PM

• anonymous--that's an interesting idea. Of course, we also have to factor in our desire for privacy and efficiency. In this network, everyone learns everything and too many packets get transmitted. But I bet there is a happy medium that can achieve some amount of anonymization cheaply (though I'm sure it depends on the surveillance model). I'll think about it.

By  Andy, at 1:15 PM

• Thanks for the plug. Please all, come visit MyBiasedCoin! (And comment. I enjoy comments.)

For more on network coding, you might check out the Network Coding Home Page (easily found on Google), and there's plenty of papers in both NetCod (Network Coding Workshop) and ISIT (International Symposium on Information Theory). The annual Allerton conference, too, is likely to have some good Network Coding work.

• First of all, I really like the title of your post, and the explanation on why you say that information is not a substance. (I would add that information _is_ a substance, but it acts in strange ways. If we want to know in what ways, we better ask Shannon).

Here's another interesting (and open!) problem where it's important to understand that information is not a substance. This was presented by Mihai Pătraşcu in the China theory week open problems session last week. The problem is simply: What is the communication complexity of approximating Hamming Distance? Specifically: Alice and Bob each get an n-bit string. They want to communicate with each other in order to decide a number that is between the Hamming distance of the two strings, and the Hamming distance plus epsilon*n. Find a way to do this with as few bits as possible.

The best upper bound known is O(1/epsilon^2), by choosing O(1/epsilon^2) random locations (using some shared randomness) and calculating the distance at this point. The approximation is correct (with high probability) using the Chernoff bound. The best lower bound known is Omega(1/epsilon) using a simple argument. (Exercise, or mail me). What's the truth?

So, if information was a substance, then I think that a lower bound of 1/epsilon^2 is easy, because you actually have to move 1/epsilon^2 units to the other side to get enough material across. (This is very informal, and probably false in many ways). However, since you can do funky things with the bits, it's hard to get a lower bound.

By  elad verbin, at 11:09 AM

• Thanks for the feedback and problem, Elad!

Question: do you really require that the number decided on be at least the Hamming distance w.h.p., or just that it be at least (1 - eps)*(Hamming dist.)?

Doesn't make much difference, but in the first case one has to amend the sampling-based algorithm by slightly increasing one's estimate.

I wanted to point out to readers that this problem is an instance where randomness really, provably helps a lot. To see this, consider the same problem for deterministic protocols, or even this simpler problem (call it the 'close-pairs problem'):

GIVEN x, y to Alice and Bob resp., each of length n;

Accept iff |x - y| <= eps*n.

Now, I claim that any deterministic k(n)-round protocol for the close-pairs problem, yields a deterministic k(n)-round protocol for checking equality of two strings x', y' of length c*n, where 0 < c < 1 is some absolute constant we could easily determine.

Why? To check if x' = y', let Alice and Bob run the close-pairs protocol on C(x'), C(y'), where C is a good error-correcting code.

There is a known (and easily proved) lower bound for deterministically checking equality: the obvious algorithm where Alice just sends x' to Bob is optimal. Thus via the reduction above we derive a linear lower bound (with slightly weaker constant) for the deterministic closest-pairs problem.

By  Andy, at 11:58 AM

• To address the question of the 'substantiality' of information, I think that in the classical setting of one-way communication it behaves much like a substance, indeed amazingly so. But adding more communicating parties and more directions of communication makes information behave progressively more strangely. The network coding example here is one example; the variety of open 'direct-sum' conjectures in communication complexity are another (although there, the issue is to some extent that we still expect communication to behave nicely, but we cannot yet prove that it does so.)

By  Andy, at 12:04 PM

• GVWIqV Your blog is great. Articles is interesting!

By  OnlinePharmacy, at 5:47 AM

• gg93uh Please write anything else!

• jkOkUR Thanks to author.

• Thanks to author.

• Hello all!

• Wonderful blog.

By  name, at 3:22 PM

• Hey! Coach Factory Online simply wanted to say your website is one of the nicely laid out, most inspirational I have come across in quite a while.