Information Theory and CS
'Information Theory' used to bug me as an undergrad. Its title suggested enormous importance, yet rarely did it overtly appear in my study of computer science, arguably the study of information in all its forms and uses. It seemed applicable only to one-way communication, and only in scenarios with stated probability distributions--too restrictive to justify grappling with that weird entropy formula.
Was its early popularity, along with the high-flying claims and ambitions of 'cybernetics', just a fad, a yearning for a master concept to unify the bewildering experience of modern life?
I won't attempt cultural history, but on the mathematical side, I've since found that info theory is more interesting and applicable than I had ignorantly thought. I'm not yet convinced that it needs developing in the CS curriculum, but it can be seen as covertly operative there and, moreover, it can be developed in a way that plays to the strengths and orientation of a computer scientist. Now I'm going to try to justify those statements. Part I will cover how to think about information theory from an algorithmic perspective; Part II will cover uses of information theory (implicit and explicit) in CS.
Part I. Understanding Entropy
First, a disclaimer: I'm not attempting a complete exposition of the concept, and I presume some familiarity, not only to fill in the considerable blanks, but also to read with a critical eye for any inaccuracies. This is a blog, folks. That said, I think what I have to say will likely be useful even to those with some familiarity with the concept, since thinking it out certainly was useful for me.
As I see it, there are three ways to think about entropy(with my personal favorite given third for rhetorical effect):
i) Pedantically, beginning with the formula. Advantage: the concept has a clear arithmetic form from the start, and basic facts can be derived with nothing fancier than calculus. Disadvantage: why should we care about this quantity? It's retrospectively motivated by the little miracles of its various properties, and then the big miracle of Shannon's Coding Theorem, but shouldn't the learning process bear some resemblance to the discovery process?
ii) Axiomatically: we want to talk about the 'uncertainty' of an event, which, obviously, should be nonnegative; two events, grouped, should have at most the sum of their individual uncertainties, with equality in the case of independent events; etc. That is, like analytic philosophers we enumerate the intuitive rules with which we already talk about uncertainty, and then we try to fit some (hopefully uniquely determined) explicit concept to these constraints.
This approach is a powerful avenue of discovery for mathematics, especially for those who enjoy mathematical ideas with anthropomorphic content. However, there can be no presumption that an intuitive concept of everyday life can have rigorous coherence, let alone application to technical problems.
iii) Define entropy as the answer to a technical question of obvious importance: at what minimum expected rate, asymptotically, can we communicate outcomes of a series of identically distributed copies of a random variable X, over a reliable binary channel? Let the inverse of this rate, i.e., the amortized encoding cost in bits, be called EnC(X). (Technical note, all 'minimums' and 'maximums' in this sketch should really be infimums and supremums.)
Let H(X) be the arithmetic entropy formula; we'll (sketch how to) show that EnC(X) = H(X); this is the Coding Theorem.
EnC(X) <= H(X) is actually a fairly easy consequence of the analysis of the idea of encoding likelier outcomes with shorter codewords; the only difficulty is that there is an O(1) 'fudge factor' in the encoding rate that can only be suppressed by passing to the amortized case. But note: here, as elsewhere in this treatment, the entropy formula emerges naturally in an average-case analysis of a concrete algorithm. How can we show the more interesting half, the matching lower bound on EnC(X)? By introducing a seemingly distinct problem, that of 'extracting' many unbiased random bits from many independent copies of random variable X. We allow the extractor to output a variable number of bits, possibly zero. Let ExtR(X) be the maximum asymptotically achievable extraction rate (i.e., the maximum c such that there exists an extraction scheme such that, for large m, we can feed in m i.i.d. copies of X and get as output an expected number of about c*m unbiased bits). We sketch how to show the inequalities EnC(X) >= ExtR(X) >= H(X), from which it follows that EnC(X) = H(X). But first, let's pause and see how this approach (proving the Coding Theorem first, using the alternative extraction characterization of entropy) renders the basic entropy inequalities transparent and algorithmic.
H(X) >= 0: Since entropy is shown equal to a minimum expected encoding cost, it must be nonnegative.
H(f(X)) <= H(X) for all functions f: any encoding scheme for X is easily convertible into a scheme for f(X) (essentially, IS such a scheme already) of the same expected rate. H(X, Y) <= H(X) + H(Y): Essentially, we can concatentate encoding schemes for X and Y to get one for (X, Y), once we figure out a way to cheaply disambiguate where the first encoding ends and the second begins. Prefix-free codes, e.g. Huffman codes, are an elegant way to do this, and can be shown to match the asymptotic rates of arbitrary codes. H(X, Y) = H(X) + H(Y) for independent X, Y: here we use the bit-extraction characterization of entropy: if X and Y are independent, we can concatenate the outputs of extraction schemes for X and Y, and the output will be unbiased with an expected number of bits equal to the sum of the components' expected outputs. So H(X, Y) >= H(X) + H(Y) in this case.
H(B) = 1 for unbiased bits B: <= follows from the naive encoding; >=: otherwise you could use k bits to extract an expected number of k' > k bits, not possible (some outcome must be too likely).
In addition to furnishing these results, the extractor characterization also yields intuition about randomness itself: for some purposes, a random variable can be considered a vessel for its 'essential randomness' H(X), which can be converted into standard form as unbiased bits, and also can be (though I won't elaborate on this) converted into simulated independent draws from any other random variable Y, with the simulation 'cost' in X-samples given by the ratio of entropies H(Y)/H(X). Entropy gives a system of 'exchange rates' for randomness in which there's no possible arbitrage.
Now we continute. ExtR(X) >= H(X): this follows from analysis of an extraction algorithm modeled on the classic Von Neumann extractor. The idea that allows extraction of random bits is that the possible outcomes of a finite series of copies of X can be partitioned into classes according to the frequency distribution of the various values of X, and each member of a class has the same probability of occuring, so an X-sequence gives us access to a kind of variable-size uniform distribution.
Uniform distributions can be extracted from in a way illustrated by an example: given a uniform distribution on 13 elements, we assign bit-outputs as follows: 13 = 2^3 + 2^4 + 2^0, so let outcomes 1 thru 8 correspond to the 3-digit outputs 000 thru 111, let 9 thru 12 correspond to 00 thru 11, and let 13 correspond to the zero-bit output. Applying this idea to the more complex setting, and reasoning about the likely frequency distributions with many copies of X (this is the hardest analysis in the development), we get ExtR(X) >= H(X).
EnC(X) >= ExtR(X): Finally, an impossibility result (on encoding schemes). The idea: given an encoding scheme for X, if you can run an extractor on m copies of X to generate an expected c*m random bits, you can run the same extractor on the random variable C[X] that is the binary *encoding* of X. If EnC(X)
note to self: LOST; FIX!
Part II. Using Entropy
Computer scientists often refer to "information-theoretic proofs" to describe impossibility proofs that show that a computer system cannot achieve some goal by arguing that it doesn't have access to the right information. For example, we can't reliably compute the OR of n bits by looking at only (n-1) bits (even adaptively chosen) because, if these all come up 0, then we literally don't know the answer. Such arguments can be couched in terms of an 'adversary' who forces us to fail, but they can also be formulated by letting the adversary be a certain kind of random process, and applying entropy analysis.
The most common tool in "information-theoretic" arguments is the Pigeonhole Principle (PHP): putting (n+1) pigeons in n holes, some hole must contain at least 2 pigeons. This tool diagnoses bottlenecks where two inputs with important differences are "treated the same" by a resource-bounded algorithmic process, hence a mistake is made. Let's see why the PHP can be seen as an entropy argument.
Let pigeon p be assigned to hole h(p). If the PHP were false, we could find an injective assignment, hence there's a function g such that g(h(p)) = p for all p.
Now let P be the uniform distribution on the (n+1) pigeons; H(P) = log(n+1). On the other hand, for any function f and random variable X, we have H(f(X)) <= H(X), so H(P) = H(g(h(P)) <= H(h(P)) <= log(n), since h(P) is a distribution on n holes and the entropy of such a distribution is maximized by the uniform distribution. This contradiction proves the PHP. I'll leave it as an exercise to prove using entropy that sorting takes an expected log(n!) steps on the uniform distribution; this holds not just for comparators but for any (possibly randomized) algorithm whose steps are binary-valued queries. Note here two strengths of entropy: its insensitivity to certain algorithmic details and its natural ability to give average-case lower bounds.
Can information theory apply to 2-way communication? I'll argue that it applies to the standard model of communication complexity as developed by Yao and expounded well in the book by Nisan and Kushilevitz. Two parties hold two strings x and y, and want to cooperate to compute some f(x,y) with minimal computation (each is computationally unbounded). We can think of this scenario as a 1-player model, where the data is behind a curtain and the player, on each step, can make any query about the data that depends only on one of the two halves. This way information flow is 1-way and entropy considerations can be used--maybe not to added effect, but it can be done.
Does information theory have its limitations in CS? Absolutely. The uncertainty associated with a random variable is fundamentally different from the uncertainty associated with an intractible computational problem. A dominant methodology in complexity theory
(and CS more generally), the 'black-box methodology', is arguably founded on the
imperfect metaphor of intractibility-as-entropy; its counsel: When you don't have
the resources or ingenuity to compute some logically determined quantity, treat it as undetermined, like a random variable. The methodology is not invalid, since it amounts to a precaution rather than a faulty assumption; but there are sharp limits on the theorems that can be proven within it.
Oracle worlds show us these limits. The most basic illustration is the result that P != NP relative to a random oracle. One defines a simple search-problem for oracle contents, argues that it's in NP for all oracles, and then uses a simple info-theoretic argument to argue that any fixed algorithm fails with probability 1 over the choice of oracles. Since black-box methods are relativizing, this (along with an oracle equating P and NP) shows that we can't get either separation or collapse here without deeper, non-black-box techniques. But notice that information theory is gracious enough (and powerful enough) to show us its own limits in this domain! In fact, most basic oracle separations are viewable as applied information theory.
That's all for now, folks. Hope this convinces somebody to have another pass at an important concept of 20th-century mathematics.
Was its early popularity, along with the high-flying claims and ambitions of 'cybernetics', just a fad, a yearning for a master concept to unify the bewildering experience of modern life?
I won't attempt cultural history, but on the mathematical side, I've since found that info theory is more interesting and applicable than I had ignorantly thought. I'm not yet convinced that it needs developing in the CS curriculum, but it can be seen as covertly operative there and, moreover, it can be developed in a way that plays to the strengths and orientation of a computer scientist. Now I'm going to try to justify those statements. Part I will cover how to think about information theory from an algorithmic perspective; Part II will cover uses of information theory (implicit and explicit) in CS.
Part I. Understanding Entropy
First, a disclaimer: I'm not attempting a complete exposition of the concept, and I presume some familiarity, not only to fill in the considerable blanks, but also to read with a critical eye for any inaccuracies. This is a blog, folks. That said, I think what I have to say will likely be useful even to those with some familiarity with the concept, since thinking it out certainly was useful for me.
As I see it, there are three ways to think about entropy(with my personal favorite given third for rhetorical effect):
i) Pedantically, beginning with the formula. Advantage: the concept has a clear arithmetic form from the start, and basic facts can be derived with nothing fancier than calculus. Disadvantage: why should we care about this quantity? It's retrospectively motivated by the little miracles of its various properties, and then the big miracle of Shannon's Coding Theorem, but shouldn't the learning process bear some resemblance to the discovery process?
ii) Axiomatically: we want to talk about the 'uncertainty' of an event, which, obviously, should be nonnegative; two events, grouped, should have at most the sum of their individual uncertainties, with equality in the case of independent events; etc. That is, like analytic philosophers we enumerate the intuitive rules with which we already talk about uncertainty, and then we try to fit some (hopefully uniquely determined) explicit concept to these constraints.
This approach is a powerful avenue of discovery for mathematics, especially for those who enjoy mathematical ideas with anthropomorphic content. However, there can be no presumption that an intuitive concept of everyday life can have rigorous coherence, let alone application to technical problems.
iii) Define entropy as the answer to a technical question of obvious importance: at what minimum expected rate, asymptotically, can we communicate outcomes of a series of identically distributed copies of a random variable X, over a reliable binary channel? Let the inverse of this rate, i.e., the amortized encoding cost in bits, be called EnC(X). (Technical note, all 'minimums' and 'maximums' in this sketch should really be infimums and supremums.)
Let H(X) be the arithmetic entropy formula; we'll (sketch how to) show that EnC(X) = H(X); this is the Coding Theorem.
EnC(X) <= H(X) is actually a fairly easy consequence of the analysis of the idea of encoding likelier outcomes with shorter codewords; the only difficulty is that there is an O(1) 'fudge factor' in the encoding rate that can only be suppressed by passing to the amortized case. But note: here, as elsewhere in this treatment, the entropy formula emerges naturally in an average-case analysis of a concrete algorithm. How can we show the more interesting half, the matching lower bound on EnC(X)? By introducing a seemingly distinct problem, that of 'extracting' many unbiased random bits from many independent copies of random variable X. We allow the extractor to output a variable number of bits, possibly zero. Let ExtR(X) be the maximum asymptotically achievable extraction rate (i.e., the maximum c such that there exists an extraction scheme such that, for large m, we can feed in m i.i.d. copies of X and get as output an expected number of about c*m unbiased bits). We sketch how to show the inequalities EnC(X) >= ExtR(X) >= H(X), from which it follows that EnC(X) = H(X). But first, let's pause and see how this approach (proving the Coding Theorem first, using the alternative extraction characterization of entropy) renders the basic entropy inequalities transparent and algorithmic.
H(X) >= 0: Since entropy is shown equal to a minimum expected encoding cost, it must be nonnegative.
H(f(X)) <= H(X) for all functions f: any encoding scheme for X is easily convertible into a scheme for f(X) (essentially, IS such a scheme already) of the same expected rate. H(X, Y) <= H(X) + H(Y): Essentially, we can concatentate encoding schemes for X and Y to get one for (X, Y), once we figure out a way to cheaply disambiguate where the first encoding ends and the second begins. Prefix-free codes, e.g. Huffman codes, are an elegant way to do this, and can be shown to match the asymptotic rates of arbitrary codes. H(X, Y) = H(X) + H(Y) for independent X, Y: here we use the bit-extraction characterization of entropy: if X and Y are independent, we can concatenate the outputs of extraction schemes for X and Y, and the output will be unbiased with an expected number of bits equal to the sum of the components' expected outputs. So H(X, Y) >= H(X) + H(Y) in this case.
H(B) = 1 for unbiased bits B: <= follows from the naive encoding; >=: otherwise you could use k bits to extract an expected number of k' > k bits, not possible (some outcome must be too likely).
In addition to furnishing these results, the extractor characterization also yields intuition about randomness itself: for some purposes, a random variable can be considered a vessel for its 'essential randomness' H(X), which can be converted into standard form as unbiased bits, and also can be (though I won't elaborate on this) converted into simulated independent draws from any other random variable Y, with the simulation 'cost' in X-samples given by the ratio of entropies H(Y)/H(X). Entropy gives a system of 'exchange rates' for randomness in which there's no possible arbitrage.
Now we continute. ExtR(X) >= H(X): this follows from analysis of an extraction algorithm modeled on the classic Von Neumann extractor. The idea that allows extraction of random bits is that the possible outcomes of a finite series of copies of X can be partitioned into classes according to the frequency distribution of the various values of X, and each member of a class has the same probability of occuring, so an X-sequence gives us access to a kind of variable-size uniform distribution.
Uniform distributions can be extracted from in a way illustrated by an example: given a uniform distribution on 13 elements, we assign bit-outputs as follows: 13 = 2^3 + 2^4 + 2^0, so let outcomes 1 thru 8 correspond to the 3-digit outputs 000 thru 111, let 9 thru 12 correspond to 00 thru 11, and let 13 correspond to the zero-bit output. Applying this idea to the more complex setting, and reasoning about the likely frequency distributions with many copies of X (this is the hardest analysis in the development), we get ExtR(X) >= H(X).
EnC(X) >= ExtR(X): Finally, an impossibility result (on encoding schemes). The idea: given an encoding scheme for X, if you can run an extractor on m copies of X to generate an expected c*m random bits, you can run the same extractor on the random variable C[X] that is the binary *encoding* of X. If EnC(X)
note to self: LOST; FIX!
Part II. Using Entropy
Computer scientists often refer to "information-theoretic proofs" to describe impossibility proofs that show that a computer system cannot achieve some goal by arguing that it doesn't have access to the right information. For example, we can't reliably compute the OR of n bits by looking at only (n-1) bits (even adaptively chosen) because, if these all come up 0, then we literally don't know the answer. Such arguments can be couched in terms of an 'adversary' who forces us to fail, but they can also be formulated by letting the adversary be a certain kind of random process, and applying entropy analysis.
The most common tool in "information-theoretic" arguments is the Pigeonhole Principle (PHP): putting (n+1) pigeons in n holes, some hole must contain at least 2 pigeons. This tool diagnoses bottlenecks where two inputs with important differences are "treated the same" by a resource-bounded algorithmic process, hence a mistake is made. Let's see why the PHP can be seen as an entropy argument.
Let pigeon p be assigned to hole h(p). If the PHP were false, we could find an injective assignment, hence there's a function g such that g(h(p)) = p for all p.
Now let P be the uniform distribution on the (n+1) pigeons; H(P) = log(n+1). On the other hand, for any function f and random variable X, we have H(f(X)) <= H(X), so H(P) = H(g(h(P)) <= H(h(P)) <= log(n), since h(P) is a distribution on n holes and the entropy of such a distribution is maximized by the uniform distribution. This contradiction proves the PHP. I'll leave it as an exercise to prove using entropy that sorting takes an expected log(n!) steps on the uniform distribution; this holds not just for comparators but for any (possibly randomized) algorithm whose steps are binary-valued queries. Note here two strengths of entropy: its insensitivity to certain algorithmic details and its natural ability to give average-case lower bounds.
Can information theory apply to 2-way communication? I'll argue that it applies to the standard model of communication complexity as developed by Yao and expounded well in the book by Nisan and Kushilevitz. Two parties hold two strings x and y, and want to cooperate to compute some f(x,y) with minimal computation (each is computationally unbounded). We can think of this scenario as a 1-player model, where the data is behind a curtain and the player, on each step, can make any query about the data that depends only on one of the two halves. This way information flow is 1-way and entropy considerations can be used--maybe not to added effect, but it can be done.
Does information theory have its limitations in CS? Absolutely. The uncertainty associated with a random variable is fundamentally different from the uncertainty associated with an intractible computational problem. A dominant methodology in complexity theory
(and CS more generally), the 'black-box methodology', is arguably founded on the
imperfect metaphor of intractibility-as-entropy; its counsel: When you don't have
the resources or ingenuity to compute some logically determined quantity, treat it as undetermined, like a random variable. The methodology is not invalid, since it amounts to a precaution rather than a faulty assumption; but there are sharp limits on the theorems that can be proven within it.
Oracle worlds show us these limits. The most basic illustration is the result that P != NP relative to a random oracle. One defines a simple search-problem for oracle contents, argues that it's in NP for all oracles, and then uses a simple info-theoretic argument to argue that any fixed algorithm fails with probability 1 over the choice of oracles. Since black-box methods are relativizing, this (along with an oracle equating P and NP) shows that we can't get either separation or collapse here without deeper, non-black-box techniques. But notice that information theory is gracious enough (and powerful enough) to show us its own limits in this domain! In fact, most basic oracle separations are viewable as applied information theory.
That's all for now, folks. Hope this convinces somebody to have another pass at an important concept of 20th-century mathematics.
Labels: complexity