### Another Public Service Announcement

We've all heard that randomness and probability are becoming more and more important in computer science. A number of good references exist for learning the basic techniques of probabilistic analysis. Many of these techniques are radically simple and incredibly powerful, even for solving problems where probability does not seem to play a role (such as design or optimization problems where what is wanted is a highly 'uniform' or 'balanced' structure, e.g. in exhibiting Ramsey graphs, expanders/extractors, low-discrepancy sets, etc.). As such they are a very good investment. One can go very far, for instance, granted only familiarity with the meaning and some applications of the following simple concepts:

-linearity of expectation

-the second moment (a.k.a. variance) method

-union bounds

-Chernoff-style deviation bounds

As I've said in the past, The Probabilistic Method by Alon and Spencer is the most delightful way to develop an appreciation for this stuff. I have blogged before on some of these ideas, and will happily do so again, but today I want to discuss an aspect of probabilistic analysis that receives somewhat less thorough discussion in the references I've consulted.

Namely: given two random variables X, Y, how should we quantify their similarity or difference? As an example of a case where such a question might be important, suppose we have a randomized algorithm that correctly solves some problem with high probability if given access to fair coin flips, but we don't have a 'perfect' source of random bits; we want to find conditions under which an imperfect or pseudorandom source will be 'close enough' to the uniform distribution to play the role of the algorithm's random bits, without destroying its correctness.

Is there one 'right' notion of similarity, or do we need multiple notions depending on application? The answer is, 'both'. There is a very widely useful quantity, called the 'statistical distance', which is effective in many cases, but to recognize these cases it is important to have several perspectives on the matter.

I will present the definition and three variants. I encourage readers to submit any others, and also to point me to official nomenclature and references for the variants (I'm working from memory).

Let X, Y be random variables. Let's say they take on (real-number) values from the finite set {a_1, a_2, ... a_n}, just to keep the discussion nice and safe. Suppose X, Y take on value a_i with probabilities p_i, q_i respectively.

The 'statistical distance' d(X, Y) is then defined as

d(X, Y) = .5 * (Sum from i=1 to i= n: |p_i - q_i|).

(That's an absolute value in there.)

Also denoted ||X - Y||, the statistical distance should not be confused with the L_1 distance, which is just twice the statistical distance when applied to probability vectors.

Readers should verify the following: d(X, Y) is a metric, taking values between 0 and 1. d(X, Y) = 0 if and only if X, Y are identically distributed, and d(X, Y) = 1 if and only if X, Y never take on the same values.

Now consider the three following questions we could ask about X and Y:

1) Say we want to build a 'distinguisher' to tell apart the two variables: that is, a function A: {a_1, ... a_n} --> {0, 1} that maximizes the quantity

|P[A(X) = 1] - P[A(Y) = 1]|. (I believe this is the 'distinguishing probability' of A, but don't quote me.) Call Q1(X, Y) the maximum quantity achievable in this way.

(Note that we could allow A to use additional randomness; it wouldn't affect the value Q1. This is important in applications.)

2) Now say a friend secretly chooses one of the two variables X or Y, each with probability 1/2. She then takes a random sample from the chosen variable and presents it to us. We are asked to guess which sample it came from, and we wish to find a rule that'll maximize our probability of guessing right.

Let Q2(X, Y) denote the maximum achievable probablity of correctness (which again is unaffected by the use of randomness in guessing).

3) Finally, say we wish to design a random variable Z = (Z_1, Z_2), outputting a pair of real values. We require that Z_1 be distributed identically to X

(i.e. d(X, Z_1) = 0), and that Z_2 be distributed identically to Y, but we do not require independence. Our goal is design the pair (Z_1, Z_2) (called a 'coupling' of X and Y) to maximize the probability that Z_1 = Z_2. Denote by Q3(X, Y) this maximum achievable probability.

Now, it is a pleasant fact that all of Q1, Q2, Q3 are completely determined by the value d(X, Y), and conversely, they also each determine d(X, Y) (and each other). Moreover, the quantities are all linearly related, so if you forget the precise relationships (as I invariably do, which motivated this post), they're easy to rederive by considering the extreme cases and interpolating. For the record, here are the relationships:

Q1(X, Y) = d(X, Y)

Q2(X, Y) = .5 + .5 * d(X, Y)

Q3(X, Y) = 1 - d(X, Y)

I recommend verifying these facts once or twice and then taking them for granted. Note that the first two (of which Q1 is more frequently used) are measures of distance, while Q3 is a measure of similarity. Both perspectives come in handy.

Note too that, while d(X, Y) is defined by an equation, Q1-Q3 are defined by an ability to do something; Q1-Q2 in particular are framed around an agent's ability to distinguish between cases, and it is natural that computer scientists often prefer to take such a view.

In future posts, I hope to elaborate somewhat on these perspectives (couplings in particular, which are probably the most interesting).

-linearity of expectation

-the second moment (a.k.a. variance) method

-union bounds

-Chernoff-style deviation bounds

As I've said in the past, The Probabilistic Method by Alon and Spencer is the most delightful way to develop an appreciation for this stuff. I have blogged before on some of these ideas, and will happily do so again, but today I want to discuss an aspect of probabilistic analysis that receives somewhat less thorough discussion in the references I've consulted.

Namely: given two random variables X, Y, how should we quantify their similarity or difference? As an example of a case where such a question might be important, suppose we have a randomized algorithm that correctly solves some problem with high probability if given access to fair coin flips, but we don't have a 'perfect' source of random bits; we want to find conditions under which an imperfect or pseudorandom source will be 'close enough' to the uniform distribution to play the role of the algorithm's random bits, without destroying its correctness.

Is there one 'right' notion of similarity, or do we need multiple notions depending on application? The answer is, 'both'. There is a very widely useful quantity, called the 'statistical distance', which is effective in many cases, but to recognize these cases it is important to have several perspectives on the matter.

I will present the definition and three variants. I encourage readers to submit any others, and also to point me to official nomenclature and references for the variants (I'm working from memory).

Let X, Y be random variables. Let's say they take on (real-number) values from the finite set {a_1, a_2, ... a_n}, just to keep the discussion nice and safe. Suppose X, Y take on value a_i with probabilities p_i, q_i respectively.

The 'statistical distance' d(X, Y) is then defined as

d(X, Y) = .5 * (Sum from i=1 to i= n: |p_i - q_i|).

(That's an absolute value in there.)

Also denoted ||X - Y||, the statistical distance should not be confused with the L_1 distance, which is just twice the statistical distance when applied to probability vectors.

Readers should verify the following: d(X, Y) is a metric, taking values between 0 and 1. d(X, Y) = 0 if and only if X, Y are identically distributed, and d(X, Y) = 1 if and only if X, Y never take on the same values.

Now consider the three following questions we could ask about X and Y:

1) Say we want to build a 'distinguisher' to tell apart the two variables: that is, a function A: {a_1, ... a_n} --> {0, 1} that maximizes the quantity

|P[A(X) = 1] - P[A(Y) = 1]|. (I believe this is the 'distinguishing probability' of A, but don't quote me.) Call Q1(X, Y) the maximum quantity achievable in this way.

(Note that we could allow A to use additional randomness; it wouldn't affect the value Q1. This is important in applications.)

2) Now say a friend secretly chooses one of the two variables X or Y, each with probability 1/2. She then takes a random sample from the chosen variable and presents it to us. We are asked to guess which sample it came from, and we wish to find a rule that'll maximize our probability of guessing right.

Let Q2(X, Y) denote the maximum achievable probablity of correctness (which again is unaffected by the use of randomness in guessing).

3) Finally, say we wish to design a random variable Z = (Z_1, Z_2), outputting a pair of real values. We require that Z_1 be distributed identically to X

(i.e. d(X, Z_1) = 0), and that Z_2 be distributed identically to Y, but we do not require independence. Our goal is design the pair (Z_1, Z_2) (called a 'coupling' of X and Y) to maximize the probability that Z_1 = Z_2. Denote by Q3(X, Y) this maximum achievable probability.

Now, it is a pleasant fact that all of Q1, Q2, Q3 are completely determined by the value d(X, Y), and conversely, they also each determine d(X, Y) (and each other). Moreover, the quantities are all linearly related, so if you forget the precise relationships (as I invariably do, which motivated this post), they're easy to rederive by considering the extreme cases and interpolating. For the record, here are the relationships:

Q1(X, Y) = d(X, Y)

Q2(X, Y) = .5 + .5 * d(X, Y)

Q3(X, Y) = 1 - d(X, Y)

I recommend verifying these facts once or twice and then taking them for granted. Note that the first two (of which Q1 is more frequently used) are measures of distance, while Q3 is a measure of similarity. Both perspectives come in handy.

Note too that, while d(X, Y) is defined by an equation, Q1-Q3 are defined by an ability to do something; Q1-Q2 in particular are framed around an agent's ability to distinguish between cases, and it is natural that computer scientists often prefer to take such a view.

In future posts, I hope to elaborate somewhat on these perspectives (couplings in particular, which are probably the most interesting).

Labels: general math, probability

## 3 Comments:

For more properties of this metric (also known as the total variation metric), see my post. For some uses of this metric in probability theory, see my thesis.

How's it going, Andy?

By Aryeh, at 3:19 PM

Hey Leo! Yeah, your post seems to totally preempt mine... but at least that suggests it was a good topic.

I'm having a great summer, thanks for asking--you?

btw readers, I don't mean to suggest that there are no other useful measures of distance/similarity beyond statistical (TV) distance and its variants... for example, one needs a different set of tools (generally more information-theoretic and entropy-based) when one wants to describe correlation and mutual information between random variables.

I also wanted to mention that there is a nifty anologue of statistical distance for quantum states, that has a similar characterization as the degree of 'distinguishability' of the two states by quantum measurements. I'm just learning about this topic, which is covered in the Nielsen-Chuang book 'Quantum Computation and Quantum Information'.

PS adding Leo's blog to the blogroll.

By Andy D, at 4:17 PM

Thanks for the add, Andy! Here is a nice summary of the different distancs between probability measures and their inter-relations. Note that TV does occupy a central role:

http://www.math.hmc.edu/~su/papers.dir/metrics.pdf

By Aryeh, at 7:12 AM

Post a Comment

<< Home