Article Source
Bourgain’s Embedding
In which we prove that every metric can be embedded into L1 with logarithmic distortion.
Today we prove the following theorem.
Theorem 1 (Bourgain) Let be a semimetric defined over a finite set . Then there exists a mapping such that, for every two elements ,
where is an absolute constant. Given , the mapping can be found with high probability in randomized polynomial time in .
Together with the results that we proved in the last lecture, this implies that an optimal solution to the LeightonRao relaxation can be rounded to an approximate solution to the sparsest cut problem. This was the best known approximation algorithm for sparsest cut for 15 years, until the AroraRaoVazirani algorithm, which will be our next topic.
The theorem has a rather short proof, but there is an element of “magic” to it. We will discuss several examples and we will see what approaches are suggested by the examples. At the end of the discussion, we will see the final proof as, hopefully, the “natural” outcome of the study of such examples and failed attempts.
1. Preliminary and Motivating Examples
A first observation is that embeddings of finite sets of points into L1 can be equivalently characterized as probabilistic embeddings into the real line.
Fact 2 For every finite set , dimension , and mapping , there is a finitelysupported probability distribution over functions such that for every two points :
Conversely, for every finite set and finitely supported distribution over functions , there is a dimension and a mapping such that
Proof: For the first claim, we write for the th coordinate of , that is , and we define to be the uniform distribution over the functions of the form .
For the second claim, if the support of is the set of functions , where function has probability , then we define .
It will be easier to reason about probabilistic mappings into the line, so we will switch to the latter setting from now on.
Our task is to associate a number to every point , and the information that we have about is the list of distances . Probably the first idea that comes to mind is to pick a random reference vertex , and work with the mapping , possibly scaled by a multiplicative constant. (Equivalently, we can think about the deterministic mapping , in which the vertex is mapped to the sequence , for some enumeration of the elements of .)
This works in certain simple cases.
Example 1 (Cycle) Suppose that is the shortestpath metric on a cycle, we can see that, for every two points on the cycle, is within a constant factor of their distance . (Try proving it rigorously!)
Example 2 (Simplex) Suppose that for every , and . Then, for every , we have , so, up to scaling, the mapping incurs no error at all.
But there are also simple examples in which this works very badly.
Example 3 (12 Metric) Suppose that for every we have (any distance function satisfying this property is always a metric) and that, in particular, there is a special vertex at distance 2 from all other vertices, while all other vertices are at distance 1 from each other. Then, for vertices both different from we have, as before
but for every different from we have
and so our error is going to be instead of the that we are trying to establish.
Maybe the next simplest idea is that we should pick at random several reference points . But how do we combine the information into a single number to associate to ? If we just take the sum of the distances, we are back to the case of sampling a single reference point. (We are just scaling up the expectation by a factor of .)
The next simplest way to combine the information is to take either the maximum or the minimum. If we take the minimum, we see that we have the very nice property that we immediately guarantee that our distances in the L1 embedding are no bigger than the original distances, so that it “only” remains to prove that the distances don’t get compressed too much.
Fact 3 []{#factfrechetupper} Let be a semimetric and be a nonempty subset of points. Define as
Then, for every two points we have
Proof: Let be the point such that and be the point such that . (It’s possible that .) Then
and, similarly,
Is there a way to sample a set such that, for every two points , the expectation is not too much smaller than ? How large should the set be?
Example 4 (12 Metric Again) Suppose that for every we have , and that we pick a subset uniformly at random, that is, each event has probability and the events are mutually independent.
Then for every :
because with probability the set contains exactly one of the elements , and conditioned on that event we have (because one of is zero and the other is at least one), which is at least .
If we pick uniformly at random, however, we incur an distortion in the case of the shortest path metric on the cycle. In all the examples seen so far, we can achieve constant distortion if we “mix” the distribution in which is a random set of size 1 and the one in which is a chosen uniformly at random among all sets, say by sampling from the former probability with probability and from the latter with probability .
Example 5 (FarAway Clusters) Suppose now that has the following structure: is partitioned into clusters , where (so ), and we have for vertices in the same cluster, and for vertices in different clusters.
If are in the same cluster, then and
If are in different clusters , then and
If we want to stick to this approach of picking a set of reference elements according to a certain distribution, and then defining the map , then the set must have the property that for every two sets , there is at least a probability that intersects exactly one of , and we would like to be as large as possible, because the distortion caused by the mapping will be at least .
This suggest the following distribution :
 Sample uniformly at random in
 Sample by selecting each , independently, to be in with probability and to be in with probability .
This distribution guarantees the above property with .
Indeed, the above distribution guarantees a distortion at most in all the examples encountered so far, including the tricky example of the clusters of different size. In each example, in fact, we can prove the following claim: for every two vertices , there is a scale , such that conditioned on that scale being chosen, the expectation of is at least a constant times . We could try to prove Bourgain’s theorem by showing that this is true in every semimetric.
Let us call the conditional distribution of conditioned on the choice of a scale . We would like to prove that for every semimetric and every two points there is a scale such that
which, recalling that for every set , is equivalent to arguing that
For this to be true, there must be distances such that and such that, with constant probability according to , we have and (or viceversa). For this to happen, there must be a constant probability that avoids the set and intersects the set . For this to happen, both sets must have size .
This means that if we want to make this “at least one good scale for every pair of points” argument work, we need to show that for every two vertices there is a “large” distance and a “small” distance (whose difference is a constant times ) such that a largeradius ball around one of the vertices and a smallradius ball around the other vertex contain roughly the same number of elements of .
Consider, however, the following example.
Example 6 (Joined Trees) Consider the graph obtained by taking two complete binary trees of the same size and identifying their leaves, as in the picture below.
Consider the shortestpath metric in the above graph. Consider the “root” vertices and . Their distance is , but, at every scale , both and are highly concentrated around and, it can be calculated that, at every scale , we have
This is still good, because averaging over all scales we still get
but this example shows that the analysis cannot be restricted to one good scale but has, in some cases, to take into account the contribution to the expectation coming from all the scales.
In the above example, the only way to get a ball around and a ball around with approximately the same number of points is to get balls of roughly the same radius. No scale could then give a large contribution to the expectation ; every scale, however, gave a noticeable contribution, and adding them up we had a bounded distortion. The above example will be the template for the full proof, which will do an “amortized analysis” of the contribution to the expectation coming from each scale , by looking at the radii that define a ball around and a ball around with approximately elements.
2. The Proof of Bourgain’s Theorem
Given Fact 2 and Fact 3, proving Bourgain’s theorem reduces to proving the following theorem.
Theorem 4 For a finite set of points , consider the distribution over subsets of sampled by uniformly picking a scale and then picking independently each to be in with probability . Let be a semimetric. Then for every ,
where is an absolute constant.
Proof: For each , let be the distance from to the th closest point to (counting ). That is,
and
and define similarly. Let be the scale such that both and are smaller than , but at least one of or are .
Define
and similarly
We claim that there is an absolute constant such that for every scale , we have
We prove the claim by showing that there are two disjoint events, each happening with probability , such that in one event , and in the other event .

The first event is that avoids the set and intersects the set . The former set has size , and the latter set has size ; the sets are disjoint because we are looking at balls or radius around and ; so the event happens with a probability that is at least an absolute constant. When the event happens,

The second event is that avoids the set and intersects the set . The former set has size , and the latter set has size ; the sets are disjoint because we are looking at balls or radius around and ; so the event happens with a probability that is at least an absolute constant. When the event happens,
So we have established (1). Averaging over all scales, we have
There is one remaining point to address. In Fact 2, we proved that a distribution over embeddings on the line can be turned into an L1 embeddings, in which the number of dimensions is equal to the size of the support of the distribution. In our proof, we have used a distribution that ranges over possible functions, so this would give rise to an embedding that uses a superpolynomial number of dimensions.
To fix this remaining problem, we sample sets and we define the embedding . It remains to prove that this randomized mapping has low distortion with high probability, which is an immediate consequence of the Chernoff bounds. Specifically, we use the following form of the Chernoff bound:
Lemma 5 Let be independent nonnegative random variables such that, with probability 1, . Let . Then
Let us look at any two vertices . Clearly, for every choice of , we have so it remains to prove a lower bound to their L1 distance. Let us call the random variable denoting their L1 distance, that is
We can write where , so that is the sum of identically distributed nonnegative random variables, such that
Applying the Chernoff bound with and , we have
which is, say, if we choose for an absolute constant .
By taking a union bound over all pairs of vertices,