Article Source
Bourgain’s Embedding
In which we prove that every metric can be embedded into L1 with logarithmic distortion.
Today we prove the following theorem.
Theorem 1 (Bourgain) Let
be a semimetric defined over a finite set
. Then there exists a mapping
such that, for every two elements
,
where
is an absolute constant. Given
, the mapping
can be found with high probability in randomized polynomial time in
.
Together with the results that we proved in the last lecture, this
implies that an optimal solution to the Leighton-Rao relaxation can be
rounded to an -approximate
solution to the sparsest cut problem. This was the best known
approximation algorithm for sparsest cut for 15 years, until the
Arora-Rao-Vazirani algorithm, which will be our next topic.
The theorem has a rather short proof, but there is an element of “magic” to it. We will discuss several examples and we will see what approaches are suggested by the examples. At the end of the discussion, we will see the final proof as, hopefully, the “natural” outcome of the study of such examples and failed attempts.
1. Preliminary and Motivating Examples
A first observation is that embeddings of finite sets of points into L1 can be equivalently characterized as probabilistic embeddings into the real line.
Fact 2 For every finite set
, dimension
, and mapping
, there is a finitely-supported probability distribution
over functions
such that for every two points
:
Conversely, for every finite set
and finitely supported distribution
over functions
, there is a dimension
and a mapping
such that
Proof: For the first claim, we write
for the
-th
coordinate of
,
that is
,
and we define
to be the uniform distribution over the
functions of the form
.
For the second claim, if the support of
is the set of functions
,
where function
has probability
,
then we define
.
It will be easier to reason about probabilistic mappings into the line, so we will switch to the latter setting from now on.
Our task is to associate a number to every point
,
and the information that we have about
is the list of distances
.
Probably the first idea that comes to mind is to pick a random reference
vertex
,
and work with the mapping
,
possibly scaled by a multiplicative constant. (Equivalently, we can
think about the deterministic mapping
,
in which the vertex
is mapped to the sequence
,
for some enumeration
of the elements of
.)
This works in certain simple cases.
Example 1 (Cycle) Suppose that
is the shortest-path metric on a cycle, we can see that, for every two points on the cycle,
is within a constant factor of their distance
. (Try proving it rigorously!)
Example 2 (Simplex) Suppose that
for every
, and
. Then, for every
, we have
, so, up to scaling, the mapping incurs no error at all.
But there are also simple examples in which this works very badly.
Example 3 (1-2 Metric) Suppose that for every
we have
(any distance function satisfying this property is always a metric) and that, in particular, there is a special vertex
at distance 2 from all other vertices, while all other vertices are at distance 1 from each other. Then, for vertices
both different from
we have, as before
but for every
different from
we have
and so our error is going to be
instead of the
that we are trying to establish.
Maybe the next simplest idea is that we should pick at random several
reference points
.
But how do we combine the information
into a single number to associate to
?
If we just take the sum of the distances, we are back to the case of
sampling a single reference point. (We are just scaling up the
expectation by a factor of
.)
The next simplest way to combine the information is to take either the maximum or the minimum. If we take the minimum, we see that we have the very nice property that we immediately guarantee that our distances in the L1 embedding are no bigger than the original distances, so that it “only” remains to prove that the distances don’t get compressed too much.
Fact 3 []{#factfrechetupper} Let
be a semimetric and
be a non-empty subset of points. Define
as
Then, for every two points
we have
Proof: Let
be the point such that
and
be the point such that
.
(It’s possible that
.)
Then
and, similarly,
Is there a way to sample a set
such that, for every two points
,
the expectation
is not too much smaller than
?
How large should the set
be?
Example 4 (1-2 Metric Again) Suppose that for every
we have
, and that we pick a subset
uniformly at random, that is, each event
has probability
and the events are mutually independent.
Then for every
:
because with probability
the set
contains exactly one of the elements
, and conditioned on that event we have
(because one of
is zero and the other is at least one), which is at least
.
If we pick
uniformly at random, however, we incur an
distortion in the case of the shortest path metric on the cycle. In all
the examples seen so far, we can achieve constant distortion if we “mix”
the distribution in which
is a random set of size 1 and the one in which
is a chosen uniformly at random among all sets, say by sampling from the
former probability with probability
and from the latter with probability
.
Example 5 (Far-Away Clusters) Suppose now that
has the following structure:
is partitioned into clusters
, where
(so
), and we have
for vertices in the same cluster, and
for vertices in different clusters.
If
are in the same cluster, then
and
If
are in different clusters
, then
and
If we want to stick to this approach of picking a set
of reference elements according to a certain distribution, and then
defining the map
,
then the set
must have the property that for every two sets
,
there is at least a probability
that
intersects exactly one of
,
and we would like
to be as large as possible, because the distortion caused by the mapping
will be at least
.
This suggest the following distribution
:
- Sample
uniformly at random in
- Sample
by selecting each
, independently, to be in
with probability
and to be in
with probability
.
This distribution guarantees the above property with .
Indeed, the above distribution guarantees a distortion at most
in all the examples encountered so far, including the tricky example of
the clusters of different size. In each example, in fact, we can prove
the following claim: for every two vertices
,
there is a scale
,
such that conditioned on that scale being chosen, the expectation of
is at least a constant times
.
We could try to prove Bourgain’s theorem by showing that this is true in
every semimetric.
Let us call
the conditional distribution of
conditioned on the choice of a scale
.
We would like to prove that for every semimetric
and every two points
there is a scale
such that
which, recalling that
for every set
,
is equivalent to arguing that
For this to be true, there must be distances
such that
and such that, with constant probability according to
,
we have
and
(or vice-versa). For this to happen, there must be a constant
probability that
avoids the set
and intersects the set
.
For this to happen, both sets must have size
.
This means that if we want to make this “at least one good scale for
every pair of points” argument work, we need to show that for every two
vertices
there is a “large” distance
and a “small” distance
(whose difference is a constant times
)
such that a large-radius ball around one of the vertices and a
small-radius ball around the other vertex contain roughly the same
number of elements of
.
Consider, however, the following example.
Example 6 (Joined Trees) Consider the graph obtained by taking two complete binary trees of the same size and identifying their leaves, as in the picture below.
Consider the shortest-path metric
in the above graph. Consider the “root” vertices
and
. Their distance
is
, but, at every scale
, both
and
are highly concentrated around
and, it can be calculated that, at every scale
, we have
This is still good, because averaging over all scales we still get
but this example shows that the analysis cannot be restricted to one good scale but has, in some cases, to take into account the contribution to the expectation coming from all the scales.
In the above example, the only way to get a ball around
and a ball around
with approximately the same number of points is to get balls of roughly
the same radius. No scale could then give a large contribution to the
expectation
;
every scale, however, gave a noticeable contribution, and adding them up
we had a bounded distortion. The above example will be the template for
the full proof, which will do an “amortized analysis” of the
contribution to the expectation coming from each scale
,
by looking at the radii that define a ball around
and a ball around
with approximately
elements.
2. The Proof of Bourgain’s Theorem
Given Fact 2 and Fact 3, proving Bourgain’s theorem reduces to proving the following theorem.
Theorem 4 For a finite set of points
, consider the distribution
over subsets of
sampled by uniformly picking a scale
and then picking independently each
to be in
with probability
. Let
be a semimetric. Then for every
,
where
is an absolute constant.
Proof: For each
,
let
be the distance from
to the
-th
closest point to
(counting
).
That is,
and
and define
similarly. Let
be the scale such that both
and
are smaller than
,
but at least one of
or
are
.
Define
and similarly
We claim that there is an absolute constant
such that for every scale
,
we have
We prove the claim by showing that there are two disjoint events, each
happening with probability ,
such that in one event
,
and in the other event
.
-
The first event is that
avoids the set
and intersects the set
. The former set has size
, and the latter set has size
; the sets are disjoint because we are looking at balls or radius
around
and
; so the event happens with a probability that is at least an absolute constant. When the event happens,
-
The second event is that
avoids the set
and intersects the set
. The former set has size
, and the latter set has size
; the sets are disjoint because we are looking at balls or radius
around
and
; so the event happens with a probability that is at least an absolute constant. When the event happens,
So we have established (1). Averaging over all scales, we have
There is one remaining point to address. In Fact 2, we
proved that a distribution over embeddings on the line can be turned
into an L1 embeddings, in which the number of dimensions is equal to the
size of the support of the distribution. In our proof, we have used a
distribution that ranges over
possible functions, so this would give rise to an embedding that uses a
superpolynomial number of dimensions.
To fix this remaining problem, we sample
sets
and we define the embedding
.
It remains to prove that this randomized mapping has low distortion with
high probability, which is an immediate consequence of the Chernoff
bounds. Specifically, we use the following form of the Chernoff bound:
Lemma 5 Let
be independent nonnegative random variables such that, with probability 1,
. Let
. Then
Let us look at any two vertices
.
Clearly, for every choice of
,
we have
so it remains to prove a lower bound to their L1 distance. Let us call
the random variable denoting their L1 distance, that is
We can write
where
,
so that
is the sum of identically distributed nonnegative random variables, such
that
Applying the Chernoff bound with
and
,
we have
which is, say,
if we choose
for an absolute constant
.
By taking a union bound over all pairs of vertices,