References: error and stability estimates for information projection

Question

$\newcommand\SS{P}\newcommand\TT{Q}$I will call a Gaussian probability measure $\SS$ on $\mathbb{R}^d$ isotropic if its covariance matrix is diagonal with non-vanishing determinant; i.e. $\Sigma_{i,i}>0$ for $i=1,\dots,d$ and $\Sigma_{i,j}=0$ whenever $i\neq j$ for each $i,j=1,\dots,d$.
Note: My definition of "isotropic" includes "the usual isotropic Gaussian measures," which, from my limited understanding, are assumed to have a covariance of the form $\sigma I_d$ for some $\sigma>0$.

Let $\mathcal{P}$ the set of isotropic Gaussian probability measures on $\mathbb{R}^d$ and let $\mathcal{Q}$ be the set of probability measures on $\mathbb{R}^d$ with Lebesgue density equipped with TV distance.

Consider the information projection (of I-projection) defined by \begin{align} \pi:\mathcal{Q} &\rightarrow \mathcal{P} \\ \pi(\TT) &:= \operatorname*{argmin}_{\SS\in \mathcal{P}}\, D(\SS\parallel\TT) \end{align}

I'm looking for references on the following "elementary properties" of the I-projection:

Is the I-projection $\pi$ Lipschitz, at-least locally?
Are there error bounds on $D(\pi(\TT),\TT)$ when $\TT$ is a Gaussian measure on $\mathbb{R}^d$ with non-singular covariance...

$$\begin{align} \pi:\mathcal{P} &\rightarrow \mathcal{Q} \\ \pi(\mathbb{P}) &:= \operatorname{argmin}_{\mathbb{Q} \in \mathcal{Q}}\, \mathbb{D}(\mathbb{Q} \| \mathbb{P}) \end{align}$$ $${}$$ $$ \begin{align} \pi:\mathcal{P} &\rightarrow \mathcal{Q} \\ \pi(\mathbb{P}) &:= \operatorname*{argmin}_{\mathbb{Q} \in \mathcal{Q}}\, \mathbb{D}(\mathbb{Q} \parallel \mathbb{P}) \end{align} $$ I edited the question to change the first display above to the second. What I did with "argmin" required only adding an asterisk to the code. Changing \| to \parallel is${}\ldots\qquad$ — Michael Hardy
– Michael Hardy, Commented Mar 8, 2023 at 19:18
$\ldots$a kind of thing to which many mathematicians who use LaTeX seem callously insensitive. The result is that instead of seeing $\mathbb D\|\mathbb P$ you see $\mathbb D\parallel\mathbb P. \qquad$ — Michael Hardy
– Michael Hardy, Commented Mar 8, 2023 at 19:21
Some people seem to take "isotropic" in this context to mean that the variance is a scalar multiple of the identity matrix. That implies it's nonsingular unless the scalar is zero. But you seem to find it necessary to add that it's nonsingular, so maybe by "isotropic" you mean something else. Can you explain? — Michael Hardy
– Michael Hardy, Commented Mar 8, 2023 at 20:35
Is $D(Q\parallel P)$ the Kullback–Leibler divergence? And why do you need all these \mathbb's? — Iosif Pinelis
– Iosif Pinelis, Commented Mar 9, 2023 at 2:07

Iosif Pinelis · Accepted Answer · 2023-03-20 17:18:45Z

Assuming you want to minimize the Kullback–Leibler divergence $$D(P\parallel Q)=\int dP\,\ln\frac{dP}{dQ}$$ over all isotropic Gaussian $P$, "the" minimizer is in general not unique and, accordingly, not Lipschitz even on the set of measures $Q$ where it is unique.

The idea of a counterexample is quite simple: Suppose that $d=1$. Let $Q_h$ be the probability measure with pdf $q_h$ given by the formula $$q_h(x)=c_h\big((1+h)\,f_{1,a}(x)\,1(x>0) +(1-h)\,f_{-1,a}(x)\,1(x<0)\big)$$ for real $x$, where $f_{t,a}$ is the pdf of the normal distribution $N(t,a^2)$, $a>0$ is small enough (the condition $0<a<\sqrt{2/\pi}$ should do), $h$ is a real number very close to $0$, and $c_h(\approx1/2)$ is the normalizing factor.

Since $a$ is rather small, $Q_h$ is somewhat close to the mixture of the rather narrow normal distributions $N(1,a^2)$ and $N(-1,a^2)$ with slightly unequal weights, $c_h\,(1+h)$ and $c_h\,(1-h)$ respectively. So, a minimizer $P_h$ of the Kullback–Leibler divergence $D(P\parallel Q_h)$ in $P$ should be sufficiently close to $N(1,a^2)$ or $N(-1,a^2)$ depending on whether the small perturbation $h$ is $>0$ or $<0$, respectively. Thus, an infinitesimally small change from, say, $h>0$ to $-h<0$ will result in quite a nonnegligible change from $P_h\approx N(1,a^2)$ to $P_{-h}\approx N(-1,a^2)$. (If $h=0$, then there will be two minimizers.)

I can write down details later, if you want them.

Responding to the comment of the OP about a possible relation of your question to a result by Csiszár: Your question concerns the existence and uniqueness, for any given probability measure (PM) $Q$, of a PM $P_{\mathcal S,Q}\in\mathcal S$ such that $D(P_{\mathcal S,Q}\parallel Q)\le D(P\parallel Q)$ for all $P\in\mathcal S:=\mathcal P$, which latter is the set of all isotropic Gaussian PM's.

In contrast, Csiszár's result is that for any given PM $P$ (rather than $Q$) there is a unique PM $\tilde P^{\mathcal S,P}$ such that $D(\tilde P^{\mathcal S,P}\parallel Q)+D(P\parallel\mathcal S)\le D(P\parallel Q)$ for all $Q\in\mathcal S$ (rather than $P\in\mathcal S$), where $D(P\parallel\mathcal S):=\inf_{Q\in\mathcal S}D(P\parallel Q)$. (So, this looks like some kind of Pythagorean inequality.) A corollary to this result by Csiszár is that $D(\tilde P^{\mathcal S,P}\parallel Q)\le D(P\parallel Q)$ for all $Q\in\mathcal S$ (rather than $P\in\mathcal S$).

So, if $D$ were a metric, your question would be about the existence and uniqueness of a PM in $\mathcal S$ closest to $Q$. On the other hand, again if $D$ were a metric, the mentioned corollary from Csiszár's result would say that the length of the projection of the segment $\tilde P^{\mathcal S,P}Q$ of the segment $PQ$ onto $\mathcal S$ is no greater than the length of $PQ$, for any $Q$. The latter property would be equivalent to your "closest" property if $D$ were a Euclidean metric. But $D$ is not a metric at all. So, Csiszár's result says something different from what your question is about.

(The comparison of your question to Csiszár's result got more complicated than necessary because you interchanged the usual order of the arguments $P$ and $Q$ of $D(P\parallel Q)$, which was also used by Csiszár. So, I have edited your post, and mine, accordingly.)

I'm a bit confused since Theorem 1, equation 2, of "Information Projections Revisited" by I. Csiszár states that: the I-projection is unique, if the set of probability measure which we project to is log-convex. What am I missing here since these statements seem to be at odds... — Mathematical-Semi_N00b
– Mathematical-Semi_N00b, Commented Mar 9, 2023 at 13:59
What conditions do we need to deduce existence, uniqueness, and continuity of the projection map? I made a follow-up post: mathoverflow.net/questions/442370/… — Mathematical-Semi_N00b
– Mathematical-Semi_N00b, Commented Mar 9, 2023 at 15:41
to me this example seems to only suggest that, if $\pi$ is well-defined wrt $\mathcal{Q},then it is locally-Lipschitz and not globally Lipschitz with possibly very bad constant. Is that your point? — Mathematical-Semi_N00b
– Mathematical-Semi_N00b, Commented Mar 9, 2023 at 17:40
@PiotrK : I have added a response to your comment about a relation of your question to the result by Csiszár. — Iosif Pinelis
– Iosif Pinelis, Commented Mar 9, 2023 at 18:06
@PiotrK : Yes, I think it is likely that $\pi$ is locally Lipschitz on the set where it is well defined. — Iosif Pinelis
– Iosif Pinelis, Commented Mar 9, 2023 at 18:08

Stack Exchange Network

References: error and stability estimates for information projection

1 Answer 1

You must log in to answer this question.

Linked

References: error and stability estimates for information projection

1 Answer 1

You must log in to answer this question.

Linked

Related