On the MSE of MAP estimators and model mismatch

Question

We consider an estimation problem where the parameter $\theta$ is assigned with the prior $g_\alpha$ depending on some parameter $\alpha$ (e.g. the variance of a Gaussian prior) and the observation $x$ has negative log-likelihood $f(x|\theta)$. Then, the MAP estimator is given by: $$\theta_\alpha^* (x) = \arg\min_{\theta \in \mathbb{R}^n} f (x | \theta) + g_\alpha(\theta).$$ We assume that $f$ and $g_\alpha$ have nice properties, e.g. such that the minimization problem above is strongly convex and admits a unique minimizer.

I am interested in known properties about the MSE of the MAP estimator, given by

$$\mathrm{MSE}_\alpha (\theta_\alpha^*(x)) = \mathbb{E}_{\theta_\alpha} [ \mathbb{E}_{x|\theta_\alpha} [ \| \theta_\alpha^*(x) - \theta_\alpha\|^2 ]],$$

where $\theta_\alpha$ follows the prior $g_\alpha$. In particular, assuming one uses a MAP estimator associated to the prior $g_\beta$ instead of the true $g_\alpha$, do we necessarily have that $$\mathrm{MSE}_\alpha (\theta_\alpha^*(x)) \leq \mathrm{MSE}_\alpha (\theta_\beta^*(x)),$$ that is, a mismatched model always leads to reduced performances in terms of MSE.

The result is true in the Gaussian case since the MAP estimator is also the MMSE. What about the general case?

jlewk · Accepted Answer · 2025-01-25 05:56:22Z

Consider $\theta\sim e^{-V}$ for convex $V:R\to R$ and $x\mid\theta\sim N(\theta ,1)$. Minimizing the posterior negative likelihood gives $$\theta_*(x)=\text{argmin}_\theta (x-\theta)^2/2 + V(\theta). $$ So $\theta_*(x)=\text{prox}[V](x)$, the proximal operator of $V$ at $x$.

Now consider $E[\theta | x]$, which is the minimizer of the MSE by definition of the conditional expectation. To answer the question by the negative, a natural next avenue is to see whether we can write $E[\theta | x]$ as the proximal $\text{prox}[W](x)$ for a different prior $e^{-W}$ than the original true prior $e^{-V}$.

A characterization of which functions can be written as proximal operators of convex function is obtained by Moreau, see the references in When is a mapping the proximity operator of some convex function?. So we only need to check that $x\mapsto E[\theta|x]$ is non-expansive and the subgradient of a convex function; here in dimension 1 it is enough to prove that it is non-expansive and nondecreasing. If everything is differentiable, we just need to check that the derivative is in $[0,1]$.

$E[\theta | x]$ has an explicit formula (Tweedie's formula): $$ E[\theta | x] = x + l'(x) $$ where $l'(x) = \frac{d}{dx} \log f(x)$ where $f(x)= \int\varphi(x-\theta)e^{-V(\theta)}d\theta$ where $\varphi$ is the standard normal pdf, or $$ E[\theta | x] = x + l'(x) = x - \frac{1}{f(x)}\int (x-\theta)\varphi(x-\theta)e^{-V(\theta)}d\theta. $$ Differentiating, the derivative is $$ 1 - \frac{1}{f(x)^2}\Big(\int (x-\theta)\varphi(x-\theta)e^{-V(\theta)}d\theta\Big)^2 - 1 + \frac{1}{f(x)}\int (x-\theta)^2\varphi(x-\theta)e^{-V(\theta)}d\theta. $$ For a fixed $x$, this is the variance of the random variable $(x-\theta)$ with respect to the density $\theta\mapsto \varphi(x-\theta)e^{-V(\theta)}/f(x)$, so this is non-negative. It is smaller than 1 (as required by the necessary and sufficient condition to write $E[\theta|x]$ as a prox) due to the Brascamp-Lieb inequality for densities of the form $C \exp(\theta x - \theta^2/2 - V(\theta))$ with $V$ convex.

I don't understand why is suffices to show that the MMSE is the MAP of another optimization problem to provide a counter-example. What does this tell us about the MSE of $\theta_\beta^*(x)$ for $\beta \neq \alpha$ in my question? — Goulifet
– Goulifet, Commented Jan 25 at 8:40
With two densities $e^{-V}$ and $e^{-W}$ we can always parametrize them. For instance take $\alpha=0$ $\beta=1$ and parametrize the prior function as $g_t = (1-t)e^{-V} + t e^{-W}$. The questions asks if the MSE inequality holds for all $\alpha$ and $\beta$. The answer provides a counter-example with the MSE inequality reversed, $\alpha=0$ and $\beta=1$. Using the prior $e^{-W}$ instead of the true prior $e^{-V}$ gives a smaller MSE. — jlewk
– jlewk, Commented Jan 25 at 14:13
OK I understand, thanks. In my case, The dependency with respect to $\alpha$ is, for instance, the variance o the regularization. Do you know if the result has a chance to be true in this case? — Goulifet
– Goulifet, Commented Jan 25 at 16:52
you could set $\alpha$ as the variance of the distribution $e^{-V}$ and $\beta$ the variance of the distribution $e^{-W}$ and parametrize $g_t$ to be some interpolation between $e^{-V}$ and $e^{-W}$ for $t\in[\alpha,\beta]$ such that the variance of $g_t$ equals $t$. For a specific family $g_\alpha$ you should ask another question. — jlewk
– jlewk, Commented Jan 25 at 17:24
Thanks, I posted a follow-up question here: mathoverflow.net/questions/486610/… — Goulifet
– Goulifet, Commented Jan 25 at 18:26

Stack Exchange Network

On the MSE of MAP estimators and model mismatch

1 Answer 1

You must log in to answer this question.

Linked

On the MSE of MAP estimators and model mismatch

1 Answer 1

You must log in to answer this question.

Linked

Related