2
$\begingroup$

Background

Say we have an optimization problem $$\min_x f(x) = g(x) + h(x)$$

where $g$ is differentiable and convex, and $h$ are convex but not necessarily differentiable. If $g$ is the mean squared error function, we can use proximal algorithms to solve optimizations of this type, using the proximal operator, given by $$\text{prox}(x) = \text{argmin}_z \frac{1}{2}\left\| x-z\right\|^2_2+h(z)$$

Problem

Suppose we know the proximal operator (so the solution to the optimization problem) for two problems of the same form such that $h_1$ and $h_2$ are convex but not necessarily differentiable:

  1. $\min_x \frac{1}{2}\left\| y-x\right\|^2 + h_1(x)$, so that we know $\text{prox}_{h_1}(x)$ for this problem.
  2. $\min_x \frac{1}{2}\left\| y-x\right\|^2 + h_2(x)$, so that we know $\text{prox}_{h_2}(x)$ for this problem.

If we now combine the two to give us a new optimization problem $$\min_x \frac{1}{2}\left\| y-x\right\|^2 +h_1(x)+ h_2(x)$$

My questions

  1. Can we work out a proximal operator $\text{prox}_{h_1, h_2}$ to solve this new problem using the two operators we already have?
  2. Do we require any assumptions on separability? Do we need the new penalty $h' = h_1 + h_2$ to be a separable function (i.e, are $h_1$ and $h_2$ separable to each other)? What happens if this is not the case?

Example

To give a more concrete example, we can look at the sparse-group lasso, given by $$\min_{\beta}\left( \frac{1}{2n}\left\|y-X\beta \right\| + (1-\alpha)\lambda\sum_{l=1}^m \sqrt{p_l}\left\|\beta^{(l)} \right\|_2 + \alpha \lambda \left\| \beta\right\|_1 \right),$$

where $h_1(\beta)=(1-\alpha)\lambda\sum_{l=1}^m \sqrt{p_l}\left\|\beta^{(l)} \right\|_2$ is the group lasso penalty and $h_2(\beta)=\alpha \lambda \left\| \beta\right\|_1$ is the lasso penalty. Therefore, could we use the proximal operators for the lasso and group lasso to derive a proximal operator for SGL? I understand that we can also derive the proximal operator directly for SGL, but I'm adding this as an example to make the question more concrete.

In this case, do we have separability between the lasso and group lasso penalties? Do we need it to help?

$\endgroup$
1
  • 1
    $\begingroup$ Since you already have an accepted answer, just another comment: There is no general way that I know of to produce the prox of the sum of two functions from the individual proxes. I don't think that such thing is possible in general. $\endgroup$ Commented Aug 27, 2022 at 8:19

1 Answer 1

2
$\begingroup$

I think what you want is the alternating direction method of multipliers (ADMM), which is a type of proximal method. The formulas you've written down are really close to ADMM. Rather than try to minimize the difficult objective $f(x) = g(x) + h(x)$ directly, we instead introduce an auxiliary variable $y$ and minimize

$$f(x, y) = g(x) + h(y)$$

subject to the constraint that $x = y$. Now, there are several ways we might enforce this constraint; to do so, we'll use the augmented Lagrangian

$$\mathscr L_\rho(x, y, \lambda) = g(x) + h(y) - \langle \lambda, x - y\rangle + \frac{\rho}{2}\|x - y\|^2.$$

The vector $\lambda$ is a Lagrange multiplier and $\rho$ is a scalar penalty factor. We can complete the square to write this in a slightly more revealing form:

$$\ldots = g(x) + h(y) + \frac{\rho}{2}\left\|x - y - \rho^{-1}\lambda\right\|^2 - \frac{1}{2\rho}\|\lambda\|^2.$$

At this point we can see that finding a minimizer of the augmented Lagrangian just with respect to $x$ or $y$ separately amounts to evaluating a proximal operator of $g$ or $h$ with a shifted argument. You've already assumed that both of these proximal operators are computable.

Finally, there's the question of what to do about the Lagrange multipliers $\lambda$. For that you can use a gradient ascent step at every iteration. I found the fact that this works without any line search to be deeply mysterious and very few references dig into this detail at all. The book Constrained Optimization and Lagrange Multiplier Methods by Dmitri Bertsekas does a good job explaining why with some nice theory behind it and the author makes the book available for free on his website.

This is all described in further detail in section 4.4.1 of Proximal Algorithms by Parikh and Boyd. I also wrote a bit about using ADMM to solve inverse coefficient problems for PDE with total variation regularization here which, although distinct from the lasso problem you've described, still has some of the same $\ell^1$ kind of character to it.

$\endgroup$
1
  • $\begingroup$ This is really useful, I'll work my way through the content. Thank you! $\endgroup$ Commented Aug 27, 2022 at 9:03

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.