Stopping criteria for damped Newton iterations with backtracking line search

Question

Are there better criteria than the Armijo criterion for damped Newton iteration with backtracking line search, when the objective is standard self-concordant? (See Boyd and Vandenberghe.)

Let $F(x)$ be a (standard self-concordant) convex function. Let $\alpha,\beta$ be given parameters. Assume that efficient and accurate programs are available for calculating $F,F',F''$. The Newton step is: $$ n_k = [F''(x_k)]^{-1}F'(x_k) $$ The damped Newton iteration to find the minimum of $F$ reads: $$ x_{k+1} = x_k - s_k n_k. $$ The number $s_k>0$ is the "step size". It is found by "backtracking line search", i.e. step sizes $s_k = \alpha^j$ are tried for $j=0,1,\ldots$ until a step size is found such that the Armijo criterion is satisfied: $$ F(x_k - s_k n_k) \leq F(x_k) - s_k\beta([F'(x_k)]^Tn_k). $$

This has a problem in double precision arithmetic. To see this, note that $$ F(x_k) - F(x_k - s_k n_k) \leq F(x_k)-F(x_{\infty}) \sim \|x_k-x_{\infty}\|^2 $$ As usual, denote $\epsilon \approx 2.22 \times 10^{-16}$ the accuracy of double precision arithmetic, and note that if $F(x_k)-F(x_{\infty}) < \epsilon$ then you can expect, in machine arithmetic, that $F(x_k)=F(x_{\infty})$ and the Armijo test breaks down when $\|x_k-x_{\infty}\| = \sqrt{\epsilon}$, i.e. you are quite far from convergence. In that scenario, you expect $[F'(x_k)]^Tn_k = O(\epsilon)$ and the Armijo test probably accepts the step size $s_k=1$ every time. You probably need to stop taking damped Newton steps here or else these large steps could lead to oscillation, so it's very hard to locate the root $x_{\infty}$ to an accuracy better than $O(\sqrt{\epsilon})$.

If you use a derivatives-free code (e.g. Brent's method), this is an unavoidable disadvantage, but for a Newton method, it should be possible to improve this, and I wonder if this is not in the literature already.

Off the top of my head, one could replace the Armijo test with something like: $$ -[F'(x_k - s_kn_k)]^Tn_k \leq \beta [F'(x_k)]^Tn_k. $$ I don't know the implications of such a choice, or if there's already better choices elsewhere.

I'm also not sure how to stop the overall Newton iteration. Right now I'm using $$ F(x_{k+1}) = F(x_k) \text{ and } \alpha\|F'(x_k)\| \leq \|F'(x_{k+1})\| $$ Is there a better way?

Can anyone point me in the right direction with maybe references or methods for the line search and the stopping criterion?

Isn't $s_k=1$ already optimal when you're very close to convergence? I am not an expert, but it is not my experience that Newton oscillates and stagnates around $\sqrt\epsilon$ in practical problems. In fact, I have seen it recommended to stop doing line searches when one is close to the exact solution. — Federico Poloni
– Federico Poloni, Commented May 31, 2024 at 18:31
I'm doing barrier methods so $F(x) = F(t,x) = tc^Tx+G(x)$ for some barrier $G$, and then you make $t$ incredibly large, and you get cancellation between these two terms. Nevertheless, what you say is true, if you've picked your t-step size correctly, you don't need to dampen (no line search), and with short t-steps, one Newton step suffices. However, this doesn't perform well in practice. Instead, you try to take large t steps, but then you have to be able to detect that Newton has or has not converged. So you can take out the line search if you want, I still need a convergence criterion. — Sébastien Loisel
– Sébastien Loisel, Commented May 31, 2024 at 18:43
when the objective is standard self-concordant... Does that mean that you have an a priori knowledge of the constant $C$ in the inequality $|f'''|\le C|f''|^{3/2}$ along each line and wish to develop your stopping criteria using this constant alone, or you mean something else by this? — fedja
– fedja, Commented Aug 9, 2024 at 21:58
Apologies for the late reply. A standard self-concordant function has C=2, if memory serves. This works out well for standard barriers found in convex optimization. — Sébastien Loisel
– Sébastien Loisel, Commented Aug 20, 2024 at 22:31

Daniel Shapero · Accepted Answer · 2024-05-31 17:06:03Z

Is there a better way to decide when to stop the Newton iteration? The two strategies you mention (small decrease in the objective, small decrease in the gradient) are pretty common. Another strategy you'll see used a lot (e.g. in PETSc) is a relative reduction in the norm of the gradient compared to the starting value: $$\|F'(x_k)\| \le \text{tol}\cdot\|F'(x_0)\|$$ where end users have to select the tolerance. This is a frequent pain point for physical scientists or engineers. If you pick too low a tolerance, the vagaries of floating-point arithmetic might mean that such a reduction is never attainable in practice or only with an unreasonably large number of iterations. If you pick it too large, the results are likely inaccurate and the only way to sanity check the results after the fact is by manual inspection. Another common strategy is to use the absolute norm of the gradient, but using this criterion effectively requires some advance knowledge of a reasonable scale for that norm.

There's another approach which is hinted at in the Boyd book that you reference. For a convex problem, the Newton decrement $$\Delta_k \equiv \frac{1}{2}\langle F'(x_k), n_k\rangle = \frac{1}{2}\langle F'(x_k), F''(x_k)^{-1}F(x_k)\rangle$$ is non-negative. The Newton tells you the expected reduction in the objective that you'd obtain by taking one more step of Newton's method, assuming that the second-order Taylor expansion of $F$ is a good approximation around $x_k$. You might imagine using the Newton decrement to come up with a convergence criterion.

Now here's where I go out on a limb a bit. The objectives for many real optimization problems usually have a bit more structure and can be written as $$F(x) = G(x) + H(x)$$ where $G$ and $H$ are both convex and $G$ is strictly positive. For example, I've read your papers so I know you like the p-Laplace problem: $$F(u) = \underbrace{\frac{1}{p}\int_\Omega|\nabla u|^p dx}_{\text{the positive part}} - \int_\Omega f\cdot u\;dx.$$ The alternative criterion is to use the ratio $$\Delta_k / G(x_k) < \text{tol}$$ to decide convergence. The advantage of this approach is that it only depends on the current guess $x_k$ and search direction $n_k$ -- you don't need to take another step of Newton's method to decide if you've converged yet. The disadvantage is that it requires explicit knowledge of the problem structure. I haven't seen this discussed elsewhere; I wrote about it in section 4.3 of this paper.

Are there better line search criteria? The stopping criterion for line search that you mention is commonly referred to as the Armijo rule. Many software packages use just the Armijo rule to decide when to stop the line search and find empirically that it happens to be good enough. Strictly speaking, you can only prove convergence of Newton's method with a backtracking line search assuming that the line search obeys the Armijo rule and a condition on decrease in the curvature: $$-\langle F'(x_k - s_kn_k), n_k\rangle \le \beta\langle F'(x_k), n_k\rangle.$$ which is exactly the second strategy that you mention. This is all illustrated really nicely in figures 3.3-3.5 of the Nocedal and Wright book. In practice, it appears that just using the Armijo rule is enough for many practically relevant problems.

Breakdown of the Armijo rule. I think to give a more refined analysis here you'd want to say that $$F(x_k) - F(x_k - s_kn_k) \le \frac{1}{2}\|F''(x_\infty)\|\cdot\|x_k - x_\infty\|^2$$ so the point where you'd see subtractive cancellation errors is going to also depend on the curvature of the objective in the vicinity of the true minimizer. I'm not quite sure about this but I wonder if you're thinking with fixed-point arithmetic for a floating-point problem. You'd expect to see subtractive cancellation errors when the relative difference between $F(x_k)$ and $F(x_k - s_kn_k)$ is less than the machine $\epsilon$, not the absolute difference. So it isn't sufficient to say that you'll get problems when $$\|x_k - x_\infty\|^2 < \epsilon / \|F''(x_\infty)\|,$$ but that you'd also need $|F(x_k)|$ is of order 1 or greater as well. This I'm a little less sure about but I suspect that getting subtractive cancellation errors in the Armijo rule should be precluded by the outer-level stopping criterion for Newton's method.

Thanks Daniel for all this information. I'm indeed working on solvers for p-Laplacians, but using the barrier method so the objective isn't directly your F(u). My objective doesn't have this type of decomposition. The reason why I've asked this question here is that I've done lots of numerical experiments and observed the Armijo rule flaking out on me. I've also tried using the Newton decrement, I know it's in principle the right thing in self-concordant calculus, but I've also had floating point issues with that quantity. You are right, I'm trying to get rid of the "tol", it's undesirable. — Sébastien Loisel
– Sébastien Loisel, Commented May 31, 2024 at 17:19
I've also tried other tests. In the line search, with $\phi(s) = F(x_k-sn_k)$, I tried things like: use $s=1$ if $\phi'(1) \leq 0$, else backtrack until $\phi'(s)\phi'(0.5s)<0$ or maybe $\phi'(s)\phi'(2s)<0$. I'm just not sure what any of these do to the convergence analysis for self-concordant functions. — Sébastien Loisel
– Sébastien Loisel, Commented May 31, 2024 at 17:25
Also, I'm aware of the difference between floating point and fixed point, I'm just following the fashion of glossing over it. If you look at Boyd and Vanderberghe, they just do Newton iterations until the absolute error is less than $\epsilon$ and they don't care about relative vs absolute. — Sébastien Loisel
– Sébastien Loisel, Commented May 31, 2024 at 17:30
I mentioned the fixed vs floating point thing because I've definitely chased my own tail on multiple occasions only to remember how floating-point arithmetic actually works. I don't have a great answer for that part of your question and it would be interesting to manufacture a case to test whether you get $\|x_k - x_\infty\|$ to order $\epsilon$ or $\sqrt\epsilon$. It might be as simple as running the code you already have written in single- and then double-precision arithmetic. If you can do it in quad-precision even better but that can be annoying to code. — Daniel Shapero
– Daniel Shapero, Commented May 31, 2024 at 17:34
I've definitely had the $\epsilon$ vs $\sqrt{\epsilon}$ issue. I can run my code in BigFloat, it's in Julia. The code is here. That being said, you can probably find the behavior with this barrier function: $f(t,x) = tx-\log(x)$. The central path is $x^*(t) = 1/t$ but $f(t,1/t) = 1+\log t$ while $f(t,c/t) = c+\log t-\log c$ so the difference is $c-1-\log c \approx 0.5(c-1)^2$, so put $c = 1+\sqrt{\epsilon}$ I guess? Hopefully this example works, my wife wants me to go to dinner, I'll double check later. — Sébastien Loisel
– Sébastien Loisel, Commented May 31, 2024 at 17:56

Stack Exchange Network

Stopping criteria for damped Newton iterations with backtracking line search

1 Answer 1

You must log in to answer this question.

Stopping criteria for damped Newton iterations with backtracking line search

1 Answer 1

You must log in to answer this question.

Related