Acceleration in Optimization (Part IV)

15 minute read

Published:

Difficulty: Advanced Undergraduate Students

In the last post, we saw an example of an unaccelerated method that implemented Nesterov’s Estimating Sequence Scheme. Let us now give an example of an accelerated method.


Accelerated Gradient Method for ${\cal P}$


  • Choose $x_{0}\in{\rm dom}\ h$ and define $\Gamma_{0}(\cdot)\equiv0$. Set $A_{0}=0$, $v_{0}=x_{0}$, and $y_{0}=x_{0}$.

  • For $k=1,\ldots$

    • Find $a_{k}$ satisfying the quadratic equation $a_{k}^{2}/(A_{k}+a_{k})=2(1+\mu A_{k})/L$.

    • Set $A_{k}=A_{k-1}+a_{k}$ and compute

    \[\begin{aligned} y_{k} & \gets\frac{A_{k-1}}{A_{k}}x_{k-1}+\frac{a_{k}}{A_{k}}v_{k-1},\\ x_{k} & \gets{\rm argmin}\ _{u\in\mathbb{R}^{n}}\left\{ \ell_{f}(u;y_{k})+h(u)+\frac{L}{2}\|u-y_{k}\|^{2}\right\} .\end{aligned}\]
    • Define the function
    \[\Gamma_{k}(\cdot)\gets\frac{A_{k-1}}{A_{k}}\Gamma_{k-1}(\cdot)+\frac{a_{k}}{A_{k}}\left[\ell_{f}(\cdot;x_{k})+h(\cdot)\right].\]
    • Compute
    \[v_{k}\gets{\rm argmin}\ _{u\in\mathbb{R}^{n}}\left\{ A_{k}\Gamma_{k}(u)+\frac{1}{2}\|u-x_{0}\|^{2}\right\} .\]

Like the previous post, let us also prove that the iterates of the above method satisfy both $({\cal R}_{k}^{1})$ and $({\cal R}_{k}^{2})$.

An Important Technical Bound

Before doing so, we derive an important technical inequality. In view of the optimality condition for $x_{k+1}$, let us denote and observe that

\[\begin{aligned} (**)\qquad\phi'(x_{k+1}) & :=\nabla f(x_{k+1})-\nabla f(y_{k})+L(y_{k}-x_{k+1})\\ & \in\nabla f(x_{k+1})+\partial h(x_{k+1})=\partial\phi(x_{k+1}).\end{aligned}\]

Using the inclusion $(**)$ and the fact that $\nabla f(\cdot)$ is $L$-Lipschitz continuous from condition $(\beta)$ (see three posts ago), we first have that

\[\begin{aligned} & \left\langle \phi'(x_{k+1}),y_{k}-x_{k+1}\right\rangle \\ & =L\|y_{k}-x_{k+1}\|^{2}-\left\langle \nabla f(y_{k})-\nabla f(x_{k+1}),y_{k}-x_{k+1}\right\rangle \\ & \overset{(**)}{=}\frac{1}{L}\left[\|\phi'(x_{k+1})-\left[\nabla f(x_{k+1})-\nabla f(y_{k})\right]\|^{2}\right]-\left\langle \nabla f(y_{k})-\nabla f(x_{k+1}),y_{k}-x_{k+1}\right\rangle \\ & =\frac{1}{L}\left[\|\phi'(x_{k+1})\|^{2}-2\left\langle \phi'(x_{k+1}),\nabla f(x_{k+1})-\nabla f(y_{k})\right\rangle +\|\nabla f(x_{k+1})-\nabla f(y_{k})\|^{2}\right]\\ & \qquad-\left\langle \nabla f(y_{k})-\nabla f(x_{k+1}),y_{k}-x_{k+1}\right\rangle \\ & =\|\phi'(x_{k+1})\|^{2}-\frac{1}{L}\|\nabla f(x_{k+1})-\nabla f(y_{k})\|^{2}+\left\langle \nabla f(y_{k})-\nabla f(x_{k+1}),y_{k}-x_{k+1}\right\rangle \\ & \overset{(\beta)}{\geq}\|\phi'(x_{k+1})\|^{2}-L\|y_{k}-x_{k}\|^{2}+\left\langle \nabla f(y_{k})-\nabla f(x_{k+1}),y_{k}-x_{k+1}\right\rangle \end{aligned}\]

On the other hand, summing $(\beta)$ at $(x,y)=(x_{k+1},y_{k})$ and $(x,y)=(y_{k},x_{k+1})$ yields

\[-\left\langle \nabla f(y_{k})-\nabla f(x_{k+1}),y_{k}-x_{k+1}\right\rangle \leq L\|y_{k}-x_{k}\|^{2}.\]

Combining the two sets of relations gives us the important bound

\[(\alpha')\qquad\left\langle \phi'(x_{k+1}),y_{k}-x_{k+1}\right\rangle \geq\|\phi'(x_{k+1})\|^{2}.\]

Verifying the Key Conditions

We now proceed by induction on conditions $({\cal R}_{k}^{1})$ and $({\cal R}_{k}^{2})$. Since they both trivially hold at iteration $0$, suppose they hold for some $k\geq1$. Using $({\cal R}_{k}^{2})$ and the fact that convex functions are minorized by their first-order approximations we have

\[\begin{aligned} A_{k+1}\Gamma_{k+1}(u)+\frac{1}{2}\|u-x_{0}\|^{2} & =A_{k}\Gamma_{k}(u)+a_{k+1}\left[\ell_{f}(u;x_{k})+h(u)\right]+\frac{1}{2}\|u-x_{0}\|^{2}\\ & \overset{({\cal R}_{k}^{2})}{\leq}A_{k}\phi(u)+a_{k+1}\left[\ell_{f}(u;x_{k})+h(u)\right]+\frac{1}{2}\|u-x_{0}\|^{2}\\ & \leq(A_{k}+a_{k+1})\phi(u)+\frac{1}{2}\|u-x_{0}\|^{2}=A_{k+1}\phi(u)+\frac{1}{2}\|u-x_{0}\|^{2},\end{aligned}\]

for every $u\in\mathbb{R}^{n}$, which is exactly $({\cal R}_{k+1}^{2})$. For $({\cal R}_{k+1}^{1})$, let $\phi’(x_{k+1})\in\partial\phi(x_{k+1})$ be as in $(**)$ and recall that that convexity of $\phi$ implies that

\[(\ddagger)\qquad\phi(u)\geq\phi(x_{k+1})+\left\langle \phi'(x_{k+1}),u-x_{k+1}\right\rangle \quad\forall u\in\mathbb{R}^{n}.\]

We now use the relations in $(\beta)$ from two posts ago, condition $(\ddagger)$ above, and the updates for $y_{k},$ $A_{k},$ and $\Gamma_{k}$ to obtain

\[\begin{aligned} & \inf_{u\in\mathbb{R}^{n}}\left\{ A_{k+1}\Gamma_{k+1}(u)+\frac{1}{2}\|u-x_{0}\|^{2}\right\} \\ & \overset{\Gamma_{k}}{=}\inf_{u\in\mathbb{R}^{n}}\left\{ A_{k}\Gamma_{k}(u)+a_{k+1}\left[\ell_{f}(u;x_{k+1})+h(u)\right]+\frac{1}{2}\|u-x_{0}\|^{2}\right\} \\ & \overset{(\beta)}{\geq}\inf_{u\in\mathbb{R}^{n}}\left\{ A_{k}\phi(x_{k})+\frac{1+\mu A_{k}}{2}\|u-v_{k}\|^{2}+a_{k+1}\left[\ell_{f}(u;x_{k+1})+h(u)\right]\right\} \\ & \overset{h}{\geq}\inf_{u\in\mathbb{R}^{n}}\left\{ A_{k}\phi(x_{k})+\frac{1+\mu A_{k}}{2}\|u-v_{k}\|^{2}+a_{k+1}\left[\phi(x_{k+1})+\left\langle \phi'(x_{k+1}),u-x_{k+1}\right\rangle \right]\right\} \\ & \overset{(\ddagger)}{\geq}\inf_{u\in\mathbb{R}^{n}}\left\{ A_{k+1}\phi(x_{k+1})+\frac{1+\mu A_{k}}{2}\|u-v_{k}\|^{2}+\right.\\ & \qquad\qquad\left.a_{k+1}\left\langle \phi'(x_{k+1}),u-x_{k+1}\right\rangle +A_{k}\left\langle \phi'(x_{k+1}),x_{k}-x_{k+1}\right\rangle \right\} \\ & \overset{y_{k}}{=}\inf_{u\in\mathbb{R}^{n}}\left\{ A_{k+1}\phi(x_{k+1})+\frac{1+\mu A_{k}}{2}\|u-v_{k}\|^{2}+\right.\\ & \qquad\qquad\left.a_{k+1}\left\langle \phi'(x_{k+1}),u-x_{k+1}\right\rangle +\left\langle \phi'(x_{k+1}),A_{k+1}y_{k}-a_{k+1}v_{k}-A_{k}x_{k+1}\right\rangle \right\} \\ & =\inf_{u\in\mathbb{R}^{n}}\left\{ A_{k+1}\phi(x_{k+1})+\frac{1+\mu A_{k}}{2}\|u-v_{k}\|^{2}+\right.\\ & \qquad\qquad\left.a_{k+1}\left\langle \phi'(x_{k+1}),u-v_{k}\right\rangle +A_{k+1}\left\langle \phi'(x_{k+1}),y_{k}-x_{k+1}\right\rangle \right\} .\end{aligned}\]

It is then straightforward to show that the minimum of the last problem is attained at

\[u_{k}=v_{k}-\frac{a_{k+1}}{1+\mu A_{k}}\phi'(x_{k+1}).\]

Substituting $u=u_{k}$ into the previous set of inequalities, and using the quadratic identity associated with $a_{k+1}$ and the technical bound $(\alpha’)$, we have that

\[\begin{aligned} & \inf_{u\in\mathbb{R}^{n}}\left\{ A_{k+1}\Gamma_{k+1}(u)+\frac{1}{2}\|u-x_{0}\|^{2}\right\} \\ & \geq A_{k+1}\phi(x_{k+1})-\frac{a_{k+1}^{2}}{2(1+\mu A_{k})}\|\phi'(x_{k+1})\|^{2}+A_{k+1}\left\langle \phi'(x_{k+1}),y_{k}-x_{k+1}\right\rangle \\ & \overset{a_{k}}{=}A_{k+1}\left[\phi(x_{k+1})+\left\langle \phi'(x_{k+1}),y_{k}-x_{k+1}\right\rangle -\frac{1}{L}\|\phi'(x_{k+1})\|^{2}\right]\\ & \overset{(\alpha')}{\geq}A_{k+1}\phi(x_{k+1}),\end{aligned}\]

which is exactly $({\cal R}_{k}^{1})$. Consequently, the accelerated gradient method is an implementation of the Estimating Sequence Scheme.

Growth Rate of $A_{k}$

Before finishing with some remarks, let us establish the growth rate of $A_{k}$. We claim that

\[A_{k}\geq\max\left\{ \frac{k^{2}}{2L},\frac{2}{L}\left[1+\sqrt{\frac{\mu}{2L}}\right]^{2(k-1)}\right\},\]

which proves that this method is indeed an accelerated one (see the previous posts). To show this, we first suppose that $\mu=0$. Then, the quadratic associated with $a_{k+1}$ implies that

\[\begin{aligned} A_{k+1} & \leq A_{k+1}(1+\mu A_{k})\overset{a_{k+1}}{=}\frac{L}{2}(A_{k+1}-A_{k})^{2}=\frac{L}{2}(A_{k+1}^{1/2}-A_{k}^{1/2})^{2}(A_{k+1}^{1/2}+A_{k}^{1/2})^{2}\\ & \leq2A_{k+1}L(A_{k+1}^{1/2}-A_{k}^{1/2})^{2}\end{aligned}\]

and, hence, dividing by $A_{k+1}$ and applying the recursion yields $A_{k+1}^{1/2}\geq k/\sqrt{2L}$. Now suppose $\mu>0$. By the same argument, we have

\[\mu A_{k}A_{k+1}<A_{k+1}(1+\mu A_{k})\leq2A_{k+1}L\left[A_{k+1}^{1/2}-A_{k}^{1/2}\right]^{2}\]

and, hence, $A_{k+1}^{1/2}\geq A_{k}^{1/2}\left[1+\sqrt{\mu/2L}\right]$. Since, $A_{1}=2/L$, the bound on $A_{k}$ follows.

Remark IV.1. The accelerated gradient method shares a few similarities with the dual gradient method. For example, its minorant $\Gamma_{k}$ is a convex combination of the first-order approximations $\ell_{f}(\cdot;\bar{x})+h(\cdot)$ at various centering points $\bar{x}$, the main iterate sequence $\{x_{k}\}_{k\geq0}$ is generated by a primal (proximal) gradient step, and the definition of $v_{k}$ is the same as in the dual gradient method.

Remark IV.2. On the other hand, the main difference between the two methods is that $\Gamma_{k}$ in the accelerated method is an unbalanced convex combination of the first-order approximations, with larger weights placed on later iterations. Also, the prox center used in the computation of $x_{k}$ is based on a convex combination of $x_{k-1}$ and $v_{k}$ rather than just $v_{k}$. Finally, the growth of $A_{k+1}$ is the same order as an accelerated method (as defined a few posts ago), making convergence of this method much faster than the dual method.

Remark IV.3. Compared to the analysis of the dual gradient method, the analysis of the accelerated method strongly uses convexity (see the underlined text above) in the proof for both $({\cal R}_{k}^{1})$ and $({\cal R}_{k}^{2})$.

Leave a Comment