Acceleration in Optimization (Part III)

11 minute read

Published:

Difficulty: Advanced Undergraduate Students

In the previous post, we introduced one of Nesterov’s classic optimization schemes, and claimed that accelerated methods were just special instances of that scheme. For this post, let us give a simple example of an unaccelerated method that implements the Estimating Sequence Scheme and describe some of its properties.


Dual Gradient Method


  • Choose $x_{0}\in{\rm dom}\ h$ and define $\Gamma_{0}(\cdot)\equiv0$. Set $A_{0}=0$, $v_{0}=x_{0}$, and $y_{0}=x_{0}$.

  • For $k=1,\ldots$

    • Set $x_{k}\gets{\rm argmin}_{0\leq i\leq k-1}\phi(y_{i}).$

    • Define the function

    \[\Gamma_{k}(\cdot)\gets\frac{A_{k-1}\Gamma_{k-1}(\cdot)}{A_{k}}+\frac{L^{-1}\left[\ell_{f}(\cdot;v_{k-1})+h(\cdot)\right]}{A_{k}}.\]
    • Set $A_{k}\gets A_{k-1}+L^{-1}$ and compute
    \[\begin{aligned} v_{k} & \gets{\rm argmin}_{u\in\mathbb{R}^{n}}\left\{ A_{k}\Gamma_{k}(u)+\frac{1}{2}\|u-v_{0}\|^{2}\right\} ,\\ y_{k} & \gets{\rm argmin}_{u\in\mathbb{R}^{n}}\left\{ \ell_{f}(u;v_{k})+h(u)+\frac{L}{2}\|u-v_{k}\|^{2}\right\}. \end{aligned}\]

For completeness, let us show that the iterates of the above method satisfy both $({\cal R}_{k}^{1})$ and $({\cal R}_{k}^{2})$, so that the method is indeed an implementation of the Estimating Sequence Scheme.

Verifying the Key Conditions

Condition $({\cal R}_{k}^{2})$ follows immediately from the fact that the convex combination of two minorants is still a minorant. For condition $({\cal R}_{k}^{1})$, we proceed by induction. Since $({\cal R}_{0}^{1})$ is trivial, suppose $({\cal R}_{k}^{1})$ holds. Using the relations in $(\beta)$, the updates for $\Gamma_{k}$ and $y_{k}$, and the upper curvature condition $(\alpha)$, we have

\[\begin{align*} & \inf_{u\in\mathbb{R}^{n}}\left\{ A_{k+1}\Gamma_{k+1}(u)+\frac{1}{2}\|u-x_{0}\|^{2}\right\} \\ & =\inf_{u\in\mathbb{R}^{n}}\left\{ A_{k}\Gamma_{k}(u)+\frac{1}{L}\left[\ell_{f}(u;v_{k})+h(u)\right]+\frac{1}{2}\|u-x_{0}\|^{2}\right\} \\ & \overset{(\beta)}{\ge}A_{k}\phi(x_{k})+\frac{1}{L}\inf_{u\in\mathbb{R}^{n}}\left\{ \ell_{f}(u;v_{k})+h(u)+\frac{L}{2}\|u-v_{k}\|^{2}\right\} \\ & \overset{y_{k}}{=}A_{k}\phi(x_{k})+\frac{1}{L}\left[\ell_{f}(y_{k};v_{k})+h(y_{k})+\frac{L}{2}\|y_{k}-v_{k}\|^{2}\right]\\ & \overset{(\alpha)}{\geq}A_{k}\phi(x_{k})+\frac{1}{L}\phi(y_{k})\geq A_{k}\phi(x_{k})+\frac{1}{L}\phi(x_{k})\\ & =A_{k+1}\phi(x_{k}), \end{align*}\]

which is exactly condition $({\cal R}_{k}^{1})$. Notice that we did not invoke any properties related to the convexity of $f$ .

Remark III.1 The update for $y_{k}$ is similar to the primal (proximal) gradient method in $(\dagger)$ from the last post. However, it has the undesirable property that $A_{k}=1/(Lk)$, meaning that it is an unaccelerated method. In the next post, we will look at a better way to construct the estimate sequence $\Gamma_{k}$ (using the convexity of $f$) to obtain an accelerated method.

Remark III.2. The affine estimate function $\Gamma_{k}$ is a balanced convex combination of the first-order approximants $\ell_{f}(\cdot;v_{k-1})+h(\cdot)$ at the points $\{v_{i}\}_{i\geq0}$. To achieve acceleration, we will consider an unbalanced convex combination of similar approximants which have larger associated values of $A_k$.

Remark III.3. By completing the square, we can write the update for $y_k$ as

\[\begin{align*} y_{k} & \gets{\rm argmin}_{u\in\mathbb{R}^{n}}\left\{ \ell_{f}(u;v_{k})+h(u)+\frac{L}{2}\|u-v_{k}\|^{2}\right\} \\ & ={\rm argmin}_{u\in\mathbb{R}^{n}}\left\{ \frac{\|\nabla f(v_{k})\|^{2}}{2L^{2}}+\left\langle \frac{1}{L}\nabla f(v_{k}),u-v_{k}\right\rangle +\frac{h(u)}{L}+\frac{1}{2}\|u-v_{k}\|^{2}\right\} \\ & ={\rm argmin}_{u\in\mathbb{R}^{n}}\left\{ \frac{h(u)}{L}+\frac{1}{2}\|u-v_{k}+\frac{1}{L}\nabla f(v_{k})\|^{2}\right\} \\ & ={\rm prox}_{h/L}\left(v_{k}-\frac{1}{L}\nabla f(v_{k})\right), \end{align*}\]

which can similarly be done for the primal (proximal) gradient method. A similar technique can be applied to the update for $v_k$.

Remark III.4. (For advanced students) The naming of dual gradient method as a “dual” method remains a mystery to probably all but Nesterov himself. One interpretation that I have about its name comes the duality theory corresponding to perturbation functions that I wrote about in my previous set of posts. Specifically, if a primal optimization problem can be reformulated as the minimization of a perturbation function $\Phi(x,u)$ over $x$ with $u=0$, then its dual problem is given as the maximization of the conjugate $-\Phi^{*}(\tilde{x},\tilde{u})$ over $\tilde{u}$ with $x=0$ and

\[\text{dual}\to\sup_{\tilde{u}}-\Phi^{*}(0,\tilde{u})\leq\inf_{x}\Phi(x,u)\gets\text{primal}.\]

Since the dual/conjugate problem is about finding the “best” support vector for $\Phi(\cdot,\cdot)$, one can see a correspondence to the dual gradient method in the sense that we are looking for the “best” *support function** $\Gamma_k$ for $f$.

Remark III.5. Nesterov also gives another analogy comparing the primal and dual gradient methods. The primal method improves the current optimization state using a local model of the objective function, and is analogous to how practioners work in industry. On the other hand, the dual method uses its generated sequence of iterates to construct an increasingly more accurate model of the objective, and is analogous to scientists and researchers.

Leave a Comment