Acceleration in Optimization (Part II)
Published:
Difficulty: Advanced Undergraduate Students
In our last post, we introduced a few important optimization concepts. With the understanding that the reader has (at least) a cursory knowledge of these concepts, let’s present the main idea behind acceleration.
We start by introducing one of Nesterov’s classic optimization schemes and then see how acceleration for obtaining $\varepsilon$-solutions can be understood within this scheme. For brevity, let us introduce the notation
\[\ell_{\psi}(\cdot;\bar{x}):=\psi(\cdot)+\left\langle \nabla\psi(\bar{x}),\cdot-\bar{x}\right\rangle\]to denote the linear approximation of a function $\psi:\mathbb{R}^{n}\mapsto\mathbb{R}$ at a given point $\bar{x}\in{\rm dom}\ \psi$.
Estimating Sequence Scheme for $({\cal P})$
Choose $x_{0}\in{\rm dom}\ h$ and some function $\Gamma_{0}:\mathbb{R}^{n}\mapsto\mathbb{R}$. Set $A_{0}\gets0$.
For $k=1,\ldots$
- Find $(x_{k},A_{k})\in{\rm dom}\ h\times\mathbb{R}_{++}$ and function $\Gamma_{k}:\mathbb{R}^{n}\mapsto\mathbb{R}$ such that
Clearly, for any optimum $x^{*}$ of $({\cal P})$, the generated iterates of the above scheme satisfy
\[A_{k}\phi(x_{k})\overset{({\cal R}_{k}^{1})}{\leq}A_{k}\Gamma_{k}(x^{*})+\frac{1}{2}\|x^{*}-x_{0}\|^{2}\overset{({\cal R}_{k}^{2})}{\leq}A_{k}\phi(x^{*})+\frac{1}{2}\|x^{*}-x_{0}\|^{2},\]for every $k\geq1$. Re-arranging, we have
\[(*)\qquad\phi(x_{k})-\phi(x^{*})\leq\frac{1}{2A_{k}}\|x^{*}-x_{0}\|^{2}.\]We now make a bold claim in view of the above result.
Claim II.1. All “classical” (or “Nesterov-like”) accelerated methods are instances (or minor variants) of the Estimating Sequence Scheme with
\[A_{k} \approx \begin{cases} \frac{k^{2}}{L}, & \mu=0,\\ \frac{1}{L}\left(1+\sqrt{\frac{\mu}{L}}\right)^{k-1}, & \mu>0. \end{cases}\]As a consequence, accelerated methods obtain an $\varepsilon$-solution of $({\cal P})$ in $O(T_{\varepsilon})$ iterations, where
\[T_{\varepsilon}=\begin{cases} \frac{\sqrt{L}\|x_{0}-x^{*}\|}{\sqrt{\varepsilon}}, & \mu=0,\\ \sqrt{\frac{L}{\mu}}\log\left(\frac{L\|x_{0}-x^{*}\|^{2}}{\varepsilon}\right), & \mu>0. \end{cases}\]Remark II.2. Bounds such as $(*)$ share some similarities with non-accelerated methods. For example, in the primal (proximal) gradient method, which has updates of the form
\[(\dagger)\qquad x_{k}\gets{\rm argmin}_{u\in\mathbb{R}^{n}}\left\{ \lambda_{k}\left[\ell_{f}(u;x_{k-1})+h(u)\right]+\frac{1}{2}\|u-x_{k-1}\|^{2}\right\} ,\]a similar bound as in $(*)$ can be made in which $A_{k}\approx\sum_{i=1}^{k}\lambda_{i}$ under the assumption that $\lambda_{i}\in(0,2/L)$ for every $i\geq1$. As an aside, note that if $h(\cdot) \equiv 0$ then the above update reduces to classic gradient descent with stepsize $\lambda_k$.
Remark II.3. The function $\Gamma_{k}$ acts like a special support function for $\phi$ whose approximation of $\phi$ at $x_{k}$ becomes increasingly more accurate as $A_{k}\to\infty$. Indeed, $\Gamma_{k}$ supports $\phi$ in the sense that $({\cal R}_{k}^{2})$ implies $\Gamma_{k}$ minorizes $\phi$ (i.e., bounds it from below at every point in its domain) while $({\cal R}_{k}^{1})$ implies that
\[\begin{aligned} \phi(x_{k}) & \leq\Gamma_{k}(x_{k})+\frac{1}{2A_{k}}\|x_{k}-x_{0}\|^{2}\\ & \leq \phi(x_{k})+\frac{1}{2A_{k}}\|x_{k}-x_{0}\|^{2}\\ & \leq \phi(x_{k})+\frac{1}{A_{k}}\|x_{k}-x^{*}\|^{2}+\frac{1}{A_{k}}\|x^{*}-x^{0}\|^{2},\end{aligned}\]in which the term $\|x_{k}-x^{*}\|^{2} / A_{k}+\|x^{*}-x^{0}\|^{2} / A_{k}$ can be made small as $A_{k}\to\infty$. However it does require the very strong global condition $({\cal R}_{k}^{1})$ to hold.
Remark II.4. Defining $v_k := {\rm argmin}_{u\in\mathbb{R}^{n}}\{A_{k}\Gamma_{k}(u)+\|u-x_{0}\|^{2}/2\}$, an interesting set of bounds implied by the above scheme is
\[\begin{align*} (\beta)\qquad A_{k}\phi(x_{k})+\frac{1+\mu A_{k}}{2}\|u-v_{k}\|^{2} & \overset{({\cal R}_{k}^{1})}{\leq}A_{k}\Gamma_{k}(v_{k})+\frac{1}{2}\|v_{k}-x_{0}\|^{2}+\frac{1+\mu A_{k}}{2}\|u-v_{k}\|^{2}\\ & \leq A_{k}\Gamma_{k}(u)+\frac{1}{2}\|u-x_{0}\|^{2}\\ & \overset{({\cal R}_{k}^{2})}{\leq}A_{k}\phi(u)+\frac{1}{2}\|u-x_{0}\|^{2}. \end{align*}\]for every $u\in\mathbb{R}^{n}$, where the second inequality follows from the fact that $A_k \Gamma_k + \|\cdot\|/2$ is $(1+\mu A_k)$-strongly convex. In particular, the above bounds at $u=x^{*}$ yield the relations
\[\|x^{*} - v_k\| \leq \|x^{*}-x_0\|, \quad \|v_k - x_0\| \leq 2\|x^{*}-x_0\|.\]Remark II.5. The previous remark gives some insight into how acceleration can be qualified. Specifically, the rate of acceleration is directly related to the quality of the generated estimate sequence $\{\Gamma_{k}\}_{k\geq1}$ in terms of its approximation of $f$ on the sequence of points $\{x_{k}\}_{k\ge 1}$. The better the approximation, the faster the algorithm converges. For convex (or strongly convex) functions, we will see that their structure plays an important role in creating good estimate sequences.
Remark II.6. In Nesterov’s paper, the function $\psi_{k}$ is defined to be $A_{k}\Gamma_{k}+\|\cdot-x_{0}\|^{2}/2$ in order to simplify some technical expressions. For this series of posts, we opt to use $\Gamma_{k}$ because it is easier to interpret some of the results relative to $\Gamma_k$.
Leave a Comment